Fcup-bdcc

Big Data & Cloud Computing

Part 1

Use Spark and Google Cloud Platform to perform analysis on increasing difficulty samples of the MovieLens dataset.

Example of Pyspark use:

def recommendByTag(singleTag, TFIDF_tags, movies, min_fmax=10, numberOfResults=10, debug=False):
    # start by most complexity-reducing operation: filter
    # filter by the singleTag
    # remove entries with f_max < min_fmax
    df_tag = TFIDF_tags.filter(TFIDF_tags.tag == singleTag)\
                   .filter(TFIDF_tags.f_max >= min_fmax)
        
    # join to get movie title
    # order by descending TFIDF + ascending lexicographic title
    # remove unnecessary columns
    # return results limited to numberOfResults
    df = df_tag.join(movies, 'movieId')\
                .orderBy(['TF_IDF','title'], ascending=[0,1])\
                .select('movieId', 'title', 'TF_IDF')\
                .limit(numberOfResults)
    return df

Part 2

Open problem of using Big Data tools and techniques to analyse a 32GB+ dataset of hospital events. Besides GCP, we used DASK and dask-ml.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
p1		p1
p2		p2
spark-setup		spark-setup
.gitignore		.gitignore
README.md		README.md
legacy.md		legacy.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fcup-bdcc

Part 1

Part 2

Example Plot

Dask-Distributed Dashboard

Authors

About

Releases

Packages

Contributors 3

Languages

msramalho/fcup-bdcc

Folders and files

Latest commit

History

Repository files navigation

Fcup-bdcc

Part 1

Part 2

Example Plot

Dask-Distributed Dashboard

Authors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages