Skip to content

Big Data & Cloud Computing - PySpark, Dask, GCP, ...

Notifications You must be signed in to change notification settings

msramalho/fcup-bdcc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fcup-bdcc

Big Data & Cloud Computing

Part 1

Use Spark and Google Cloud Platform to perform analysis on increasing difficulty samples of the MovieLens dataset.

Example of Pyspark use:

def recommendByTag(singleTag, TFIDF_tags, movies, min_fmax=10, numberOfResults=10, debug=False):
    # start by most complexity-reducing operation: filter
    # filter by the singleTag
    # remove entries with f_max < min_fmax
    df_tag = TFIDF_tags.filter(TFIDF_tags.tag == singleTag)\
                   .filter(TFIDF_tags.f_max >= min_fmax)
        
    # join to get movie title
    # order by descending TFIDF + ascending lexicographic title
    # remove unnecessary columns
    # return results limited to numberOfResults
    df = df_tag.join(movies, 'movieId')\
                .orderBy(['TF_IDF','title'], ascending=[0,1])\
                .select('movieId', 'title', 'TF_IDF')\
                .limit(numberOfResults)
    return df

Part 2

Open problem of using Big Data tools and techniques to analyse a 32GB+ dataset of hospital events. Besides GCP, we used DASK and dask-ml.

Example Plot

Dask-Distributed Dashboard

Authors