<h1>MovieLens 25M</h1>

<hr>
<h2>0. Introduction</h2>

<p>This dataset describes 5-star rating and free-text tagging activity from MovieLens (https://movielens.org), a movie recommendation service. The dataset is available for download at https://grouplens.org/datasets/movielens/25m/.</p>

<p>The explored version here is the MovieLes 25M Dataset, a stable benchamrk version, with 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019.</p>

<p>We could discover more information by just reading the dataset's README file, but we are not doing that since the purpose of this notebook is analyzing the exploration process.</p>

<h2>1. Reading Data</h2>

<p>The first step in the exploration process is loading your data to be able to analyze it with your selected tool. Actually, one could argue the first step is ingesting or obtaining this data, but we are not concerned with this part here.</p>

<p>Assuming we already have our data available, let's start by loading it.</p>

In [1]:
# Viewing what is in the directory
!cd ml-25m; ls

genome-scores.csv  links.csv   ratings.csv
genome-tags.csv    movies.csv  tags.csv


Here, we can see the data is in the ml-25m directory, and is composed by multiple csv files. Let's load them all at once.

In [2]:
# Importing stuff
import os
import numpy as np
import pandas as pd

In [3]:
dataframes_names = ['genome-scores', 'genome-tags', 'links', 'movies', 'ratings', 'tags']

def read_multiple_csv(filepath, df_names=None):

    '''
    Improved version of pd.read_csv, which can read multiple files in a directory
    If df_names == None, it iterates the whole directory.
    
    Receives the path and optionally a list of the dataframes names.
    Returns a dict of the names and dataframes.
    
    uses import os, pandas as pd
    the folder should have only csv files, no .txt
    '''
    
    df_list = []
    
    # Here the function uses the names of the dataframes to read the files
    if df_names is not None:
        for i in range(len(df_names)):
            temp_df = pd.read_csv(filepath + df_names[i] + '.csv')
            df_list.append(temp_df)
    
    # If names are not given, the function just reads all data in folder
    else:
        df_names = []
        path, dirs, files = next(os.walk(filepath))
        file_count = len(files)
        for i in range(file_count):
            temp_df = pd.read_csv(filepath + files[i])
            df_list.append(temp_df)
            df_names.append(files[i][:-4])
    
    df_dict = dict(zip(df_names, df_list))
    
    return df_dict

In [4]:
dataframes = read_multiple_csv('./ml-25m/')
print(dataframes.keys())

dict_keys(['movies', 'genome-tags', 'genome-scores', 'tags', 'links', 'ratings'])


In [5]:
list(dataframes.values())[0]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


<h2>2. Dataset Description</h2>

<p>Now we can actually work with our dataframes. Our first step will be checking for missing values in the dataframes and uniformizing the missing values.</p>

<i>Notes: Detect type of each column using nlp to infer the type by the name, uniformize name format, find if there are missing values, replace all missing values with a single type of null representation, describe the dataset statistically, check the amount of available data with each column and the intersection amount (amount of available data on rows that have all data or rows that have data on specific columns set by the user.</i>

In [156]:
import time

In [166]:
# first version of the function

def multiple_missing_values_for_time(dataframes_dict):
    '''
    receives dataframes_dict, a dictionary with the keys being the names of the dataframes
    and the values being the dataframes
    
    returns a list of dataframes with each showing the amount of missing values from each
    column of each original dataframe
    '''
    start_time = time.time() # checking speed
    missing_values_df_list = []
    
    for df_name, df in dataframes_dict.items():
        missing_values_df = {}
        
        for i, value in enumerate(df):
            df_missing_values = df.isnull().sum()
            missing_values_df[str(list(df_missing_values.index)[i])] = list(df_missing_values.values)[i]
                
        missing_values_df = pd.DataFrame(missing_values_df, index=['missing'])
        missing_values_df_list.append(missing_values_df)
        
    return 'Execution time: {0} seconds'.format(time.time() - start_time) # should actually return missing_values_df_list

# now vectorizing the inner for loop

def multiple_missing_values_time(dataframes_dict):
    '''
    receives dataframes_dict, a dictionary with the keys being the names of the dataframes
    and the values being the dataframes
    
    returns a list of dataframes with each showing the amount of missing values from each
    column of each original dataframe
    '''
    start_time = time.time() # checking speed
    missing_values_df_list = []
    
    for df_name, df in dataframes_dict.items():
        # computing and storing missing values
        df_missing_values = df.isnull().sum()
        
        # now an array with the columns names and values is created 
        missing_values_df = np.array((df_missing_values.index, df_missing_values.values))
        
        # finally, a dataframe is created out of this array and appended to the missing_values_list
        missing_values_df_list.append(pd.DataFrame([missing_values_df[1]], columns=missing_values_df[0], index=['missing']))
    
    return ('Execution time: {0} seconds'.format(time.time() - start_time)) # should actually return missing_values_df_list


print(multiple_missing_values_for_time(dataframes))
print(multiple_missing_values_time(dataframes))

Execution time: 1.0507614612579346 seconds
Execution time: 0.2858912944793701 seconds


The vectorized version is 5 times as fast as the other one due to the parallel computing implementation of numpy, so that's the one we'll keep.

In [168]:
def multiple_missing_values(dataframes_dict):
    '''
    receives dataframes_dict, a dictionary with the keys being the names of the dataframes
    and the values being the dataframes
    
    returns a list of dataframes with each showing the amount of missing values from each
    column of each original dataframe
    '''
    missing_values_df_list = []
    
    for df_name, df in dataframes_dict.items():
        # computing and storing missing values
        df_missing_values = df.isnull().sum()
        
        # now an array with the columns names and values is created 
        missing_values_df = np.array((df_missing_values.index, df_missing_values.values))
        
        # finally, a dataframe is created out of this array and appended to the missing_values_list
        missing_values_df_list.append(pd.DataFrame([missing_values_df[1]], columns=missing_values_df[0], index=['missing']))
    
    return missing_values_df_list


In [170]:
display(multiple_missing_values(dataframes)[0])

Unnamed: 0,movieId,title,genres
missing,0,0,0
