<h1>MovieLens 25M</h1>

<hr>
<h2>0. Introduction</h2>

<p>This dataset describes 5-star rating and free-text tagging activity from MovieLens (https://movielens.org), a movie recommendation service. The dataset is available for download at https://grouplens.org/datasets/movielens/25m/.</p>

<p>The explored version here is the MovieLes 25M Dataset, a stable benchamrk version, with 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019.</p>

<p>We could discover more information by just reading the dataset's README file, but we are not doing that since the purpose of this notebook is analyzing the exploration process.</p>

Steps:
<ol>
    <li>Reading Data</li>
    <li>Data Types Inference</li>
    <li>Missing Values Count and Treatment</li>
    <li>Statistical Analysis, Describe</li>
    <li>Numerical Univariate Analysis, Graphical</li>
    <li>Numerical Bivariate Analysis, Graphical</li>
    <li>Probable Outliers</li>
    <li>Correlation</li>
    <li>Dimensionality Reduction</li>
</ol

<h2>1. Reading Data</h2>

<p>The first step in the exploration process is loading your data to be able to analyze it with your selected tool. Actually, one could argue the first step is ingesting or obtaining this data, but we are not concerned with this part here.</p>

<p>Assuming we already have our data available, let's start by loading it.</p>

In [24]:
# Viewing what is in the directory
!cd ml-25m; ls

genome-scores.csv  links.csv   ratings.csv
genome-tags.csv    movies.csv  tags.csv


Here, we can see the data is in the ml-25m directory, and is composed by multiple csv files. Let's load them all at once.

In [25]:
# Importing stuff
import os
import numpy as np
import pandas as pd

In [50]:
import time

def read_multiple_csv(filepath, df_names=None):

    '''
    Improved version of pd.read_csv, which can read multiple files in a directory
    If df_names == None, it iterates the whole directory.
    
    Receives the path and optionally a list of the dataframes names.
    Returns a dict of the names and dataframes.
    
    uses import os, pandas as pd
    the folder should have only csv files, no .txt
    '''
    start_time = time.time()
    df_list = []
    
    # Here the function uses the names of the dataframes to read the files
    if df_names is not None:
        df_names = np.char.array(df_names)
        filepath = np.full(df_names.shape, filepath)
        reader = filepath + df_names + '.csv'
        
        temp_df = [pd.read_csv(i) for i in reader]
        df_list.append(temp_df)
    
    # If names are not given, the function just reads all data in folder
    else:
        df_names = []
        path, dirs, files = next(os.walk(filepath))
        file_count = len(files)
        for i in range(file_count):
            temp_df = pd.read_csv(filepath + files[i])
            df_list.append(temp_df)
            df_names.append(files[i][:-4])
    
    df_dict = dict(zip(df_names, df_list))
    
    return 'Execution time: {0} seconds'.format(time.time() - start_time)

print(read_multiple_csv('./ml-25m/', ['genome-scores', 'genome-tags', 'links', 'movies', 'ratings', 'tags']
))

dataframes_names = ['genome-scores', 'genome-tags', 'links', 'movies', 'ratings', 'tags']

def read_multiple_csv(filepath, df_names=None):

    '''
    Improved version of pd.read_csv, which can read multiple files in a directory
    If df_names == None, it iterates the whole directory.
    
    Receives the path and optionally a list of the dataframes names.
    Returns a dict of the names and dataframes.
    
    uses import os, pandas as pd
    the folder should have only csv files, no .txt
    '''
    start_time = time.time()
    df_list = []
    
    # Here the function uses the names of the dataframes to read the files
    if df_names is not None:
        for i in range(len(df_names)):
            temp_df = pd.read_csv(filepath + df_names[i] + '.csv')
            df_list.append(temp_df)
    
    # If names are not given, the function just reads all data in folder
    else:
        df_names = []
        path, dirs, files = next(os.walk(filepath))
        file_count = len(files)
        for i in range(file_count):
            temp_df = pd.read_csv(filepath + files[i])
            df_list.append(temp_df)
            df_names.append(files[i][:-4])
    
    df_dict = dict(zip(df_names, df_list))
    
    return 'Execution time old: {0} seconds'.format(time.time() - start_time)

# note: I'll probably change this to return a list of dataframes and define an attribute dataframe.name

print(read_multiple_csv('./ml-25m/', ['genome-scores', 'genome-tags', 'links', 'movies', 'ratings', 'tags']
))

Execution time: 7.290629148483276 seconds
Execution time old: 6.957117795944214 seconds


Tried to optimize by vectorizing, but little difference was noticed, for the cost of readability. I'll come back to this later.

In [27]:
dataframes_names = ['genome-scores', 'genome-tags', 'links', 'movies', 'ratings', 'tags']

def read_multiple_csv(filepath, df_names=None):

    '''
    Improved version of pd.read_csv, which can read multiple files in a directory
    If df_names == None, it iterates the whole directory.
    
    Receives the path and optionally a list of the dataframes names.
    Returns a dict of the names and dataframes.
    
    uses import os, pandas as pd
    the folder should have only csv files, no .txt
    '''
    df_list = []
    
    # Here the function uses the names of the dataframes to read the files
    if df_names is not None:
        for i in range(len(df_names)):
            temp_df = pd.read_csv(filepath + df_names[i] + '.csv')
            df_list.append(temp_df)
    
    # If names are not given, the function just reads all data in folder
    else:
        df_names = []
        path, dirs, files = next(os.walk(filepath))
        file_count = len(files)
        for i in range(file_count):
            temp_df = pd.read_csv(filepath + files[i])
            df_list.append(temp_df)
            df_names.append(files[i][:-4])
    
    df_dict = dict(zip(df_names, df_list))
    
    return df_dict

In [28]:
dataframes = read_multiple_csv('./ml-25m/')
print(dataframes.keys())

dict_keys(['movies', 'genome-tags', 'genome-scores', 'tags', 'links', 'ratings'])


In [29]:
list(dataframes.values())[0]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


Let's define a function that displays multiple dataframes, it's going to be very useful later.

In [40]:
from IPython.core.display import HTML

def multi_table(table_list):
    ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell
    '''
    return HTML(
        ''.join(
            ('<table><tr style="background-color:white;">', 
        ''.join(['<td>' + table._repr_html_() + '</td>' for table in table_list]),
        '</tr></table>')
        )
    )

multi_table(dataframes.values())

# source: https://github.com/epmoyer/ipy_table/issues/24

Unnamed: 0_level_0,movieId,title,genres,Unnamed: 4_level_0,Unnamed: 5_level_0
Unnamed: 0_level_1,tagId,tag,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unnamed: 0_level_2,movieId,tagId,relevance,Unnamed: 4_level_2,Unnamed: 5_level_2
Unnamed: 0_level_3,userId,movieId,tag,timestamp,Unnamed: 5_level_3
Unnamed: 0_level_4,movieId,imdbId,tmdbId,Unnamed: 4_level_4,Unnamed: 5_level_4
Unnamed: 0_level_5,userId,movieId,rating,timestamp,Unnamed: 5_level_5
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,,
1,2,Jumanji (1995),Adventure|Children|Fantasy,,
2,3,Grumpier Old Men (1995),Comedy|Romance,,
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,,
4,5,Father of the Bride Part II (1995),Comedy,,
...,...,...,...,,
62418,209157,We (2018),Drama,,
62419,209159,Window of the Soul (2001),Documentary,,
62420,209163,Bad Poems (2018),Comedy|Drama,,
62421,209169,A Girl Thing (2001),(no genres listed),,

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
...,...,...
1123,1124,writing
1124,1125,wuxia
1125,1126,wwii
1126,1127,zombie

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.06250
3,1,4,0.07575
4,1,5,0.14075
...,...,...,...
15584443,206499,1124,0.11000
15584444,206499,1125,0.04850
15584445,206499,1126,0.01325
15584446,206499,1127,0.14025

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455
...,...,...,...,...
1093355,162521,66934,Neil Patrick Harris,1427311611
1093356,162521,103341,cornetto trilogy,1427311259
1093357,162534,189169,comedy,1527518175
1093358,162534,189169,disabled,1527518181

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
62418,209157,6671244,499546.0
62419,209159,297986,63407.0
62420,209163,6755366,553036.0
62421,209169,249603,162892.0

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
...,...,...,...,...
25000090,162541,50872,4.5,1240953372
25000091,162541,55768,2.5,1240951998
25000092,162541,56176,2.0,1240950697
25000093,162541,58559,4.0,1240953434


<h2>2. Dataset Description</h2>

<p>Now we can actually work with our dataframes. Our first step will be checking the types of each column in each dataframe.</p>

In [41]:
def dfdict_dtypes(dataframes_dict):
    '''
    receives dataframes_dict, a dictionary with the keys being the names of the dataframes
    and the values being the dataframes
    
    returns a list of dataframes with each showing the types of each column of each original 
    dataframe
    '''
    df_types_list = []
    
    for df_name, df in dataframes_dict.items():
        # checking the types of the columns
        df_types = df.dtypes
        
        # now an array with the columns names and values is created 
        types_df = np.array((df_types.index, df_types.values))
        
        # finally, a dataframe is created out of this array and appended to the df_types_list
        df_types_list.append(pd.DataFrame([types_df[1]], columns=types_df[0], index=['type']))
    
    return df_types_list

dfs_types = dfdict_dtypes(dataframes)

for df_name, df in dataframes.items():
    print(df_name)
    print(dataframes[df_name].dtypes, end='\n\n')

movies
movieId     int64
title      object
genres     object
dtype: object

genome-tags
tagId     int64
tag      object
dtype: object

genome-scores
movieId        int64
tagId          int64
relevance    float64
dtype: object

tags
userId        int64
movieId       int64
tag          object
timestamp     int64
dtype: object

links
movieId      int64
imdbId       int64
tmdbId     float64
dtype: object

ratings
userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object



<p>Now we'll check for missing values in the dataframes and uniformize the missing values.</p>

In [42]:
import time

In [43]:
# first version of the function

def multiple_missing_values_for_time(dataframes_dict):
    '''
    receives dataframes_dict, a dictionary with the keys being the names of the dataframes
    and the values being the dataframes
    
    returns a list of dataframes with each showing the amount of missing values from each
    column of each original dataframe
    '''
    start_time = time.time() # checking speed
    missing_values_df_list = []
    
    for df_name, df in dataframes_dict.items():
        missing_values_df = {}
        
        for i, value in enumerate(df):
            df_missing_values = df.isnull().sum()
            missing_values_df[str(list(df_missing_values.index)[i])] = list(df_missing_values.values)[i]
                
        missing_values_df = pd.DataFrame(missing_values_df, index=['missing'])
        missing_values_df_list.append(missing_values_df)
        
    return 'Execution time: {0} seconds'.format(time.time() - start_time) # should actually return missing_values_df_list

# now vectorizing the inner for loop

def multiple_missing_values_time(dataframes_dict):
    '''
    receives dataframes_dict, a dictionary with the keys being the names of the dataframes
    and the values being the dataframes
    
    returns a list of dataframes with each showing the amount of missing values from each
    column of each original dataframe
    '''
    start_time = time.time() # checking speed
    missing_values_df_list = []
    
    for df_name, df in dataframes_dict.items():
        # computing and storing missing values
        df_missing_values = df.isnull().sum()
        
        # now an array with the columns names and values is created 
        missing_values_df = np.array((df_missing_values.index, df_missing_values.values))
        
        # finally, a dataframe is created out of this array and appended to the missing_values_list
        missing_values_df_list.append(pd.DataFrame([missing_values_df[1]], columns=missing_values_df[0], index=['missing']))
    

    return 'Execution time: {0} seconds'.format(time.time() - start_time) # should actually return missing_values_df_list


print(multiple_missing_values_for_time(dataframes))
print(multiple_missing_values_time(dataframes))

Execution time: 1.05549955368042 seconds
Execution time: 0.2696866989135742 seconds


The vectorized version is about 4 times as fast as the other one due to the parallel computing implementation of numpy, so that's the one we'll keep.

In [44]:
def multiple_missing_values(dataframes_dict):
    '''
    receives dataframes_dict, a dictionary with the keys being the names of the dataframes
    and the values being the dataframes
    
    returns a list of dataframes with each showing the amount of missing values from each
    column of each original dataframe
    '''
    missing_values_df_list = []
    
    for df_name, df in dataframes_dict.items():
        # computing and storing missing values
        df_missing_values = df.isnull().sum()
        
        # now an array with the columns names and values is created 
        missing_values_df = np.array((df_missing_values.index, df_missing_values.values))
        
        # finally, a dataframe is created out of this array and appended to the missing_values_list
        missing_values_df_list.append(pd.DataFrame([missing_values_df[1]], columns=missing_values_df[0], index=['missing']))
    
    return missing_values_df_list


In [45]:
display(multiple_missing_values(dataframes)[0])

Unnamed: 0,movieId,title,genres
missing,0,0,0


<p>This version is fast enough for now, so let's not further optimize right now.</p>
<p>Now displaying the result.</p>

In [47]:
df_list = multiple_missing_values(dataframes)
multi_table(df_list)

# source: https://github.com/epmoyer/ipy_table/issues/24

Unnamed: 0_level_0,movieId,title,genres,Unnamed: 4_level_0,Unnamed: 5_level_0
Unnamed: 0_level_1,tagId,tag,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unnamed: 0_level_2,movieId,tagId,relevance,Unnamed: 4_level_2,Unnamed: 5_level_2
Unnamed: 0_level_3,userId,movieId,tag,timestamp,Unnamed: 5_level_3
Unnamed: 0_level_4,movieId,imdbId,tmdbId,Unnamed: 4_level_4,Unnamed: 5_level_4
Unnamed: 0_level_5,userId,movieId,rating,timestamp,Unnamed: 5_level_5
missing,0,0,0,,
missing,0,0,,,
missing,0,0,0,,
missing,0,0,16,0,
missing,0,0,107,,
missing,0,0,0,0,
movieId  title  genres  missing  0  0  0,tagId  tag  missing  0  0,movieId  tagId  relevance  missing  0  0  0,userId  movieId  tag  timestamp  missing  0  0  16  0,movieId  imdbId  tmdbId  missing  0  0  107,userId  movieId  rating  timestamp  missing  0  0  0  0

Unnamed: 0,movieId,title,genres
missing,0,0,0

Unnamed: 0,tagId,tag
missing,0,0

Unnamed: 0,movieId,tagId,relevance
missing,0,0,0

Unnamed: 0,userId,movieId,tag,timestamp
missing,0,0,16,0

Unnamed: 0,movieId,imdbId,tmdbId
missing,0,0,107

Unnamed: 0,userId,movieId,rating,timestamp
missing,0,0,0,0


In [49]:
# displaying columns types
multi_table(dfs_types)

Unnamed: 0_level_0,movieId,title,genres,Unnamed: 4_level_0,Unnamed: 5_level_0
Unnamed: 0_level_1,tagId,tag,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unnamed: 0_level_2,movieId,tagId,relevance,Unnamed: 4_level_2,Unnamed: 5_level_2
Unnamed: 0_level_3,userId,movieId,tag,timestamp,Unnamed: 5_level_3
Unnamed: 0_level_4,movieId,imdbId,tmdbId,Unnamed: 4_level_4,Unnamed: 5_level_4
Unnamed: 0_level_5,userId,movieId,rating,timestamp,Unnamed: 5_level_5
type,int64,object,object,,
type,int64,object,,,
type,int64,int64,float64,,
type,int64,int64,object,int64,
type,int64,int64,float64,,
type,int64,int64,float64,int64,
movieId  title  genres  type  int64  object  object,tagId  tag  type  int64  object,movieId  tagId  relevance  type  int64  int64  float64,userId  movieId  tag  timestamp  type  int64  int64  object  int64,movieId  imdbId  tmdbId  type  int64  int64  float64,userId  movieId  rating  timestamp  type  int64  int64  float64  int64

Unnamed: 0,movieId,title,genres
type,int64,object,object

Unnamed: 0,tagId,tag
type,int64,object

Unnamed: 0,movieId,tagId,relevance
type,int64,int64,float64

Unnamed: 0,userId,movieId,tag,timestamp
type,int64,int64,object,int64

Unnamed: 0,movieId,imdbId,tmdbId
type,int64,int64,float64

Unnamed: 0,userId,movieId,rating,timestamp
type,int64,int64,float64,int64


Then statistical analysis

In [69]:
def dflist_describe(dataframes_dict):
    '''
    receives dataframes_dict, a dictionary with the keys being the names of the dataframes
    and the values being the dataframes
    
    returns a list of dataframes with each showing the types of each column of each original 
    dataframe
    '''
    df_descriptions_list = []
    
    for df_name, df in dataframes_dict.items():
        # checking the types of the columns
        df_descriptions = df.describe(include='all')
        
        # finally, a dataframe is created out of this array and appended to the df_types_list
        df_descriptions_list.append(df_descriptions)
    
    return df_descriptions_list

# note: try the next one, source https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side

def mydisplay(dfs, names=[]):

    count = 0
    maxTables = 6

    if not names:
        names = [x for x in range(len(dfs))]

    html_str = ''
    html_th = ''
    html_td = ''

    for df, name in zip(dfs, names):
        if count <= (maxTables):
            html_th += (''.join(f'<th style="text-align:center">{name}</th>'))
            html_td += (''.join(f'<td style="vertical-align:top"> {df.to_html(index=False)}</td>'))
            count += 1
        else:
            html_str += f'<tr>{html_th}</tr><tr>{html_td}</tr>'
            html_th = f'<th style="text-align:center">{name}</th>'
            html_td = f'<td style="vertical-align:top"> {df.to_html(index=False)}</td>'
            count = 0


    if count != 0:
        html_str += f'<tr>{html_th}</tr><tr>{html_td}</tr>'


    html_str += f'<table>{html_str}</table>'
    html_str = html_str.replace('table','table style="display:inline"')
    display_html(html_str, raw=True)

In [71]:
multi_table(dflist_describe(dataframes)) # I want to show the titles of the dataframes above them, but it's better if I do it after changing dict to list

Unnamed: 0_level_0,movieId,title,genres,Unnamed: 4_level_0,Unnamed: 5_level_0
Unnamed: 0_level_1,tagId,tag,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unnamed: 0_level_2,movieId,tagId,relevance,Unnamed: 4_level_2,Unnamed: 5_level_2
Unnamed: 0_level_3,userId,movieId,tag,timestamp,Unnamed: 5_level_3
Unnamed: 0_level_4,movieId,imdbId,tmdbId,Unnamed: 4_level_4,Unnamed: 5_level_4
Unnamed: 0_level_5,userId,movieId,rating,timestamp,Unnamed: 5_level_5
count,62423.000000,62423,62423,,
unique,,62325,1639,,
top,,I See You (2019),Drama,,
freq,,2,9056,,
mean,122220.387646,,,,
std,63264.744844,,,,
min,1.000000,,,,
25%,82146.500000,,,,
50%,138022.000000,,,,
75%,173222.000000,,,,

Unnamed: 0,movieId,title,genres
count,62423.0,62423,62423
unique,,62325,1639
top,,I See You (2019),Drama
freq,,2,9056
mean,122220.387646,,
std,63264.744844,,
min,1.0,,
25%,82146.5,,
50%,138022.0,,
75%,173222.0,,

Unnamed: 0,tagId,tag
count,1128.0,1128
unique,,1128
top,,infidelity
freq,,1
mean,564.5,
std,325.769857,
min,1.0,
25%,282.75,
50%,564.5,
75%,846.25,

Unnamed: 0,movieId,tagId,relevance
count,15584450.0,15584450.0,15584450.0
mean,46022.49,564.5,0.1163679
std,55352.21,325.6254,0.1544722
min,1.0,1.0,0.00025
25%,3853.75,282.75,0.024
50%,8575.5,564.5,0.0565
75%,80186.5,846.25,0.14075
max,206499.0,1128.0,1.0

Unnamed: 0,userId,movieId,tag,timestamp
count,1093360.0,1093360.0,1093344,1093360.0
unique,,,73050,
top,,,sci-fi,
freq,,,8330,
mean,67590.22,58492.76,,1430115000.0
std,51521.14,59687.31,,117738400.0
min,3.0,1.0,,1135429000.0
25%,15204.0,3504.0,,1339262000.0
50%,62199.0,45940.0,,1468929000.0
75%,113642.0,102903.0,,1527402000.0

Unnamed: 0,movieId,imdbId,tmdbId
count,62423.0,62423.0,62316.0
mean,122220.387646,1456706.0,155186.689999
std,63264.744844,2098007.0,153362.6947
min,1.0,1.0,2.0
25%,82146.5,81686.5,36768.75
50%,138022.0,325805.0,86750.5
75%,173222.0,2063724.0,255255.25
max,209171.0,11170940.0,646282.0

Unnamed: 0,userId,movieId,rating,timestamp
count,25000100.0,25000100.0,25000100.0,25000100.0
mean,81189.28,21387.98,3.533854,1215601000.0
std,46791.72,39198.86,1.060744,226875800.0
min,1.0,1.0,0.5,789652000.0
25%,40510.0,1196.0,3.0,1011747000.0
50%,80914.0,2947.0,3.5,1198868000.0
75%,121557.0,8623.0,4.0,1447205000.0
max,162541.0,209171.0,5.0,1574328000.0
