<h1>MovieLens 25M</h1>

<hr>
<h2>0. Introduction</h2>

<p>This dataset describes 5-star rating and free-text tagging activity from MovieLens (https://movielens.org), a movie recommendation service. The dataset is available for download at https://grouplens.org/datasets/movielens/25m/.</p>

<p>The explored version here is the MovieLes 25M Dataset, a stable benchamrk version, with 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019.</p>

<p>We could discover more information by just reading the dataset's README file, but we are not doing that since the purpose of this notebook is analyzing the exploration process.</p>

<h2>1. Reading Data</h2>

<p>The first step in the exploration process is loading your data to be able to analyze it with your selected tool. Actually, one could argue the first step is ingesting or obtaining this data, but we are not concerned with this part here.</p>

<p>Assuming we already have our data available, let's start by loading it.</p>

In [1]:
# Viewing what is in the directory
!cd ml-25m; ls

genome-scores.csv  links.csv   ratings.csv
genome-tags.csv    movies.csv  tags.csv


Here, we can see the data is in the ml-25m directory, and is composed by multiple csv files. Let's load them all at once.

In [2]:
# Importing stuff
import os
import numpy as np
import pandas as pd

In [19]:
dataframes_names = ['genome-scores', 'genome-tags', 'links', 'movies', 'ratings', 'tags']

def read_multiple_csv(filepath, df_names=None, concat=False):

    '''
    Improved version of pd.read_csv, which can read multiple files in a directory
    If df_names == None, it iterates the whole directory.
    
    Receives the path and optionally a list of the dataframes names.
    Returns a list of the dataframes.
    Join option is to join all dataframes in one.
    
    uses import os, pandas as pd
    the folder should have only csv files, no .txt
    '''
    df_list = []
    
    # Here the function uses the names of the dataframes to read the files
    if df_names != None:
        for i in range(len(df_names)):
            temp_df = pd.read_csv(filepath + df_names[i] + '.csv')
            df_list.append(temp_df)
    
    # If names are not given, the function just reads all data in folder
    else:
        path, dirs, files = next(os.walk(filepath))
        file_count = len(files)
        for i in range(file_count):
            temp_df = pd.read_csv(filepath + files[i])
            df_list.append(temp_df)
    
    # *Not Implemented* If selected, joins the csv files in a larger dataframe
    if concat:
        #df_list = pd.concat(df_list)
        raise NotImplementedError
        
    return df_list

In [20]:
dataframes = read_multiple_csv('./ml-25m/')

In [27]:
dataframes[0]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


In [28]:
dataframes[1]

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
...,...,...
1123,1124,writing
1124,1125,wuxia
1125,1126,wwii
1126,1127,zombie


<h2>2. Dataset Description</h2>

<p>Now we can actually work with our datasets.</p>