## The Movies Dataset
One of the most common datasets that is available on the kaggle for building a Recommender System is the [The Movies DataSet](https://www.kaggle.com/rounakbanik/the-movies-dataset/data).

### Context

These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

### Content

This dataset consists of the following files:

movies_metadata.csv: The main Movies Metadata file contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here

The original data are contained in three files:
[movie_metadata.csv](https://www.kaggle.com/rounakbanik/the-movies-dataset?select=movies_metadata.csv), 
[ratings_small.csv](https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings_small.csv) and 
[keywords.csv](https://www.kaggle.com/rounakbanik/the-movies-dataset?select=keywords.csv). 

To make it easier to work with the data, we converted them into smaller csv files. We created three smaller csv files namely movies_cbr_small.csv , movie_titles.csv and ratings_scale_small.csv
The process can be viewed in our noteboob below.
[momoMoviesDataProcessing](https://github.com/mohandasnj/momomovies/blob/master/momoMoviesDataProcessing.ipynb).

In [1]:
import numpy as np
import pandas as pd
import ast
import json
from datetime import datetime

In [6]:
# Define file directories
MOVIES_DATA_DIR = './data/'
KAGGLE_DATA_DIR = './kaggledata/'
KAGGLE_MOVIES_CSV_FILE =  KAGGLE_DATA_DIR + 'movies_metadata.csv'
KAGGLE_RATINGS_CSV_FILE = KAGGLE_DATA_DIR + 'ratings_small.csv'
KAGGLE_KEYWORDS_CSV_FILE = KAGGLE_DATA_DIR + 'keywords.csv'
RATING_SCALE_CSV_FILE = MOVIES_DATA_DIR + 'ratings_scale_small.csv'
MOVIES_CBR_CSV_FILE = MOVIES_DATA_DIR + 'movies_cbr_small.csv'
MOVIE_TITLE_CSV_FILE = MOVIES_DATA_DIR + 'movie_titles.csv'


## Data Preparation
Let's load this data into Python. We will load the dataset with Pandas onto Dataframes: **movies_cbr_small**, **movie_titles**, and **ratings_scale_small**.

In [7]:
#Load movies_meta_data.csv and keywords.csv to create movies_cbr_small.csv and 
#create movie_titles.csv using movie_cbr_cmall.csv 
def create_movies_cbr_small(movies_file_path, movie_keywords_file_path, iscreatecsv=False):
    #Load movies & create movies dataframe
    moviesdf = pd.read_csv(movies_file_path, dtype='unicode')
    #Load movieswithkeywords & create movies dataframe
    moviekeywordsdf = pd.read_csv(movie_keywords_file_path, dtype='unicode')
    
    #convert moviekeywordsdf["id"] datatype from object to int (int64)
    moviekeywordsdf["id"] = moviekeywordsdf["id"].astype(str).astype('int64')
    moviekeywordsdf["keywords"] = moviekeywordsdf["keywords"].astype(str)
    
    #convert each item of release_date to datetime.date type entity
    moviesdf['release_date'] = pd.to_datetime(moviesdf['release_date'], errors='coerce').apply(lambda x: x.date())
    
    moviesdf.drop(moviesdf.index[19730],inplace=True)
    moviesdf.drop(moviesdf.index[29502],inplace=True)
    moviesdf.drop(moviesdf.index[35585],inplace=True)
    moviesdf.reset_index(drop=True, inplace=True)

    #convert moviesdf["id"] datatype from object to int (int64)
    moviesdf["id"] = moviesdf["id"].astype(str).astype('int64')
    
    movies_with_keywords_df = pd.merge(moviesdf,moviekeywordsdf,left_on='id',right_on='id',how='inner')
    
    # all json columns`
    json_columns = ['belongs_to_collection', 'genres', 'production_companies', 'production_countries', 'spoken_languages', 'keywords']
    for column in json_columns:
        # use ast because json data has single quotes in the csv, which is invalid for a json object; it should be " normally
        movies_with_keywords_df[column] = movies_with_keywords_df[column].apply(lambda x: np.nan if pd.isnull(x) else ast.literal_eval(x))
    
    
    movies_cbr = pd.DataFrame(columns=[
        'title',
        'genres',
        'genres_list',
        'release_date',
        'budget',
        'revenue',
        'tmdbid',
        'imdbid',
        'runtime',
        'vote_average',
        'vote_count',
        'keywords',
        'keywords_list'
    ])

    for i,movie_row in movies_with_keywords_df.iterrows():
        release_date_year = ''
        try:
            if str( movie_row['release_date'] ) == 'NaT':
                continue
            else:
                release_date_year = ' ( ' + str(movie_row.dropna()['release_date'].year) + ' )'
                release_date = movie_row.dropna()['release_date']
        except TypeError:
            continue
           
        movie_title_new = movie_row['title'] + release_date_year
        if movie_row['revenue'] is not np.nan and movie_row['budget'] is not np.nan:
            budget = movie_row['budget']
            revenue = movie_row['revenue']
        tmdbid = int(movie_row['id'])
        imdbid = movie_row['imdb_id']
        runtime = movie_row['runtime']
        vote_average = movie_row['vote_average']
        vote_count = movie_row['vote_count']
        if movie_row['genres'] is not np.nan and movie_row['keywords'] is not np.nan:
            movie_row_genres_list = []
            movie_row_genres_str = ''
            many_genres = movie_row['genres']
            g = 1
            for one_genre in many_genres:
                if one_genre['name']:
                    movie_row_genres_list.append(one_genre['name'])
                    if g==1:
                        movie_row_genres_str = one_genre['name']
                        g+=1
                    elif g>1:
                        movie_row_genres_str += '|' + one_genre['name']
            
            movie_row_keywords_list = []
            movie_row_keywords_str = ''
            many_keywords = movie_row['keywords']
            k=1
            for one_keyword in many_keywords:
                if one_keyword['name']:
                    movie_row_keywords_list.append(one_keyword['name'])
                    if k==1:
                        movie_row_keywords_str = one_keyword['name']
                        k+=1
                    elif k>1:
                        movie_row_keywords_str += '|' + one_keyword['name']

            movies_cbr = movies_cbr.append({'title':movie_title_new,
                                            'genres':movie_row_genres_str,
                                            'genres_list': movie_row_genres_list,
                                            'release_date' : release_date,
                                            'budget':budget,
                                            'revenue':revenue,
                                            'tmdbid':tmdbid,
                                            'imdbid':imdbid,
                                            'runtime':runtime,
                                            'vote_average':vote_average,
                                            'vote_count':vote_count,
                                            'keywords': movie_row_keywords_str,
                                            'keywords_list': movie_row_keywords_list
                                           },ignore_index=True)
        
    
    movies_cbr = movies_cbr.drop_duplicates(subset='title', keep="first")
    movies_cbr['genres'] = movies_cbr['genres'].fillna("").astype('str')
    movies_cbr['release_date'] = pd.to_datetime(movies_cbr['release_date'], errors='coerce').apply(lambda x: x.date())
    movies_cbr["tmdbid"] = movies_cbr["tmdbid"].astype(str).astype('int64')
    movies_cbr["imdbid"] = movies_cbr["tmdbid"].astype(str).astype('int64')
    movies_cbr["budget"] = movies_cbr["budget"].astype(str).astype('int64')
    movies_cbr["revenue"] = movies_cbr["revenue"].astype(str).astype('int64')
    movies_cbr["runtime"] = movies_cbr["runtime"].astype(str).astype(float)
    movies_cbr["vote_average"] = movies_cbr["vote_average"].astype(str).astype(float)
    movies_cbr["vote_count"] = movies_cbr["vote_count"].astype(str).astype('int64')
    movies_cbr_small = movies_cbr.loc[(movies_cbr.budget > 0) & (movies_cbr.revenue > 0),:]
    
    #Create movie_titles.csv using movie_cbr_cmall.csv
    movie_titles = movies_cbr_small.sort_values(by='title', ascending=True)['title']
    if iscreatecsv :
        movies_cbr_small.to_csv(MOVIES_CBR_CSV_FILE)
        movie_titles.to_csv(MOVIE_TITLE_CSV_FILE)
    return movies_cbr_small, movie_titles
    

In [8]:
#Create ratings_scale_small dataframe using ratings_small.csv
def create_ratings_scale_small(ratings_small_filepath, iscreatecsv):
    #Load movieswithratings & create ratings_small dataframe
    ratings_small = pd.read_csv(ratings_small_filepath, dtype='unicode')
    
    #convert movies["id"] datatype from object to int (int64)
    ratings_small["rating"] = ratings_small["rating"].astype(str).astype(float)
    ratings_small["movieId"] = ratings_small["movieId"].astype(str).astype(int)
    ratings_small["userId"] = ratings_small["userId"].astype(str).astype(int)
    
    ratings_small.shape
    ratings_scale_small_df = pd.DataFrame(columns=['userId','movieId','rating','timestamp','scale','sortOrder'])
    for j, rating_row in ratings_small.iterrows():
        userid = rating_row['userId']
        movieid = rating_row['movieId']
        rating = rating_row['rating']
        timestamp = rating_row['timestamp']
        scale =''
        sortOrder = 1
        i = float(rating)
        if(i >= 0) and (i <= 1): 
            scale = 'Poor (0-1)'
            sortOrder = 1
        elif(i > 1) and (i <= 2): 
            scale = 'Fair (1-2)'
            sortOrder = 2
        elif(i > 2) and (i <= 3): 
            scale = 'Good (2-3)'
            sortOrder = 3
        elif(i > 3) and (i <= 4): 
            scale = 'Very Good (3-4)'
            sortOrder = 4
        else: 
            scale = 'Excellent (4-5)'
            sortOrder = 5
        ratings_scale_small_df = ratings_scale_small_df.append({'userId': userid,
                                      'movieId':movieid,
                                      'rating':rating,
                                      'timestamp':timestamp,
                                      'scale':scale,
                                      'sortOrder':sortOrder}, ignore_index = True)
        
    if iscreatecsv :
        ratings_scale_small_df.to_csv(RATING_SCALE_CSV_FILE)
        
    return ratings_scale_small_df

In [9]:
#Set the iscreatecsv flag to True and call create_movies_cbr_small to create movies_cbr_small.csv & 
#movie_titles.csv files. 
#Call create_ratings_scale_small to create ratings_scale_small.csv file. 
iscreatecsv = True
movies_cbr_small, movie_titles  = create_movies_cbr_small(KAGGLE_MOVIES_CSV_FILE, KAGGLE_KEYWORDS_CSV_FILE, iscreatecsv)
ratings_scale_small_df = create_ratings_scale_small(KAGGLE_RATINGS_CSV_FILE, iscreatecsv)

