# MovieLens: Preprocesamiento

* [Easily parallelize your calculations in pandas with parallel-pandas](https://towardsdatascience.com/easily-parallelize-your-calculations-in-pandas-with-parallel-pandas-dc194b82d82f)

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
BASE_PATH    = '../..'
SRC_PATH     = f'{BASE_PATH}/src'
DATASET_PATH = f'{BASE_PATH}/datasets'

In [4]:
import sys
sys.path.append(SRC_PATH)

import numpy as np
import pandas as pd

from parallel_pandas import ParallelPandas
from tmdb_api import TMDbApi
import datetime

In [5]:
DATASET_PATH = '../../datasets'

In [6]:
#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=24, split_factor=4, disable_pr_bar=False)

1. Cargamos todas las funetes de datos

In [7]:
movies  = pd.read_csv(f'{DATASET_PATH}/ml-latest-small/movies.csv')
links   = pd.read_csv(f'{DATASET_PATH}/ml-latest-small/links.csv')
ratings = pd.read_csv(f'{DATASET_PATH}/ml-latest-small/ratings.csv')

In [8]:
tmdb_movies = pd.read_csv(f'{DATASET_PATH}/tmdb/tmdb_5000_movies.csv')

In [9]:
tmdb_movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [10]:
movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [11]:
links.head(5)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [14]:
ratings.head(5)

Unnamed: 0,userId,movieId,rating,timestamp,datetime
0,1,1,4.0,964982703,2000-07-30 18:45:03
1,1,3,4.0,964981247,2000-07-30 18:20:47
2,1,6,4.0,964982224,2000-07-30 18:37:04
3,1,47,5.0,964983815,2000-07-30 19:03:35
4,1,50,5.0,964982931,2000-07-30 18:48:51


2. Creamos el api client que nos permite consulta datos de las movies directamente a TMDB.

In [15]:
api = TMDbApi()

2. Construimos el dataset de movies

In [16]:
movies_links = movies.merge(links, on='movieId')
movies_links.head(5)

Unnamed: 0,movieId,title,genres,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0


In [17]:
movies_table = movies_links.merge(tmdb_movies, left_on='tmdbId', right_on='id')
movies_table.head(2)

Unnamed: 0,movieId,title_x,genres_x,imdbId,tmdbId,budget,genres_y,homepage,id,keywords,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title_y,vote_average,vote_count
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,30000000,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 35, ""...",http://toystory.disney.com/toy-story,862,"[{""id"": 931, ""name"": ""jealousy""}, {""id"": 4290,...",...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1995-10-30,373554033,81.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,Toy Story,7.7,5269
1,10,GoldenEye (1995),Action|Adventure|Thriller,113189,710.0,58000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 28, ""...",http://www.mgm.com/view/movie/757/Goldeneye/,710,"[{""id"": 701, ""name"": ""cuba""}, {""id"": 769, ""nam...",...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",1995-11-16,352194034,130.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,No limits. No fears. No substitutes.,GoldenEye,6.6,1174


In [18]:
movies_table.shape

(3537, 25)

In [19]:
movies_table['image'] = movies_table['tmdbId'].p_apply(lambda movie_id: api.movie_by(movie_id).poster)

<LAMBDA> DONE:   0%|          | 0/3537 [00:00<?, ?it/s]

In [20]:
movies_table = movies_table[['movieId', 'title_x', 'overview', 'image']]
movies_table = movies_table.rename(columns={
    'movieId' : 'id', 
    'title_x'   : 'name', 
    'overview': 'description'
})

In [21]:
movies_table.head(5)

Unnamed: 0,id,name,description,image
0,1,Toy Story (1995),"Led by Woody, Andy's toys live happily in his ...",https://image.tmdb.org/t/p/w500//uXDfjJbdP4ijW...
1,10,GoldenEye (1995),James Bond must unmask the mysterious head of ...,https://image.tmdb.org/t/p/w500//z0ljRnNxIO7CR...
2,11,"American President, The (1995)","Widowed U.S. president Andrew Shepherd, one of...",https://image.tmdb.org/t/p/w500//yObOAYFIHXHkF...
3,14,Nixon (1995),An all-star cast powers this epic look at Amer...,https://image.tmdb.org/t/p/w500//cz2MTGr2wpDZL...
4,15,Cutthroat Island (1995),"Morgan Adams and her slave, William Shaw, are ...",https://image.tmdb.org/t/p/w500//hYdeBZ4BFXivd...


In [22]:
movies_table = movies_table[movies_table['image'] != None]

3. construimos el databset de interacciones

In [24]:
ratings      = ratings[ratings['movieId'].isin(movies_table['id'].values)]

In [25]:
interactions = ratings.rename(columns={
    'movieId' : 'item_id', 
    'userId'  : 'user_id'
})
interactions = interactions[['item_id', 'user_id', 'rating']]
interactions.shape

(70194, 3)

4. Se guardan ambos datasets.

In [26]:
movies_table.to_csv(f'{DATASET_PATH}/items.csv', index=False)
interactions.to_csv(f'{DATASET_PATH}/interactions.csv', index=False)

In [27]:
movies_table.shape[0], interactions.shape[0], movies_table.shape[0] * interactions.shape[0]

(3537, 70194, 248276178)