# Data Collection

In [1]:
# imports
import pandas as pd
import numpy as np

## MovieLens Data Structure
This data from the [MovieLens 25M Dataset](https://grouplens.org/datasets/movielens/25m/). MovieLens itself is an online recommender system that allows its users to add ratings and tags to movies. This data includes reviews and tags taken directly from MovieLens users, as well as movie data organized by unique movie IDs. This dataset from MovieLens was collected in 2019 from 162,00 users. It contains 25 million ratings and one million tags for 62,000 movies.

### Movies Data Set
This data set pairs each MovieLens ID with its title, formatted 'Name of Movie (Year)', and its genres, formatted 'genre1|genre2|...'.

In [39]:
movies = pd.read_csv('../Data/Large-Data/MovieLens-25M/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [43]:
movies.tail()

Unnamed: 0,movieId,title,genres
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)
62422,209171,Women of Devil's Island (1962),Action|Adventure|Drama


In [45]:
movies.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [44]:
# This dataset covers 62,423 movies
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


In [48]:
# Lets check there are no duplicates
print(f"There are {movies['movieId'].nunique()} unique movie IDs.")
print(f"There are {movies['title'].nunique()} unique titles.")

There are 62423 unique movie IDs.
There are 62325 unique titles.


There are 62,423 unique movie IDs, but only 62,325 unique titles. Since `title` has the name and year this may mean we have movies with the same title and year or it could mean duplicates.

In [50]:
# It looks like the most often a movie pops up is twice. Since there are unique movie IDs, we will leave this for now, but keep in mind during EDA and modeling
# Since the most a title occurs is twice, that means we only have 98 duplicates out of over 62,000
movies['title'].value_counts()

title
The Void (2016)                                2
Seven Years Bad Luck (1921)                    2
Clear History (2013)                           2
Enron: The Smartest Guys in the Room (2005)    2
Deranged (2012)                                2
                                              ..
$ellebrity (Sellebrity) (2012)                 1
Macabre (Macabro) (1980)                       1
Punk's Dead: SLC Punk! 2 (2014)                1
Chinese Hercules (1973)                        1
Women of Devil's Island (1962)                 1
Name: count, Length: 62325, dtype: int64

We'll need to figure out how to better organize `genres`, most likely with boolean columns with each genre, for example `genres_fantasy`. It would be interesting to look at how movies that share multiple genres interact or if certain people like movies that have multiple genres more than movies that only have one. 

### Links Data Set
This data set pairs each MovieLens ID with its IMDb and TMDB IDs. This is needed for combining our data from IMDb with our MovieLens data.

In [51]:
links = pd.read_csv('../Data/Large-Data/MovieLens-25M/links.csv')
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [86]:
print(f'The shape of the links DataFrame is {links.shape}')
print(f"There are {links['movieId'].nunique()} unique MovieLens IDs.")
print(f"There are {links['imdbId'].nunique()} unique IMDb IDs.")
print(f"There are {links['tmdbId'].nunique()} unique TMDB IDs.")

The shape of the links DataFrame is (62423, 3)
There are 62423 unique MovieLens IDs.
There are 62423 unique IMDb IDs.
There are 62281 unique TMDB IDs.


In [68]:
# Let's check for any missing values
links.isna().sum()

movieId      0
imdbId       0
tmdbId     107
dtype: int64

We have the same number of unique MovieLens IDs as our `movies` DataFrame, which also matches our unique IMDb IDs. This should make it easy to match the IMDb and MovieLens IDs up when combining our reviews from each into one dataframe. There are some duplicate TMDB IDs as well as some missing values, but as of right now we will not need to use that column, so we can ignore this.

### Ratings Data Set
This data contains the ratings from MovieLens users, up to 2019. It has mulitple reviews per userId, broken up by movieId, rating, and timestamp.

In [69]:
ratings = pd.read_csv('../Data/Large-Data/MovieLens-25M/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [85]:
print(f'The shape of the ratings DataFrame is {ratings.shape}')
print(f"There are {ratings['userId'].nunique()} unique MovieLens user IDs.")
print(f"There are {ratings['movieId'].nunique()} unique movies rated.")

The shape of the ratings DataFrame is (25000095, 4)
There are 162541 unique MovieLens user IDs.
There are 59047 unique movies rated.


Only 59,047 movies out of the 62,000 that we have titles for have ratings. We will need to keep this in mind when considering which movies to include in our recommender system, as we don't want to be recommending movies without having any information on them.

In [71]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [79]:
# Let's check for missing values
ratings.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [75]:
# There are 10 possible ratings on a scale of 0.5-5
print(ratings['rating'].nunique())
print(ratings['rating'].unique())

10
[5.  3.5 4.  2.5 4.5 3.  0.5 2.  1.  1.5]


Our data for these ratings looks pretty clean. We will need to scale the ratings to match the IMDb ratings we have, but there are no incorrect data types or missing values.

### Tags Data Set
The [MovieLens Data Dictionary](https://files.grouplens.org/datasets/movielens/ml-25m-README.html) defines tags as "user-generated metadata about movies. Each tag is typically a single word or short phrase." These tags are user generated and added to movies on the MovieLens database. This data set is a set of unique user tags, containing the MovieLens user ID, MovieLens movie ID, tag, and timestamp of the tag.

In [53]:
tags = pd.read_csv('../Data/Large-Data/MovieLens-25M/tags.csv')
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [83]:
# We have over 1,000,000 tags, about 73,000 of them unique
print(f"The shape of the tags DataFrame is {tags.shape}.")
print(f"There are {tags['tag'].nunique()} unique tags.")
print(f"There are tags for {tags['movieId'].nunique()} unique movies.")

The shape of the tags DataFrame is (1093360, 4).
There are 73050 unique tags.
There are tags for 45251 unique movies.


Only 45,251 of our 62,000 movies have tags. We may want to consider dropping movies that do not have tags or ratings from our data, but we will examine that when we combine the data.

In [55]:
# Let's look for any missing values
tags.isna().sum()

userId        0
movieId       0
tag          16
timestamp     0
dtype: int64

Since we are only interesting in the tags column, a row with no tag is of no use to us. We can drop these rows without worrying, since it is only decreasing our data by 16 rows out of more than 1,000,000.

## IEEE IMDb Review Data
This data set comes from [IEEE DataPort](https://ieee-dataport.org/open-access/imdb-users-ratings-dataset). It contains 4,669,820 ratings from 1,499,238 users to 351,109 movies on IMDb. Each row has a userID, movieID, rating, and review date.

In [2]:
# This code comes directly from the IEEE instructions for reading in this data

dataset = np.load ("../Data/Large-Data/imdb_reviews.npy")

dataset[0]

'ur4592644,tt0120884,10,16 January 2005'

It is set up as a numpy array so we need to reformat it to turn it into a DataFrame

In [3]:
# Turn into list or rows seperated by commas, like a csv file
data_list = [row.split(',') for row in dataset]

# Convert to pandas DataFrame with labeled columns
imdb_ratings = pd.DataFrame(data_list, columns=['UserID', 'MovieID', 'Rating', 'ReviewDate'])

In [4]:
print(imdb_ratings.dtypes)
imdb_ratings.head()

UserID        object
MovieID       object
Rating        object
ReviewDate    object
dtype: object


Unnamed: 0,UserID,MovieID,Rating,ReviewDate
0,ur4592644,tt0120884,10,16 January 2005
1,ur3174947,tt0118688,3,16 January 2005
2,ur3780035,tt0387887,8,16 January 2005
3,ur4592628,tt0346491,1,16 January 2005
4,ur3174947,tt0094721,8,16 January 2005


In [5]:
# Convert Rating to numeric
imdb_ratings['Rating'] = pd.to_numeric(imdb_ratings['Rating'])

# Convert ReviewDate to datetime
imdb_ratings['ReviewDate'] = pd.to_datetime(imdb_ratings['ReviewDate'])

imdb_ratings.dtypes

UserID                object
MovieID               object
Rating                 int64
ReviewDate    datetime64[ns]
dtype: object

In [6]:
print(f'The shape of the IMDb ratings DataFrame is {imdb_ratings.shape}')
print(f"There are {imdb_ratings['UserID'].nunique()} unique IMDb user IDs.")
print(f"There are {imdb_ratings['MovieID'].nunique()} unique movies rated.")

The shape of the IMDb ratings DataFrame is (4669820, 4)
There are 1499238 unique IMDb user IDs.
There are 351109 unique movies rated.


There are fewer ratings than the MovieLens data, but they cover almost 300,000 more movies. This will be helpful to make sure we have enough data and may prevent us from having to remove any movies from our recommender system.

In [7]:
# These ratings are on a scale of 1-10, unlike the MovieLens which is 0.5 to 5
# We can scale these later to match, since there are 10 options for each
print(imdb_ratings['Rating'].nunique())
print(imdb_ratings['Rating'].unique())

10
[10  3  8  1  9  7  2  4  6  5]


In [10]:
# Lets save this as a csv so we don't need to reformat it again
imdb_ratings.to_csv('../Data/Large-Data/ieee_imdb_reviews.csv', index=False)

In [11]:
imdb_reviews = pd.read_csv('../Data/Large-Data/ieee_imdb_reviews.csv')
imdb_reviews.head()

Unnamed: 0,UserID,MovieID,Rating,ReviewDate
0,ur4592644,tt0120884,10,2005-01-16
1,ur3174947,tt0118688,3,2005-01-16
2,ur3780035,tt0387887,8,2005-01-16
3,ur4592628,tt0346491,1,2005-01-16
4,ur3174947,tt0094721,8,2005-01-16


## Kaggle Movies Data

### Movie Metadata

In [21]:
movies_metadata = pd.read_csv('../Data/Large-Data/kaggle-movies/movies_metadata.csv')
movies_metadata

  movies_metadata = pd.read_csv('../Data/Large-Data/kaggle-movies/movies_metadata.csv')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [22]:
movies_metadata.isnull().sum()

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

In [23]:
credits = pd.read_csv('../Data/Large-Data/kaggle-movies/credits.csv')
credits

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862
...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506


In [24]:
keywords = pd.read_csv('../Data/Large-Data/kaggle-movies/keywords.csv')
keywords

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."
...,...,...
46414,439050,"[{'id': 10703, 'name': 'tragic love'}]"
46415,111109,"[{'id': 2679, 'name': 'artist'}, {'id': 14531,..."
46416,67758,[]
46417,227506,[]


In [25]:
links_small = pd.read_csv('../Data/Large-Data/kaggle-movies/links_small.csv')
links_small

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9120,162672,3859980,402672.0
9121,163056,4262980,315011.0
9122,163949,2531318,391698.0
9123,164977,27660,137608.0


In [26]:
ratings_small = pd.read_csv('../Data/Large-Data/kaggle-movies/ratings_small.csv')
ratings_small

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
...,...,...,...,...
99999,671,6268,2.5,1065579370
100000,671,6269,4.0,1065149201
100001,671,6365,4.0,1070940363
100002,671,6385,2.5,1070979663


In [27]:
credits = pd.read_csv('../Data/Large-Data/kaggle-movies/credits.csv')
credits

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862
...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506


## IMDb Title Data

In [17]:
imdb_titles = pd.read_csv('../Data/Large-Data/imdb_title_akas.tsv', sep='\t')
imdb_titles

  imdb_titles = pd.read_csv('../Data/Large-Data/imdb_title_akas.tsv', sep='\t')


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0
...,...,...,...,...,...,...,...,...
38684347,tt9916852,5,Episódio #3.20,PT,pt,\N,\N,0
38684348,tt9916852,6,Episodio #3.20,IT,it,\N,\N,0
38684349,tt9916852,7,एपिसोड #3.20,IN,hi,\N,\N,0
38684350,tt9916856,1,The Wind,DE,\N,imdbDisplay,\N,0


In [31]:
imdb_titles.dtypes

titleId            object
ordering            int64
title              object
region             object
language           object
types              object
attributes         object
isOriginalTitle    object
dtype: object

In [37]:
originals = imdb_titles[imdb_titles['types']=='original']
originals

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
6,tt0000001,7,Carmencita,\N,\N,original,\N,1
8,tt0000002,1,Le clown et ses chiens,\N,\N,original,\N,1
21,tt0000003,6,Pauvre Pierrot,\N,\N,original,\N,1
25,tt0000004,1,Un bon bock,\N,\N,original,\N,1
34,tt0000005,11,Blacksmith Scene,\N,\N,original,\N,1
...,...,...,...,...,...,...,...,...
38684077,tt9916730,2,6 Gunn,\N,\N,original,\N,1
38684079,tt9916754,2,Chico Albuquerque - Revelações,\N,\N,original,\N,1
38684081,tt9916756,2,Pretty Pretty Black Girl,\N,\N,original,\N,1
38684097,tt9916764,2,38,\N,\N,original,\N,1


In [39]:
originals.isna().sum()

titleId            0
ordering           0
title              2
region             0
language           0
types              0
attributes         0
isOriginalTitle    0
dtype: int64

In [40]:
originals = originals.dropna()

In [43]:
print(originals.shape)
print(originals['titleId'].nunique())

(1842205, 8)
1842198


In [44]:
originals = originals.drop_duplicates(subset=['titleId'])
print(originals.shape)
print(originals['titleId'].nunique())

(1842198, 8)
1842198


In [45]:
originals[originals['titleId']=='tt0114709']

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
1003703,tt0114709,22,Toy Story,\N,\N,original,\N,1


In [46]:
imdb_titles = originals[['titleId', 'title']]
imdb_titles

Unnamed: 0,titleId,title
6,tt0000001,Carmencita
8,tt0000002,Le clown et ses chiens
21,tt0000003,Pauvre Pierrot
25,tt0000004,Un bon bock
34,tt0000005,Blacksmith Scene
...,...,...
38684077,tt9916730,6 Gunn
38684079,tt9916754,Chico Albuquerque - Revelações
38684081,tt9916756,Pretty Pretty Black Girl
38684097,tt9916764,38


In [15]:
imdb_titles.to_csv('../Data/Large-Data/imdb_titles.csv', index=False)