# Data Cleaning
In this notebook, we will be cleaning the data sets that were reviewed in the Data Collection notebook. For modeling we need to combine the features we want to use into on dataframe for modeling.
<br><br> This includes:
 - Combining MovieLens and IMDb reviews into one dataframe
 - 
 - 

In [1]:
# Imports
import pandas as pd

## Combining MovieLens and IMDb Reviews
### Reformatting Movie IDs
Our data from MovieLens comes with three different IDs for each movie, but we will be merging our data using the IMDb ID from our IMDb review data. The MovieLens IDs only have a number for IMDb ID, whereas the other data sets have it formatted as an 9-character code starting with "tt". Toy Story, for example, is "tt0114709" in the IMDB data and only "114709" in the MovieLens Data.

In [3]:
ml_links = pd.read_csv('../Data/Large-Data/MovieLens-25M/links.csv')
ml_links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [4]:
ml_links.sort_values(by = 'imdbId', ascending=True)

Unnamed: 0,movieId,imdbId,tmdbId
46249,172063,1,16612.0
32269,140539,3,88013.0
50293,180695,7,159895.0
16842,88674,8,105158.0
24188,120869,10,774.0
...,...,...,...
61985,207453,11057912,636593.0
62071,207714,11101550,640427.0
62387,209051,11108064,642749.0
62312,208711,11168100,642203.0


In [6]:
# Lets add as a new column called 'imdb_reformatted'
ml_links['imdb_reformatted'] = 'tt' + ml_links['imdbId'].astype(str).str.zfill(7)

ml_links.sort_values(by = 'imdbId', ascending=True)

Unnamed: 0,movieId,imdbId,tmdbId,imdb_reformatted
46249,172063,1,16612.0,tt0000001
32269,140539,3,88013.0,tt0000003
50293,180695,7,159895.0,tt0000007
16842,88674,8,105158.0,tt0000008
24188,120869,10,774.0,tt0000010
...,...,...,...,...
61985,207453,11057912,636593.0,tt11057912
62071,207714,11101550,640427.0,tt11101550
62387,209051,11108064,642749.0,tt11108064
62312,208711,11168100,642203.0,tt11168100


### Reformatting MovieLens Ratings
Our MovieLens ratings are scored out of 5, with 10 possible scores. The IMDb scores also have 10 options, but are formatted as 1-10. We can scale these to match so we can combine them into one set of ratings.

In [7]:
ml_ratings = pd.read_csv('../Data/Large-Data/MovieLens-25M/ratings.csv')
ml_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [8]:
# Multiply the MovieLens rating by 2 so we have the same 1-10 rating scale as IMDb
ml_ratings['scaled_rating'] = ml_ratings['rating'] * 2
ml_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,scaled_rating
0,1,296,5.0,1147880044,10.0
1,1,306,3.5,1147868817,7.0
2,1,307,5.0,1147868828,10.0
3,1,665,5.0,1147878820,10.0
4,1,899,3.5,1147868510,7.0


# Next thing