In [1]:
import pandas as pd
import numpy as np

path = "C:/Users/Admin/Documents/ironhack/streaming_service_recommender/"

## Netflix Merge

#### Goals:

- Merge Netflix tv shows with IMDB ID


**NOTE: Since we need to do different mergings in order to get the complete data related for each streaming service, we will do it in individual notebooks for each.**


## 1. Import imdb and netflix data

In [2]:
imdb = pd.read_pickle(path + "Data/imdb_tv_all.pkl")

In [3]:
netflix = pd.read_csv(path + "Data/netflix_shows.csv")

We will look at how many titles we have using the shape function, which will return a tuple indicating number of rows and number of columns.

In [4]:
netflix.shape

(1915, 5)

## 2. Merge dataframes

We will start by merging on title and year in order to get the right IMDB ID (tconst) because there might be some remakes.
We will do this by using a left join in order to keep all the original netflix titles.

In [5]:
netflix_genres = netflix.merge(imdb, left_on=["show", "year"], right_on=["originalTitle", "startYear"], how="left")

From the new merged data frame, we will get a new data frame called netflix_missing1 to get the missing titles that didn't find a match.

In [6]:
netflix_missing1 = netflix_genres[netflix_genres["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                           , "imdb", "rotten_tomatoes"]]

In [7]:
netflix_missing1.shape

(649, 5)

We are missing 649 titles from 1915 in total.

We will now merge but now using primaryTitle instead of originalTitle, since some of the titles might be in their original language.

In [8]:
netflix_missing1 = netflix_missing1.merge(imdb, left_on=["show", "year"], right_on=["primaryTitle", "startYear"], how="left")

We will create another data frame for those titles who weren't matched.

In [9]:
netflix_missing2 = netflix_missing1[netflix_missing1["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                               , "imdb", "rotten_tomatoes"]]

In [10]:
netflix_missing2.shape

(452, 5)

We are now missing 452 titles.

We will merge again without the year, since there might be some discrepancy on the years taken from web scraping.

In [11]:
netflix_missing2 = netflix_missing2.merge(imdb, left_on=["show"], right_on=["originalTitle"], how="left")

We will create a data frame for missing titles each time we merge.

In [12]:
netflix_missing3 = netflix_missing2[netflix_missing2["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                               , "imdb", "rotten_tomatoes"]]

In [13]:
netflix_missing3.shape

(324, 5)

We are now missing 324 titles. We will now try using the primaryTitle.

In [14]:
netflix_missing3 = netflix_missing3.merge(imdb, left_on=["show"], right_on=["primaryTitle"], how="left")

In [15]:
netflix_missing4 = netflix_missing3[netflix_missing3["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                               , "imdb", "rotten_tomatoes"]]

In [16]:
netflix_missing4.shape

(311, 5)

We are now missing 311 titles. Since we don't have something else to relate to in this dataframe, we will now import the other data frame from the IMDB data base which includes all different titles created for a single one and the IMDB ID (titleID).

In [17]:
title_regions = pd.read_csv("C:/Users/Admin/Documents/ironhack/title.akas.tsv.gz", sep="\t", low_memory=False)

In [18]:
title_regions.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


We will do the merge of the missing titles with the new title_regions data frame by title name.

In [19]:
netflix_missing4 = netflix_missing4.merge(title_regions, left_on=["show"], right_on=["title"], how="left")

In [20]:
netflix_missing5 = netflix_missing4[netflix_missing4["titleId"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                                , "imdb", "rotten_tomatoes"]]

In [21]:
netflix_missing5.shape

(198, 5)

Now we are just missing 198 titles, we will export this data frame and work on it on the next notebook.

## 3. Export data frames

We will join all data frames without null values to get a final netflix data frame containing the IMDB ID.
We will drop duplicates because the title regions may have joined multiple rows since some titles have the same name in different regions.

In [22]:
netflix_genres = netflix_genres[~netflix_genres["tconst"].isna()]

In [23]:
netflix_genres = netflix_genres.append(netflix_missing1[~netflix_missing1["tconst"].isna()])

In [24]:
netflix_genres = netflix_genres.append(netflix_missing2[~netflix_missing2["tconst"].isna()]).drop_duplicates("show")

In [25]:
netflix_genres = netflix_genres.append(netflix_missing3[~netflix_missing3["tconst"].isna()]).drop_duplicates("show")

In [26]:
netflix_genres = netflix_genres.append(netflix_missing4[~netflix_missing4["titleId"].isna()]).drop_duplicates("show")

In [27]:
netflix_genres.columns

Index(['show', 'year', 'rating', 'imdb', 'rotten_tomatoes', 'tconst',
       'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear',
       'endYear', 'runtimeMinutes', 'genres', 'titleId', 'ordering', 'title',
       'region', 'language', 'types', 'attributes', 'isOriginalTitle'],
      dtype='object')

In [28]:
netflix_genres = netflix_genres.reset_index(drop=True)

In [30]:
netflix_genres.head()

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes,tconst,titleType,primaryTitle,originalTitle,isAdult,...,runtimeMinutes,genres,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,Breaking Bad,2008,18+,9.5,96%,tt0903747,tvSeries,Breaking Bad,Breaking Bad,0.0,...,49,"Crime,Drama,Thriller",,,,,,,,
1,Stranger Things,2016,16+,8.8,93%,tt4574334,tvSeries,Stranger Things,Stranger Things,0.0,...,51,"Drama,Fantasy,Horror",,,,,,,,
2,Sherlock,2010,16+,9.1,78%,tt1475582,tvSeries,Sherlock,Sherlock,0.0,...,88,"Crime,Drama,Mystery",,,,,,,,
3,Better Call Saul,2015,18+,8.7,97%,tt3032476,tvSeries,Better Call Saul,Better Call Saul,0.0,...,46,"Crime,Drama",,,,,,,,
4,The Office,2005,16+,8.9,81%,tt0386676,tvSeries,The Office,The Office,0.0,...,22,Comedy,,,,,,,,


In [31]:
# netflix_genres.to_pickle(path + "Data/netflix_ids.pkl")

We will also drop duplicates from the missing titles data frame.

In [32]:
netflix_missing5 = netflix_missing5.drop_duplicates("show")

In [34]:
netflix_missing5.head()

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes
0,YOU,2018,18+,7.8,91%
1,Marvel's Jessica Jones,2015,18+,8.0,83%
2,HAPPY!,2017,18+,8.2,84%
3,Haikyu!!,2014,16+,8.7,
4,F is for Family,2015,18+,8.0,86%


In [35]:
# netflix_missing5.to_pickle(path + "Data/netflix_missing.pkl")