In [1]:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz

## Hulu Merge

#### Goals:

- Merge Hulu tv shows with IMDB ID


## 1. Import imdb and hulu data

In [2]:
imdb = pd.read_pickle("../Data/imdb_tv_all.pkl")

disney = pd.read_csv("Data_Hulu_Disney/disney_shows.csv")

We will look at how many titles we have using the shape function, which will return a tuple indicating number of rows and number of columns.

In [3]:
disney.shape

(179, 5)

## 2. Merge dataframes

We will start by merging on title and year in order to get the right IMDB ID (tconst) because there might be some remakes.
We will do this by using a left join in order to keep all the original hulu titles.

In [4]:
disney_genres = disney.merge(imdb, left_on=["show", "year"], right_on=["originalTitle", "startYear"], how="left")

From the new merged data frame, we will get a new data frame called hulu_missing1 to get the missing titles that didn't find a match.

In [5]:
disney_missing1 = disney_genres[disney_genres["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                           , "imdb", "rotten_tomatoes"]]

In [6]:
disney_missing1.shape

(41, 5)

We are missing 41 titles from 179 in total.

We will now merge but now using primaryTitle instead of originalTitle, since some of the titles might be in their original language.

In [7]:
disney_missing1 = disney_missing1.merge(imdb, left_on=["show", "year"], right_on=["primaryTitle", "startYear"], how="left")

We will create another data frame for those titles who weren't matched.

In [8]:
disney_missing2 = disney_missing1[disney_missing1["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                               , "imdb", "rotten_tomatoes"]]

In [9]:
disney_missing2.shape

(39, 5)

We are now missing 39 titles. Since we don't have something else to relate to in this dataframe, we will now import the other data frame from the IMDB data base which includes all different titles created for a single one and the IMDB ID (titleID).

In [10]:
title_regions = pd.read_csv("C:/Users/Admin/Documents/ironhack/title.akas.tsv.gz", sep="\t", low_memory=False)

In [11]:
title_regions.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


We will do the merge of the missing titles with the new title_regions data frame by title name.

In [12]:
disney_missing2 = disney_missing2.merge(title_regions, left_on=["show"], right_on=["title"], how="left")

In [13]:
disney_missing3 = disney_missing2[disney_missing2["titleId"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                                , "imdb", "rotten_tomatoes"]]

In [14]:
disney_missing3.shape

(21, 5)

Now we are just missing 21 titles. We will now try to find the missing title IDs with fuzzy wuzzy.

### 3. Find missing titles using fuzzywuzzy

We will first create a list for missing shows.

In [15]:
disney_shows_missing = [show for show in disney_missing3["show"]]

In [16]:
imdb_titles = [title for title in imdb["primaryTitle"]]

We will use the same function we created on 03.2_a_netflix_missing_title_ids to find the most similar title on the imdb title basics file.

In [17]:
def find_shows(show):
    matches = []

    for title in imdb_titles:
        # compute ratio and remove case-sensitivity
        ratio = fuzz.ratio(title.lower(), show.lower())

        # add all matches to list with ratio > 60
        if ratio >= 60:
            matches.append((title, show, ratio))
    
    # return none if there was no match found
    if len(matches) == 0:
        return None
    return sorted(matches, key=lambda x: x[2], reverse=True)[0][0]

In [18]:
disney_missing3["imdb_titles"] = disney_missing3["show"].apply(lambda x: find_shows(x))

In [21]:
disney_missing3.head(10)

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes,imdb_titles
0,High School Musical: The Musical: The Series,2019,7+,7.3,79%,High School Musical: The Musical - The Series
1,Marvel's Ultimate Spider-Man,2012,7+,7.1,,Ultimate Spider-Man
2,Disney Gallery / Star Wars: The Mandalorian,2020,7+,8.4,,Lego Star Wars: The Padawan Menace
3,I am Luna,2016,,6.8,,Hey I am Luna
4,Big Hero 6 The Series,2017,7+,7.1,,Big Hero 6: The Series
5,Dr. K's Exotic Animal ER,2014,7+,7.9,,Dr K's Exotic Animal ER
6,Coop & Cami Ask The World,2018,all,5.8,,Coop and Cami Ask the World
7,Marvel’s Hulk and the Agents of S.M.A.S.H,2013,7+,6.1,,Hulk and the Agents of S.M.A.S.H.
8,JONAS,2009,,4.6,,Jonas
9,Mighty Ducks: The Animated Series,1996,7+,6.4,,It: The Animated Series


From the first 10 missing titles we can see that our function did a good job, we will now take a look at the titles that weren't found a match.

In [22]:
disney_missing3[disney_missing3["imdb_titles"].isna()]

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes,imdb_titles


We found a match for all the missing titles, we will now merge them with their IMDB ID and the year.

## 4. Merge IMDB IDs with missing titles

We will still merge on the year, in order to be sure that we are not merging a remake or a similar title.

In [23]:
disney_missing3 = disney_missing3.merge(imdb, how="left", left_on=["imdb_titles", "year"], right_on=["primaryTitle", "startYear"])

In [24]:
disney_missing3.head()

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes,imdb_titles,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,High School Musical: The Musical: The Series,2019,7+,7.3,79%,High School Musical: The Musical - The Series,tt8510382,tvSeries,High School Musical: The Musical - The Series,High School Musical: The Musical - The Series,0.0,2019.0,2019.0,31.0,"Comedy,Musical"
1,Marvel's Ultimate Spider-Man,2012,7+,7.1,,Ultimate Spider-Man,tt1722512,tvSeries,Ultimate Spider-Man,Ultimate Spider-Man,0.0,2012.0,2012.0,23.0,"Action,Adventure,Animation"
2,Disney Gallery / Star Wars: The Mandalorian,2020,7+,8.4,,Lego Star Wars: The Padawan Menace,,,,,,,,,
3,I am Luna,2016,,6.8,,Hey I am Luna,,,,,,,,,
4,Big Hero 6 The Series,2017,7+,7.1,,Big Hero 6: The Series,tt5515212,tvSeries,Big Hero 6: The Series,Big Hero 6: The Series,0.0,2017.0,2017.0,21.0,"Action,Adventure,Animation"


We will now take a look at the titles who didn't find a match.

In [25]:
disney_missing3[disney_missing3["tconst"].isna()]

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes,imdb_titles,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
2,Disney Gallery / Star Wars: The Mandalorian,2020,7+,8.4,,Lego Star Wars: The Padawan Menace,,,,,,,,,
3,I am Luna,2016,,6.8,,Hey I am Luna,,,,,,,,,
9,Mighty Ducks: The Animated Series,1996,7+,6.4,,It: The Animated Series,,,,,,,,,
10,Spin and Marty,1955,all,8.2,,Desi and Mari,,,,,,,,,
14,Holiday Magic,2017,,7.0,,Holiday Music,,,,,,,,,
15,Buried Secrets of the Bible with Albert Lin,2019,,5.1,,Ancient Secrets of the Bible,,,,,,,,,
16,Lost Treasures of Egypt,2019,,,,Lost Treasures of the Maya,,,,,,,,,
17,Tut's Treasures: Hidden Secrets,2018,,,,Seaside Secrets,,,,,,,,,
18,Awesome Animals,2013,,,,We Move Animals,,,,,,,,,
19,Paradise Islands,2017,,,,Paradise Island,,,,,,,,,


We can see that most of this titles don't have a similar imdb titles, due to this, we will drop this rows.

In [26]:
disney_missing_ids = disney_missing3[~disney_missing3["tconst"].isna()].reset_index(drop=True).drop_duplicates("tconst")

Now we are just missing 10 titles. We will remove this titles for our project.

## 5. Join data frames with IMDB IDs

We will join all data frames without null values to get a final hulu data frame containing the IMDB ID.
We will drop duplicates because the title regions may have joined multiple rows since some titles have the same name in different regions.

In [27]:
disney_genres = disney_genres[~disney_genres["tconst"].isna()]

In [28]:
disney_genres = disney_genres.append(disney_missing1[~disney_missing1["tconst"].isna()])

In [29]:
disney_genres = disney_genres.append(disney_missing2[~disney_missing2["titleId"].isna()])

In [30]:
disney_genres = disney_genres.append(disney_missing_ids)

In [31]:
disney_genres = disney_genres.drop_duplicates("tconst").reset_index(drop=True)

In [32]:
disney_genres.head()

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes,tconst,titleType,primaryTitle,originalTitle,isAdult,...,genres,titleId,ordering,title,region,language,types,attributes,isOriginalTitle,imdb_titles
0,The Mandalorian,2019,7+,8.7,93%,tt8111088,tvSeries,The Mandalorian,The Mandalorian,0.0,...,"Action,Adventure,Sci-Fi",,,,,,,,,
1,The Simpsons,1989,7+,8.7,85%,tt0096697,tvSeries,The Simpsons,The Simpsons,0.0,...,"Animation,Comedy",,,,,,,,,
2,Gravity Falls,2012,7+,8.9,100%,tt1865718,tvSeries,Gravity Falls,Gravity Falls,0.0,...,"Action,Adventure,Animation",,,,,,,,,
3,Star Wars: The Clone Wars,2008,7+,8.2,93%,tt0458290,tvSeries,Star Wars: The Clone Wars,Star Wars: The Clone Wars,0.0,...,"Action,Adventure,Animation",,,,,,,,,
4,DuckTales,2017,7+,8.2,100%,tt5531466,tvSeries,DuckTales,DuckTales,0.0,...,"Action,Adventure,Animation",,,,,,,,,


We will now calculate the ratio of titles with an imdb id match and the original titles.

In [33]:
len(disney_genres) / len(disney)

0.8491620111731844

This means that we will be using 85% of the original data.

## 6. Cleaning final data

We will create a final data frame including:
- show
- year
- rating
- imdb
- rotten_tomatoes
- imdb_id
- all data from title basics

First, we will need to create the imdb_id column, this will have tconst and titleId values.

In [34]:
disney_genres["imdb_id"] = np.where(disney_genres["tconst"].isna(), disney_genres["titleId"], disney_genres["tconst"])

In [35]:
disney_genres["imdb_id"].isna().value_counts()

False    152
Name: imdb_id, dtype: int64

This means we have all the IMDB IDs for all the titles in this final data frame. We will now remove columns that are not show, year, rating, imdb, rotten_tomatoes and imdb_id and merge again.

In [36]:
to_drop = [col for col in disney_genres.columns if col not in ["show", "year", "rating", "imdb", "rotten_tomatoes", "imdb_id"]]

In [37]:
disney_genres = disney_genres.drop(columns=to_drop)

In [38]:
disney_genres.head()

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes,imdb_id
0,The Mandalorian,2019,7+,8.7,93%,tt8111088
1,The Simpsons,1989,7+,8.7,85%,tt0096697
2,Gravity Falls,2012,7+,8.9,100%,tt1865718
3,Star Wars: The Clone Wars,2008,7+,8.2,93%,tt0458290
4,DuckTales,2017,7+,8.2,100%,tt5531466


Now we will merge all data from title basics

In [39]:
disney_genres = disney_genres.merge(imdb, how="left", left_on="imdb_id", right_on="tconst")

In [40]:
disney_genres.head()

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes,imdb_id,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,The Mandalorian,2019,7+,8.7,93%,tt8111088,tt8111088,tvSeries,The Mandalorian,The Mandalorian,0,2019.0,2019.0,30,"Action,Adventure,Sci-Fi"
1,The Simpsons,1989,7+,8.7,85%,tt0096697,tt0096697,tvSeries,The Simpsons,The Simpsons,0,1989.0,1989.0,22,"Animation,Comedy"
2,Gravity Falls,2012,7+,8.9,100%,tt1865718,tt1865718,tvSeries,Gravity Falls,Gravity Falls,0,2012.0,2012.0,23,"Action,Adventure,Animation"
3,Star Wars: The Clone Wars,2008,7+,8.2,93%,tt0458290,tt0458290,tvSeries,Star Wars: The Clone Wars,Star Wars: The Clone Wars,0,2008.0,2008.0,23,"Action,Adventure,Animation"
4,DuckTales,2017,7+,8.2,100%,tt5531466,tt5531466,tvSeries,DuckTales,DuckTales,0,2017.0,2017.0,21,"Action,Adventure,Animation"


In [41]:
disney_genres = disney_genres.drop(columns="tconst")

## 7. Check null and unique values

In [42]:
disney_genres.isna().sum()

show                 0
year                 0
rating              20
imdb                 3
rotten_tomatoes    131
imdb_id              0
titleType            0
primaryTitle         0
originalTitle        0
isAdult              0
startYear            0
endYear              0
runtimeMinutes       0
genres               0
dtype: int64

In [43]:
disney_genres.nunique(axis=0)

show               151
year                36
rating               3
imdb                43
rotten_tomatoes     11
imdb_id            152
titleType            2
primaryTitle       151
originalTitle      151
isAdult              1
startYear           36
endYear             36
runtimeMinutes      20
genres              43
dtype: int64

From the previous cells we can see that:
- Most values are missing for rotten_tomatoes.
- isAdult has just one value.

We will drop this two columns, since we cannot get much information from it.

In [44]:
disney_genres = disney_genres.drop(columns=["rotten_tomatoes", "isAdult"])

## 8. Change data types

- rating: We will remove the '+' sign and turn it into an integer
- runtimeMinutes: we will change the type to integer

#### i. rating

In [45]:
disney_genres["rating"] = [str(i).replace("+", "") for i in disney_genres["rating"]]

In [46]:
disney_genres["rating"].value_counts()

all    73
7      57
nan    20
16      2
Name: rating, dtype: int64

Since rating is a string type, we will convert the 'nan' values to null and 'all' to 0, meaning that the series can be watched by all ages.

In [47]:
disney_genres["rating"] = np.where(disney_genres["rating"] == "nan", None, disney_genres["rating"])
disney_genres["rating"] = np.where(disney_genres["rating"] == "all", 0, disney_genres["rating"])

In [48]:
disney_genres["rating"].value_counts()

0     73
7     57
16     2
Name: rating, dtype: int64

In [49]:
disney_genres["rating"] = pd.to_numeric(disney_genres["rating"], errors="coerce")

#### ii. runtimeMinutes

In [50]:
disney_genres["runtimeMinutes"] = pd.to_numeric(disney_genres["runtimeMinutes"], errors="coerce")

#### ii. Check final data types

In [51]:
disney_genres.dtypes

show               object
year                int64
rating            float64
imdb              float64
imdb_id            object
titleType          object
primaryTitle       object
originalTitle      object
startYear         float64
endYear           float64
runtimeMinutes    float64
genres             object
dtype: object

## 9. Rename rating and imdb columns

In [52]:
disney_genres = disney_genres.rename(columns={"rating":"age", "imdb":"imdb_rating"})

## 10. Export data

In [53]:
# disney_genres.to_pickle("Data_Hulu_Disney/disney_final_clean.pkl")