In [1]:
import pandas as pd
import numpy as np

path = "C:/Users/Admin/Documents/ironhack/streaming_service_recommender/"

## HBO Merge

#### Goals:

- Merge HBO tv shows with IMDB ID

We will follow the same steps as in 03_a_netflix_imdb_merge notebook.

## 1. Import imdb and hbo data

In [2]:
imdb = pd.read_pickle(path + "Data/imdb_tv_all.pkl")

In [3]:
hbo = pd.read_csv(path + "Data/hbo_shows.csv")

In [4]:
hbo.shape

(200, 5)

We have 200 titles from the original data frame.

## 2. Merge dataframes

We will start by merging on title and year in order to get the right IMDB ID (tconst) because there might be some remakes.
We will do this by using a left join in order to keep all the original hbo titles.

In [5]:
hbo_genres = hbo.merge(imdb, left_on=["show", "year"], right_on=["originalTitle", "startYear"], how="left")

From the new merged data frame, we will get a new data frame called hbo_missing1 to get the missing titles that didn't find a match. We will do this in each step.

In [6]:
hbo_missing1 = hbo_genres[hbo_genres["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating", "imdb"
                                                                               , "rotten_tomatoes"]]

In [7]:
hbo_missing1.shape

(41, 5)

We are missing 41 titles from 200 in total.

We will now merge but now using primaryTitle instead of originalTitle, since some of the titles might be in their original language.

In [8]:
hbo_missing1 = hbo_missing1.merge(imdb, left_on=["show", "year"], right_on=["primaryTitle", "startYear"], how="left")

In [9]:
hbo_missing2 = hbo_missing1[hbo_missing1["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating", "imdb"
                                                                                   , "rotten_tomatoes"]]

In [10]:
hbo_missing2.shape

(34, 5)

We are now missing 34 titles.

We will merge again without the year, since there might be some discrepancy on the years taken from web scraping.

In [11]:
hbo_missing2 = hbo_missing2.merge(imdb, left_on=["show"], right_on=["originalTitle"], how="left")

In [12]:
hbo_missing3 = hbo_missing2[hbo_missing2["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating", "imdb"
                                                                                   , "rotten_tomatoes"]]

In [13]:
hbo_missing3.shape

(20, 5)

We are now missing 20 titles. We will now try using the primaryTitle.

In [14]:
hbo_missing3 = hbo_missing3.merge(imdb, left_on=["show"], right_on=["primaryTitle"], how="left")

In [15]:
hbo_missing4 = hbo_missing3[hbo_missing3["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating", "imdb"
                                                                                   , "rotten_tomatoes"]]

In [16]:
hbo_missing4.shape

(19, 5)

We are now missing 19 titles. Since we don't have something else to relate to in this dataframe, we will now import the other data frame from the IMDB data base which includes all different titles created for a single one and the IMDB ID (titleID).

In [17]:
title_regions = pd.read_csv("C:/Users/Admin/Documents/ironhack/title.akas.tsv.gz", sep="\t", low_memory=False)

In [18]:
title_regions.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


We will do the merge of the missing titles with the new title_regions data frame by title name.

In [21]:
hbo_missing4 = hbo_missing4.merge(title_regions, left_on=["show"], right_on=["title"], how="left")

In [22]:
hbo_missing5 = hbo_missing4[hbo_missing4["titleId"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                             , "imdb", "rotten_tomatoes"]]

In [23]:
hbo_missing5.shape

(15, 5)

Now we are just missing 15 titles, we will export this data frame and work on it on the next notebook.

## 3. Export data frames

We will join all data frames without null values to get a final hbo data frame containing the IMDB ID.
We will drop duplicates because the title regions may have joined multiple rows since some titles have the same name in different regions.

In [24]:
hbo_genres = hbo_genres[~hbo_genres["tconst"].isna()]

In [25]:
hbo_genres = hbo_genres.append(hbo_missing1[~hbo_missing1["tconst"].isna()])

In [26]:
hbo_genres = hbo_genres.append(hbo_missing2[~hbo_missing2["tconst"].isna()]).drop_duplicates("show")

In [27]:
hbo_genres = hbo_genres.append(hbo_missing3[~hbo_missing3["tconst"].isna()]).drop_duplicates("show")

In [28]:
hbo_genres = hbo_genres.append(hbo_missing4[~hbo_missing4["titleId"].isna()]).drop_duplicates("show")

In [29]:
hbo_genres.columns

Index(['show', 'year', 'rating', 'imdb', 'rotten_tomatoes', 'tconst',
       'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear',
       'endYear', 'runtimeMinutes', 'genres', 'titleId', 'ordering', 'title',
       'region', 'language', 'types', 'attributes', 'isOriginalTitle'],
      dtype='object')

In [30]:
hbo_genres = hbo_genres.reset_index(drop=True)

In [31]:
hbo_genres.head()

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes,tconst,titleType,primaryTitle,originalTitle,isAdult,...,runtimeMinutes,genres,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,Game of Thrones,2011,18+,9.3,89%,tt0944947,tvSeries,Game of Thrones,Game of Thrones,0.0,...,57,"Action,Adventure,Drama",,,,,,,,
1,The Wire,2002,18+,9.3,94%,tt0306414,tvSeries,The Wire,The Wire,0.0,...,59,"Crime,Drama,Thriller",,,,,,,,
2,Chernobyl,2019,18+,9.4,96%,tt7366338,tvMiniSeries,Chernobyl,Chernobyl,0.0,...,330,"Drama,History,Thriller",,,,,,,,
3,The Sopranos,1999,18+,9.2,92%,tt0141842,tvSeries,The Sopranos,The Sopranos,0.0,...,55,"Crime,Drama",,,,,,,,
4,Band of Brothers,2001,18+,9.4,94%,tt0185906,tvMiniSeries,Band of Brothers,Band of Brothers,0.0,...,594,"Action,Drama,History",,,,,,,,


In [32]:
# hbo_genres.to_pickle(path + "Data/hbo_ids.pkl")

We will also drop duplicates from the missing titles data frame.

In [33]:
hbo_missing5 = hbo_missing5.drop_duplicates("show")

In [34]:
hbo_missing5 = hbo_missing5.reset_index(drop=True)

In [36]:
hbo_missing5.head()

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes
0,Jonah From Tonga,2014,,7.1,80%
1,We Can Be Heroes: Finding The Australian of th...,2005,18+,8.1,
2,Magnifica 70,2015,16+,7.8,
3,The Shop: Uninterrupted,2018,18+,6.6,
4,Arliss,1996,,7.0,


In [38]:
# hbo_missing5.to_pickle(path + "Data/hbo_missing.pkl")