In [1]:
import pandas as pd
import numpy as np

path = "C:/Users/Admin/Documents/ironhack/streaming_service_recommender/"

## Amazon Merge

#### Goals:

- Merge Amazon tv shows with IMDB ID

We will follow the same steps as in 03_a_netflix_imdb_merge notebook.

## 1. Import imdb and amazon data

In [2]:
imdb = pd.read_pickle(path + "Data/imdb_tv_all.pkl")

In [3]:
amazon = pd.read_csv(path + "Data/amazon_shows.csv")

In [4]:
amazon.shape

(2136, 5)

We have 2136 titles from the original data frame.

## 2. Merge dataframes

We will start by merging on title and year in order to get the right IMDB ID (tconst) because there might be some remakes.
We will do this by using a left join in order to keep all the original amazon titles.

In [5]:
amazon_genres = amazon.merge(imdb, left_on=["show", "year"], right_on=["originalTitle", "startYear"], how="left")

From the new merged data frame, we will get a new data frame called hbo_missing1 to get the missing titles that didn't find a match. We will do this in each step.

In [6]:
amazon_missing1 = amazon_genres[amazon_genres["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating", "imdb"
                                                                                        , "rotten_tomatoes"]]

In [7]:
amazon_missing1.shape

(1033, 5)

We are missing 1033 titles from 2136 in total.

We will now merge but now using primaryTitle instead of originalTitle, since some of the titles might be in their original language.

In [9]:
amazon_missing1 = amazon_missing1.merge(imdb, left_on=["show", "year"], right_on=["primaryTitle", "startYear"], how="left")

In [10]:
amazon_missing2 = amazon_missing1[amazon_missing1["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                            , "imdb", "rotten_tomatoes"]]

In [11]:
amazon_missing2.shape

(968, 5)

We are now missing 968 titles.

We will merge again without the year, since there might be some discrepancy on the years taken from web scraping.

In [12]:
amazon_missing2 = amazon_missing2.merge(imdb, left_on=["show"], right_on=["originalTitle"], how="left")

In [13]:
amazon_missing3 = amazon_missing2[amazon_missing2["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                            , "imdb", "rotten_tomatoes"]]

In [14]:
amazon_missing3.shape

(724, 5)

We are now missing 724 titles. We will now try using the primaryTitle.

In [15]:
amazon_missing3 = amazon_missing3.merge(imdb, left_on=["show"], right_on=["primaryTitle"], how="left")

In [16]:
amazon_missing4 = amazon_missing3[amazon_missing3["tconst"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                            , "imdb", "rotten_tomatoes"]]

In [17]:
amazon_missing4.shape

(713, 5)

We are now missing 311 titles. Since we don't have something else to relate to in this dataframe, we will now import the other data frame from the IMDB data base which includes all different titles created for a single one and the IMDB ID (titleID).

In [18]:
title_regions = pd.read_csv("C:/Users/Admin/Documents/ironhack/title.akas.tsv.gz", sep="\t", low_memory=False)

In [19]:
title_regions.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


We will do the merge of the missing titles with the new title_regions data frame by title name.

In [20]:
amazon_missing4 = amazon_missing4.merge(title_regions, left_on=["show"], right_on=["title"], how="left")

In [21]:
amazon_missing5 = amazon_missing4[amazon_missing4["titleId"].isna()].reset_index(drop=True)[["show", "year", "rating"
                                                                                             , "imdb", "rotten_tomatoes"]]

In [22]:
amazon_missing5.shape

(596, 5)

Now we are just missing 596 titles, we will export this data frame and work on it on the next notebook.

## 3. Export data frames

We will join all data frames without null values to get a final amazon data frame containing the IMDB ID.
We will drop duplicates because the title regions may have joined multiple rows since some titles have the same name in different regions.

In [23]:
amazon_genres = amazon_genres[~amazon_genres["tconst"].isna()]

In [24]:
amazon_genres = amazon_genres.append(amazon_missing1[~amazon_missing1["tconst"].isna()])

In [25]:
amazon_genres = amazon_genres.append(amazon_missing2[~amazon_missing2["tconst"].isna()]).drop_duplicates("show")

In [26]:
amazon_genres = amazon_genres.append(amazon_missing3[~amazon_missing3["tconst"].isna()]).drop_duplicates("show")

In [27]:
amazon_genres = amazon_genres.append(amazon_missing4[~amazon_missing4["titleId"].isna()]).drop_duplicates("show")

In [28]:
amazon_genres.columns

Index(['show', 'year', 'rating', 'imdb', 'rotten_tomatoes', 'tconst',
       'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear',
       'endYear', 'runtimeMinutes', 'genres', 'titleId', 'ordering', 'title',
       'region', 'language', 'types', 'attributes', 'isOriginalTitle'],
      dtype='object')

In [30]:
amazon_genres = amazon_genres.reset_index(drop=True)

In [31]:
amazon_genres.head()

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes,tconst,titleType,primaryTitle,originalTitle,isAdult,...,runtimeMinutes,genres,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,The Wire,2002,18+,9.3,94%,tt0306414,tvSeries,The Wire,The Wire,0.0,...,59,"Crime,Drama,Thriller",,,,,,,,
1,The Sopranos,1999,18+,9.2,92%,tt0141842,tvSeries,The Sopranos,The Sopranos,0.0,...,55,"Crime,Drama",,,,,,,,
2,Band of Brothers,2001,18+,9.4,94%,tt0185906,tvMiniSeries,Band of Brothers,Band of Brothers,0.0,...,594,"Action,Drama,History",,,,,,,,
3,Vikings,2013,18+,8.6,93%,tt2306299,tvSeries,Vikings,Vikings,0.0,...,44,"Action,Adventure,Drama",,,,,,,,
4,Mr. Robot,2015,18+,8.5,94%,tt4158110,tvSeries,Mr. Robot,Mr. Robot,0.0,...,49,"Crime,Drama,Thriller",,,,,,,,


In [32]:
# amazon_genres.to_pickle(path + "Data/amazon_ids.pkl")

We will also drop duplicates from the missing titles data frame.

In [33]:
amazon_missing5 = amazon_missing5.drop_duplicates("show")

In [34]:
amazon_missing5.head()

Unnamed: 0,show,year,rating,imdb,rotten_tomatoes
0,Peep show,2003,18+,8.6,96%
1,The Test: A New Era For Australia's Team,2020,,9.0,
2,Made In Abyss,2017,18+,8.4,
3,Chacha Vidhayak Hain Humare,2018,,8.0,
4,Darker than Black,2007,16+,7.8,


In [35]:
# amazon_missing5.to_pickle(path + "Data/amazon_missing.pkl")