In [1]:
import pandas as pd

# Cleaning Sample History

In [2]:
df = pd.read_csv("Sample-History.csv")
df = df.rename(columns = {"Title": "History"})

In [3]:
df.head()

Unnamed: 0,History,Date
0,Better Call Saul: Season 5: Something Unforgiv...,15/02/23
1,Better Call Saul: Season 5: Bad Choice Road,15/02/23
2,Better Call Saul: Season 5: Bagman,15/02/23
3,Better Call Saul: Season 5: JMM,15/02/23
4,Better Call Saul: Season 5: Dedicado a Max,14/02/23


In [4]:
df[['Title', 'Season', 'Episode']] = df['History'].str.split(': ', expand = True, n = 2)

In [5]:
df['Type'] = df['Episode'].apply(lambda x : 'Movie' if x == None else 'TV')

In [6]:
df.head(50)

Unnamed: 0,History,Date,Title,Season,Episode,Type
0,Better Call Saul: Season 5: Something Unforgiv...,15/02/23,Better Call Saul,Season 5,Something Unforgivable,TV
1,Better Call Saul: Season 5: Bad Choice Road,15/02/23,Better Call Saul,Season 5,Bad Choice Road,TV
2,Better Call Saul: Season 5: Bagman,15/02/23,Better Call Saul,Season 5,Bagman,TV
3,Better Call Saul: Season 5: JMM,15/02/23,Better Call Saul,Season 5,JMM,TV
4,Better Call Saul: Season 5: Dedicado a Max,14/02/23,Better Call Saul,Season 5,Dedicado a Max,TV
5,Better Call Saul: Season 5: Namaste,13/02/23,Better Call Saul,Season 5,Namaste,TV
6,Better Call Saul: Season 5: The Guy for This,13/02/23,Better Call Saul,Season 5,The Guy for This,TV
7,Better Call Saul: Season 5: 50% Off,12/02/23,Better Call Saul,Season 5,50% Off,TV
8,Better Call Saul: Season 5: Magic Man,11/02/23,Better Call Saul,Season 5,Magic Man,TV
9,Better Call Saul: Season 4: Winner,10/02/23,Better Call Saul,Season 4,Winner,TV


Right now, movies that have a colon in their title have an incomplete value in the `Title`. I correct that below:

In [7]:
tv = df[df['Type']!='Movie']
movies = df[df['Type']=='Movie']
movies['Title'] = movies['History']
movies['Season'] = None
df = pd.concat([tv, movies], ignore_index = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['Title'] = movies['History']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['Season'] = None


In [8]:
movies.head()

Unnamed: 0,History,Date,Title,Season,Episode,Type
48,Doctor G,27/12/22,Doctor G,,,Movie
49,Glass Onion: A Knives Out Mystery,25/12/22,Glass Onion: A Knives Out Mystery,,,Movie
50,Love Today,04/12/22,Love Today,,,Movie
61,Delhi Belly,04/08/22,Delhi Belly,,,Movie
80,Major (Telugu),03/07/22,Major (Telugu),,,Movie


In [9]:
df[df['Season']=='The Last Airbender']

Unnamed: 0,History,Date,Title,Season,Episode,Type
48,Avatar: The Last Airbender: Book 1: The Boy in...,30/09/22,Avatar,The Last Airbender,Book 1: The Boy in the Iceberg,TV


Issues:

1. TV shows that have a colon in their title need to be handled differently. See `Avatar: The Last Airbender`.
    - Should we consider only calling split() on ': Season', ': Series', ': Book', etc? An obvious issue is it would be hard to compile an exhaustive such list. Other examples include Volume, Chapter, etc.
    - Alternatively, we could keep a list of all TV shows from the Netflix dataset that do have a colon. If any show then appears in that list, we could then make sure its first colon is not counted in the above procedure. Again, an obvious issue is figuring out how to determine that a given show is in that list (since this has to occur before we do any cleaning)
    - A potential solution is to do split on colon occurences in reverse order. For instance, only split based on the last two occurences of the colon character, and leave earlier colon occurences alone.
    
2. Movie titles may not exactly match the Netflix Dataset. See `Major (Telugu)`, which appears in the Netflix dataset just as `Major`, but in 3 different rows with 3 different languages.
    - I think the best solution would be to remove the `([language])` substring in the sample dataset, and drop duplicates in the Netflix dataset (since the only differences between the duplicates would be the language).

# Aggregation