# ====== Preprocessing =======

In [32]:
import pandas as pd
import seaborn as sns

In [19]:
df= pd.read_csv("data/netflix-rotten-tomatoes-metacritic-imdb.csv")

Here are the percentages of each column which are NaN.

In [20]:
df.isna().sum() / df.shape[0]

Title                    0.000000
Genre                    0.110465
Tags                     0.004328
Languages                0.126227
Series or Movie          0.000000
Hidden Gem Score         0.135724
Country Availability     0.001227
Runtime                  0.000065
Director                 0.304134
Writer                   0.279716
Actors                   0.124354
View Rating              0.453747
IMDb Score               0.135594
Rotten Tomatoes Score    0.587726
Metacritic Score         0.719897
Awards Received          0.607558
Awards Nominated For     0.505103
Boxoffice                0.741150
Release Date             0.136111
Netflix Release Date     0.000000
Production House         0.667377
Netflix Link             0.000000
IMDb Link                0.148773
Summary                  0.000581
IMDb Votes               0.135724
Image                    0.000000
Poster                   0.235013
TMDb Trailer             0.535271
Trailer Site             0.535271
dtype: float64

### Cleaning up tags

We need to convert the comma separated fields to something more suitable.

[This](https://datascience.stackexchange.com/questions/85488/encoding-tags-for-random-forest) seems relevant.

After that we will need to reduce the dimensionality, perhaps with [this](https://medium.com/codex/dimensionality-reduction-techniques-for-categorical-continuous-data-75d2bca53100).

In [21]:
# Convert comma separated fields to lists
cols_to_clean= ['Genre', 'Tags', 'Languages', 'Country Availability', 'Director', 'Writer', 'Actors', 'Production House']
for col in cols_to_clean:
  df[col]= df[col].dropna().map(lambda x: x.split(','))

In [22]:
print(df['Tags'])

0        [Comedy Programmes, Romantic TV Comedies, Horr...
1        [Dramas, Comedies, Films Based on Books, British]
2                                              [Thrillers]
3          [TV Dramas, Romantic TV Dramas, Dutch TV Shows]
4        [Social Issue Dramas, Teen Movies, Dramas, Com...
                               ...                        
15475    [TV Dramas, TV Programmes, TV Comedies, Romant...
15476    [Animal Tales, Family Comedies, Family Adventu...
15477    [TV Comedies, Kids TV, Animal Tales, TV Cartoo...
15478    [TV Comedies, Kids TV, TV Cartoons, TV Program...
15479    [TV Comedies, Kids TV, Animal Tales, TV Cartoo...
Name: Tags, Length: 15480, dtype: object


In [31]:
# find all unique tags
unique_tags= {}
for x in df['Tags'].dropna():
  for tag in x:
    if tag in unique_tags:
      unique_tags[tag] += 1
    else:
      unique_tags[tag]= 1

tags= pd.Series(unique_tags).sort_values(ascending=False)
tags
# tags

Dramas                      4558
Comedies                    4168
Action & Adventure          2094
TV Dramas                   1207
International Movies        1198
                            ... 
Educación y orientación        1
TV para niños                  1
Historias de animales          1
Dibujos animados               1
Programas de TV y series       1
Length: 1003, dtype: int64

In [46]:
# find all unique tags
uniques= {}
tags= {}
for col in ['Tags', 'Genre', 'Director', 'Actors']:
  uniques[col]= {}
  for x in df[col].dropna():
    for tag in x:
      if tag in uniques[col]:
        uniques[col][tag] += 1
      else:
        uniques[col][tag]= 1

  uniques[col]= pd.Series(uniques[col]).sort_values(ascending=False)
uniques

{'Tags': Dramas                      4558
 Comedies                    4168
 Action & Adventure          2094
 TV Dramas                   1207
 International Movies        1198
                             ... 
 Educación y orientación        1
 TV para niños                  1
 Historias de animales          1
 Dibujos animados               1
 Programas de TV y series       1
 Length: 1003, dtype: int64,
 'Genre':  Drama          3792
 Comedy          3407
  Thriller       2634
 Drama           2567
  Romance        2338
 Action          2182
  Comedy         1670
 Animation       1649
  Fantasy        1529
  Adventure      1380
  Family         1316
  Crime          1216
  Sci-Fi         1183
  Mystery        1122
 Documentary     1028
 Crime            716
  Horror          696
  Action          628
  History         515
 Biography        433
 Adventure        429
  Music           398
 Horror           374
  Sport           357
  War             328
 Short            234
  Musica

We can probably collapse the least popular tags into some kind of `Other`.

There are around 1300 tags with fewer than 10 instances in the database, combining these into `Other` would be one of our more common tags, not sure if this is acceptable.

In [50]:
unique_tags= uniques['Tags']
unique_genres= uniques['Genre']

In [47]:
uniques['Tags'][uniques['Tags'] < 10].sum()

np.int64(1312)

In [49]:
uniques['Genre'][uniques['Genre'] < 10].sum()

np.int64(17)