<h2>Desafio 3 - Machine Learning</h2>

<h4>Objetivo</h4>
Construir un modelo de clasificación (Aprendizaje supervisado> clasificación) a partir de un dataset seleccionado.

<h3>Grupo 7</h3>
<ul>
    <li>Ignacio Mendieta</li>
    <li>Laura Jazmín Chao</li>
    <li>Juan Nicolás Capistrano</li>
    <li>Betiana Srur</li>
    <li>Marecelo Carrizo</li>
    
</ul>
<h3> Exploración y limpieza de datos para clasificación Multilabel
    

<a id="section_toc"></a> 
<h2> Tabla de Contenidos </h2>

[Librerías](#section_import)

[Dataset](#section_dataset)

[Exploración](#section_exploration)

[Unificación de columnas de texto](#section_strings)

[Limpieza de documentos](#section_docs_preprocessing)   

[Limpieza de columna target](#section_target_preprocessing)   

[Binarización de etiquetas múltiples](#section_binarizer)
   
[Exportación de datos](#section_export)   


<a id="section_import"></a> 
<h3>Librerías</h3>

[volver a TOC](#section_toc)

In [1]:
import pandas as pd
import numpy as np
from langdetect import detect
import re

<a id="section_dataset"></a> 
<h3>Dataset</h3>

[volver a TOC](#section_toc)

In [2]:
pd.set_option('display.max_columns', 100) # Para mostrar todas las columnas
# pd.set_option('display.max_rows', 100) # Para mostrar todas las filas

In [3]:
data_raw = pd.read_csv("../Data/IMDb movies.csv", low_memory=False)

In [4]:
display(data_raw.head(1))
display(data_raw.shape)
display(data_raw.dtypes)
display(data_raw.columns)

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0


(85855, 22)

imdb_title_id             object
title                     object
original_title            object
year                      object
date_published            object
genre                     object
duration                   int64
country                   object
language                  object
director                  object
writer                    object
production_company        object
actors                    object
description               object
avg_vote                 float64
votes                      int64
budget                    object
usa_gross_income          object
worlwide_gross_income     object
metascore                float64
reviews_from_users       float64
reviews_from_critics     float64
dtype: object

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

<a id="section_dataset_drop"></a> 
<h4> Selección de datos: Drop de columnas innecesarias </h4>

[volver a TOC](#section_toc)

In [5]:
data_raw.drop(['year', 'date_published','duration', 'director', 'writer',
              'production_company', 'actors','avg_vote', 'votes', 'budget', 'usa_gross_income',
              'worlwide_gross_income', 'metascore', 'imdb_title_id', 'original_title',
              'reviews_from_users', 'reviews_from_critics'], axis=1,inplace=True)

In [6]:
data_raw.head(1)

Unnamed: 0,title,genre,country,language,description
0,Miss Jerry,Romance,USA,,The adventures of a female reporter in the 1890s.


<a id="section_exploration"></a> 
<h3> Exploración </h3>

[volver a TOC](#section_toc)

<h4>Países y lenguas</h4>

[volver a TOC](#section_toc)

In [7]:
countries = data_raw['country'].value_counts()

In [8]:
top_10_countries = countries[:10]
display(top_10_countries)
other_countries = countries[:20:-1]
#display(other_countries) 
#para dropear en caso de decidir no quedarnos con estos datos

USA          28511
India         6065
UK            4111
Japan         3077
France        3055
Italy         2444
Canada        1802
Germany       1396
Turkey        1351
Hong Kong     1239
Name: country, dtype: int64

In [9]:
languages = data_raw['language'].value_counts()
top_10_languages = languages[:10]

<h4>Cálculo de cantidad de nulos</h4>

[volver a TOC](#section_toc)

In [10]:
missing_values_check = data_raw.isnull().sum()
print(missing_values_check)

title             0
genre             0
country          64
language        833
description    2115
dtype: int64


In [11]:
data_raw.dropna(inplace=True)

In [12]:
missing_values_check = data_raw.isnull().sum()
print(missing_values_check)

title          0
genre          0
country        0
language       0
description    0
dtype: int64


In [13]:
#Detectar idioma y seleccionar solo lo que esté en inglés

# data_usa['descrip_lang'] = data_usa['description'].apply(lambda x: detect(str(str(x).split())[:5]))
# display(data_usa['descrip_lang'].value_counts())
# data_usa_en = data_usa.loc[data_usa['descrip_lang']=='en', :]
# data_usa_en.shape

In [14]:
#sacar si se corre la celda anterior

data = data_raw

<a id="section_strings"></a>
<h3>Unificación de columnas de texto</h3>

[volver a TOC](#section_toc)

In [15]:
data.description = data.title + " " + data.description
data.head()

Unnamed: 0,title,genre,country,language,description
0,Miss Jerry,Romance,USA,,Miss Jerry The adventures of a female reporter...
1,The Story of the Kelly Gang,"Biography, Crime, Drama",Australia,,The Story of the Kelly Gang True story of noto...
3,Cleopatra,"Drama, History",USA,English,Cleopatra The fabled queen of Egypt's affair w...
4,L'Inferno,"Adventure, Drama, Fantasy",Italy,Italian,L'Inferno Loosely adapted from Dante's Divine ...
5,"From the Manger to the Cross; or, Jesus of Naz...","Biography, Drama",USA,English,"From the Manger to the Cross; or, Jesus of Naz..."


In [16]:
# Si uso las funciones del notebook de Mark
#def complete_clean(sentence):
#     clean_html = cleanHtml(sentence)
#     clean_punc = cleanPunc(clean_html)
#     clean_alpha = keepAlpha(clean_punc)
#     return clean_alpha

<a id="section_docs_preprocessing"></a>
<h3>Limpieza de documentos</h3>

[volver a TOC](#section_toc)

In [17]:
import unidecode
import re

def clean_text(t):
    t_lower_no_accents=unidecode.unidecode(t.lower()); # sacamos acentos y llevamos a minuscula
    t_lower_no_accents_no_punkt=re.sub(r'([^\s\w]|_)+','',t_lower_no_accents); # quitamos signos de puntuacion usando una regex que reemplaza todo lo q no sean espacios o palabras por un string vacio
    return t_lower_no_accents_no_punkt

In [18]:
data['description_clean'] = data['description'].apply(clean_text)

In [19]:
data.head()

Unnamed: 0,title,genre,country,language,description,description_clean
0,Miss Jerry,Romance,USA,,Miss Jerry The adventures of a female reporter...,miss jerry the adventures of a female reporter...
1,The Story of the Kelly Gang,"Biography, Crime, Drama",Australia,,The Story of the Kelly Gang True story of noto...,the story of the kelly gang true story of noto...
3,Cleopatra,"Drama, History",USA,English,Cleopatra The fabled queen of Egypt's affair w...,cleopatra the fabled queen of egypts affair wi...
4,L'Inferno,"Adventure, Drama, Fantasy",Italy,Italian,L'Inferno Loosely adapted from Dante's Divine ...,linferno loosely adapted from dantes divine co...
5,"From the Manger to the Cross; or, Jesus of Naz...","Biography, Drama",USA,English,"From the Manger to the Cross; or, Jesus of Naz...",from the manger to the cross or jesus of nazar...


<a id="section_target_preprocessing"></a>
<h3>Limpieza de columna target</h3>

[volver a TOC](#section_toc)

In [20]:
data['genre_clean'] = data['genre'].apply(clean_text)

In [23]:
genres = pd.unique(data['genre_clean'].str.split(expand=True).stack())
genres

array(['romance', 'biography', 'crime', 'drama', 'history', 'adventure',
       'fantasy', 'war', 'mystery', 'horror', 'western', 'comedy',
       'family', 'action', 'scifi', 'thriller', 'sport', 'animation',
       'musical', 'music', 'filmnoir', 'adult', 'documentary',
       'realitytv', 'news'], dtype=object)

In [24]:
data['genre_list'] = data['genre_clean'].str.split(" ")

In [25]:
data.head(5)

Unnamed: 0,title,genre,country,language,description,description_clean,genre_clean,genre_list
0,Miss Jerry,Romance,USA,,Miss Jerry The adventures of a female reporter...,miss jerry the adventures of a female reporter...,romance,[romance]
1,The Story of the Kelly Gang,"Biography, Crime, Drama",Australia,,The Story of the Kelly Gang True story of noto...,the story of the kelly gang true story of noto...,biography crime drama,"[biography, crime, drama]"
3,Cleopatra,"Drama, History",USA,English,Cleopatra The fabled queen of Egypt's affair w...,cleopatra the fabled queen of egypts affair wi...,drama history,"[drama, history]"
4,L'Inferno,"Adventure, Drama, Fantasy",Italy,Italian,L'Inferno Loosely adapted from Dante's Divine ...,linferno loosely adapted from dantes divine co...,adventure drama fantasy,"[adventure, drama, fantasy]"
5,"From the Manger to the Cross; or, Jesus of Naz...","Biography, Drama",USA,English,"From the Manger to the Cross; or, Jesus of Naz...",from the manger to the cross or jesus of nazar...,biography drama,"[biography, drama]"


In [26]:
data.reset_index(inplace=True, drop=True)

<a id="section_binarizer"></a> 
<h3>Binarización de multietiquetas</h3>

[volver a TOC](#section_toc)

In [27]:
from sklearn.preprocessing import MultiLabelBinarizer

In [28]:
binarizer = MultiLabelBinarizer(classes=genres)
binarizer.fit(data['genre_list'])
genre_encoded = binarizer.transform(data['genre_list'])


In [29]:
genre_enc = pd.DataFrame(genre_encoded, columns = binarizer.classes_, index = data.index)

In [30]:
genre_enc.head(10)

Unnamed: 0,romance,biography,crime,drama,history,adventure,fantasy,war,mystery,horror,western,comedy,family,action,scifi,thriller,sport,animation,musical,music,filmnoir,adult,documentary,realitytv,news
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [31]:
#count cantidad de movies por genero
counts = []
for genre in genre_enc.columns:
    counts.append((genre, genre_enc[genre].sum()))
df_stats = pd.DataFrame(counts, columns=['genre', 'number of movies'])
df_stats

Unnamed: 0,genre,number of movies
0,romance,13721
1,biography,2329
2,crime,10847
3,drama,45700
4,history,2215
5,adventure,7425
6,fantasy,3702
7,war,2174
8,mystery,5156
9,horror,9368


In [32]:
columns_to_drop = ['documentary', 'realitytv', 'news', 'adult']
indexes_to_drop = []

for genre in columns_to_drop:
    genre = data['genre_clean'].apply(lambda x: genre in str(x).lower());
    movies_index = list(genre[genre].index)
    for index in movies_index:
        indexes_to_drop.append(index)

In [33]:
indexes_to_drop

[19789, 37174, 42874, 75381, 66018, 15918, 24399]

In [34]:
display(data.shape)
display(genre_enc.shape)

(82887, 8)

(82887, 25)

In [35]:
data_pre = pd.concat([data,genre_enc], axis=1)
data_pre.sample(3)

Unnamed: 0,title,genre,country,language,description,description_clean,genre_clean,genre_list,romance,biography,crime,drama,history,adventure,fantasy,war,mystery,horror,western,comedy,family,action,scifi,thriller,sport,animation,musical,music,filmnoir,adult,documentary,realitytv,news
13811,La rossa maschera del terrore,Horror,UK,English,La rossa maschera del terrore Aristocrat Julia...,la rossa maschera del terrore aristocrat julia...,horror,[horror],0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
17500,The Brain Machine,"Sci-Fi, Thriller",USA,English,The Brain Machine Several people volunteer for...,the brain machine several people volunteer for...,scifi thriller,"[scifi, thriller]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0
70522,Rere's Children,Drama,New Zealand,English,Rere's Children Rere's Children is the stunnin...,reres children reres children is the stunning ...,drama,[drama],0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [36]:
data_pre.shape

(82887, 33)

In [37]:
#Drop de filas donde hay movies de las 4 ultimas categorias
data_pre.drop(index=indexes_to_drop, inplace=True)

In [38]:
counts_2 = []
for genre in genre_enc.columns:
    counts_2.append((genre, data_pre[genre].sum()))
df_stats_2 = pd.DataFrame(counts_2, columns=['genre', 'number of movies'])
df_stats_2

Unnamed: 0,genre,number of movies
0,romance,13720
1,biography,2328
2,crime,10846
3,drama,45698
4,history,2215
5,adventure,7425
6,fantasy,3702
7,war,2173
8,mystery,5156
9,horror,9365


In [39]:
#Drop las columnas de adult, documentary, reality_tv y news
data_pre.drop(columns=columns_to_drop, axis=1, inplace=True)

<a id="section_export"></a>
<h3>Exportación de los datos</h3>

[volver a TOC](#section_toc)

In [40]:
data_pre.sample(3)

Unnamed: 0,title,genre,country,language,description,description_clean,genre_clean,genre_list,romance,biography,crime,drama,history,adventure,fantasy,war,mystery,horror,western,comedy,family,action,scifi,thriller,sport,animation,musical,music,filmnoir
80322,La fille au bracelet,"Crime, Drama","France, Belgium",French,La fille au bracelet A teenager stands trial f...,la fille au bracelet a teenager stands trial f...,crime drama,"[crime, drama]",0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
63281,Zombie A-Hole,Horror,USA,English,Zombie A-Hole The creators of The Puppet Monst...,zombie ahole the creators of the puppet monste...,horror,[horror],0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
50663,1800 gramów,Comedy,Poland,Polish,1800 gramów A touching story about the most im...,1800 gramow a touching story about the most im...,comedy,[comedy],0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


In [41]:
#check si las filas suman cero 

In [42]:
mask = genre_enc.sum(axis=1) == 0

In [43]:
len(mask)

82887

In [44]:
any(mask)

False

In [45]:
data_pre.columns

Index(['title', 'genre', 'country', 'language', 'description',
       'description_clean', 'genre_clean', 'genre_list', 'romance',
       'biography', 'crime', 'drama', 'history', 'adventure', 'fantasy', 'war',
       'mystery', 'horror', 'western', 'comedy', 'family', 'action', 'scifi',
       'thriller', 'sport', 'animation', 'musical', 'music', 'filmnoir'],
      dtype='object')

In [46]:
data_final = data_pre.drop(['genre', 'country', 'language', 'description', 'genre_list'], axis=1)
data_final.sample(3)

Unnamed: 0,title,description_clean,genre_clean,romance,biography,crime,drama,history,adventure,fantasy,war,mystery,horror,western,comedy,family,action,scifi,thriller,sport,animation,musical,music,filmnoir
35679,Baadshah,baadshah a small time detective is mistaken as...,action comedy crime,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0
46376,House at the End of the Drive,house at the end of the drive can a 46 year ol...,horror thriller,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
24129,La fête des pères,la fete des peres thomas and stephane are happ...,comedy family romance,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0


In [47]:
data_final.to_csv(path_or_buf='../Data/movies_multilabel.csv', sep=',',
                   header=True, encoding='utf8', index=False)