<h2>Desafio 3 - Machine Learning</h2>

<h4>Objetivo</h4>
Construir un modelo de clasificación (Aprendizaje supervisado> clasificación) a partir de un dataset seleccionado.

<h3>Grupo 7</h3>
<ul>
    <li>Ignacio Mendieta</li>
    <li>Laura Jazmín Chao</li>
    <li>Juan Nicolás Capistrano</li>
    <li>Betiana Srur</li>
    <li>Marecelo Carrizo</li>
    
</ul>
<h3> Clasificación multilabel

<a id="section_toc"></a> 
<h2> Tabla de Contenidos </h2>

[Librerías](#section_import)

[Dataset](#section_dataset)

[Stemmer](#section_stemmer)

[División de sets de entrenamiento y testeo](#section_train_test_split)

[Pruebas de clasificación](#section_clf_1)


<a id="section_import"></a> 
<h3>Librerías</h3>

[volver a TOC](#section_toc)

In [None]:
#Clasificación

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, StratifiedShuffleSplit
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report
import unidecode
from nltk.corpus import stopwords 

<a id="section_dataset"></a> 
<h3>Dataset</h3>

[volver a TOC](#section_toc)

In [2]:
pd.set_option('display.max_columns', 100) # Para mostrar todas las columnas
# pd.set_option('display.max_rows', 100) # Para mostrar todas las filas

In [3]:
data = pd.read_csv("../Data/movies_multilabel.csv", low_memory=False)
data.head()

Unnamed: 0,title,description_clean,genre_clean,romance,biography,crime,drama,history,adventure,fantasy,war,mystery,horror,western,comedy,family,action,scifi,thriller,sport,animation,musical,music,filmnoir
0,Miss Jerry,miss jerry the adventures of a female reporter...,romance,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,The Story of the Kelly Gang,the story of the kelly gang true story of noto...,biography crime drama,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Cleopatra,cleopatra the fabled queen of egypts affair wi...,drama history,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,L'Inferno,linferno loosely adapted from dantes divine co...,adventure drama fantasy,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"From the Manger to the Cross; or, Jesus of Naz...",from the manger to the cross or jesus of nazar...,biography drama,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
data.shape

(82880, 24)

In [5]:
data.columns

Index(['title', 'description_clean', 'genre_clean', 'romance', 'biography',
       'crime', 'drama', 'history', 'adventure', 'fantasy', 'war', 'mystery',
       'horror', 'western', 'comedy', 'family', 'action', 'scifi', 'thriller',
       'sport', 'animation', 'musical', 'music', 'filmnoir'],
      dtype='object')

In [6]:
data.head(3)

Unnamed: 0,title,description_clean,genre_clean,romance,biography,crime,drama,history,adventure,fantasy,war,mystery,horror,western,comedy,family,action,scifi,thriller,sport,animation,musical,music,filmnoir
0,Miss Jerry,miss jerry the adventures of a female reporter...,romance,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,The Story of the Kelly Gang,the story of the kelly gang true story of noto...,biography crime drama,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Cleopatra,cleopatra the fabled queen of egypts affair wi...,drama history,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


<a id="section_stemmer"></a>
<h3>Stemmer</h3>

[volver a TOC](#section_toc)

In [7]:
from nltk.stem.snowball import SnowballStemmer

In [8]:
stemmer = SnowballStemmer("english")
def stemming(sentence):
    stemSentence = ""
    for word in sentence.split():
        stem = stemmer.stem(word)
        stemSentence += stem
        stemSentence += " "
    stemSentence = stemSentence.strip()
    return stemSentence


In [9]:
data['description_clean'] = data['description_clean'].apply(stemming)
data.head()

Unnamed: 0,title,description_clean,genre_clean,romance,biography,crime,drama,history,adventure,fantasy,war,mystery,horror,western,comedy,family,action,scifi,thriller,sport,animation,musical,music,filmnoir
0,Miss Jerry,miss jerri the adventur of a femal report in t...,romance,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,The Story of the Kelly Gang,the stori of the kelli gang true stori of noto...,biography crime drama,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Cleopatra,cleopatra the fabl queen of egypt affair with ...,drama history,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,L'Inferno,linferno loos adapt from dant divin comedi and...,adventure drama fantasy,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"From the Manger to the Cross; or, Jesus of Naz...",from the manger to the cross or jesus of nazar...,biography drama,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


<a id="section_train_test_split"></a>
<h3>División de sets de entrenamiento y testeo</h3>

[volver a TOC](#section_toc)

In [10]:
train, test = train_test_split(data, random_state=42, test_size=0.30, shuffle=True)
train_text = train['description_clean']
test_text = test['description_clean']

In [11]:
stop_words = stopwords.words('english')
stop_words.extend(['young', 'life', 'man', 'find', 'get'])
stop_words.append('family');
stop_words.append('la');
stop_words.append('woman');
stop_words.append('il');
stop_words.append('di');
stop_words.append('young');
stop_words.append('man');
stop_words.append('life');
stop_words.extend(['zero','one','two','three','four','five','six','seven','eight','nine','ten','may','also','across','among','beside','however','yet','within'])

In [12]:
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2', stop_words=stop_words)
vectorizer.fit(train_text)

TfidfVectorizer(ngram_range=(1, 3),
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                strip_accents='unicode')

In [13]:
X_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['title', 'description_clean', 'genre_clean'], axis=1)

X_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['title', 'description_clean', 'genre_clean'], axis=1)

<a id="section_clf_1"></a>
<h3>Pruebas de clasificación</h3>

[volver a TOC](#section_toc)

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier

In [15]:
genres = pd.unique(data['genre_clean'].str.split(expand=True).stack())
genres

array(['romance', 'biography', 'crime', 'drama', 'history', 'adventure',
       'fantasy', 'war', 'mystery', 'horror', 'western', 'comedy',
       'family', 'action', 'scifi', 'thriller', 'sport', 'animation',
       'musical', 'music', 'filmnoir'], dtype=object)

In [16]:
len(genres)

21

#### OneVsRest Individual

In [17]:
classifier_log = OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)

for genre in genres:
    print('**Processing {} movies...**'.format(genre))
    
    # Training logistic regression model on train data
    classifier_log.fit(X_train, train[genre])
    
    # calculating test accuracy
    prediction = classifier_log.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[genre], prediction)))

**Processing romance movies...**
Test accuracy is 0.8484153796653797
**Processing biography movies...**
Test accuracy is 0.9721685971685972
**Processing crime movies...**
Test accuracy is 0.8829633204633205
**Processing drama movies...**
Test accuracy is 0.7023809523809523
**Processing history movies...**
Test accuracy is 0.9740990990990991
**Processing adventure movies...**
Test accuracy is 0.9185971685971686
**Processing fantasy movies...**
Test accuracy is 0.9544723294723295
**Processing war movies...**
Test accuracy is 0.9747023809523809
**Processing mystery movies...**
Test accuracy is 0.9356901544401545
**Processing horror movies...**
Test accuracy is 0.9234234234234234
**Processing western movies...**
Test accuracy is 0.9835907335907336
**Processing comedy movies...**
Test accuracy is 0.7454552767052767
**Processing family movies...**
Test accuracy is 0.9540299227799228
**Processing action movies...**
Test accuracy is 0.8736727799227799
**Processing scifi movies...**
Test accura

#### OneVsRest Multilabel

In [18]:
classifier_log2 = OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)

In [19]:
classifier_log2.fit(X_train, y_train)

OneVsRestClassifier(estimator=LogisticRegression(solver='sag'), n_jobs=-1)

In [20]:
predictions = classifier_log2.predict(X_test)

In [22]:
print("Accuracy = ",accuracy_score(y_test,predictions))
print("\n")

Accuracy =  0.1842020592020592




In [23]:
# calculating test accuracy

print(classification_report(y_test, predictions, target_names=genres, zero_division=1))

              precision    recall  f1-score   support

     romance       0.57      0.28      0.37      4044
   biography       0.64      0.02      0.04       698
       crime       0.60      0.34      0.44      3285
       drama       0.71      0.78      0.74     13670
     history       0.58      0.01      0.02       646
   adventure       0.74      0.15      0.25      2241
     fantasy       0.37      0.01      0.02      1125
         war       0.56      0.27      0.36       665
     mystery       0.46      0.12      0.18      1565
      horror       0.77      0.47      0.58      2820
     western       0.77      0.10      0.17       438
      comedy       0.70      0.42      0.53      8375
      family       0.65      0.03      0.05      1158
      action       0.69      0.36      0.47      3901
       scifi       0.74      0.19      0.30      1087
    thriller       0.48      0.17      0.25      3372
       sport       0.75      0.04      0.07       310
   animation       0.79    

#### ClassifierChain

In [21]:
# Ojo que es muy pesado para procesar!

In [None]:
# from skmultilearn.problem_transform import ClassifierChain

In [None]:
# classifier2 = ClassifierChain(LogisticRegression())

# # Training logistic regression model on train data
# classifier2.fit(X_train, y_train)

# # predict
# predictions = classifier.predict(X_test)

# # accuracy
# print("Accuracy = ",accuracy_score(y_test,predictions))
# print("\n")

In [None]:
# print(classification_report(y_test, predictions, target_names=genres, zero_division=1))