<h2>Desafio 3 - Machine Learning</h2>

<h4>Objetivo</h4>
Construir un modelo de clasificación (Aprendizaje supervisado> clasificación) a partir de un dataset seleccionado.

<h3>Grupo 7</h3>
<ul>
    <li>Ignacio Mendieta</li>
    <li>Laura Jazmín Chao</li>
    <li>Juan Nicolás Capistrano</li>
    <li>Betiana Srur</li>
    <li>Marecelo Carrizo</li>
    
</ul>
<h3>Parte II - Clasificación

<a id="section_toc"></a> 
<h2> Tabla de Contenidos </h2>

[Librerías](#section_import)

[Dataset](#section_dataset)

[Selección de géneros](#section_genre_selection)

[Exploración de palabras representativas](#section_words)

[Nubes de palabras](#section_wordcloud)

[Encoder de targets](#section_ordinalEncoder)

[Stemmer](#section_stemmer)

[División de sets de entrenamiento y testeo](#section_train_test_split)

[Primera prueba de clasificación](#section_clf_1)

[Pipeline y GridSearch](#section_clf_pipeline)

$\hspace{.5cm}$[1. Logistic Regression](#section_logreg)

$\hspace{.5cm}$[2. Mulyinomial NB](#section_multiNB)


<a id="section_import"></a> 
<h3>Librerías</h3>

[volver a TOC](#section_toc)

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, StratifiedShuffleSplit
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import unidecode
from nltk.corpus import stopwords 
stop_words = stopwords.words('english');

from wordcloud import WordCloud, STOPWORDS

In [2]:
pd.set_option('display.max_columns', 100) # Para mostrar todas las columnas
# pd.set_option('display.max_rows', 100) # Para mostrar todas las filas

<a id="section_dataset"></a> 
<h3>Dataset</h3>

[volver a TOC](#section_toc)

In [3]:
data = pd.read_csv("data/movies_preprocesado.csv", low_memory=False)
data.head()

Unnamed: 0,description_clean,genre_unique
0,the adventures of a female reporter in the 1890s,romance
1,true story of notorious australian outlaw ned ...,biography
2,the fabled queen of egypts affair with roman g...,drama
3,loosely adapted from dantes divine comedy and ...,adventure
4,an account of the life of jesus christ based o...,biography


In [4]:
data['genre_unique'].value_counts()

drama          24473
comedy         23103
action         11971
crime           5453
horror          4998
adventure       3486
animation       2045
biography       2022
thriller        1317
romance          730
western          611
family           596
mystery          585
fantasy          486
scifi            410
musical          312
war               98
music             72
history           71
filmnoir          29
sport             16
adult              2
documentary        1
Name: genre_unique, dtype: int64

In [5]:
data.shape

(82887, 2)

In [6]:
data.columns

Index(['description_clean', 'genre_unique'], dtype='object')

<a id="section_genre_selection"></a>
<h3>Selección de géneros</h3>

[volver a TOC](#section_toc)

In [7]:
genres = data['genre_unique'].unique()
genres

array(['romance', 'biography', 'drama', 'adventure', 'crime', 'western',
       'fantasy', 'comedy', 'horror', 'family', 'action', 'mystery',
       'history', 'scifi', 'animation', 'musical', 'music', 'thriller',
       'war', 'filmnoir', 'sport', 'adult', 'documentary'], dtype=object)

In [8]:
mask_filter = (data['genre_unique']=='action') | (data['genre_unique']=='comedy') | (data['genre_unique']=='drama') | (data['genre_unique']=='horror') 
#| (data['genre_unique']=='romance')

mask_filter
data = data.loc[mask_filter, :]

In [9]:
data['genre_unique'].value_counts()

drama     24473
comedy    23103
action    11971
horror     4998
Name: genre_unique, dtype: int64

<a id="section_words"></a>
<h3>Exploración de palabras representativas por género</h3>

[volver a TOC](#section_toc)

In [None]:
# stop_words.extend(list(STOPWORDS))

In [None]:
# stop_words = list(set(stop_words))

In [10]:
stop_words.extend(['young', 'life', 'man', 'find', 'get'])
stop_words.append('family');
stop_words.append('la');
stop_words.append('woman');
stop_words.append('il');
stop_words.append('di');
stop_words.append('young');
stop_words.append('man');
stop_words.append('life');
stop_words.extend(['zero','one','two','three','four','five','six','seven','eight','nine','ten','may','also','across','among','beside','however','yet','within'])

In [11]:
vectorizer=CountVectorizer(stop_words=stop_words)

In [12]:
clases = data['genre_unique'].unique()
words = []

for clase in clases:
    X=vectorizer.fit_transform(data[data['genre_unique']==clase]['description_clean'])
    counts=X.sum(axis=0)
    counts=np.array(counts)
    
    indices=np.argsort(counts)
    valores=np.sort(counts)
    indices=indices[0][::-1]
    valores=valores[0][::-1]
    terms=np.array(vectorizer.get_feature_names())

    print('\n Clase ', clase)
    print(terms[indices[:30]])
    


 Clase  drama
['love' 'story' 'new' 'girl' 'lives' 'father' 'years' 'mother' 'war'
 'wife' 'world' 'old' 'film' 'son' 'finds' 'school' 'home' 'friends'
 'town' 'daughter' 'boy' 'falls' 'small' 'day' 'meets' 'husband' 'time'
 'death' 'becomes' 'people']

 Clase  comedy
['love' 'new' 'friends' 'girl' 'story' 'school' 'comedy' 'years' 'old'
 'wife' 'lives' 'finds' 'father' 'gets' 'film' 'time' 'town' 'day' 'world'
 'falls' 'group' 'back' 'home' 'friend' 'help' 'make' 'small' 'mother'
 'decides' 'go']

 Clase  horror
['group' 'house' 'friends' 'new' 'killer' 'mysterious' 'night' 'home'
 'people' 'girl' 'town' 'years' 'old' 'death' 'haunted' 'horror' 'evil'
 'dead' 'soon' 'must' 'film' 'college' 'strange' 'finds' 'couple' 'small'
 'school' 'students' 'back' 'becomes']

 Clase  action
['police' 'must' 'group' 'war' 'gang' 'new' 'world' 'story' 'love' 'gets'
 'cop' 'son' 'take' 'father' 'help' 'finds' 'fight' 'agent' 'city' 'crime'
 'death' 'years' 'town' 'officer' 'team' 'friends' 'back' 'r

<a id="section_wordcloud"></a>
<h3>Nubes de palabras</h3>

[volver a TOC](#section_toc)

In [None]:
for genre in clases:
    plt.figure(figsize=(20,12.5))
    subset = data[data['genre_unique']==genre]
    text = subset['description_clean'].values
    cloud = WordCloud(stopwords=stop_words,
                      background_color='white',
                      collocations=False,
                      width=2500,
                      height=1800,
                      max_words=100).generate(" ".join(text))
    
    plt.subplot(2, 3, 1)
    plt.axis('off')
    plt.title(genre,fontsize=20)
    plt.imshow(cloud)



<a id="section_ordinalEncoder"></a>
<h3>Encoder de targets</h3>

[volver a TOC](#section_toc)

In [13]:
data.head(2)

Unnamed: 0,description_clean,genre_unique
2,the fabled queen of egypts affair with roman g...,drama
6,an epic italian film quo vadis influenced many...,drama


In [14]:
#se puede usar LabelEncoder tambien!
from sklearn.preprocessing import OrdinalEncoder
#Cuando le paso las categorías como paramétro me tira error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
ord_enc = OrdinalEncoder()
data["target_genre"] = ord_enc.fit_transform(data[["genre_unique"]])

data.sample(3)

Unnamed: 0,description_clean,genre_unique,target_genre
5809,a struggling artist becomes a new york city pr...,drama,2.0
50709,everything that can go wrong goes wrong when c...,comedy,1.0
40457,a weekend reunion becomes a confrontation with...,horror,3.0


<a id="section_stemmer"></a>
<h3>Stemmer</h3>

[volver a TOC](#section_toc)

In [15]:
from nltk.stem.snowball import SnowballStemmer

In [16]:
stemmer = SnowballStemmer("english")
def stemming(sentence):
    stemSentence = ""
    for word in sentence.split():
        stem = stemmer.stem(word)
        stemSentence += stem
        stemSentence += " "
    stemSentence = stemSentence.strip()
    return stemSentence



In [17]:
data['description_clean'] = data['description_clean'].apply(stemming)
data.head()

Unnamed: 0,description_clean,genre_unique,target_genre
2,the fabl queen of egypt affair with roman gene...,drama,2.0
6,an epic italian film quo vadi influenc mani of...,drama,2.0
7,richard of gloucest use manipul and murder to ...,drama,2.0
8,after dr friedrich wife becom mental unstabl a...,drama,2.0
11,lesli swayn an adventur in order to obtain eno...,drama,2.0


<a id="section_train_test_split"></a>
<h3>División de sets de entrenamiento y testeo</h3>

[volver a TOC](#section_toc)

In [18]:
train, test = train_test_split(data, random_state=42, test_size=0.30, shuffle=True)
train_text = train['description_clean']
test_text = test['description_clean']

In [19]:
train_text.head()

78514    base on the bestsel book of the same name mons...
26635    ryu dhalsim and vega are sent out to stop m bi...
12257    after a black man daughter is kill by the kkk ...
74031    two young peopl figur out how to shape their f...
17225    older group of worker are forc to attend the h...
Name: description_clean, dtype: object

In [20]:
vectorizer = TfidfVectorizer(min_df=10,ngram_range=(1,3),max_features=20000, stop_words=stop_words)
vectorizer.fit(train_text)

TfidfVectorizer(max_features=20000, min_df=10, ngram_range=(1, 3),
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...])

In [21]:
X_train = vectorizer.transform(train_text)
X_test = vectorizer.transform(test_text)
y_train = train['genre_unique']
y_test = test['genre_unique']

<a id="section_clf_1"></a>
<h3>Primera prueba de clasificación</h3>

[volver a TOC](#section_toc)

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB

In [23]:
genres = data['genre_unique'].unique()
len(genres)

4

In [24]:
classifier_log = LogisticRegression(penalty='l2', solver='lbfgs', multi_class='multinomial', verbose=1, n_jobs=-1)
classifier_log.fit(X_train, y_train)
prediction = classifier_log.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(y_test, prediction)))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Test accuracy is 0.619809956620533


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    3.3s finished


In [None]:
print(classification_report(y_test, prediction, zero_division=1))

In [None]:
classifier_log.classes_

In [None]:
print(confusion_matrix(y_test, prediction))

In [None]:
conf_matrix = confusion_matrix(y_test, prediction)
conf_mat_df = pd.DataFrame(data=conf_matrix, 
                           index=classifier_log.classes_, 
                           columns=classifier_log.classes_)


In [None]:
conf_mat_df

In [None]:
heatmap = sns.heatmap(conf_mat_df, annot=True, fmt='d', cmap='YlGnBu')
plt.xlabel("Predichos") 
plt.ylabel("Observados")
plt.show()

## SVC

In [None]:
# from sklearn.svm import SVC
# svc = SVC(kernel = 'rbf')
# svc.fit(X_train, y_train)

# y_pred_svc = svc.predict(X_test)

# print("Accuracy of Support Vector Classifier is: {}%".format(accuracy_score(y_test, y_pred_svc) * 100))
# print("Confusion Matrix of Support Vector Classifier is: \n{}".format(confusion_matrix(y_test, y_pred_svc)))
# print("{}".format(classification_report(y_test, y_pred_svc)))

<a id="section_clf_pipeline"></a>
<h3>Pipeline y Gridsearch</h3>

[volver a TOC](#section_toc)

<a id="section_logreg"></a>
<h4>Logistic Regression</h4>

[volver a TOC](#section_toc)

In [None]:
data_y = data['genre_unique']
X_train_pipe, X_test_pipe, y_train_pipe, y_test_pipe = train_test_split(data_x, data_y, random_state=42, test_size=0.33, shuffle=True)

In [None]:

pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)), 
                     ('log', LogisticRegression())])

parameters = {'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
              'log__penalty': ['l2'],
              'log__C': [0.01, 0.1, 1],
              "log__class_weight": ['balanced', None],
              "log__solver" : ['sag', 'lbfgs']
}

# skf=StratifiedKFold(n_splits=3, random_state=3, shuffle=True)


In [None]:
grid = GridSearchCV(
    pipeline, parameters, cv=3, n_jobs=2, verbose=1)
grid.fit(X_train_pipe, y_train_pipe)

In [None]:
print("Best parameters set:")
grid.best_estimator_.steps

In [None]:
print("Applying best classifier on test data:")
predictions = grid.best_estimator_.predict(X_test_pipe)

print(classification_report(y_test_pipe, predictions, target_names=genres))

In [None]:
print("Accuracy = ",accuracy_score(y_test_pipe,predictions))
print("\n")

<a id="section_multiNB"></a>
<h4>Multinomial NB</h4>

[volver a TOC](#section_toc)

In [None]:
pipeline1 = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('clf', MultinomialNB(
        fit_prior=True, class_prior=None))
])
parameters1 = {
#     'tfidf__max_df': [50, 100, 200],
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
#     'tfidf__norm':['l1', 'l2']
    'clf__alpha': (1e-2, 1e-3)
}

In [None]:
grid1 = GridSearchCV(
    pipeline1, parameters1, cv=3, n_jobs=2, verbose=1)
grid1.fit(X_train_pipe, y_train_pipe)

In [None]:
print("Best parameters set:")
grid1.best_estimator_.steps

In [None]:
grid1.best_score_

In [None]:
print("Applying best classifier on test data:")
best_clf = grid1.best_estimator_
predictions1 = best_clf.predict(X_test_pipe)

print(classification_report(y_test_pipe, predictions1, target_names=genres))

In [None]:
print("Accuracy = ",accuracy_score(y_test_pipe, predictions1))
print("\n")