# Metrolyrics Text Classification

## Text Classification
In this section, you'll build a text classifier, determining the genre of a song based on its lyrics.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns

# from sklearn import metrics
# from sklearn.model_selection import train_test_split
# from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE

# from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# from sklearn.feature_extraction.text import TfidfTransformer
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.ensemble import ExtraTreesClassifier

In [2]:
df = pd.read_parquet('lyrics.parquet')

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support

### Text classification using Bag-of-Words
Build a Naive Bayes classifier based on the bag of Words.  
You will need to divide your dataset into a train and test sets.

Our dataset is pretty large, and keeping 10% for test still gives us 1,000 songs for each category.  
I would even consider a small percentage of test set, if I have reasons to believe more data would greatly improve the classifier performance.

In [413]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train, X_test, y_train, y_test = \
  train_test_split(df.sent, df.genre, test_size=0.1, random_state=0) # We used 10,000 songs from each category - 1,000 songs in the test set seem like a lot. 

X_train_counts = count_vect.fit_transform([" ".join(sent) for sent in X_train])
X_test_counts = count_vect.transform([" ".join(sent) for sent in X_test])

Note that the count vectorizer is only fitted on the train set. It's very easy to allow a leakage in these cases.  
The safest way to avoid them is using pipelines, but here we'll focus on regular flow.

In [414]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report

model = MultinomialNB(alpha = 1)
model.fit(X_train_counts, y_train)

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

Show the confusion matrix.

Note below that when Y values are the names of the classes as strings, the report includes them and is much easier to interpret.

In [415]:
# Predict
print('Performance on the train set:')
y_pred = model.predict(X_train_counts)
cm = confusion_matrix(y_train, y_pred)
print( pd.DataFrame(cm, index=genres, columns=genres) )

print('\nPerformance on the test set:')
y_pred = model.predict(X_test_counts)
cm = confusion_matrix(y_test, y_pred)
print( pd.DataFrame(cm, index=genres, columns=genres) )

          Pop  Hip-Hop  Rock  Country  Metal
Pop      6534       60   150      239    551
Hip-Hop   243     7016   198      798    550
Rock      181      267  6600      267   1233
Country  1834      557   433     4016   2643
Metal    1931      432   933      779   6534
         Pop  Hip-Hop  Rock  Country  Metal
Pop      639        8    19       23     85
Hip-Hop   32      781    21       96     64
Rock      21       46   685       35    179
Country  211       87    57      374    333
Metal    286       69   169      119    559


Show the classification report - precision, recall, f1 for each class.

In [416]:
y_pred = model.predict(X_test_counts)
print( classification_report(y_test, y_pred ))

             precision    recall  f1-score   support

    Country       0.54      0.83      0.65       774
    Hip-Hop       0.79      0.79      0.79       994
      Metal       0.72      0.71      0.71       966
        Pop       0.58      0.35      0.44      1062
       Rock       0.46      0.47      0.46      1202

avg / total       0.61      0.61      0.60      4998



The confusion matrix is sometimes better displayed visually, as a heatmap.  
Seaborn provides a simple syntax for displaying a heatmap. We can also add as annotations the counts.

In [2]:
import seaborn as sns
sns.heatmap( confusion_matrix(y_test, y_pred), annot='%2.0d' )

NameError: name 'confusion_matrix' is not defined

When we created our dataset, we kept roughly an equal number of songs from each genre. Without this step, there would be much more cases of Rock songs, and the heatmap would be dominated by this row/column, attenuating all other genres.  
If this is the case, it is often helpful to display the heatmap normalized by the row or column:

In [None]:
cm = confusion_matrix(y_test, y_pred)
normalized_cm = cm / cm.sum(axis=0)
print(normalized_cm)
sns.heatmap(normalized_cm , annot='%0.2d' )

In [None]:
cm = confusion_matrix(y_test, y_pred)
normalized_cm = cm / cm.sum(axis=1)
print(normalized_cm)
sns.heatmap(normalized_cm , annot='%0.2d' )

### Text classification using Word Vectors
#### Average word vectors
Do the same, using a classifier that averages the word vectors of words in the document.

In [417]:
X_train_vec = np.array([sum(w2v[w] for w in sent if w in w2v) for sent in X_train])
X_test_vec = np.array([sum(w2v[w] for w in sent if w in w2v) for sent in X_test])

In [428]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vec, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [429]:
# Predict
y_pred = model.predict(X_train_vec)
cm = confusion_matrix(y_train, y_pred)
print( pd.DataFrame(cm, index=genres, columns=genres) )

y_pred = model.predict(X_test_vec)
cm = confusion_matrix(y_test, y_pred)
print( pd.DataFrame(cm, index=genres, columns=genres) )

          Pop  Hip-Hop  Rock  Country  Metal
Pop      4700       64   250      743   1777
Hip-Hop   135     6872   293      921    584
Rock      170      225  6502      371   1280
Country  1174      660   725     4348   2576
Metal    1758      269  1818     1877   4887
         Pop  Hip-Hop  Rock  Country  Metal
Pop      483        7    29       80    175
Hip-Hop   16      784    25      104     65
Rock      16       28   742       38    142
Country  116       74    74      515    283
Metal    221       27   230      208    516


The section above raises an interesting question - have we introduced leakage by using word vectors trained on the entire dataset, including that of the test data? What do you think?

#### TfIdf Weighting
Do the same, using a classifier that averages the word vectors of words in the document, weighting each word by its TfIdf.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(smooth_idf=True, sublinear_tf=False, norm=None, analyzer='word')
X_train_tfidf = tfidf_vect.fit_transform([" ".join(sent) for sent in X_train])
X_test_tfidf = tfidf_vect.transform([" ".join(sent) for sent in X_test])

In [None]:
# TODO: Weight each word by its tfidf value
X_train_vec = np.array([sum(w2v[w] for w in sent if w in w2v) for sent in X_train])
X_test_vec = np.array([sum(w2v[w] for w in sent if w in w2v) for sent in X_test])

### Text classification using ConvNet
Do the same, using a ConvNet.  
The ConvNet should get as input a 2D matrix where each column is an embedding vector of a single word, and words are in order. Use zero padding so that all matrices have a similar length.  
Some songs might be very long. Trim them so you keep a maximum of 128 words (after cleaning stop words and rare words).  
Initialize the embedding layer using the word vectors that you've trained before, but allow them to change during training.  

Extra: Try training the ConvNet with 2 slight modifications:
1. freezing the the weights trained using Word2vec (preventing it from updating)
1. random initialization of the embedding layer

You are encouraged to try this question on your own.  

You might prefer to get ideas from the paper "Convolutional Neural Networks for Sentence Classification" (Kim 2014, [link](https://arxiv.org/abs/1408.5882)).

There are several implementations of the paper code in PyTorch online (see for example [this repo](https://github.com/prakashpandey9/Text-Classification-Pytorch) for a PyTorch implementation of CNN and other architectures for text classification). If you get stuck, they might provide you with a reference for your own code.