# Word Embedding - Home Assigment
## Dr. Omri Allouche 2018. YData Deep Learning Course

[Open in Google Colab](https://colab.research.google.com/github/omriallouche/deep_learning_course/blob/master/DL_word_embedding_assignment.ipynb)
    
    
In this exercise, you'll use word vectors trained on a corpus of 380,000 lyrics of songs from MetroLyrics (https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics).  
The dataset contains these fields for each song, in CSV format:
1. index
1. song
1. year
1. artist
1. genre
1. lyrics

Before doing this exercise, we recommend that you go over the "Bag of words meets bag of popcorn" tutorial (https://www.kaggle.com/c/word2vec-nlp-tutorial)

Other recommended resources:
- https://rare-technologies.com/word2vec-tutorial/
- https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial

In [None]:
!conda install spacy

In [None]:
# Install needed packages
!pip install gensim

In [None]:
# import needed packages
import pandas as pd
import numpy as np
# import operator
# import multiprocessing
import logging
# from tqdm import tqdm
import re
import string
# import warnings
# warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
%matplotlib inline
 
import seaborn as sns
# sns.set_style("darkgrid")

from collections import Counter
# from scipy.spatial import distance

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
import spacy

from spacy.util import minibatch, compounding
from spacy.tokenizer import Tokenizer

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

In [None]:
STOP_WORDS = set(stopwords.words('english'))
PUNCT = dict.fromkeys(map(ord, string.punctuation))
CPUS = multiprocessing.cpu_count()

In [None]:
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.test.utils import get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [None]:
df = pd.read_csv('lyrics.csv', index_col=0)
print('df.shape:', df.shape)
print('df.columns:', df.columns)
df.head()

In [None]:
df = df.dropna().reset_index(drop=True)

### Exploratory Data Analysis
Let's examine the data a bit first. We can print the numbers, or plot a bar chart of the counts.

In [None]:
df.groupby('genre')['lyrics'].count().sort_values()

In [None]:
sns.countplot(data=df, y='genre', order=df.groupby('genre')['lyrics'].count().sort_values(ascending=False).index, orient='h')

We see that there are a lot of Rock songs, and very few Indie and Folk songs. How can this class imbalance affect our classifier performance?

This class imbalance might affect our algorithms if we're using only the most common words, since these words will be biased by their prevalance in Rock and Pop songs.  


We also see songs classified as "Not Available" or "Other". We should probably remove these from our dataset:

In [None]:
df = df[ ~df.genre.isin(['Not Available', 'Other'])]

Next, let's examine the distribution of the length of songs. We expect most songs to be around 3-5 minutes, and therefore have a certain length.

In [None]:
df['num_chars'] = df['lyrics'].str.len()

In [None]:
df['num_chars'].hist(bins=100)

We see a large pick at roughly 1000 characters, but a long tail of long documents, and quite a few documents that are very short. Let's check the CDF (Comulative Distribution Function):

In [None]:
from scipy.stats import percentileofscore, scoreatpercentile
from numpy import sort, arange, nanpercentile, diff

def get_cdf(data, ignore_nan=False):
    if ignore_nan:
        data = data[ ~np.isnan(data) ]
    values = sort(data)
    percentiles = arange(len(values))/float(len(values))
    return values, percentiles

def plot_cdf(data, **kwargs):
    ignore_nan = kwargs.pop('ignore_nan', False)
    ax = kwargs.pop('ax', plt.gca())
    values, percentiles = get_cdf(data, ignore_nan=ignore_nan)
    ax.plot(values, percentiles, **kwargs )
    return values, percentiles

plot_cdf(df['num_chars']);
plt.xlim((0, 3000))
plt.xlabel('Number of Characters'); plt.ylabel('Proportion'); plt.title('CDF - Number of Characters');

We indeed see that some documents are very short (remember this is the number of characters, including spaces, punctuation marks and new line characters). We will examine the short and long documents later.

Next, let's examine the length of songs by genre, using a box plot:

In [None]:
sns.boxplot(data=df, y='genre', x='num_chars')
plt.xlim(0, 4000);

We see that Hip-Hop songs tend to be much longer than the others, but don't see very large differences between other genres. We also note that there are many songs that are very long.

In [None]:
sns.violinplot(data=df, y='genre', x='num_chars')
plt.xlim(0, 4000);

We see that Electronic and Pop songs have large variability in length compared with Jazz and Country.

Next, let's examine the longest documents:

In [None]:
df.sort_values('num_chars').tail(50)

Let's examine a few of the records with many characters. We can see that these aren't actually songs - they are, for example, interviews or full albums. We don't expect our classifier to classify these correctly or learn from them, and better remove them from our corpus.

In [None]:
print( df.loc[255126].lyrics )

In [None]:
print( df.loc[230543].lyrics )

In [None]:
print( df.query('num_chars>4000 & num_chars<4500').iloc[2]['lyrics'] )

On the other end, let's examine a few of the short songs in our dataset:

In [None]:
df.sort_values('num_chars').head(10)

We see that many songs have only special characters and no actual lyrics. We also see comments in [] blocks. We'd like to make sure we remove these from our data.

In [None]:
df = df[ df['num_chars']>200 ]

Below we look at the number of words and unique number of words for each genre - NOT IMPORTANT.

In [None]:
df['num_words'] = df['lyrics'].apply(lambda x: len(x.replace('\n', ' ').split()) )

In [None]:
df['num_unique_words'] = df['lyrics'].apply(lambda x: len( set( x.replace('\n', ' ').lower().split()) ) )

In [None]:
df['average_word_length'] = df['num_chars'] / df['num_words']

In [None]:
df.head()

In [None]:
sns.boxplot(data=df, y='genre', x='average_word_length')
plt.xlim(0, 10)

In [None]:
sns.boxplot(data=df, y='genre', x='num_unique_words')
plt.xlim(0, 200)

In [None]:
for genre in df.genre.unique():
    d = df[ df.genre==genre ]
    sns.regplot(d['num_chars'], y=d['num_unique_words'], x_bins=30, label=genre)
plt.legend()

### Sample data 
In order to explore efficiently, we don't need all of the data we have. Let's create a subset of it, making sure the number of songs from each category generally matches. 

We first focus only on the most common genres - Rock, Pop, Hip-Hop, Metal and Country.

In [None]:
df = df[ df.genre.isin(['Rock', 'Pop', 'Hip-Hop', 'Metal', 'Country'])]
genre_counts = df['genre'].value_counts()

In [None]:
weights = [ 1/genre_counts.loc[v] for v in df.genre.values ]

In [None]:
s = df.sample(n=50000, random_state=10, weights=weights)

In [None]:
s.genre.value_counts()

In [None]:
df = s

### Train word vectors
Train word vectors using the Skipgram Word2vec algorithm and the gensim package.
Make sure you perform the following:
- Tokenize words
- Lowercase all words
- Remove punctuation marks
- Remove rare words
- Remove stopwords

Use 300 as the dimension of the word vectors. Try different context sizes.

In [None]:
def nltk_tokenize(text):
    text = re.sub("[^a-zA-Z]"," ", text)
    text = re.sub("[\[*\]]"," ", text)
    text = text.translate(PUNCT)
    return [word.lower() for word in nltk.word_tokenize(text) if word not in STOP_WORDS]

In [None]:
print( df.iloc[0].lyrics )

In [None]:
" ".join( nltk_tokenize( df.iloc[0].lyrics ) )

In [None]:
df['sent'] = df['lyrics'].apply(nltk_tokenize)

Let's find the most common and rare words. We can count each genre separately, and then easily combine the counts to get the backgroun distribution. We need to remember that the number of songs in each genre is different.

In [None]:
df['num_words'] = df.sent.apply(lambda x: len(x))
df = df[ df.num_words > 5]

Finally, let's train word vectors on our new corpus. 

In [None]:
!pip install gensim

In [None]:
import gensim
w2v = gensim.models.Word2Vec(df.sent, sg=1, min_count=4, size=50, workers=CPUS*2-1)

### Review most similar words
Get initial evaluation of the word vectors by analyzing the most similar words for a few interesting words in the text. 

Choose words yourself, and find the most similar words to them.

In [None]:
words_to_check = ['love', 'hate', 'lonely', 'heartache', 'success', 'guitar', 'god', 'beer', 'gun', 'police']
for word in words_to_check:
    print(word, ' -> ', ['{} ({:.2f}), '.format(tup[0], tup[1]) for tup in w2v.wv.similar_by_word(word, topn=5)])
    print()

In [None]:
import gensim
w2v = gensim.models.Word2Vec(df.sent, sg=1, min_count=5, size=300, workers=CPUS*2-1)

In [None]:
words_to_check = ['love', 'hate', 'lonely', 'heartache', 'success', 'guitar', 'god', 'beer', 'gun', 'police']
for word in words_to_check:
    print(word, ' -> ', ['{} ({:.2f}), '.format(tup[0], tup[1]) for tup in w2v.wv.similar_by_word(word, topn=5)])
    print()

### Word Vectors Algebra
We've seen in class examples of algebraic games on the word vectors (e.g. man - woman + king = queen ). 

Try a few vector algebra terms, and evaluate how well they work. Try to use the Cosine distance and compare it to the Euclidean distance.

In [None]:
model = w2v

In [None]:
model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])[0]
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])[0]
model.wv.most_similar_cosmul(positive=['girl', 'brother'], negative=['boy'])[0]
model.wv.most_similar_cosmul(positive=["man","daughter"],negative=["woman"])
model.wv.most_similar(positive=['king', 'woman'], negative=['man'])
model.wv.most_similar(positive=['mother', 'he'], negative=['father'])
model.wv.most_similar(positive=['strong', 'small'], negative=['weak'])

## Sentiment Analysis
Estimate sentiment of words using word vectors.  
In this section, we'll use the SemEval-2015 English Twitter Sentiment Lexicon.  
The lexicon was used as an official test set in the SemEval-2015 shared Task #10: Subtask E, and contains a polarity score for words in range -1 (negative) to 1 (positive) - http://saifmohammad.com/WebPages/SCL.html#OPP

Build a classifier for the sentiment of a word given its word vector. Split the data to a train and test sets, and report the model performance on both sets.

We start by downloading the data and extracting it. We will create a dictionary with the keys are the words and the values are their sentiment scores.  
Note that the sentiment dataset contains terms of multiple words and hashtags - we will remove these from the dataset.

In [None]:
!wget http://saifmohammad.com/WebDocs/lexiconstoreleaseonsclpage/SemEval2015-English-Twitter-Lexicon.zip
!unzip SemEval2015-English-Twitter-Lexicon.zip
!head SemEval2015-English-Twitter-Lexicon/SemEval2015-English-Twitter-Lexicon.txt

In [None]:
!wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
!unzip glove.twitter.27B.zip

In [None]:
# https://radimrehurek.com/gensim/scripts/glove2word2vec.html
if False:
    tmp_file = get_tmpfile("w2v.twitter.27B.100d.txt")
    glove2word2vec('glove.twitter.27B.100d.txt', tmp_file)
    model = KeyedVectors.load_word2vec_format(tmp_file)

In [None]:
with open('SemEval2015-English-Twitter-Lexicon.txt', 'r') as f:
    scores, twitter_sentiment_words = zip(*[line.strip().split() for line in f.readlines()])

twitter_sentiment_words = [w[1:] if w.startswith('#') else w for w in twitter_sentiment_words]
X = []
y = []

for i in range(len(scores)):
    if twitter_sentiment_words[i] in model:
        X.append(model[twitter_sentiment_words[i]])
        y.append(scores[i])

X = np.array(X, dtype='float')
y = np.array(y, dtype='float')

X_train, X_test, y_train, y_test = \
  train_test_split(X, y, test_size=0.1, random_state=0) # We used 10,000 songs from each category - 1,000 songs in the test set seem like a lot. 

Next let's build a regressor to predict the sentiment of a word from its word vector. We use the word vectors trained on our corpus of song lyrics.  
Let's try a few different algorithms, and choose the one with the smallest MSE on the test set:

In [None]:
from sklearn import metrics

from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

models = [
    LinearRegression(),
    MLPRegressor(max_iter=1000, tol=1e-5, hidden_layer_sizes=(300,200,100)),
    RandomForestRegressor(),
    GradientBoostingRegressor(),
    SVR()
]
score = {}
for model in models:
    model_name = str(model).split('(')[0]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    score[model_name] = metrics.mean_squared_error(y_test, y_pred)

print(score)

In [None]:
model = models[-1]

Use your trained model from the previous question to predict the sentiment score of words in the lyrics corpus that are not part of the original sentiment dataset. Review the words with the highest positive and negative sentiment. Do the results make sense?

In [None]:
num_words = 0
test_words = []
for w, _ in freqs_bg.most_common(10000):
    if w not in twitter_sentiment_words and w in w2v:
        test_words.append(w)
        num_words += 1
    if num_words == 1000:
        break

similar_words_features = np.array([w2v[w] for w in test_words], 'float')
sentiment_scores = model.predict(similar_words_features)

df_predicted_sentiment = pd.Series(sentiment_scores, index=test_words)

print(' --- top 20 negative sentiment score --')
print( df_predicted_sentiment.sort_values().head(20) )

print(' --- top 20 positive sentiment score --')
print( df_predicted_sentiment.sort_values().tail(20) )

### Visualize Word Vectors
In this section, you'll plot words on a 2D grid based on their inner similarity. We'll use the tSNE transformation to reduce dimensions from 300 to 2. You can get sample code from https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial or other tutorials online.

Perform the following:
- Keep only the 3,000 most frequent words (after removing stopwords)
- For this list, compute for each word its relative abundance in each of the genres
- Compute the ratio between the proportion of each word in each genre and the proportion of the word in the entire corpus (the background distribution)
- Pick the top 50 words for each genre. These words give good indication for that genre. Join the words from all genres into a single list of top significant words. 
- Compute tSNE transformation to 2D for all words, based on their word vectors
- Plot the list of the top significant words in 2D. Next to each word output its text. The color of each point should indicate the genre for which it is most significant.

You might prefer to use a different number of points or a slightly different methodology for improved results.  
Analyze the results.

In [None]:
!pip install tqdm

In [None]:
from tqdm import tqdm 

freqs = {}
for i,r in tqdm(df.iterrows()):
    if r.genre not in freqs:
        freqs[ r.genre ] = Counter()
    freqs[ r.genre ].update(r.sent)

In [None]:
freqs_bg = Counter()
[freqs_bg.update(v) for v in freqs.values()];

Let's check the most common words for each genre. Remember, here we only consider word occurrences. A word might be common in all genres. We will later examine words that are common in a specific genre more than the others.

In [None]:
freqs_bg.most_common(20)

In [None]:
_ = {genre: [v[0] for v in counts.most_common(20)] for genre, counts in freqs.items()}
pd.DataFrame(_)

Can you spot the genre with the least amount of "love"?

To find the most common words for each genre, we will search for words that are both common and have a large ratio between their relative frequency in a genre and their relative frequency in the background.


In [None]:
total_counts = {genre: sum(counts.values()) for genre, counts in freqs.items()}
total_counts

In [None]:
total_counts_bg = sum(freqs_bg.values())
word_freq = pd.DataFrame({w: freqs_bg[w] / total_counts_bg for w, v in freqs_bg.most_common(2000)}, index=['bg']).T
word_freq.head()

In [None]:
for genre in total_counts.keys():
    word_freq[genre] = [ freqs[genre].get(w, 0)/total_counts[genre]/r['bg'] for w,r in word_freq.iterrows() ]

In [None]:
genres = df.genre.unique()
for genre in genres:
    print( genre )
    print( ", ".join(word_freq.sort_values(genre, ascending=False).index[:50].values ) )
    print()

In [None]:
import itertools
words = list(set(list(itertools.chain.from_iterable([word_freq.sort_values(genre, ascending=False).index[:50].values for genre in genres]))))

In [None]:
from sklearn.manifold import TSNE
X = [w2v[w] for w in words]
X_embedded = TSNE(n_components=2).fit_transform(X)

In [None]:
X_embedded.shape

In [None]:
colors = ['r', 'g', 'b', 'k', 'y']
c = [colors[list(genres).index(word_freq.loc[w].argmax())] for w in words]
word_genre = np.array( [word_freq.loc[w].argmax() for w in words] )

In [None]:
plt.figure(figsize=(20,12))

x = X_embedded[:,0];
y = X_embedded[:,1];

for i,genre in enumerate(genres):
    ids = np.where( word_genre==genre )[0]
    plt.plot(x[ids], y[ids], '.', color=colors[i], label=genre)
for i, word in enumerate(words):
    plt.annotate(word, alpha=0.5, xy=(x[i], y[i]), xytext=(5, 2),
                 textcoords='offset points', ha='right', va='bottom', size=12)
plt.legend();

## Text Classification
In this section, you'll build a text classifier, determining the genre of a song based on its lyrics.

### Text classification using Bag-of-Words
Build a Naive Bayes classifier based on the bag of Words.  
You will need to divide your dataset into a train and test sets.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train, X_test, y_train, y_test = \
  train_test_split(df.sent, df.genre, test_size=0.1, random_state=0) # We used 10,000 songs from each category - 1,000 songs in the test set seem like a lot. 

X_train_counts = count_vect.fit_transform([" ".join(sent) for sent in X_train])
X_test_counts = count_vect.transform([" ".join(sent) for sent in X_test])

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report

model = MultinomialNB(alpha = 1)
model.fit(X_train_counts, y_train)

Show the confusion matrix.

In [None]:
# Predict
y_pred = model.predict(X_train_counts)
cm = confusion_matrix(y_train, y_pred)
print( pd.DataFrame(cm, index=genres, columns=genres) )

y_pred = model.predict(X_test_counts)
cm = confusion_matrix(y_test, y_pred)
print( pd.DataFrame(cm, index=genres, columns=genres) )

Show the classification report - precision, recall, f1 for each class.

In [None]:
print( classification_report(y_test, y_pred ))

In [None]:
import seaborn as sns
sns.heatmap( confusion_matrix(y_test, y_pred) )

### Text classification using Word Vectors
#### Average word vectors
Do the same, using a classifier that averages the word vectors of words in the document.

In [None]:
X_train_vec = np.array([sum(w2v[w] for w in sent if w in w2v) for sent in X_train])
X_test_vec = np.array([sum(w2v[w] for w in sent if w in w2v) for sent in X_test])

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vec, y_train)

In [None]:
# Predict
y_pred = model.predict(X_train_vec)
cm = confusion_matrix(y_train, y_pred)
print( pd.DataFrame(cm, index=genres, columns=genres) )

y_pred = model.predict(X_test_vec)
cm = confusion_matrix(y_test, y_pred)
print( pd.DataFrame(cm, index=genres, columns=genres) )

#### TfIdf Weighting
Do the same, using a classifier that averages the word vectors of words in the document, weighting each word by its TfIdf.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(smooth_idf=True, sublinear_tf=False, norm=None, analyzer='word')
X_train_tfidf = tfidf_vect.fit_transform([" ".join(sent) for sent in X_train])
X_test_tfidf = tfidf_vect.transform([" ".join(sent) for sent in X_test])

In [None]:
X_train_vec = np.array([sum(w2v[w] for w in sent if w in w2v) for sent in X_train])
X_test_vec = np.array([sum(w2v[w] for w in sent if w in w2v) for sent in X_test])

### Text classification using ConvNet
Do the same, using a ConvNet.  
The ConvNet should get as input a 2D matrix where each column is an embedding vector of a single word, and words are in order. Use zero padding so that all matrices have a similar length.  
Some songs might be very long. Trim them so you keep a maximum of 128 words (after cleaning stop words and rare words).  
Initialize the embedding layer using the word vectors that you've trained before, but allow them to change during training.  

Extra: Try training the ConvNet with 2 slight modifications:
1. freezing the the weights trained using Word2vec (preventing it from updating)
1. random initialization of the embedding layer

You are encouraged to try this question on your own.  

You might prefer to get ideas from the paper "Convolutional Neural Networks for Sentence Classification" (Kim 2014, [link](https://arxiv.org/abs/1408.5882)).

There are several implementations of the paper code in PyTorch online (see for example [this repo](https://github.com/prakashpandey9/Text-Classification-Pytorch) for a PyTorch implementation of CNN and other architectures for text classification). If you get stuck, they might provide you with a reference for your own code.