**References**


https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

In [None]:
import numpy as np
import pandas as pd

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

*****

In [None]:
songs_df = pd.read_csv ('./songs_train.csv', header=0, low_memory=False)
songs_df.shape

In [None]:
songs_df.shape

In [None]:
songs_df.columns

In [None]:
# the first row

songs_df.iloc[1,:]

In [None]:
# data type of each variable (column)

songs_df.dtypes

***

##  Explore categorical/text variables

In [None]:
songs_df.dtypes.value_counts()

In [None]:
songs_df.dtypes[songs_df.dtypes==object]

In [None]:
songs_df[songs_df.dtypes[songs_df.dtypes==object].index].head()

In [None]:
text_variables = ['artist_mbtags','terms','location','title']
songs_df[text_variables].head(20)

In [None]:
songs_df[text_variables].nunique()

In [None]:
# number of missing values in each column

songs_df[text_variables].isnull().sum()

****

## Feature engineering of `title` variable
Our goal here is to extract the most important keywords from the `text` column and use them to represent the title as a feature vector instead of plain text.

In [None]:
title_corpus = songs_df.title.tolist()
title_corpus[0:5]

### 0) Visual exploration of corpus

Visual exploration will help us detect noise in the corpus (so that we clean it in the next step).

We will be looking for two types of noise:

- non-word characters; we will do this by visualizing the distribution of characters in corpus.
- non-English words; we will do this via simple visual inspection.

#### QUESTIONS

Execute the cells in this section then answer the following questions.

1. Are there any **strange** characters in this corpus, i.e. that are **not** English word letters (a-z), punctuation, or numbers?
2. If you answered yes to the above question, are any of these strange characters **very frequent** ? Justify your answer. *Hint*: you can use `fdist1` to determine the frequency of any character.
(If yes, then you will need to make sure these characters are removed in the next step ...)
3. Based on the distribution plot below, the top 20 characters cover what fraction of all character occurrences in this corpus?
4. Based on simple visual inspection of the corpus, do you notice any non-English words? If yes, are they a few or alot?  (If yes, then we would need a way of removing them in the next step ...)

In [None]:
from nltk import FreqDist

In [None]:
# convert the corpus from list of strings to a single string (i.e. sequence of characters)

corpus_char_list = "\n".join(title_corpus)
type(corpus_char_list),len(corpus_char_list)

In [None]:
# Create an instance of FreqDist class and then use it to count the number of occurrences of each character in the corpus

fdist1 = FreqDist([c for c in corpus_char_list])
type(fdist1)

In [None]:
# The FreqDist data type is in fact similar to a dictionary

fdist1

In [None]:
print("Total number of characters in this coprus:",fdist1.N())

print("Number of DISTINCT characters in this corpus:",fdist1.B())

In [None]:
print('List of distinct characters in the corpus, sorted by their Unicode values:\n')
print(sorted(list(fdist1.keys())))

In [None]:
# We can now obtain the number of occurrences of any character using the fdist1 object

print("The number of occurrences of the character 'z':", fdist1['z'])

for x in ['a','b','c','/','[',';','-','?','!','~']:
    print("The number of occurrences of the character '%c': %d" % (x, fdist1[x]))

In [None]:
#?fdist1.most_common

In [None]:
# The most_common() method sorts the characters in decreasing order of frequency

print('The 10 most frequent characters in the corpus and their corresponding number of occurrences:')
fdist1.most_common(10)

In [None]:
#?fdist1.plot

In [None]:
# The plot() method of the FreqDist class plots the distribution of the most frequent characters
fig=fdist1.plot(20,cumulative=True)

*****

### 1) Text cleaning

We are now going to apply the following sequence of text cleaning operations to **every** document in the corpus. 

- a) remove non-word useless characters (if there are any)
- b) convert to lowercase
- c) tokenize (convert sequence of characters to sequence of words)
- d) remove stop words
- e) remove useless words (too short or too long words)
- f) stemming

We will do most of these tasks using functions from the ``NLTK`` library.

### a)b)c) Remove useless characters & convert to lowercase & tokenize

**QUESTIONS**

Execute the cells in this section then answer the following questions:

1. What does the instance of RegexpTokenizer class do; which characters does it keep?
2. Modify the value of `tokenization_regexp` so that it only keeps a-z characters then re-execute the cells in this section.
3. What is the smallest and largest number of words in a song title?
4. How many song titles contain the word 'love'?

In [None]:
from nltk.tokenize import RegexpTokenizer

In [None]:
# create an instance of the RegexpTokenizer class

tokenization_regexp = '[^_\W]+'
tokenizer = RegexpTokenizer(tokenization_regexp)

In [None]:
# tokenize an example document by calling the tokenizer() method of this class
tokenizer.tokenize("It's still early - it is just 3 o'clock now!! :) (_) ")

In [None]:
# apply tokenizer to each document (song title) in our corpus
title_corpus_words = [tokenizer.tokenize(doc.lower()) for doc in title_corpus]
type(title_corpus_words),len(title_corpus_words)

In [None]:
# the first 5 documents in the tokenized corpus
title_corpus_words[0:5]

We will now visualize distribution of number of words per song title. This is just to verify the results of tokenization ...

In [None]:
# create list containing the number of words in each document
df = pd.Series([len(doc) for doc in title_corpus_words])

In [None]:
df.describe()

In [None]:
# Cumulative frequency distribution of number of words in each song title

df.plot.hist(title='song title length (in number of words)', cumulative=True)
fig=plt.xlabel('number of words')

In [None]:
# Display song titles that contain only one word

L = [doc[0] for doc in title_corpus_words if len(doc)==1]
print(len(L))
print(L[0:20])

In [None]:
# Display documents that contain more than 20 words

L = [' '.join(doc) for doc in title_corpus_words if len(doc)>=20]
print(len(L))
L

### d) Remove stopwords
We will use NLTK's default list of stop worsd for the English language.

In [None]:
# Load list of stopwords from NLTK library
from nltk.corpus import stopwords

In [None]:
# You might need to download the set of stop words the first time
import nltk
nltk.download('stopwords')

In [None]:
# Load stop words
stop_words_en = stopwords.words('english')

In [None]:
print(type(stop_words_en))
print(len(stop_words_en))

In [None]:
# Show the first 10 stop words
stop_words_en[:10]

In [None]:
# Remove stop words from our corpus
title_corpus_words_2 = [[word for word in doc  if word not in stop_words_en] for doc in title_corpus_words]
type(title_corpus_words_2),len(title_corpus_words_2)

### e) remove useless words based on word length

- Very short words are usually not very meaningful.
- Very long words might be either spelling mistakes, or elongated words.

**QUESTIONS**

1. remove all words that contain <= 2 characters or more than 12 characters from all titles in `title_corpus_words_2`. Put the result in a new list called `title_corpus_words_3`.

In [None]:
# create set of all distinct words in corpus
distinct_words_set = {word for doc in title_corpus_words_2 for word in doc}
type(distinct_words_set),len(distinct_words_set)

In [None]:
words_len_df = pd.Series([len(word) for word in distinct_words_set], index=list(distinct_words_set))

In [None]:
# Summary statistics of word length
words_len_df.describe()

In [None]:
# Histogram of word length
fig = words_len_df.plot.hist(title="word length")

In [None]:
fig = words_len_df.plot.hist(title="word length", cumulative=True)

In [None]:
# how many words have length <= 2
words_len_df[words_len_df<=2].count()

In [None]:
# which words have length <= 2
print(sorted(words_len_df[words_len_df<=2].index.tolist()))

In [None]:
# how many words contain more than 15 characters
words_len_df[words_len_df>=15].count()

In [None]:
print(sorted(words_len_df[words_len_df>=15].index.tolist()))

In [None]:
# remove words that contain <= 2 characters or >= 15

title_corpus_words_3 = ...


In [None]:
# verify type and length

assert type(title_corpus_words_3)==list and len(title_corpus_words_3)==len(title_corpus_words_2) and type(title_corpus_words_3[0])==list

### f) Stemming
- All stemming methods are heuristic; there is no perfect stemming method; they all make mistakes.
- We will try the famous Porter method from the `NLTK` library.

#### QUESTIONS

Execute the cells below then answer the following questions:

1. This stemming method reduces/shrinks the vocabulary by how much (fraction)?
2. Give 3 example errors of this stemming method; where an error is when two unrelated words are mapped to the same stem word.
3. Do you think there are too many stemming erros and we should just NOT use stemming? Explain.
4. Can you suggest a simple way (based on simple string operations) to reduce the amount of errors of this stemming method?

In [None]:
from nltk.stem import PorterStemmer

In [None]:
# create instance of class
stemmer = PorterStemmer()

In [None]:
title_corpus_words_4 = [[stemmer.stem(word) for word in doc] for doc in title_corpus_words_3]
type(title_corpus_words_4),len(title_corpus_words_4)

In [None]:
title_corpus_words_4[0:5]

Analyze the results of stemming method

In [None]:
# Number of distinct words BEFORE stemming
len({word for doc in title_corpus_words_3 for word in doc})

In [None]:
# Number of distinct words AFTER stemming
len({word for doc in title_corpus_words_4 for word in doc})

In [None]:
# Create dictionnary containing each word and its corresponding stemmed word
distinct_words = {word for doc in title_corpus_words_3 for word in doc} # set of distinct words in corpus BEFORE stemming
from collections import defaultdict
d1 = defaultdict(list)
for w in distinct_words:
    d1[stemmer.stem(w)].append(w)
len(d1)

In [None]:
# key = stemmed word, value = list of all words mapped to this stemmed word
d1

In [None]:
# Display words which are mapped to the SAME word
for k,v in d1.items():
    if len(v)>1:
        print(k,v)

*****

#### Prepare clean corpus for BOW in Scikit-learn
The BOW module in ``Scikit-Learn`` library requires the input documents as a list of strings, and not as a list of words. Therefore we are going to concatenate the words in our cleaned tokenized corpus ..

In [None]:
# concatenate the words in the cleaned corpus
title_corpus_clean = [' '.join(doc) for doc in title_corpus_words_4]
len(title_corpus_clean)

In [None]:
title_corpus[0:5]

In [None]:
title_corpus_clean[0:5]

*****

### 2) Vector representation using boolean BOW

**QUESTIONS**

1. Create an instance of the `CounVectorizer` class, with min_df=3 and max_df=0.9. Put this instance in a variable called `title_vec`. Make sure it corresponds to **boolean** bag-of-words.
2. Fit this instance on the text documents in `title_corpus_clean`. How many words are there in the vocabulary?
3. Create the document-term matrix for `title_corpus_clean`, and put the result in a variable called `title_dtm`. What is the size of this matrix? 
4. What are the minimum and maximum values of this matrix?  Does this make sense?
5. How many rows of this matrix contain all zeros? How many rows contain only one non-zero value?  Hint: call the sum() method with argument axis=1, then convert the result to an array ...
6. Calculate the cosine similarity between all pairs of rows in `title_dtm` by calling the `cosine_similarity` function from the `sklearn.metrics.pairwise` module (this function has already been imported for you above).
7. Print all pairs of song titles that have cosine similarity above 0.9.

*****

## Optional Bonus Questions

Answer the question that seems easier to you.

### Question 1

1. Cluster the song titles using the feature vectors in `title_dtm`. You are free to use any clustering method (for example kmeans in sklearn ...). Also, you will need to select an appropriate number of clusters.
2. Print the song titles in the largest cluster.

### Question 2

Copy this file into a new .ipynb file and then repeat all the work for the `terms` column instead of the `title` column.