#  **Hands-on of course 3 : Embedding**

# **PART 1 : LSA Demonstrator**

In this tutorial, you will learn how to use Latent Semantic Analysis to either discover hidden topics from given documents in an unsupervised way 
Later you'll use LSA values as a feature vectors to classify document with known document categories.

## Imports

In [None]:
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
!pip install gensim==4.1.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim==4.1.2
  Downloading gensim-4.1.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.1.2


In [None]:
import gensim
from gensim.test.utils import get_tmpfile
print(gensim.__version__)

4.1.2


In [None]:
#import modules
import os
import pandas as pd
import numpy as np
from string import punctuation

import nltk
from nltk import WordNetLemmatizer, word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt

nltk.download("stopwords")
nltk.download('punkt')
nltk.download("wordnet")
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

## Preprocessing function

In [None]:
stop_words = nltk.corpus.stopwords.words("english")
stop_char = stop_words + list(punctuation)

In [None]:
def preprocessing(sentence):
    """ Basic processing of a document, word by word. 
    Outputs a list of processed tokens
    """
    # Tokenization
    tokens = word_tokenize(sentence)
    # stopwords + lowercase
    tokens = [token.lower().replace("'", "") for token in tokens if token.lower() not in stop_char]
    
    Lemmatizer = WordNetLemmatizer()
    tokens = [Lemmatizer.lemmatize(token) for token in tokens]
    
    # Deleting words with  only one caracter
    tokens = [token for token in tokens if len(token)>2]
    [word for word in words if len(token)>2]
    
    return tokens

## A. Example on few sentences

In [None]:
docA = 'I believe cats are better animals than dogs, I love cats !'
docB = 'I saw this movie named cats, it was quite bad'
docC = 'The cat jumped over the gate'

docD = 'Artificial intelligence is fun'
docE = 'Business and data science / artificial intelligence combination is the key'
docF = 'Data science is the future and data is the new black gold'
docs = [docA, docB, docC, docD, docE, docF]
docs

['I believe cats are better animals than dogs, I love cats !',
 'I saw this movie named cats, it was quite bad',
 'The cat jumped over the gate',
 'Artificial intelligence is fun',
 'Business and data science / artificial intelligence combination is the key',
 'Data science is the future and data is the new black gold']

In [None]:
import re
# We will separate each sentence into tokens
def strip_digit(tokens):
    tokens = [re.sub("\d+", "", token) for token in tokens ]
    tokens = [token for token in tokens if len(token)!=""]
    return tokens

strip_digit(["the",'code',"will", "delete", "100", "but", "not","100km"])

['the', 'code', 'will', 'delete', '', 'but', 'not', 'km']

### Preprocessing

**Question 1 : Complete the code in order to preprocess docs**


In [None]:

simple_clean_docs = []
for doc in docs: 
  ### START CODE HERE
  preprocessed_doc = preprocessing(doc)
  ### END CODE HERE
  simple_clean_docs.append(preprocessed_doc)

simple_corpus = [' '.join(sentence) for sentence in simple_clean_docs]
simple_corpus

NameError: ignored

### TF-IDF vectorization
To convert text data in a document-term matrix, we are goint to use `TfidfVectorizer` from `sklearn` library

**Question 2 : Complete the code in order to apply the TF IDF vectorization to simple corpus**

In [None]:
# START CODE HERE
simple_vectorizer = TfidfVectorizer() # Initialization of Tf IDF
simple_vect_corpus = simple_vectorizer.fit_transform(simple_corpus) # apply tfidf to simple corpus
# END CODE HERE

In [None]:
simple_dictionary = np.array(simple_vectorizer.get_feature_names())
simple_df_tfidf = pd.DataFrame(simple_vect_corpus.todense(), columns = simple_dictionary)
simple_df_tfidf.head()

### Singular Value Decomposition

**Question 3 : Apply SVD**

To perform Singular Value Decomposition, you can use `TruncatedSVD`. You must specify the number of topics/latent features you are expecting. Default value is set to 2. Here we will keep 2 as number of components as we are expecting to discover 2 topics regarding this corpus. Later, you'll see how to optimize this number.

In [None]:
# START CODE HERE
simple_svd = TruncatedSVD(n_components=2) # Initialize SVD with n_components = 2
simple_lsa = simple_svd.fit_transform(simple_df_tfidf) # Apply SVD to simple_tf_idf
# END CODE HERE

In [None]:
simple_topic_encoded_df = pd.DataFrame(simple_lsa, columns=['topic_1', 'topic_2'])
simple_topic_encoded_df['corpus'] = simple_corpus
simple_topic_encoded_df

### Deep dive into dictionary

In [None]:
simple_dictionary

In [None]:
simple_encoding_matrix = pd.DataFrame(simple_svd.components_, index=['topic_1', 'topic_2'], columns=simple_dictionary).T
simple_encoding_matrix

**Question 4 : What are the top words for each topics ?** 

In [None]:
# START CODE HERE
simple_encoding_matrix['abs_topic_1'] = np.abs(simple_encoding_matrix['topic_1']) # GET ABSOLUTE VALUE OF COLUMN TOPIC 1
simple_encoding_matrix['abs_topic_2'] = np.abs(simple_encoding_matrix['topic_2']) # GET ABSOLUTE VALUE OF COLUMN TOPIC 2
# END CODE HERE
simple_encoding_matrix.sort_values('abs_topic_1', ascending=False)


In [None]:
simple_encoding_matrix.sort_values('abs_topic_2', ascending=False)

## B. On larger corpus
We will use the corpus NLTK Gutenburg that includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/.

We will use the two books :

1.   Alice in Wonderland of Lewis Carroll
2.   Hamlet of Shakespeare



In [None]:
nltk.download('gutenberg')
alice_raw = nltk.corpus.gutenberg.raw('carroll-alice.txt')
hamlet_raw = nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt')

NameError: ignored

### Preprocessing

In [None]:
alice_sentences = nltk.sent_tokenize(alice_raw)

alice_sentence_clean = []
for sent in alice_sentences:
    if len(sent)>0:
        alice_sentence_clean.append(preprocessing(sent))
    
print("Number of sentences after cleaning:", len(alice_sentence_clean))
alice_sentence_clean[50]

In [None]:
hamlet_sentences = nltk.sent_tokenize(hamlet_raw)

hamlet_sentence_clean = []
for sent in hamlet_sentences:
    if len(sent)>0:
        hamlet_sentence_clean.append(preprocessing(sent))
    
print("Number of sentences after cleaning:", len(hamlet_sentence_clean))
hamlet_sentence_clean[50]

### TF-IDF vectorization

In [None]:
corpus_alice = pd.concat([pd.Series((' '.join(sentence) for sentence in alice_sentence_clean), name='sentence'), 
                          pd.Series(np.ones(len(alice_sentence_clean)), name='is_Alice')], axis=1)

corpus_hamlet = pd.concat([pd.Series((' '.join(sentence) for sentence in hamlet_sentence_clean), name='sentence'), 
                          pd.Series(np.zeros(len(hamlet_sentence_clean)), name='is_Alice')], axis=1)

corpus = pd.concat([corpus_alice, corpus_hamlet]).reset_index(drop=True)

In [None]:
corpus

**Question 5 : Apply TF IDF to sentences in corpus**

In [None]:
vectorizer = TfidfVectorizer(min_df=3)

# START CODE HERE 
vect_corpus = vectorizer.fit_transform(corpus['sentence'])

# END CODE HERE

dictionary = np.array(vectorizer.get_feature_names())
df_tfidf = pd.DataFrame(vect_corpus.todense(), columns = dictionary)
df_tfidf.sample(5)

### Sparsity of the matrix

**Question 6 : What is the dimension of the tf-idf matrix ? What each dimension represents ?**

In [None]:
# START CODE HERE
df_tfidf.shape
# END CODE HERE


**Question 7 : What are the words that have in average the highest frequency in the corpus**

In [None]:
# START CODE HERE
df_tfidf_mean = df_tfidf.mean()

# END CODE HERE
df_tfidf_mean = df_tfidf_mean.sort_values(ascending=False).to_frame(name='tfidf mean')
df_tfidf_mean[:15].plot(kind='bar')
plt.show()

### Singular Value Decomposition

In [None]:
svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(df_tfidf)

**Question 8 : Exclude sentences with length <= 15**



In [None]:
topic_encoded_df = pd.DataFrame(lsa, columns=['topic_1', 'topic_2'])
topic_encoded_df['sentence'] = corpus['sentence']
topic_encoded_df['is_Alice'] = corpus['is_Alice']

# START CODE HERE
topic_encoded_df['len'] = topic_encoded_df['sentence'].str.split().str.len() # compute length of each sentence ( number of words)
topic_encoded_df[topic_encoded_df['len']>15] # Filter on sentences with length > 15
# END CODE HERE

### Deep dive into Dictioniary

In [None]:
dictionary[:10]

In [None]:
encoding_matrix = pd.DataFrame(svd.components_, index=['topic_1', 'topic_2'], columns=dictionary).T
encoding_matrix

In [None]:
encoding_matrix['abs_topic_1'] = np.abs(encoding_matrix['topic_1'])
encoding_matrix['abs_topic_2'] = np.abs(encoding_matrix['topic_2'])
encoding_matrix.sort_values('abs_topic_1', ascending=False).head(10)

In [None]:
encoding_matrix.sort_values('abs_topic_2', ascending=False).head(10)

### Plot topic encoded data

We are going to represent each sentence regarding the two latent features. They are colorized regarding the `is_Alice` binary variable

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

for val in topic_encoded_df['is_Alice'].unique():
    topic_1 = topic_encoded_df[topic_encoded_df['is_Alice']==val]['topic_1'].values
    topic_2 = topic_encoded_df[topic_encoded_df['is_Alice']==val]['topic_2'].values
    color = "red" if val else "blue"
    label= "Alice Wonderland" if val else "Hamlet"
    ax.scatter(topic_1, topic_2, alpha=0.5, label=label)
    
ax.set_xlabel('First Topic')
ax.set_ylabel('Second Topic')
ax.axvline(linewidth=0.5)
ax.axhline(linewidth=0.5)
ax.legend()

## Select the best number of components for SVD

In [None]:
svd.explained_variance_ratio_

We will create Function Calculating Number Of Components Required To Pass Threshold. 

This function have to take in parameters a large list of explained variance ratio (number of components close from number of originally features/terms)

In [None]:
def select_n_components(var_ratio, var_threshold):
    # Set initial variance explained so far
    total_variance = 0.0
    n_components = 0
    
    # For the explained variance of each feature:
    for explained_variance in var_ratio:
        total_variance += explained_variance
        n_components += 1
    
        if total_variance >= var_threshold:
            break
            
    # Return the number of components
    return n_components

**Question 9 : Select the optimal number of components to apply SVD explaining 50% of variance**

In [None]:
large_svd = TruncatedSVD(n_components=df_tfidf.shape[1]-1)
large_lsa = large_svd.fit_transform(df_tfidf)
# START CODE HERE
threshold = 0.5
n_opt = select_n_components(large_svd.explained_variance_ratio_, threshold)
# END CODE HERE
print(f"The optimal number of components to explain {threshold*100}% of the variance is {n_opt}")

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

explained_variance = pd.Series(large_svd.explained_variance_ratio_.cumsum())
explained_variance.plot()

ax.xaxis.set_ticks(np.arange(0, len(explained_variance), 100))

ax.set_xlabel('Number of Topics')
ax.set_ylabel('Percentage of explained variance')
ax.set_title('Percentage of explained variance by number of topics')

**Question 10 : Apply SVD with optimal number of component and select for the first 10 topic the top words**

In [None]:
# START CODE HERE
optimal_svd = TruncatedSVD(n_components=n_opt)
optimal_lsa = optimal_svd.fit_transform(df_tfidf)
# END CODE HERE
optimal_encoding_matrix = pd.DataFrame(optimal_svd.components_, index=[f'topic_{i}' for i in range(n_opt)], columns=dictionary).T

In [None]:
for i in range(10):
  # START CODE HERE
    optimal_encoding_matrix[f'abs_topic_{i}'] = np.abs(optimal_encoding_matrix[f'topic_{i}']) # get Absolute value of column topic i
    top_words = optimal_encoding_matrix.sort_values(f'abs_topic_{i}', ascending=False).index[:5] # get top 5 words

    # END CODE HERE
    print(f"Top words for topic {i} are : ")
    print(top_words)
    print()
    print()

# **PART 2 : WORD2VEC**

In [None]:
corpus_alice = pd.concat([pd.Series((' '.join(sentence) for sentence in alice_sentence_clean), name='sentence'), 
               
               
                   pd.Series(np.ones(len(alice_sentence_clean)), name='is_Alice')], axis=1)

corpus_hamlet = pd.concat([pd.Series((' '.join(sentence) for sentence in hamlet_sentence_clean), name='sentence'), 
                          pd.Series(np.zeros(len(hamlet_sentence_clean)), name='is_Alice')], axis=1)

corpus = pd.concat([corpus_alice, corpus_hamlet]).reset_index(drop=True)

In [None]:
gensim_corpus = [corp.split(" ") for corp in corpus.sentence]
gensim_corpus

In [None]:
len(gensim_corpus)

In [None]:
gensim_corpus[0]

__Question 11 : Create a temporary file by giving an extension and make sure you add ".model" as extension__

In [None]:
# START CODE HERE
path = get_tmpfile("word2vec_lesson.model")
# END CODE HERE

__Question 12 : Instantiate your word2vec model__

*This module implements the word2vec family of algorithms: skip-gram and CBOW models.*

**window** = Maximum distance between the current and predicted word within a sentence.

**min_count** = Ignores all words with total frequency lower than this.

**workers** = Use these many worker threads to train the model (=faster training with multicore machines).

**seed** = Seed for the random number generator.

In [None]:
# START CODE HER
model = gensim.models.Word2Vec(window=3, min_count=5, workers=4, seed=1) 

# END CODE HERE

__Question 13 : Define the vocabulary of your model__

Build vocabulary from a sequence of sentences. (use model.build_vocab)

In [None]:
## START CODE HERE
model.build_vocab(gensim_corpus[:3000])
# END CODE HERE

__Question 14 : Train your word2vec model__

use model.train with epochs = 50 

In [None]:
## START CODE HERE
model.train(gensim_corpus[:3000], total_examples=model.corpus_count, epochs=50)

# END CODE HERE

__Question 15 : Save your word2vec model; give the same path as in your temporary file__

In [None]:
## START CODE HERE
model.save("word2vec_lesson.model")
# END CODE HERE

In [None]:
model = gensim.models.Word2Vec.load("word2vec_lesson.model")

__Question 16 : Get the weight vector of a word; this is the vector (of numerical) representation of your word__ (use model.wv




In [None]:
## START CODE HERE
list(model.wv["lord"])

# END CODE HERE

__Question 17 : Get the 10 most similar words to "lord" and "alice"__ (use model.wv.most_similar)

In [None]:
# START CODE HERE / MOST SIMILAR WORDS TO "lord"
model.wv.most_similar("lord", topn=10)

# END CODE HERE

In [None]:
# START CODE HERE / MOST SIMILAR WORDS TO "alice"
model.wv.most_similar("alice", topn=10)
# END CODE HERE

## Create Word Embedding of Words

---


__Question 18 : Get the embedding dict of your corpus__

In [None]:
embedding_matrix = dict()
# START CODE HERE
# embedding_matrix[word]= word2vec representation of the word
for word in model.wv.index_to_key:
    embedding_matrix[word] = list(model.wv[word]) # get numpy vector of a word (wv = word vector)

# END CODE HERE

__Question 19 : Transform it to a pandas DataFrame and look into few lines of you embedding matrix__

In [None]:
# START CODE HERE
embedding_matrix = pd.DataFrame(embedding_matrix)
embedding_matrix.head()

# END CODE HERE