<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

**<center><h3>NLP 1 Module 2 Assignment Questions</h3></center>**

----
# **Table of Contents**
----

**1.** [**Problem Statement**](#section1)<br>
**2.** [**Import Libraries**](#section2)<br>
**3.** [**Data Preprocessing**](#section3)<br>
**4.** [**Visualize Word Vector from Anna Karenina by Leo Tolstoy** ](#section4)<br>
   - **4.1** [**Creating Word Vectors**](#section401)
   - **4.2** [**Plot t-SNE Object**](#section402)  

**5.** [**Visualize Word Vector from War and Peace by Leo Tolstoy** ](#section5)<br>
   - **5.1** [**Creating Word Vectors**](#section501)
   - **5.2** [**Plot t-SNE Object**](#section502)

----
<a id=section1></a>
# **1. Problem Statement**
----

- To perform **Word vector visualization** on following novels of Russian writer **Leo Tolstoy**:
1. **Anna Karenina by Leo Tolstoy** 
2. **War and Peace by Leo Tolstoy**


- The **visualization** can be useful to **understand** how **Word2Vec** works and how to interpret **relations** between vectors captured from your **texts** before using them in neural networks or other machine learning algorithms.

- As training data, we will use **2 novels** of russian writer **Leo Tolstoy**, who is regarded as one of the greatest authors of all time.

- **T-SNE** is a technique of non-linear dimensionality reduction and visualization of **multi-dimensional** data.

----
<a id=section2></a>
# **2. Import Libraries**
----

In [None]:
!pip install gensim

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

from sklearn.manifold import TSNE
import re
import codecs
import multiprocessing
import nltk

nltk.download('punkt')
import gensim
from gensim.models import Word2Vec

----
<a id=section3></a>
# **3. Data Preprocessing**
----

<a id=section301></a>
### **3.1 Importing the Dataset**

In [None]:
# Downloading the dataset from github on Colab.
# If this command doesn't work on your local system then, download the file manually from your browser.
# To download the file from your browser, open this link: https://github.com/insaid2018/DeepLearning/raw/master/Data/data.zip
# Then place this file in the same folder as your notebook, and skip this cell.
!wget https://github.com/insaid2018/DeepLearning/raw/master/Data/data.zip

In [None]:
# Unzipping the data.zip file containing the datasets.
!unzip -qq data.zip

In [None]:
!ls

<a id=section302></a>
### **3.2 Preprocessing the Data**

In [None]:
# Function used to preprocess the text in the data files.
def preprocess_text(text):
    text = re.sub('[^a-zA-Zа-яА-Я1-9]+', ' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip()

In [None]:
# Function used to prepare the data for creating word vectors using Gensim.
def prepare_for_w2v(filename_from, filename_to, lang):
    raw_text = codecs.open(filename_from, "r", encoding='windows-1251').read()
    with open(filename_to, 'w', encoding='utf-8') as f:
      

      for sentence in nltk.sent_tokenize(raw_text, lang):
          print(preprocess_text(sentence.lower()), file=f)

<a id=section303></a>
### **3.3 Creating Word Vectors using Gensim**


- Word Vector(Word Embegging) is a way to represent a word in **numerically** form in such a way that the vector corresponds to its **sementic meaning** and in what way it is been used.

- Words that appear in similar contexts will have **similar vectors**. 
 For example, vectors for **"Delhi", "Mumbai"**, and **"Chennai"** will be close together, while they'll be **far** away from **"House"** and **"Fort"**.

- **Gensim** provides the **Word2Vec** class for working with a Word2Vec mode.

- Gensim provides tools for loading **pre-trained** word embeddings in a few formats and for making use and **querying** a loaded **embedding**.


In [None]:
# Function used to create word2vec embedding of data using Gensim.
def train_word2vec(filename):
    data = gensim.models.word2vec.LineSentence(filename)
    return Word2Vec(data, size=200, window=5, min_count=50, workers=multiprocessing.cpu_count())

----
<a id=section4></a>
# **4. Visualizaing Word2Vec Vectors for Anna Karenina Dataset**
----

<a id=section401></a>
### **4.1 Creating Word Vectors**

**<h4>Question 1:** Prepare Anna Karenina Dataset for applying Word2Vec.</h4>

<details>

**<summary>Hint:</summary>**

- Use **`prepare_for_w2v()`** funtion.

- Pass the following parameters to **`prepare_for_w2v()`** in `run_prepare_ak`:

  - **filename_from**: 'Anna Karenina by Leo Tolstoy (ru).txt'

  - **filename_to**: 'train_anna_karenina_ru.txt'

  - **lang**: 'russian'

</details>

In [None]:
def run_prepare_ak():
    # Write code to prepare Anna Karenina data for Word2Vec using prepare_for_w2v here.

In [None]:
run_prepare_ak()

**<h4>Question 2:** Train Word2Vec Model for Anna Karenina dataset.</h4>

<details>

**<summary>Hint:</summary>**

- Use **`train_word2vec()`** function and pass training data **'train_anna_karenina_ru.txt'** to the function as parameter. 

</details>

In [None]:
def run_train_word2vec_ak():
    model_ak = # Write code to train word2vec model for Anna Karenina data using train_word2vec here.

    return model_ak

In [None]:
model_ak = run_train_word2vec_ak()

In [None]:
words_ak = []
embeddings_ak = []

for word in list(model_ak.wv.vocab):
    embeddings_ak.append(model_ak.wv[word])
    words_ak.append(word)

In [None]:
words_ak[1]

- Example of How the **word embedding** will look like for **word** `"каренина"`.

In [None]:
embeddings_ak[1]

<a id=section402></a>
### **4.2 Plot T-SNE object**

#### **T-Distributed Stochastic Neighbor Embedding**

- **T-distributed Stochastic Neighbor Embedding** (T-SNE) is a non **linear dimensionality reduction** technique helps in visualize **multi dimensional** data. 

- T-SNE **maps** **multi** dimensional data to **2** or more **dimension** in such a way that points which are **far** in multi dimension **representation** will also be **far** in T-SNE **representation** and **vice-versa**. 

- In other words **T-SNE** give a new **data representation** with **preserved** neighbourhood **relation**.

- The **visualization** can be useful to understand how **Word2Vec** works and how to **interpret** **relations** between vectors **captured** from your texts before using them in **neural networks** or other machine learning algorithms.

**<h4>Question 3:** Create TSNE object for Anna Karenina data.</h4>
 
<details>

**<summary>Hint:</summary>**

- Create a **TSNE object** using **`TSNE()`** object.

- Pass **parameter** as follows to object:

  - `perplexity`(smooth measure of the effective number of neighbors) = **40**.

  - `n_components` (dimension of the output space) = **2**

  - `init`(initial initialization for embeddings) = **'pca'**

  - `n_iter` (number of iterations) = **3500** 

  - `random_state` equal to **32**.

</details>

In [None]:
def run_tsne_ak():
    tsne_ak = # Write code to create TSNE object here.

    return tsne_ak

In [None]:
tsne_ak = run_tsne_ak()

**<h4>Question 4:** Fit the created TSNE object.</h4>

<details>

**<summary>Hint:</summary>**

- Apply **`fit_transform`** method on `tsne_ak` object.

- Pass parameter `embeddings_ak` into the **`fit_transform`** method.

</details>

In [None]:
def fit_tsne_ak(embeddings_ak):
    embeddings_ak = # Write code to fit the tsne_ak object here.
    
    return embeddings_ak

In [None]:
embeddings_ak = fit_tsne_ak(embeddings_ak)

In [None]:
# Function to plot the TSNE values.
def tsne_plot(label, embeddings, words=[], a=1):
    plt.figure(figsize=(16, 9))
    colors = cm.rainbow(np.linspace(0, 1, 1))
    x = embeddings[:,0]
    y = embeddings[:,1]
    plt.scatter(x, y, c=colors, alpha=a, label=label)
    for i, word in enumerate(words):
        plt.annotate(word, alpha=0.5, xy=(x[i], y[i]), xytext=(5, 2), 
                     textcoords='offset points', ha='right', va='bottom', size=10)
    plt.legend(loc=4)
    plt.grid(True)
    plt.savefig("{}.png".format(label), format='png', dpi=150, bbox_inches='tight')
    plt.show()

**<h4>Question 5:** Plot the TSNE plot.</h4>

<details>

**<summary>Hint:</summary>**

- Use **`tsne_plot`** funtion.

- Pass following parameters in **`tsne_plot()`**:

  - `label`: 'Anna Karenina by Leo Tolstoy'

  - `embeddings`: embeddings_ak

  - `a`: 0.2

</details>

In [None]:
def run_tsne_plot_ak():
    # Write code to plot the TSNE values of Anna Karenina data here.

In [None]:
run_tsne_plot_ak()

**<h4>Question 6:** Plot the TSNE plot with words.</h4>

<details>

**<summary>Hint:</summary>**

- Use **`tsne_plot()`** funtion.

- Pass into **`tsne_plot`**:

  - `label`: 'Anna Karenina by Leo Tolstoy'

  - `embeddings`: embeddings_ak

  - `words`: words_ak

  - `a`: 0.2

</details>

In [None]:
def run_tsne_plot_words_ak():
    # Write code to plot the TSNE values of Anna Karenina data with names here.

In [None]:
run_tsne_plot_words_ak()

----
<a id=section5></a>
# **5. Visualizaing Word2Vec Vectors from War and Peace Dataset**
----

<a id=section501></a>
### **5.1 Creating Word Vectors**

**<h4>Question 7:** Prepare War and Peace Dataset for applying Word2Vec.</h4>

<details>

**<summary>Hint:</summary>**

- Use **`prepare_for_w2v()`** funtion.

- Pass the following parameters to **`prepare_for_w2v()`** in `run_prepare_ak`:

  - **filename_from**: 'War and Peace by Leo Tolstoy (ru).txt'

  - **filename_to**: 'train_war_and_peace_ru.txt'

  - **lang**: 'russian'

</details>

In [None]:
def run_prepare_wp():
    # Write code to prepare War and Peace data for Word2Vec using prepare_for_w2v here.

In [None]:
run_prepare_wp()

**<h4>Question 8:** Train Word2Vec Model for War and Peace dataset.</h4>

<details>

**<summary>Hint:</summary>**

- Use **`train_word2vec()`** function and pass training data **'train_war_and_peace_ru.txt'** to the function as parameter. 

</details>

In [None]:
def run_train_word2vec_wp():
    model_wp = # Write code to train word2vec model for War and Peace data using train_word2vec here.

    return model_wp

In [None]:
model_wp = run_train_word2vec_wp()

In [None]:
words_wp = []
embeddings_wp = []

for word in list(model_wp.wv.vocab):
    embeddings_wp.append(model_wp.wv[word])
    words_wp.append(word)

In [None]:
words_wp[45]

In [None]:
embeddings_wp[45]

<a id=section502></a>
### **5.2 Plot T-SNE Object**

**<h4>Question 9:** Create TSNE object for War and Peace data.</h4>

<details>

**<summary>Hint:</summary>**

- Create a TSNE object using **`TSNE()`**.

- Pass parameter as follows to object:

  - `perplexity`(smooth measure of the effective number of neighbors) = **40**   

  - `n_components` (dimension of the output space) = **2**  

  - `init`(initial initialization for embeddings) = **'pca'**

  - `n_iter` (number of iterations) = **3500** 

  - `random_state` equal to **32**. 

</details>

In [None]:
def run_tsne_wp():
    tsne_wp = # Write code to create TSNE object here.

    return tsne_wp

In [None]:
tsne_wp = run_tsne_wp()

**<h4>Question 10:** Fit the created TSNE object.</h4>

<details>

**<summary>Hint:</summary>**

- Apply **`fit_transform`** method on `tsne_wp` object.

- Pass parameter **embeddings_wp** into the **`fit_transform`** method.

</details>

In [None]:
def fit_tsne_wp(embeddings_wp):
    embeddings_wp = # Write code to fit the tsne_wp object here.
    
    return embeddings_wp

In [None]:
embeddings_wp = fit_tsne_wp(embeddings_wp)

**<h4>Question 11:** Plot the TSNE Curve.</h4>

<details>

**<summary>Hint:</summary>**

- Use **`tsne_plot()`** funtion.

- Pass into **`tsne_plot()`**:

  - `label`: 'War and Peace by Leo Tolstoy with words'

  - `embeddings`: embeddings_wp

  - `a`: 0.2

</details>

In [None]:
def run_tsne_plot_wp():
    # Write code to plot the TSNE values of War and Peace data here.

In [None]:
run_tsne_plot_wp()

**<h4>Question 12:** Plot the TSNE with words.</h4>

<details>

**<summary>Hint:</summary>**

- Use **`tsne_plot()`** funtion.

- Pass into **`tsne_plot`**:

  - `label`: 'War and Peace by Leo Tolstoy with words'

  - `embeddings`: embeddings_wp

  - `words`: words_wp

  - `a`: 0.15

</details>

In [None]:
def run_tsne_plot_words_wp():
    # Write code to plot the TSNE values of War and Peace data with names here.

In [None]:
run_tsne_plot_words_wp()


<a id=section504></a>
### **5.4 3-Dimension representation of words using T-SNE.**

In [None]:
prepare_for_w2v('War and Peace by Leo Tolstoy (ru).txt', 'train_war_and_peace_ru.txt', 'russian')
model_wp = train_word2vec('train_war_and_peace_ru.txt')

words_wp = []
embeddings_wp = []
for word in list(model_wp.wv.vocab):
    embeddings_wp.append(model_wp.wv[word])
    words_wp.append(word)
    
tsne_wp_3d = TSNE(perplexity=30, n_components=3, init='pca', n_iter=3500, random_state=12)
embeddings_wp_3d = tsne_wp_3d.fit_transform(embeddings_wp)

In [None]:
from mpl_toolkits.mplot3d import Axes3D


def tsne_plot_3d(title, label, embeddings, a=1):
    fig = plt.figure(figsize=(10, 7))
    ax = Axes3D(fig)
    colors = cm.rainbow(np.linspace(0, 1, 1))
    plt.scatter(embeddings[:, 0], embeddings[:, 1], embeddings[:, 2], c=colors, alpha=a, label=label)
    plt.legend(loc=4)
    plt.title(title)
    plt.show()


tsne_plot_3d('Visualizing Embeddings using t-SNE', 'War and Peace', embeddings_wp_3d, a=0.5)