<div style="color:white; background-color: black; padding: 20px; border-radius:8px; font-size:26px"><b style="font-weight: 700;"><center>LEARNING NLP </center></b></div>

<div style="color:white; background-color: black; padding: 20px; border-radius:8px; font-size:20px"><b style="font-weight: 700;"><center> Text Representation </center></b></div>

<div style="background-color:  #eddcd2; padding: 10px;">

### Experimental Data

</div>


Tweets about Covid, collected from [here](https://www.kaggle.com/datasets/lakshmi25npathi/coronavirus-tweets-dataset)

Games of Thrones, collected from [here](https://www.kaggle.com/datasets/khulasasndh/game-of-thrones-books)

fastText English Word Vectors from [here](https://www.kaggle.com/code/nkitgupta/text-representations/input)


### **Common Terms in NLP:**
- **Corpus**: is the entire collection of texts used for analysis or modeling (machine learning tasks, or to train NLP models)
    - *Example*: The entire collection of Wikipedia articles in English, a collection of legal documents, or a set of medical research papers can all be considered corpora.
- **Vocabulary** ($V$): is the set of unique words (or tokens) present in a corpus. It represents the entire lexical repertoire of a language or domain.
    - *Example*: In a corpus of scientific articles, the vocabulary would include terms like "hypothesis," "experiment," "data," etc.
- **Document**: is a single unit of text within the corpus, often representing a separate entity. It can be a single article, a sentence, a paragraph, or any chunk of text that is considered as a separate entity for analysis.
    - *Example*: In a corpus of news articles, each individual news article is considered a document.
- **Word**: is the basic linguistic unit, which can be a single word or a tokenized element in a text. These tokens can be words, punctuation marks, or other meaningful units.
    - *Example*: In the sentence "The quick brown fox jumps over the lazy dog," the words are "The," "quick," "brown," "fox," "jumps," "over," "the," "lazy," and "dog."

---

**Core Intuition**

In the context of Natural Language Processing (NLP), "core intuition" refers to the fundamental understanding or insight that underlies the way humans process and understand language. It is the foundational principle that guides the development of NLP models and algorithms.

Here are some core intuitions in NLP:

- *Semantics and Syntax*:
    - Understanding that language has both semantic (meaning) and syntactic (grammar and structure) components. This intuition guides the development of models that can grasp the meaning of sentences and understand their grammatical structure.

- *Context Matters*:
    - Recognizing that the meaning of a word or phrase can change based on the surrounding context. For example, the word "bank" has different meanings in "river bank" and "financial bank."

- *Ambiguity*:
    - Acknowledging that language can be ambiguous, with words or phrases having multiple possible interpretations. NLP models need to be able to handle this ambiguity.

- *Sequential Information*:
    - Appreciating that the order of words in a sentence carries important information. This is particularly crucial for tasks like machine translation or sentiment analysis.

- *Domain Specificity*:
    - Understanding that language can be highly specialized and context-dependent, especially in technical or domain-specific domains. NLP models may need to adapt to different domains.

- *Pragmatics and Contextual Inference*:
    - Realizing that communication often relies on pragmatic aspects, such as implicatures, presuppositions, and conversational implicatures. NLP models may need to infer information beyond the explicit content of a sentence.

- *Linguistic Variation*:
    - Recognizing that language use can vary widely based on factors like dialect, socio-cultural context, and individual idiosyncrasies. NLP models need to be robust to these variations.

- *Figurative Language*:
    - Understanding that language can include metaphors, similes, and other figurative expressions. NLP models may need to recognize and interpret these non-literal uses of language.

These core intuitions serve as the basis for developing NLP models that can perform tasks like sentiment analysis, text classification, machine translation, question answering, and more. They guide researchers and practitioners in designing algorithms that can effectively process and understand human language.

<div class="list-group" id="list-tab" role="tablist">

## TABLE OF CONTENTS

- <a href='#1'>1. IMPORTING LIBRARIES</a>
- <a href='#2'>2. LOADING CLEANED DATA</a>
- <a href='#3'>3. DATA SPLITTING</a>
- <a href='#4'>4. TEXT REPRESENTATION</a>
    - <a href='#4-1'>4.1 BASIC TEXT REPRESENTATION (VECTORIZATION)</a>
        - <a href='#4-1-1'>4.1.1 One Hot Encoding</a>
        - <a href='#4-1-2'>4.1.2 Bag of Words</a>
        - <a href='#4-1-3'>4.1.3 Bag of N-grams</a>
        - <a href='#4-1-4'>4.1.4 Term frequency - Inverse document frequency </a>
    - <a href='#4-2'>4.2 DISTRIBUTED TEXT REPRESENTATION (VECTORIZATION)</a>
        - <a href='#4-2-1'>4.2.1 Word2Vec Word Embedding </a>
        - <a href='#4-2-2'>4.2.2 GloVe Word Embeddings </a>
        - <a href='#4-2-3'>4.2.3 FastText Word Embeddings </a>
        - <a href='#4-2-4'>4.2.4 Visualizing Embeddings </a>

</div>

# <a id='1'>1. Importing Libraries </a>


In [5]:
import pandas as pd
import numpy as np                          # for working with arrays and matrices

pd.set_option('display.max_rows', 500)      # Set max number of rows displayed
pd.set_option('display.max_columns', 500)   # Set max number of columns displayed
pd.set_option('display.width', 1000)

# Regex pkg
import re

# String and time module
import string, time

# Visualizations
import matplotlib.pyplot as plt             # for creating plots
from matplotlib.colors import ListedColormap
%matplotlib inline
import seaborn as sns
import plotly

# Split pkgs
from sklearn.model_selection import train_test_split

from scipy.stats import skew
import statsmodels.api as sm

# Save and load pkgs
from pickle import dump, load

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

# <a id='2'>2. Loading cleaned dataset </a>

In [2]:
import pickle

with open('cleaned_tweets_data.pkl', 'rb') as file:
    df = pickle.load(file)

## <a id='3'>3 Train-Test Split of Data</a>

In [3]:
# SPLIT DATA
X_train, X_test, Y_train, Y_test = train_test_split(df.drop('Sentiment', axis = 1),
                                                    df['Sentiment'],
                                                    train_size=0.8,                            # <--- 80% train and 20% test
                                                    random_state=42)

print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(32925, 6)
(32925,)
(8232, 6)
(8232,)


### **Feature Extraction from Text**

**Feature Extraction from Text** involves **converting raw text data into a numerical format** that can be used as input for machine learning models. This is a crucial step in natural language processing (NLP) tasks, as most machine learning algorithms require numerical input. The extracted features are then used as input for machine learning models to perform tasks like classification, regression, clustering, and more. The choice of feature extraction technique depends on the specific NLP task, the nature of the data, and the characteristics of the text corpus.

Here are **some common techniques** for feature extraction from text:

- *Bag of Words (BoW)*:
    - BoW represents text data as a collection of words, disregarding grammar and word order. It creates a vocabulary of all unique words in a corpus and counts the frequency of each word in a document. Each document is then represented as a vector where each element corresponds to the frequency of a word in the vocabulary.
<br>

- *Term Frequency-Inverse Document Frequency (TF-IDF)*:
    - TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It takes into account both the frequency of a word in a document (Term Frequency) and the rarity of the word in the entire corpus (Inverse Document Frequency).
<br>

- *Word Embeddings*:
    - Word embeddings are dense, low-dimensional vectors that represent words in a continuous vector space. Techniques like Word2Vec, GloVe, and FastText learn these embeddings by considering the context in which words appear. They capture semantic relationships between words and are effective in capturing word similarity and analogy.
<br>

- *Word Counts and Character Counts*:
    - Simple features like the total number of words in a document, average word length, or frequency of specific characters can also be used as features.
<br>

- *N-grams*:
    - N-grams are sequences of $N$ consecutive words. For example, Bi-grams consist of pairs of adjacent words. By considering sequences of words, N-grams can capture more context compared to BoW.
<br>

- *Part-of-Speech (POS) Tagging*:
    - POS tagging assigns a grammatical label to each word in a sentence (e.g., noun, verb, adjective). These tags can be used as features to capture linguistic information.
<br>

- *Sentiment Scores*:
    - Sentiment analysis tools can be used to assign sentiment scores to text, indicating the sentiment (positive, negative, neutral) expressed in the text.
<br>

- *Topic Modeling*:
    - Topic modeling techniques like Latent Dirichlet Allocation (LDA) can be used to extract topics from a collection of documents. The distribution of topics in a document can be used as features.
<br>

- *Syntactic Features*:
    - Features related to sentence structure, such as the presence of specific grammatical constructs (e.g., passive voice, conditional clauses), can be used.
<br>

- *Dependency Parsing*:
    - Features based on syntactic relationships between words in a sentence can be used to capture structural information.
<br>

- *Lexical Diversity Measures*:
    - Metrics like type-token ratio or TTR (ratio of unique words to total words) can be used to measure the richness and diversity of vocabulary in a document.



# <a id='4'> WORD EMBEDDINGS </a>

In NLP, **Word Embedding** is a term used for the representation of words for text analysis, typically in the form of **a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning**.

**Word Embedding** or word vector is defined as a **numeric vector input that allows words with similar meanings to have the same representation**. It can approximate meaning and represent a word in a lower dimensional space.

But the concept of Word Embedding is not simply to assign a random vector to a word, but assign the vector to each word in a way so that words of similar meaning will have a similar vector and have a closer distance in the vector space.

Taxonomy of Word Embedding:

![Models_proscons](figures/Word_embedding.png)


## <a id='4-1'>4.1 BASIC TEXT REPRESENTATION</a>

### <a id='4-1-1'>4.1.1 One Hot Encoding</a>

**One Hot Encoding** is a technique used in Natural Language Processing (NLP) to **convert categorical data, such as words or labels, into a numerical format** that can be used as input for machine learning algorithms. It's particularly useful when dealing with categorical features in text data.

Here's how One Hot Encoding works in the context of NLP:

- *Word Tokenization*:
    - The first step in NLP is to break down text data into individual words or tokens. Each unique word becomes a category that we want to represent numerically.

- *Creating a <b>Vocabulary</b>*:
    - Next, we create a vocabulary, which is a set of all unique words in the corpus. This vocabulary will serve as the basis for One Hot Encoding.

- *Assigning Indices*:
    - Each word in the vocabulary is assigned a unique index. For example, the first word might be assigned index 0, the second word index 1, and so on.

- *One Hot Encoding*:
    - For each word in a sentence or document, a vector of all zeros is created, with a length equal to the size of the vocabulary. The element at the index corresponding to the word's index in the vocabulary is set to 1, while all other elements remain 0. This way, each word is represented as a binary vector.

    - For example, if we have a vocabulary with 10 words, and the word "apple" has an index of 3, its One Hot Encoding vector will look like [0, 0, 0, 1, 0, 0, 0, 0, 0, 0].

    - This encoding ensures that each word has a unique representation, and the vectors are orthogonal (meaning they are mutually independent).

- *Handling Out-of-Vocabulary Words*:
    - If a word is encountered in the future that is not in the original vocabulary, it can be handled by assigning a special "Out of Vocabulary" (OOV) token or by extending the vocabulary and re-encoding the data.

- *Sparse Matrix Representation*:
    - In practice, One Hot Encoding results in a sparse matrix where most of the values are zero. This is because most documents only contain a small subset of the entire vocabulary.

<u>Notes: </u>
One Hot Encoding is a straightforward and widely used technique for representing categorical data in NLP tasks like text classification, sentiment analysis, and more.
- **Advantages**:
    - Intuitive, easy to implement
- **Disadvantages**:
    - *Sparsity*, i.e. it can lead to high-dimensional feature spaces, which may not be efficient for very large vocabularies (overfitting) (In such cases, techniques like word embeddings (e.g., Word2Vec, GloVe) are preferred, as they provide more compact and semantically meaningful representations of words.)
    - *No fixed size*
    - *Out Of Vocabulary problem (OOV)*: It refers to the issue that arises when encountering words in new, unseen data that were not present in the original vocabulary used for encoding. If, during testing or real-world use, the model encounters a word that was not present in the original training data's vocabulary, it faces a problem. Since the word is not in the vocabulary, it cannot be represented using the OHE method. For example, if the word "neologism" was not present in the original training data, and it is encountered during testing, the model would not know how to represent it. It's important to choose an appropriate strategy based on the specific NLP task and the nature of the OOV words.
        - **Strategies for Handling OOV Words**:
                - *Ignoring the Word*: One option is to simply ignore OOV words. This may work in some cases if the OOV words are not critical for the task.
                - *Replacing with a Special Token*: Replace OOV words with a special token like <UNK> (unknown) to indicate that the word is not recognized.
                - *Expanding the Vocabulary*: If OOV words are common or critical for the task, you might consider retraining the model with an expanded vocabulary that includes the OOV words.
    - *No capturing of semantic*





### <a id='4-1-2'>4.1.2 Bags of Words</a>

**The Bag of Words (BoW)** method is a fundamental technique in Natural Language Processing (NLP) used for text processing and feature extraction. It's called "bag" because it **involves treating text data as an unordered collection or bag of words, disregarding grammar, word order, and context**. BoW is widely used in various NLP tasks like sentiment analysis, document classification, and information retrieval. The BoW model represents a document as a vector of word frequencies. where each dimension of the vector corresponds to a unique words in the Vocabulary. The value at each dimension is the count of how many times that word appears in the document (absolute frequency)

Similar to one-hot encoding, BoW maps words to unique integer IDs between 1 and |V|. Each document in the corpus is then converted into a vector of |V| dimensions were in the ith component of the vector, i = wid, is simply the number of times the word w occurs in the document, i.e., we simply score each word in V by their occurrence count in the document.

Here are the **key steps** and concepts in the Bag of Words method:

- *Tokenization*:
    - The first step is to break down a piece of text into individual words or tokens. This process may involve removing punctuation and handling special cases like contractions.

- *Vocabulary Building*:
    - Once the text is tokenized, a vocabulary is constructed. This vocabulary consists of all unique words (or tokens) that appear in the corpus (collection of documents).

- *Word Frequency Count*:
    - For each document in the corpus, a vector is created where each element represents the frequency of a word in the document. These vectors can be very high-dimensional, with each dimension corresponding to a word in the vocabulary.

- *Sparse Matrix Representation*:
    - The result of BoW is often represented as a sparse matrix. A sparse matrix is a data structure that only stores non-zero elements, which are the counts of words in this case. This is efficient in terms of memory.

- *Normalization (Optional)*:
    - Depending on the specific task, the frequency counts can be normalized to make them more comparable across different documents. Common normalization techniques include TF-IDF (Term Frequency-Inverse Document Frequency).

- *Feature Vectors*:
    - Each document is represented as a feature vector where each element corresponds to the frequency of a specific word in the vocabulary. The order of the words does not matter, hence the term "bag of words".

- *Loss of Contextual Information*:
    - One limitation of BoW is that it completely ignores the order of words and any contextual information. For example, "not good" and "good not" would be represented the same way.

- *High Dimensionality*:
    - BoW can lead to high-dimensional feature spaces, especially for large vocabularies and extensive documents. This can impact the efficiency of some machine learning algorithms.

- *Application in Machine Learning*:
    - BoW vectors are commonly used as input features for various machine learning models. For example, in sentiment analysis, these vectors can be fed into a classifier to predict the sentiment of a document.

<u>Notes: </u>
BoW is a powerful and versatile technique, but it may not be suitable for tasks where word order or context is crucial (e.g., language translation or tasks requiring understanding of semantics). In such cases, more advanced techniques like word embeddings or deep learning models may be more appropriate.

- **Advantages**:
    - *Simple and intuitive*
- **Disadvantages**:
    - *Sparsity*
    - *OOV*
    - *Not considering Ordering*
    - *Close or similar vectors with completely different meanings*


**`CountVectorizer()`**: It is a tool provided by scikit-learn for text processing in machine learning. It is used to convert a collection of text documents to a matrix of token counts.
Some of its parameters are:
- **`max_df`**: (maximum document frequency)
    - This parameter specifies the threshold for the maximum frequency a term (word) can occur in the documents. Terms that occur more frequently than this threshold will be ignored. It can be set as an absolute count (e.g., max_df=5 means ignore terms that occur in more than 5 documents) or as a proportion (e.g., max_df=0.85 means ignore terms that occur in more than 85% of the documents).
- **`min_df`**: (minimum document frequency):
    - This parameter specifies the threshold for the minimum frequency a term must occur in the documents. Terms that occur less frequently than this threshold will be ignored. It can be set as an absolute count (e.g., min_df=2 means ignore terms that occur in fewer than 2 documents) or as a proportion (e.g., min_df=0.1 means ignore terms that occur in fewer than 10% of the documents).
- **`max_features`**:
    - This parameter specifies the maximum number of features (terms) to include in the vocabulary. It selects the max_features most frequently occurring terms in the dataset. This can be useful to limit the number of features in cases where the vocabulary is very large.
- **`binary`**:
    - When set to True, this parameter specifies that the CountVectorizer should return binary values (0 or 1) instead of the count of occurrences. This means that a term either occurs or it doesn't in a document.
- **`ngram_range`**:
    - This parameter takes a tuple of two values (*min_n, max_n*) where min_n is the minimum size of the n-grams and max_n is the maximum size of the n-grams.

In [5]:
# Example

df_test = pd.DataFrame({'text': ['people watch campusx', 'campusx watch campusx', 'people write comment', 'campusx write comment'], 'output': [1,1,0,0]})

df_test

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Initiate the Count vectorizer
my_cv = CountVectorizer()

# Apply the transformation
my_bow = my_cv.fit_transform(df_test['text'])

# Get Vocabulary
print('Vocabulary: \n', my_cv.vocabulary_)

print('\n')

# Get the Features names
my_bow_feats = my_cv.get_feature_names_out()
print('Features names:\n', my_bow_feats)

print('\n')

# Get the dimension of the Vocabulary
print('Number of unique words in the Corpus: \n', len(my_bow_feats))

print('\n')

# Ge the Word Frequency Count for each Document
print('Word Frequency Count for each Document of the Corpus:')
print(my_bow[0].toarray())
print(my_bow[1].toarray())
print(my_bow[2].toarray())
print(my_bow[3].toarray())

Vocabulary: 
 {'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


Features names:
 ['campusx' 'comment' 'people' 'watch' 'write']


Number of unique words in the Corpus: 
 5


Word Frequency Count for each Document of the Corpus:
[[1 0 1 1 0]]
[[2 0 0 1 0]]
[[0 1 1 0 1]]
[[1 1 0 0 1]]


In [7]:
my_cv.transform(['campusx watch and write comment of campusx']).toarray()


array([[2, 1, 0, 1, 1]], dtype=int64)

**BoW to Tweets Data**

In [39]:
cv = CountVectorizer(min_df = 1,
                     max_df = 0.90,
                     # max_features = 1000,
                     binary = False)

In [40]:
# Train data
bow_tweets_train = X_train['LemmatizedTweets']

# fit_Transform train data
bow_tweets_train = cv.fit_transform(bow_tweets_train)


# Test data
bow_tweets_test = X_test['LemmatizedTweets']

# Transform test data
bow_tweets_test = cv.transform(bow_tweets_test)

print('BOW cv_train:', bow_tweets_train.shape)
print('BOW cv_test:', bow_tweets_test.shape)


BOW cv_train: (32925, 49601)
BOW cv_test: (8232, 49601)


In [41]:
# Create dataframes with the sparse matrices

df_bow_tweets_train = pd.DataFrame(bow_tweets_train.todense())    # <---  It is used to convert a sparse matrix into a dense matrix.  In a dense matrix, all values are stored, including the zeros, which can consume a lot more memory. The resulting dense matrix is a regular 2D array where all the cells are explicitly represented.
display(df_bow_tweets_train.head())
df_bow_tweets_test = pd.DataFrame(bow_tweets_test.todense())
display(df_bow_tweets_test.head())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,...,49351,49352,49353,49354,49355,49356,49357,49358,49359,49360,49361,49362,49363,49364,49365,49366,49367,49368,49369,49370,49371,49372,49373,49374,49375,49376,49377,49378,49379,49380,49381,49382,49383,49384,49385,49386,49387,49388,49389,49390,49391,49392,49393,49394,49395,49396,49397,49398,49399,49400,49401,49402,49403,49404,49405,49406,49407,49408,49409,49410,49411,49412,49413,49414,49415,49416,49417,49418,49419,49420,49421,49422,49423,49424,49425,49426,49427,49428,49429,49430,49431,49432,49433,49434,49435,49436,49437,49438,49439,49440,49441,49442,49443,49444,49445,49446,49447,49448,49449,49450,49451,49452,49453,49454,49455,49456,49457,49458,49459,49460,49461,49462,49463,49464,49465,49466,49467,49468,49469,49470,49471,49472,49473,49474,49475,49476,49477,49478,49479,49480,49481,49482,49483,49484,49485,49486,49487,49488,49489,49490,49491,49492,49493,49494,49495,49496,49497,49498,49499,49500,49501,49502,49503,49504,49505,49506,49507,49508,49509,49510,49511,49512,49513,49514,49515,49516,49517,49518,49519,49520,49521,49522,49523,49524,49525,49526,49527,49528,49529,49530,49531,49532,49533,49534,49535,49536,49537,49538,49539,49540,49541,49542,49543,49544,49545,49546,49547,49548,49549,49550,49551,49552,49553,49554,49555,49556,49557,49558,49559,49560,49561,49562,49563,49564,49565,49566,49567,49568,49569,49570,49571,49572,49573,49574,49575,49576,49577,49578,49579,49580,49581,49582,49583,49584,49585,49586,49587,49588,49589,49590,49591,49592,49593,49594,49595,49596,49597,49598,49599,49600
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,...,49351,49352,49353,49354,49355,49356,49357,49358,49359,49360,49361,49362,49363,49364,49365,49366,49367,49368,49369,49370,49371,49372,49373,49374,49375,49376,49377,49378,49379,49380,49381,49382,49383,49384,49385,49386,49387,49388,49389,49390,49391,49392,49393,49394,49395,49396,49397,49398,49399,49400,49401,49402,49403,49404,49405,49406,49407,49408,49409,49410,49411,49412,49413,49414,49415,49416,49417,49418,49419,49420,49421,49422,49423,49424,49425,49426,49427,49428,49429,49430,49431,49432,49433,49434,49435,49436,49437,49438,49439,49440,49441,49442,49443,49444,49445,49446,49447,49448,49449,49450,49451,49452,49453,49454,49455,49456,49457,49458,49459,49460,49461,49462,49463,49464,49465,49466,49467,49468,49469,49470,49471,49472,49473,49474,49475,49476,49477,49478,49479,49480,49481,49482,49483,49484,49485,49486,49487,49488,49489,49490,49491,49492,49493,49494,49495,49496,49497,49498,49499,49500,49501,49502,49503,49504,49505,49506,49507,49508,49509,49510,49511,49512,49513,49514,49515,49516,49517,49518,49519,49520,49521,49522,49523,49524,49525,49526,49527,49528,49529,49530,49531,49532,49533,49534,49535,49536,49537,49538,49539,49540,49541,49542,49543,49544,49545,49546,49547,49548,49549,49550,49551,49552,49553,49554,49555,49556,49557,49558,49559,49560,49561,49562,49563,49564,49565,49566,49567,49568,49569,49570,49571,49572,49573,49574,49575,49576,49577,49578,49579,49580,49581,49582,49583,49584,49585,49586,49587,49588,49589,49590,49591,49592,49593,49594,49595,49596,49597,49598,49599,49600
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### <a id='4-1-3'>4.1.3 Bag of N-grams</a>

The **Bag of N-grams model** is an extension of the Bag of Words (BoW) model used in Natural Language Processing (NLP). While BoW treats each word as a separate entity and disregards word order, **the Bag of N-grams model considers sequences of N consecutive words, or "N-grams," in a text**.

The corpus **vocabulary**, $V$, **is a collection of all unique n-grams across the text corpus**. Then, each document in the corpus is represented by a vector of length $|V|$. This vector simply contains the frequency counts of n-grams present in the document and zero for the n-grams that are not present.

Bag of N-grams model:
- *N_grams*:
    - An N-gram is a contiguous sequence of N items from a given sample of text or speech. In the context of language, these "items" are usually words.

- *Types of N-grams*:
    - <u>Unigrams (N=1) </u>: Each word is treated individually. The sentence "The quick brown fox" would be represented as ["The", "quick", "brown", "fox"].
    - <u>Bigrams (N=2) </u>: Sequences of two consecutive words are considered. The same sentence would be represented as ["The quick", "quick brown", "brown fox"].
    - <u>Trigrams (N=3) </u>: Sequences of three consecutive words are considered. The sentence would be represented as ["The quick brown", "quick brown fox"].
    - And so on for higher values of N.

- *Bag of N-grams*:
    - Like the Bag of Words model, the Bag of N-grams model represents text data as a collection of features, but instead of individual words, it considers N-gram sequences.

- *Feature Extraction*:
    - The process involves creating a vocabulary of all unique N-grams in the corpus. Each document is then represented as a vector, where each element corresponds to the frequency of an N-gram in the vocabulary.
      Example: Consider the sentence: "The quick brown fox jumps over the lazy dog." For bigrams, the Bag of N-grams representation might include features like "The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the lazy", "lazy dog".

- *Preserving Context*:
    - By considering sequences of words, the Bag of N-grams model captures some degree of context, which can be important for tasks like sentiment analysis, where the meaning of a sentence can change drastically based on word order.

- *Limitations*:
    - As N increases, the number of unique N-grams in the vocabulary can grow exponentially, potentially leading to high-dimensional feature spaces and increased computational requirements.

The Bag of N-grams model is useful for tasks where word order and context are important, such as machine translation, part-of-speech tagging, and some types of sentiment analysis. It provides a compromise between the simplicity of Bag of Words and the context-awareness of more sophisticated models like recurrent neural networks (RNNs) or transformers.

<u>Notes </u>:
- **Advantages**:
    - *Easy implementation*.
    - *It captures some context and word-order information in the form of n-grams*.
    - Thus, *the resulting vector space can capture some semantic similarity*. **Documents having the same n-grams will have their vectors closer to each other in Euclidean space as compared to documents with completely different n-grams**.
- **Disadvantages**:
    - *As $n$ increases, dimensionality (and therefore sparsity) only increases rapidly, slowing down the algorithm*.
    - *OOV*

In [11]:
# Example 1

df_test

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [12]:
cv = CountVectorizer(ngram_range =(3,3))      # <--- change ngram_range to see different results

In [13]:
bow = cv.fit_transform(df_test['text'])

In [14]:
# Get the vocabulary
print(cv.vocabulary_)

{'people watch campusx': 2, 'campusx watch campusx': 0, 'people write comment': 3, 'campusx write comment': 1}


In [15]:
print(bow[0].toarray())
print(bow[1].toarray())

[[0 0 1 0]]
[[1 0 0 0]]


In [16]:
# Example 2
# Bag of 1-gram (unigram)
sample_boN = CountVectorizer(ngram_range = (1,1))

sample_corpus = ['the cat sat',
                 'the cat sat in the hat',
                 'the cat with the hat']

sample_boN.fit(sample_corpus)

def get_boN_representation(text):
    return sample_boN.transform(text)

print(f"Unigram Vocabulary mapping for given sample corpus : \n {sample_boN.vocabulary_}")
print("\nBag of 1-gram (unigram) Representation of sentence 'the cat cat sat in the hat'")
print(get_boN_representation(["the cat cat sat in the hat"]).toarray())

Unigram Vocabulary mapping for given sample corpus : 
 {'the': 4, 'cat': 0, 'sat': 3, 'in': 2, 'hat': 1, 'with': 5}

Bag of 1-gram (unigram) Representation of sentence 'the cat cat sat in the hat'
[[2 1 1 1 2 0]]


In [17]:
# Example 3
# Bag of 2-gram (bigram)
sample_boN = CountVectorizer(ngram_range = (2,2))

sample_corpus = ['the cat sat',
                 'the cat sat in the hat',
                 'the cat with the hat']

sample_boN.fit(sample_corpus)

def get_boN_representation(text):
    return sample_boN.transform(text)

print(f"Unigram Vocabulary mapping for given sample corpus : \n {sample_boN.vocabulary_}")
print("\nBag of 1-gram (unigram) Representation of sentence 'the cat cat sat in the hat'")
print(get_boN_representation(["the cat cat sat in the hat"]).toarray())

Unigram Vocabulary mapping for given sample corpus : 
 {'the cat': 4, 'cat sat': 0, 'sat in': 3, 'in the': 2, 'the hat': 5, 'cat with': 1, 'with the': 6}

Bag of 1-gram (unigram) Representation of sentence 'the cat cat sat in the hat'
[[1 0 1 1 1 1 0]]


In [18]:
# Example 4
# Bag of 3-gram (trigram)

sample_boN = CountVectorizer(ngram_range = (3, 3))

sample_corpus = ["the cat sat", "the cat sat in the hat", "the cat with the hat"]

sample_boN.fit(sample_corpus)

def get_boN_representation(text):
        return sample_boN.transform(text)

print(f"Trigram Vocabulary mapping for given sample corpus : \n {sample_boN.vocabulary_}")
print("\nBag of 3-gram (trigram) Representation of sentence 'the cat cat sat in the hat'")
print(get_boN_representation(["the cat cat sat in the hat"]).toarray())

Trigram Vocabulary mapping for given sample corpus : 
 {'the cat sat': 4, 'cat sat in': 0, 'sat in the': 3, 'in the hat': 2, 'the cat with': 5, 'cat with the': 1, 'with the hat': 6}

Bag of 3-gram (trigram) Representation of sentence 'the cat cat sat in the hat'
[[1 0 1 1 0 0 0]]


### <a id='4-1-4'>4.1.4 Term Frequency-Inverse Document Frequency (TF-IDF)</a>

**TF-IDF** **is a statistical measure** used in Natural Language Processing (NLP) and information retrieval **to evaluate the importance of a word within a document relative to a collection of documents, often referred to as a corpus**.

Tf-Idf aims to quantify the importance of a given word relative to other words in the document and in the corpus.

The intuition behind TF-IDF is as follows: if a word $w$ appears many times in a sentence $S_1$ but does not occur much in the rest of the Sentences $S_n$ in the corpus, then the word $w$ must be of great importance to the Sentence $S_1$. The importance of $w$ should increase in proportion to its frequency in $S_1$ (how many times that word occurs in sentence $S_1$), but at the same time, its importance should decrease in proportion to the word’s frequency in other sentences $S_n$ in the corpus. Mathematically, this is captured using two quantities: *TF* and *IDF*. The two are then multiplied to arrive at the TF-IDF score.
- **TF (term frequency)**: It measures the frequency of a term (word) within a document. It indicates how often a word appears in a document relative to the total number of words in that document.
    Mathematical Expression of TF:
    $$\text{TF}(t,d) = \displaystyle\frac{\text{Number of occurrences of term $t$ in document $d$}}{\text{Total number of terms in the document $d$}}$$

Example: In a document with 100 words, if the word "apple" appears 5 times, its TF score is 0.05.

- **IDF (Inverse Document frequency)**: It measures the importance of a term across a collection of documents (corpus). It assigns a weight to each term based on how common or rare it is in the entire corpus. Words that are common across many documents receive a lower weight, while rare words receive a higher weight.
     It’s a well-known fact that stop words like *is, are, am*, etc., are not important, even though they occur frequently. To account for such cases, IDF weighs down the terms that are very common across a corpus and weighs up the rare terms. IDF of a term $t$ is calculated as follows:
     $$\text{IDF}(t) = \log_e\displaystyle\frac{\text{Total number of documents in the corpus}}{\text{Number of documents with term $t$ in them}}$$

Example: If there are 1,000,000 documents in the corpus and the word "apple" appears in 10,000 of them, its IDF score is approximately 4.0.

- **TF_IDF Score**: The TF-IDF score of a term in a document is the product of its TF and IDF scores. It indicates how important a term is to a specific document in the context of the entire corpus.
 $$\text{TF-IDF}(t,d,D) = \text{TF}(t,d)\cdot\text{IDF}(t,D)$$

Example: If "apple" has a TF of 0.05 and an IDF of 4.0, its TF-IDF score is 0.2.

Similar to BoW, **we can use the TF-IDF vectors to calculate the similarity between two texts using a similarity measure like Euclidean distance or cosine similarity**. TF-IDF is a commonly used representation in application scenarios such as information retrieval and text classification. However, even though TF-IDF is better than the vectorization methods we saw earlier in terms of capturing similarities between words, it still suffers from the curse of high dimensionality.

<u>Notes </u>:
- **Advantages**:
    - *Information Retrieval*. It captures a bit of the semantics of the sentence.

- **Disadvantages**:
    - Its *implementation is not that easy* as compared to BoW, OHE, N-grams
    - I have a *fixed-length encoding for any sentence of arbitrary length*.
    - The feature vectors are high-dimensional representations. The *dimensionality increases with the size of the vocabulary*.
    - *OOV* problem

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [27]:
# Example 1

tfidf = TfidfVectorizer()

tfidf.fit_transform(df_test['text']).toarray()

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [28]:
# Get the IDF scores for each word of the corpus
print(tfidf.idf_)

# Get the features (words)
print(tfidf.get_feature_names_out())

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]
['campusx' 'comment' 'people' 'watch' 'write']


In [33]:
# Example 2

tfidf = TfidfVectorizer()

sample_corpus = ['the cat sat',
                 'the cat sat in the hat',
                 'the cat with the hat']
tfidf_rep = tfidf.fit_transform(sample_corpus)

print("IDF Values for sample corpus :", tfidf.idf_)

print("TF-IDF Representation for sentence 'the cat sat in the hat' :")
print(tfidf.transform(["the cat sat in the hat"]).toarray())

IDF Values for sample corpus : [1.         1.28768207 1.69314718 1.28768207 1.         1.69314718]
TF-IDF Representation for sentence 'the cat sat in the hat' :
[[0.29903422 0.385061   0.50630894 0.385061   0.59806843 0.        ]]


**TF-IDF to Tweets Data**

In [48]:
tfidf = TfidfVectorizer(min_df = 1,
                        max_df = 0.90,
                        norm = 'l2',               # It controls the normalization term used in the TF-IDF computation. It can take values like 'l1', 'l2', or None. 'l2' is commonly used, and it normalizes the term vectors to have a Euclidean norm of 1. This normalization is often applied to prevent longer documents from having a larger impact solely due to their length.
                        use_idf = True,            # It determines whether to enable the inverse-document-frequency reweighting. If True, it scales down the importance of words that occur in many documents, giving more weight to terms that are rare across documents and less weight to terms that are common.
                        max_features = 2000,       # It specifies the maximum number of features (words) to consider. It limits the vocabulary size to the top max_features ordered by term frequency across the corpus. If not specified, there is no limit. By setting max_features, you are limiting the vocabulary size to the most frequent words, which can help in reducing the dimensionality of the resulting TF-IDF matrix.
                        stop_words = 'english')    #  If set to 'english', it will remove common English stop words ('the', 'is', 'and', etc.) from the input data. This is generally a good practice for text data, since stop words are often considered noise in natural language processing tasks.

In [49]:
# Train data
tfidf_tweets_train_cv = X_train['LemmatizedTweets']

# fit_Transform train data
tfidf_tweets_train = tfidf.fit_transform(tfidf_tweets_train_cv)

# Test data
tfidf_tweets_test_cv = X_test['LemmatizedTweets']

# Transform test data
tfidf_tweets_test = tfidf.transform(tfidf_tweets_test_cv)

print('TF-IDF cv_train:', tfidf_tweets_train.shape)
print('TF-IDf cv_test:', tfidf_tweets_test.shape)

TF-IDF cv_train: (32925, 2000)
TF-IDf cv_test: (8232, 2000)


In [50]:
# Create dataframes with the sparse matrices

df_tfidf_tweets_train = pd.DataFrame(tfidf_tweets_train.todense())    # <---  It is used to convert a sparse matrix into a dense matrix.  In a dense matrix, all values are stored, including the zeros, which can consume a lot more memory. The resulting dense matrix is a regular 2D array where all the cells are explicitly represented.
display(df_tfidf_tweets_train.head())
df_tfidf_tweets_test = pd.DataFrame(tfidf_tweets_test.todense())
display(df_tfidf_tweets_test.head())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,...,1750,1751,1752,1753,1754,1755,1756,1757,1758,1759,1760,1761,1762,1763,1764,1765,1766,1767,1768,1769,1770,1771,1772,1773,1774,1775,1776,1777,1778,1779,1780,1781,1782,1783,1784,1785,1786,1787,1788,1789,1790,1791,1792,1793,1794,1795,1796,1797,1798,1799,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,1810,1811,1812,1813,1814,1815,1816,1817,1818,1819,1820,1821,1822,1823,1824,1825,1826,1827,1828,1829,1830,1831,1832,1833,1834,1835,1836,1837,1838,1839,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849,1850,1851,1852,1853,1854,1855,1856,1857,1858,1859,1860,1861,1862,1863,1864,1865,1866,1867,1868,1869,1870,1871,1872,1873,1874,1875,1876,1877,1878,1879,1880,1881,1882,1883,1884,1885,1886,1887,1888,1889,1890,1891,1892,1893,1894,1895,1896,1897,1898,1899,1900,1901,1902,1903,1904,1905,1906,1907,1908,1909,1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.442051,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.41163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.300317,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.443189,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.352491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.36402,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.415739,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.389323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.351276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,...,1750,1751,1752,1753,1754,1755,1756,1757,1758,1759,1760,1761,1762,1763,1764,1765,1766,1767,1768,1769,1770,1771,1772,1773,1774,1775,1776,1777,1778,1779,1780,1781,1782,1783,1784,1785,1786,1787,1788,1789,1790,1791,1792,1793,1794,1795,1796,1797,1798,1799,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,1810,1811,1812,1813,1814,1815,1816,1817,1818,1819,1820,1821,1822,1823,1824,1825,1826,1827,1828,1829,1830,1831,1832,1833,1834,1835,1836,1837,1838,1839,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849,1850,1851,1852,1853,1854,1855,1856,1857,1858,1859,1860,1861,1862,1863,1864,1865,1866,1867,1868,1869,1870,1871,1872,1873,1874,1875,1876,1877,1878,1879,1880,1881,1882,1883,1884,1885,1886,1887,1888,1889,1890,1891,1892,1893,1894,1895,1896,1897,1898,1899,1900,1901,1902,1903,1904,1905,1906,1907,1908,1909,1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.271063,0.0,0.323641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.156001,0.0,0.111864,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.236246,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.514855,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.302078,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.284898,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.186208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.332171,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.236563,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.13473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.264564,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.282965,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.174833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.279508,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## <a id='4-2'>4.2 DISTRIBUTED TEXT REPRESENTATION</a>

<u> <b> Important Notes about Word Embedding models </b> </u>

1. All text representations are inherently biased based on what they saw in training data. For example, an embedding model trained heavily on technology news or articles is likely to identify Apple as being closer to, say, Microsoft or Facebook than to an orange or pear.
2. Unlike the basic vectorization approaches, pre-trained embeddings are generally large-sized files (several gigabytes), which may pose problems in certain deployment scenarios. This is something we need to address while using them, otherwise it can become an engineering bottleneck in performance. The Word2vec model takes ~4.5 GB RAM.

### <a id='4-2-1'>4.2.1 Word2vec Word Embeddings</a>

**Word Vector**: word vectors put words to a nice vector space, where similar words cluster together and different words repel.

### **Continuous Bag of Words (CBOW)**

**Continuous Bag of Words (CBOW)** is a word embedding model and one of the two architectures used in the Word2Vec framework. CBOW **aims to predict the target word based on its context, which is formed by the surrounding words in a given text**.

CBOW is particularly useful when you have a large amount of training data and you want to generate word embeddings that capture syntactic and semantic relationships between words. It is computationally efficient and can be trained on large datasets.

**DETAILED EXPLANATION**
**Objective**:
- The primary goal of CBOW is **to learn distributed representations** (**vectors**) of words. *These representations are continuous and capture semantic relationships between words*.

**Architecture**:
- **CBOW is a shallow neural network model with one hidden layer**. It has *an input layer*, *a projection layer* (hidden layer), and *an output layer*. **The input and output layers have the same number of neurons, which is equal to the size of the vocabulary**.
![CBoW](figures/CBoW.png)

**Input**:
- The input to the CBOW model is a **set of context words** (surrounding words) that are used to predict a target word. *The number of context words is determined by a fixed-size window around the target word*.

**Output**:
- The output layer of CBOW **predicts the probability distribution of the target word given the context words**.

**Word Vectors**:
- **The weights of the projection layer serve as the word vectors** (also called word embeddings). These vectors are what CBOW is primarily interested in.

**Training Process**:
- During training, **the model is presented with pairs of context words and their corresponding target words**. The *weights of the neural network are adjusted using techniques like backpropagation and gradient descent to minimize the prediction error*.

**Context Window**:
- The context window **determines the number of words to consider on either side of the target word**. For example, with a context window of 2, if the target word is in the middle, the model considers 2 words to the left and 2 words to the right.

**Word Probability**:
- The output of the model is a probability distribution over the entire vocabulary. Each element of the output vector represents the **probability of a specific word being the target word given the context**.

**Loss Function**:
- The loss function used in CBOW is typically a **softmax function**, which **measures the difference between predicted probabilities and actual labels**.

**Word Similarity**:
- After training, **words with similar meanings tend to have similar vector representations**. This means that the vectors for semantically similar words will be located close to each other in the vector space.

----

**<span style="color:red"> Illustrated example of CBoW (taken from [Here](https://pyimagesearch.com/2022/07/11/word2vec-a-study-of-embeddings-in-nlp/) </span>**

CBoW is a technique where, given the neighboring words, the center word is determined.

Consider the sentence: *I am reading the book*
- then, the input pairs and labels for a window size of 3 would be:
    - *I*, *reading*, for the label *am*
    - *am*, *the*, for the label *reading*
    - *reading*, *book*, for the label *the*
- The figure shows the sketch of the example:
![CBoWexample](figures/cbow_example.png)
- Assume that the sentence is the complete input, then the vocabulary size is 5.
- Assume there are 3 embedding dimensions for simplicity
- Consider the example of the input - label pair of (*I, reading*) *-* (*am*)
    1. We start with the one-hot encodings of *I* and *reading* (shape $1x5$),
    2. We multiply those encodings with an encoding matrix of shape $5x3$, and we obtain a $1x3$ hidden layer.
    3. We multiply the hidden layer by a $3x5$ decoding matrix, and we obtain a prediction of a $1x5$ shape.

---
**<span style="color:red"> Another detailed example and further explanation can be found in this Kaggle [Notebook](https://www.kaggle.com/code/nkitgupta/text-representations) </span>**

### **Example of implementation of CBoW**

One of the most commonly used implementations is with **gensim**. We have to choose several hyperparameters

**`KeyedVectors`**: Class from the `gensim.models` module that provides an interface for working with word vectors or embeddings

In [58]:
import gensim
from gensim.test.utils import common_texts         # It imports a set of common example texts provided by Gensim. These texts are used as a sample corpus to train the Word2Vec model.
from gensim.models import Word2Vec, KeyedVectors

print('Sentences on which we are gonna train our CBOW Word2Vec model:\n')
print(common_texts)

# Train the CBOW Word2Vec model and storage it in 'CBOW_Word2Vec_model'
CBOW_Word2Vec_model = Word2Vec(common_texts,        # Input data, which is a list of tokenized sentences or documents. In this case, it's the set of common example texts provided by Gensim.
                               vector_size = 10,    # Dimensionality of the Word Vectors to 10: Each word will be represented as a vector of 10 numbers.
                               window = 5,          # maximum distance between the current and predicted word within a sentence (size of the context window)
                               min_count = 1,      # minimum number of times a word must appear in the corpus to be considered during training. Words with very low frequencies are often removed to reduce noise in the model.
                               workers = 8,        # number of CPU cores to be used for training the model. More workers can lead to faster training.
                               sg = 0)             # It specifies the training algorithm. sg=0 indicates that the CBOW algorithm should be used.

# Save the trained CBOW Word2Vec model
CBOW_Word2Vec_model.save("CBOW_Word2Vec_model.w2v")
print("Model Saved")

Sentences on which we are gonna train our CBOW Word2Vec model:

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
Model Saved


**The CBOW_Word2Vec_model can be used to obtain word embeddings for words in the corpus or for downstream natural language processing tasks.**

**`.wv`**: stands for *"word vectors"*, and it's a sub-attribute of the Word2Vec model in Gensim. It provides access to word vectors.

 **most_similar([word])**:
 - Method used to find words that are most similar to a given word based on their vector representations in the Word2vec model. It returns a list of words similar to 'word' along with their similarity scores. This method is commonly used to find words that are semantically similar or related to a given word. It's useful in tasks like **synonym identification**, **semantic similarity tasks**, and more.

In [59]:
CBOW_Word2Vec_model.wv.most_similar('human',
                                    topn = 5)       # It specifies the number of similar words to retrieve

[('graph', 0.3586882948875427),
 ('system', 0.22743132710456848),
 ('time', 0.1153423935174942),
 ('interface', 0.09816545248031616),
 ('survey', 0.01448808517307043)]

In [60]:
# Show the Word Vector of 'human' based on the trained model 'CBOW_Word2Vec_model'
CBOW_Word2Vec_model.wv['human']

array([-0.00410223, -0.08368949, -0.05600012,  0.07104538,  0.0335254 ,
        0.0722567 ,  0.06800248,  0.07530741, -0.03789154, -0.00561806],
      dtype=float32)

### **Skip-gram**

The **Skip-gram model** is a type of word embedding model used in Natural Language Processing (NLP) and is one of the architectures within the Word2Vec framework. It is **designed to learn distributed representations of words by predicting the context words given a target word in a given text**. It is trained using large amounts of unstructured text data and can capture the context and semantic similarity between words.

**DETAILED EXPLANATION**
**Objective**:
 - The primary goal of the Skip-gram model is **to learn distributed representations (vectors) of words**. These representations are continuous and capture semantic relationships between words.

**Architecture**:
 - The Skip-gram model is a **shallow neural network model with one hidden layer**. It has *an input layer, a projection layer (hidden layer), and an output layer*. **The input and output layers have the same number of neurons, which is equal to the size of the vocabulary**.
![SkipGram model](figures/SkipGram_model.png)

**Input and Output**:
- In the Skip-gram model, **the input is a single target word**, and **the output is a probability distribution over the entire vocabulary of words**.

**Context Words**:
- Unlike Continuous Bag of Words (CBOW), which predicts a target word based on its context, the Skip-gram model **predicts the surrounding context words based on the current target word**.

**Training Process**:
- During training, the model is presented with pairs of target words and their corresponding context words. The weights of the neural network are adjusted using techniques like backpropagation and gradient descent to minimize the prediction error.

**Word Vectors**:
- The weights of the projection layer serve as the word vectors (also called word embeddings). These vectors are what the Skip-gram model is primarily interested in.

**Context Window**:
- The context window determines the number of words to consider on either side of the target word. For example, with a context window of 2, if the target word is in the middle, the model considers 2 words to the left and 2 words to the right.

**Word Probability**:
- The output of the model is a probability distribution over the entire vocabulary. Each element of the output vector represents the probability of a specific word being in the context given the target word.

**Semantic Similarity**:
- After training, words with similar meanings tend to have similar vector representations. This means that the vectors for semantically similar words will be located close to each other in the vector space.

**Use Cases**:
- The Skip-gram model is useful for capturing semantic relationships and can be used in various NLP tasks such as **sentiment analysis**, **machine translation**, and more.

The Skip-gram model, along with Continuous Bag of Words (CBOW), forms the basis of Word2Vec, which has been instrumental in advancing the state-of-the-art in various NLP tasks.

----

**<span style="color:red"> Illustrated example of Skip-Gram (taken from [Here](https://pyimagesearch.com/2022/07/11/word2vec-a-study-of-embeddings-in-nlp/)) </span>**

Skip-Gram is a technique where, given the center words, its neighboring words are predicted.

Consider the sentence: *I am reading the book*
- then, the Skip-Grma pairs for a window size of 3 would be:
    - *am*, for labels *I* and *reading*
    - *reading*, for labels *am* and *the*
    - *the*,for labels *reading* and *book*
- The figure shows the sketch of the example:
![skipgram_example](skipgram_example.png)
- Assume that the sentence is the complete input, then the vocabulary size is 5.
- Assume there are 3 embedding dimensions for simplicity
- Consider the example of the input - label pair of (*I, reading*) *-* (*am*)
    1. We start with the encoding matrix, where we grab the vector located at the index of our center word (*am* in this case)
    2. We transpose the vector, so we have now a $3x1$ vector representation of the word $am$ (since we are directly grabbing a row of the encoding matrix, this **will not** be a one-hot encoding).
    3. We multiply this vector representation with the decoding matrix of shape $5x1$. The resultant vector will essentially be a softmax representation over the whole vocabulary, pointing to the indices belonging to the neighboring words of our input center word. In this example, the output should point to the indices of *I* and *reading*.

---

**<span style="color:red"> Another detailed example and further explanation can be found in this Kaggle [Notebook](https://www.kaggle.com/code/nkitgupta/text-representations) </span>**

### **Example of implementation of Skip-Gram model**

In [61]:
# Train the Skip-Gram Word2Vec model and storage it in 'SkipGram_Word2Vec_model'
SkipGram_Word2Vec_model = Word2Vec(common_texts,
                                   vector_size=10,
                                   window=5,
                                   min_count=1,
                                   workers=8,
                                   sg=1)  # It specifies the training algorithm. sg=1 indicates that the Skip-Gram algorithm should be used.

# Save the trained Skip-Gram Word2Vec model
SkipGram_Word2Vec_model.save("SkipGram_Word2Vec_model.w2v")
print("Model Saved")

Model Saved


In [62]:
SkipGram_Word2Vec_model.wv.most_similar('human',
                                        topn = 5)

[('graph', 0.3586882948875427),
 ('system', 0.22743132710456848),
 ('time', 0.1153423935174942),
 ('interface', 0.09816545248031616),
 ('survey', 0.01448808517307043)]

In [65]:
# Show the Word Vector of 'human' based on the trained model 'SkipGram_Word2Vec_model'
SkipGram_Word2Vec_model.wv['human']

array([-0.00410223, -0.08368949, -0.05600012,  0.07104538,  0.0335254 ,
        0.0722567 ,  0.06800248,  0.07530741, -0.03789154, -0.00561806],
      dtype=float32)

### **Word2Vec Word model**

**Word2vec** is a technique for natural language processing (NLP) published in 2013. It is designed to represent words as continuous vectors in a high-dimensional space, such that semantically similar words are located close to each other in this space. The word2vec algorithm **uses a neural network model** to learn word associations from a large corpus of text. Once trained, such a **model can detect synonymous words or suggest additional words for a partial sentence**. Word2vec is not a singular algorithm, rather, it is **a family of model architectures and optimizations** that can be used to learn word embeddings from large datasets.

Word2Vec is the representation of words that allows words with the same meaning to have similar representation. Word2vec operationalizes this by projecting the meaning of the words in a vector space where words with similar meanings will tend to cluster together, and works with very different meanings are far from one another.

**DETAILED EXPLANATION**:
**Two Models**:
- Word2Vec consists of two models:
    - *Continuous Bag of Words (CBOW)*: This model predicts the current word based on its context (surrounding words).
    - *Skip-gram*: This model predicts the surrounding words given the current word.

**Neural Network Architecture**:
- Both CBOW and Skip-gram models are shallow neural networks with one hidden layer. The input layer and output layer have the same number of neurons, which is equal to the size of the vocabulary.

**Word Context**:
- The context of a word is determined by a fixed-size window of adjacent words. The model tries to predict the target word based on this context.

**Word Vectors**:
- The weights of the hidden layer of the neural network serve as the word vectors (also called word embeddings). These vectors are what Word2Vec is primarily interested in.

**Training Process**:
- During training, the model is presented with a large corpus of text. It adjusts the weights of the neural network using techniques like backpropagation and gradient descent to minimize the prediction error.

**Semantic Similarity**:
- After training, words that have similar meanings tend to have similar vector representations. For example, the vectors for "king" and "queen" are likely to be close to each other in the vector space.

**Arithmetic Operations**:
- One of the interesting properties of Word2Vec is that vector operations in the embedding space can capture semantic relationships. For example, "king - man + woman" might be close to "queen" in the vector space.

**Pre-trained models**
Word2Vec models can be pre-trained on large corpora and then used in downstream NLP tasks like sentiment analysis, machine translation, and more. Pre-trained models are available for various languages and domains.
Training your own word embeddings is a pretty expensive process (in terms of both time and computing). Thankfully, for many scenarios, it’s not necessary to train your own embeddings, since someone has done the hard work of training word embeddings on a large corpus, such as Wikipedia, news articles, or even the entire web, and has put words and their corresponding vectors on the web. These embeddings can be downloaded and used to get the vectors for the words you want.
Some of the most popular pre-trained embeddings are Word2vec by Google, GloVe by Stanford, and fasttext embeddings by Facebook, to name a few.

**Advantages**:
- Word2Vec provides dense, continuous vector representations of words, which capture semantic relationships.
- It's computationally efficient and can be trained on large datasets.

---
Learn more about gensim model: [Here](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py)

All available models you can download with gensim: [Here](https://github.com/RaRe-Technologies/gensim-data)

### **<span style="color:blue"> Pre-trained Word2Vec model**
I will use the **pre-trained** weights of Word2Vec that was trained on Google New corpus containing 3 bilion words (**`word2vec-google-news-300` model**) which is almost 2GB worth of memory. This model consists of 300-dimensional vectors for 3 million words and phrases.



**Downloading gensim word2vec-google-news-300 model locally** (Run Only one time!!)
The model will be downloaded if you never download it before

In [12]:
import gensim.downloader as api
# word2vec_model = api.load('word2vec-google-news-300')

**Using the pre-downloaded model in the notebook**:

In [13]:
word2vec_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

I can see the **Word vector** for each word in the loaded Word2vec model:

In [16]:
# Example:
# Shape of the word vector "man" in the loaded Word2Vec model

word2vec_model['man'].shape       # <--- we se that the vector is 300-dimensional

(300,)

In [15]:
# Word vector "man":

word2vec_model['man']

array([ 0.32617188,  0.13085938,  0.03466797, -0.08300781,  0.08984375,
       -0.04125977, -0.19824219,  0.00689697,  0.14355469,  0.0019455 ,
        0.02880859, -0.25      , -0.08398438, -0.15136719, -0.10205078,
        0.04077148, -0.09765625,  0.05932617,  0.02978516, -0.10058594,
       -0.13085938,  0.001297  ,  0.02612305, -0.27148438,  0.06396484,
       -0.19140625, -0.078125  ,  0.25976562,  0.375     , -0.04541016,
        0.16210938,  0.13671875, -0.06396484, -0.02062988, -0.09667969,
        0.25390625,  0.24804688, -0.12695312,  0.07177734,  0.3203125 ,
        0.03149414, -0.03857422,  0.21191406, -0.00811768,  0.22265625,
       -0.13476562, -0.07617188,  0.01049805, -0.05175781,  0.03808594,
       -0.13378906,  0.125     ,  0.0559082 , -0.18261719,  0.08154297,
       -0.08447266, -0.07763672, -0.04345703,  0.08105469, -0.01092529,
        0.17480469,  0.30664062, -0.04321289, -0.01416016,  0.09082031,
       -0.00927734, -0.03442383, -0.11523438,  0.12451172, -0.02

In [17]:
# Example:

word2vec_model.most_similar(['man'])

[('woman', 0.7664012908935547),
 ('boy', 0.6824871301651001),
 ('teenager', 0.6586930155754089),
 ('teenage_girl', 0.6147903203964233),
 ('girl', 0.5921714305877686),
 ('suspected_purse_snatcher', 0.571636438369751),
 ('robber', 0.5585119128227234),
 ('Robbery_suspect', 0.5584409832954407),
 ('teen_ager', 0.5549196600914001),
 ('men', 0.5489763021469116)]

**doesnt_match()**:
- Method used to find the word in a list of words that doesn't fit in with the others, based on their vector representations in the Word2Vec model. The method returns the word that is considered an outlier or doesn't fit in with the others in terms of their vector representations. This method can be used to identify the word in a list that is semantically dissimilar to the others. It's useful in tasks like **anomaly detection** or **quality control** in NLP applications.

In [18]:
# Example:

# in this example, other than 'car' the rest is some form of food
print(word2vec_model.doesnt_match(['apple', 'banana', 'orange', 'car', 'cheese', 'juice']))

car


**similarity('word1', 'word2')**:
- Method used to calculate the **cosine similarity** between the vector representations of two words, 'word1' and 'word2', in the Word2Vec model. The method returns a similarity score, which is a numerical value between -1 and 1. A higher similarity score indicates that the words are more similar in meaning based on their vector representations. A score of 1 means the words are identical, while a score of -1 means they are completely dissimilar. This method is **used to quantify the semantic similarity between two words based on their context in the training data**.

In [20]:
word2vec_model.similarity('man','woman')

0.76640123

**Arithmetics using the word vectors to get the most similar words to a certain word**:

In [25]:
# Example 1:

# Get the word vectors for 'king', 'man', 'woman' and 'queen'
vec_king = word2vec_model['king']
vec_man = word2vec_model['man']
vec_woman = word2vec_model['woman']
vec_queen = word2vec_model['queen']

# Vector resultant of some arithmetic between the previous word vectors
vec = vec_king - vec_man + vec_woman

# Get the most similar word vectors to vector vec based the Word2vec model
word2vec_model.most_similar([vec])

[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676352500916),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376775860786438),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]

In [27]:
# Another way to get the most similar word vectors to vector vec
word2vec_model.most_similar(positive = ['king', 'woman'], negative = ['man'])    # <--- positive means: + 'king' + 'woman', and negative means - 'man'

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

In [28]:
# Example 2 (less explanatory):
vec = word2vec_model['INR'] - word2vec_model['Cuba'] + word2vec_model['England']

word2vec_model.most_similar([vec])

[('INR', 0.6407418251037598),
 ('Rs', 0.4681122899055481),
 ('Rs1', 0.4643361568450928),
 ('Rs.##', 0.4495318531990051),
 ('Rs.1', 0.44255903363227844),
 ('Rs3', 0.43901121616363525),
 ('Rs###_crore', 0.4389852285385132),
 ('Rs.#.##', 0.43212899565696716),
 ('Rs.3', 0.43179401755332947),
 ('Rs####_crore', 0.431002676486969)]

### **Ways to solve the OOV problem:**

1. A simple approach that often works is *to exclude those words from the feature extraction process* so we don’t have to worry about how to get their representations.

2. Another way to deal with the OOV problem for word embeddings is *to create vectors that are initialized randomly*.

3. There are also other approaches that handle the OOV problem by modifying the training process by *bringing in characters and other subword-level linguistic components*. The key idea is that one can potentially handle the OOV problem by using subword information, such as morphological properties (e.g., prefixes, suffixes, word endings, etc.), or by using character representations. **fastText**, from Facebook AI research, is one of the popular algorithms that follows this approach.


In [19]:
# Sometimes the word you pass in doesn't have a vector
# for more info, review the section: Out of Vocabulary (OOV)

try:
    vec_cameroon = word2vec_model['asdasd']
except KeyError:
    print("The word 'asdasd' does not appear in this model")

The word 'asdasd' does not appear in this model


### **<span style="color:blue"> Training a Word2vec model </span>**

Training a Word2Vec model involves the following steps:

**Data Preparation**:
- Gather a large corpus of text data. This could be a collection of articles, books, tweets, or any other type of text.

**Preprocessing**:
- Clean and preprocess the text data. This may include tasks like removing punctuation, converting text to lowercase, handling special characters, and tokenizing the text into words or subword units.

**Building Vocabulary**:
- Create a vocabulary from the preprocessed text data. The vocabulary consists of a unique set of words or subword units that occur in the corpus.

**Word2Vec Architecture**:
- Choose between two Word2Vec architectures: Continuous Bag of Words (CBOW) or Skip-gram. CBOW predicts a target word given its context, while Skip-gram predicts context words given a target word.

**Model Initialization**:
- Set hyperparameters like the embedding dimension (size of word vectors), context window size, and others.

**Training**:
- Use the preprocessed text data to train the Word2Vec model. During training, the model adjusts its weights to predict context words based on target words (for Skip-gram) or vice versa (for CBOW).

**Word Vectors**:
- After training, the weights of the hidden layer serve as word vectors (word embeddings). These vectors represent words in a high-dimensional space.

**Evaluation**:
- Optionally, evaluate the quality of the word vectors. This can be done by measuring the similarity between words, performing analogy tasks, or using them as features in downstream NLP tasks.

**Application**:
- Use the trained Word2Vec model for various NLP tasks like sentiment analysis, machine translation, named entity recognition, and more. The pre-trained word vectors can also be used in tasks where transfer learning is beneficial.

Keep in mind that the specific implementation details may vary depending on the library or framework you're using. Popular libraries for Word2Vec implementation include Gensim (Python), TensorFlow, and PyTorch. Each of these libraries has its own API for training Word2Vec models.

In [2]:
import os

**`from gensim.utils import simple_preprocess`**

The **simple_preprocess** function from the *gensim.utils* module is a utility **used for text processing in the context of natural language processing (NLP) tasks**.
Here's why it's useful:
*Text Tokenization*:
- **simple_preprocess** is primarily used for tokenizing text. It takes a text input and converts it into a list of tokens (words or subword units) based on certain criteria.

*Preprocessing*:
- It performs basic preprocessing tasks like converting text to lowercase, removing punctuation, and handling special characters. This is important for standardizing text data before further analysis.

*Control over Tokenization*:
- You can customize the behavior of simple_preprocess by specifying parameters like minimum and maximum token length, controlling whether to perform lowercasing, and more. This gives you flexibility in how you want to tokenize your text.

*Compatibility with Gensim Models*:
- Gensim, a popular library for natural language processing, often requires text data to be preprocessed and tokenized before using it with models like Word2Vec, Doc2Vec, etc. simple_preprocess provides a convenient way to prepare text data for use with Gensim models.

*Memory Efficiency*:
- **simple_preprocess** is designed to be memory-efficient, which is important when dealing with large corpora or datasets. It processes text in a streaming fashion, making it suitable for handling large volumes of text.

*Suitability for Gensim Pipelines*:
- It fits well into data preprocessing pipelines within the Gensim framework. You can use it as a first step in preparing text data for subsequent NLP tasks.

*Simplicity and Ease of Use*:
- As the name implies, simple_preprocess is straightforward to use. It doesn't require complex configurations or parameter tuning, making it accessible for users at various levels of expertise.

**Load, tokenize and apply basic preprocessing to the Games of Thrones Books**

In [30]:
from nltk import sent_tokenize                # <--- used for segmenting text into individual sentences
from gensim.utils import simple_preprocess

# Processes a collection of text files stored in a directory ('data'):

story = []                                         # ---> Initializes an empty list named 'story'. This list will be used to store tokenized sentences.
for filename in os.listdir('data'):                # ---> Iterates over the files in the 'data' directory. It assumes that the text files to be processed are located in a directory named 'data'.

    f = open(os.path.join('data',filename))        # ---> Opens the current file in the iteration. "os.path.join()" is used to construct the file path by combining the directory ('data') with the filename.

    corpus = f.read()                              # ---> Reads the contents of the file and stores it in the variable 'corpus'.

    raw_sent = sent_tokenize(corpus)              # --->  Uses sent_tokenize to tokenize the 'corpus' into individual sentences. The resulting list of sentences is stored in the variable 'raw_sent'.

    for sent in raw_sent:                        # --->  Iterates over each sentence in 'raw_sent'
        story.append(simple_preprocess(sent))    # --->  Applies the simple_preprocess function to each sentence (tokenize and basic preprocessing tasks. Explained above).


In [33]:
# Size of 'story'
print(len(story))

# Print 'story' vector
print(story[:100])

145020
[['game', 'of', 'thrones', 'book', 'one', 'of', 'song', 'of', 'ice', 'and', 'fire', 'by', 'george', 'martin', 'prologue', 'we', 'should', 'start', 'back', 'gared', 'urged', 'as', 'the', 'woods', 'began', 'to', 'grow', 'dark', 'around', 'them'], ['the', 'wildlings', 'are', 'dead'], ['do', 'the', 'dead', 'frighten', 'you'], ['ser', 'waymar', 'royce', 'asked', 'with', 'just', 'the', 'hint', 'of', 'smile'], ['gared', 'did', 'not', 'rise', 'to', 'the', 'bait'], ['he', 'was', 'an', 'old', 'man', 'past', 'fifty', 'and', 'he', 'had', 'seen', 'the', 'lordlings', 'come', 'and', 'go'], ['dead', 'is', 'dead', 'he', 'said'], ['we', 'have', 'no', 'business', 'with', 'the', 'dead'], ['are', 'they', 'dead'], ['royce', 'asked', 'softly'], ['what', 'proof', 'have', 'we'], ['will', 'saw', 'them', 'gared', 'said'], ['if', 'he', 'says', 'they', 'are', 'dead', 'that', 'proof', 'enough', 'for', 'me'], ['will', 'had', 'known', 'they', 'would', 'drag', 'him', 'into', 'the', 'quarrel', 'sooner', 'or', 'l

**TRAIN the Word2Vec model**

In [66]:
# Instantiate the Word2vec model
Word2Vec_model2 = gensim.models.Word2Vec(window = 10,       # maximum distance between the current and predicted word within a sentence (size of the context window)
                               min_count = 2,     # minimum number of times a word must appear in the corpus to be considered during training. Words with very low frequencies are often removed to reduce noise in the model.
                               workers = 4)       # number of CPU cores to be used for training the model. More workers can lead to faster training.

In [67]:
# Create the Vocabulary of "story" based on 'model'
Word2Vec_model2.build_vocab(story)

In [68]:
# TRAIN the Word2Vec model
Word2Vec_model2.train(story,                                   # ---> Always expected to be tokenized sentences
            total_examples = Word2Vec_model2.corpus_count,     # ---> Specifies the total number of training examples. 'model.corpus_count' is assumed to hold the total number of sentences
            epochs = Word2Vec_model2.epochs)                   # --->  determines the number of training iterations over the dataset. 'model.epochs' is likely a variable holding the number of training epochs.

(6570207, 8628190)

**Once trained, the Word2Vec_model can be used to obtain word embeddings for words in the corpus or for downstream natural language processing tasks.**

In [70]:
Word2Vec_model2.wv.most_similar('daenerys')

[('stormborn', 0.8233193159103394),
 ('targaryen', 0.7718337774276733),
 ('unburnt', 0.7110245823860168),
 ('viserys', 0.7089210748672485),
 ('queen', 0.6991424560546875),
 ('princess', 0.698495626449585),
 ('elia', 0.691801905632019),
 ('rhaegar', 0.6882088780403137),
 ('margaery', 0.6851475834846497),
 ('myrcella', 0.6821972131729126)]

In [71]:
Word2Vec_model2.wv.doesnt_match(['jon', 'rikon', 'robb', 'arya', 'sansa', 'bran'])

'jon'

In [72]:
Word2Vec_model2.wv.doesnt_match(['cersei', 'jaime', 'bronn', 'tyrion'])

'bronn'

In [73]:
print(Word2Vec_model2.wv['king'])

print(Word2Vec_model2.wv['king'].shape)

[ 0.8385213  -0.32660002  2.4832273   1.3739705  -0.68670666  1.5288645
 -0.9064945  -0.14985189 -0.6579333  -1.2585018  -1.9412525   0.46761927
  1.4651912   1.920611   -2.7219985  -0.7131552  -0.6748674   3.2017539
  1.318852    2.3343842   0.6258875  -0.5407192   0.96264243 -3.1946523
 -1.2901669   1.8450948  -0.10530413 -2.2906604   1.6824007   0.46300274
 -3.3540008   0.70043564  1.9128523  -0.29968357  3.0447953  -3.237226
 -1.7406158  -1.6336908   0.57345444 -4.112034   -0.87156993  2.1654003
  1.284714   -0.8237764  -0.06316835 -1.8724283   0.2829679  -2.7961738
  3.126344   -2.533017   -2.325688   -0.5829145  -2.4539921  -4.4085603
  2.1058238  -2.7390673  -0.20640023  0.5620899   0.46157134  0.5732121
  1.2047747   1.9023302  -0.8390119   0.50373393  0.3043873   2.4742224
 -1.2991458  -0.4343109   0.59889376 -2.1053457   0.8134864   2.1276329
  4.6359115  -0.04990909  2.4811332   0.3449794  -0.04451819 -1.9887319
 -0.34851268 -0.89618486  1.1765779   3.2417228  -1.589542   -0

In [74]:
Word2Vec_model2.wv.similarity('arya', 'sansa')

0.8375704

In [75]:
Word2Vec_model2.wv.similarity('cersei', 'sansa')

0.77544475

In [76]:
Word2Vec_model2.wv.similarity('tywin', 'sansa')

0.24595071

In [77]:
# Get the normalized vectors for each word of the Vocabulary (Useful for PCA)
print(Word2Vec_model2.wv.get_normed_vectors())

# Get the total of normalized vectors and the size of each (the same value for all)
print(Word2Vec_model2.wv.get_normed_vectors().shape)

[[-0.14291956 -0.16906518  0.09170673 ... -0.04753628  0.08504093
   0.04904262]
 [-0.16266577 -0.15573785  0.11031783 ... -0.05131987 -0.02999355
   0.18084556]
 [ 0.06822678 -0.06784158 -0.09801825 ...  0.00143532  0.03326532
  -0.05417931]
 ...
 [ 0.03302287  0.04284379 -0.06652787 ...  0.00112755  0.03193364
  -0.08434646]
 [-0.02375463  0.10225476  0.11504482 ...  0.00067619  0.13837829
  -0.06792125]
 [-0.03034963  0.03301172  0.10083866 ...  0.0054536   0.02690871
  -0.06748521]]
(17453, 100)


In [78]:
# Get the Vocabulary, sorted by their index based on the Word2Vec model
y = Word2Vec_model2.wv.index_to_key    # --> attribute in Gensim's Word2Vec module used to retrieve a list of words (or keys) in the vocabulary, ordered by their index. Useful when you want to retrieve the words from the model's vocabulary in a structured manner. Particularly handy when you need to access specific words based on their position in the vocabulary.

print(y)

print(len(y))

17453


### PCA for Vectorization of Words

In the context of Word2Vec vectorization, PCA can be applied for several reasons:

*Reducing Computation Costs*:
- Word2Vec models often generate high-dimensional word embeddings (e.g., 100-dimensional or 300-dimensional vectors). Operating on such high-dimensional data can be computationally expensive. PCA can help reduce the dimensionality of the vectors, making subsequent operations more efficient.

*Visualizing Word Embeddings*:
- PCA can be used to project high-dimensional word embeddings onto a lower-dimensional space (e.g., 2D or 3D). This allows for visualization of the relationships between words. While it's not always feasible to visualize high-dimensional spaces, lower-dimensional representations obtained via PCA can provide insights.

*Removing Redundant Information*:
- Some dimensions in high-dimensional space may contain redundant or noisy information. PCA can identify and remove these dimensions, retaining only the most important information.

*Handling Noise in the Data*:
- Word embeddings may contain noise or irrelevant information. By reducing the dimensionality, PCA can filter out some of this noise, potentially improving the quality of the embeddings.

*Facilitating Downstream Tasks*:
- In some NLP tasks, lower-dimensional embeddings may be more suitable. For example, tasks like clustering, classification, or visualization may benefit from lower-dimensional representations.

*Interpretability*:
- Lower-dimensional embeddings are easier to interpret and understand compared to high-dimensional vectors. They can provide insights into the underlying relationships between words.

*Storage and Memory Efficiency*:
- High-dimensional vectors require more storage space and memory compared to lower-dimensional representations. PCA can help reduce these requirements.

*Maintaining Semantic Information*:
- PCA is designed to retain the most important information while reducing dimensionality. This means that after PCA, the resulting vectors still capture much of the semantic information present in the original embeddings.

<u> **Note** </u>:
It's important to note that while PCA can be beneficial in some cases, **it's not always necessary or suitable for every application of Word2Vec. The decision to use PCA should be based on the specific requirements and constraints of the task at hand**.

In [79]:
# Apply PCA for Visualization

from sklearn.decomposition import PCA

pca = PCA(n_components = 3)

X = pca.fit_transform(Word2Vec_model2.wv.get_normed_vectors())

In [80]:
# Show the dimensions of the transformed vectors
X.shape

(17453, 3)

**Visualization of the PCA components for each word of Vocabulary**

In [81]:
import plotly.express as px

fig = px.scatter_3d(X[:100],
                    x = 0,
                    y = 1,
                    z = 2,
                    color = y[:100])    # --> Specifies the color of each data point

fig.show(width=2000, height=1800)

### **Doc2Vec Word model**

**Doc2Vec is an extension of the Word2Vec model that is used for generating vector representations of documents**. While Word2Vec generates vector representations for individual words, **Doc2Vec generates vector representations for entire documents, including sentences, paragraphs, or even entire articles**.

Here's how Doc2Vec works:

- **Paragraph Vector**:
    In Doc2Vec, **each document is represented as a unique vector**. This vector is sometimes referred to as a "paragraph vector" or "document vector". It captures the semantics and context of the entire document.

- **Similarity and Operations**:
    **Doc2Vec allows you to compute similarity between documents and perform operations similar to Word2Vec**. For example, you can find documents similar to a given document, or find the document that best matches a set of words.

- **Two Training Models**:
    Like Word2Vec, **Doc2Vec has two training models: Distributed Memory (DM) and Distributed Bag of Words (DBOW)**.
        - **Distributed Memory (DM)**: This model tries to predict a word based on its context and the entire document. It takes into account both the document and the surrounding context when generating the paragraph vector.
        - **Distributed Bag of Words (DBOW)**: This model tries to predict a word based solely on the document. It doesn't take context into account, only the content of the document itself.

- **Combining Word Vectors**:
    **In** the **DM model, word vectors and document vectors are combined to make predictions**. This allows the model to capture both the semantics of individual words and the context of the entire document.

- **Training Process**:
    The training process **involves updating the vectors to minimize the prediction errors**. **The model learns to generate accurate vectors that can predict words or documents within the given context**.

- **Applications**:
    Doc2Vec is used in a **wide range of applications, including document classification, sentiment analysis, information retrieval, and recommendation systems**.

- **Gensim Implementation**:
    Gensim is a popular Python library for training and using Doc2Vec models. It provides an easy-to-use interface for creating and working with document vectors.

### <a id='4-2-2'>4.2.2 GloVe Word Embeddings</a>

**GloVe (Global Vectors for Word Representation)** is an unsupervised learning algorithm for generating word embeddings. It was introduced by researchers at Stanford University and was launched in 2014. The **main objective** of GloVe is to **create word vectors (embeddings) that represent the semantic relationships between words**. These vectors are numerical representations of words in a continuous vector space.

**Co-occurrence Matrix**:
- **GloVe starts by constructing a co-occurrence matrix from a large corpus of text**. This matrix captures how often words co-occur with each other within a specified context window. Rows and columns correspond to words, and each cell contains the number of times the word in the row occurs in the context of the word in the column.

**Optimization Objective**:
- GloVe aims to learn word embeddings in such a way that their dot product equals the logarithm of the words' co-occurrence probabilities. The objective function combines the information from the co-occurrence matrix to achieve this goal.

**Training Process**:
- The training process of GloVe involves **minimizing the loss function** that quantifies the difference between the predicted co-occurrence probabilities and the actual probabilities in the co-occurrence matrix.

**Word Similarities**:
- **Once trained, GloVe embeddings place words with similar meanings close to each other in the vector space**. This enables operations like word analogy tasks (e.g., "king" - "man" + "woman" = "queen") and finding semantically related words.

**Dimensionality**:
- **GloVe allows users to specify the desired dimensionality of the resulting word vectors**. For example, a common choice is 50, 100, or 300 dimensions.

**Pre-trained Models**:
- Pre-trained GloVe models are available for a wide range of languages and can be used for various NLP tasks. These models have been trained on large text corpora and capture general linguistic patterns.

**Applications**:
- GloVe embeddings are used in a variety of NLP tasks, including **sentiment analysis**, **machine translation**, **question answering**, and more.

**Advantages**:
- GloVe is known for its **efficiency and ability to capture both syntactic and semantic relationships between words**.

**Limitations**:
- GloVe does **not capture polysemy** (multiple meanings of a word) as effectively as some other models. Additionally, it may struggle with rare words or words not present in the training corpus.

For detailed knowledge about GloVe Word Embedding, refer to [this article](https://towardsdatascience.com/light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010)

---

**Advantages of Glove**:
1. The goal of Glove is very straightforward, i.e., to enforce the word vectors to capture sub-linear relationships in the vector space. Thus, it proves to perform better than Word2vec in the word analogy tasks.
2. Glove adds some more practical meaning to word vectors by considering the relationships between word pair and word pair rather than word and word.
3. Glove gives lower weight for highly frequent word pairs to prevent the meaningless stop words like “the”, “an” will not dominating the training progress.

**Disadvantages of Glove**:
1. The model is trained on the co-occurrence matrix of words, which takes a lot of memory for storage. Especially, if you change the hyper-parameters related to the co-occurrence matrix, you have to reconstruct the matrix again, which is very time-consuming.

Advantages and disadvantages can be found in [this site](https://www.quora.com/What-are-the-advantages-and-disadvantages-of-Word2vec-and-GloVe)

2. Like the Word2Vec, can not deal directly with the OOV.
---
**Differences between GloVe and Word2Vec Embeddings**

The first important aspect in the comparison between Glove and Word2Vec is the advantage that GloVe have where, **unlike Word2vec, GloVe does not rely just on local statistics (local context information of words), but incorporates global statistics (word co-occurrence) to obtain word vectors**. Thus, The GloVe can be used to find relations between words like synonyms, company-product relations, zip codes, and cities, etc. This is way, based on the NLP current task, we need sometimes to use GloVe instead of Word2Vec: Word2Vec relies only on local information of language, that is, the semantics learned for a given word, are only affected by the surrounding words.

**Both Word2vec and Glove do not solve the problems like**:
1. How to learn the representation for out-of-vocabulary words.
2. How to separate some opposite word pairs. For example, “good” and “bad” are usually located very close to each other in the vector space, which may limit the performance of word vectors in NLP tasks like sentiment analysis.

---

**<span style="color:red"> Illustrated example of GloVe Word Embedding </span>**

Lets take the sentence: <span style="font-size: 20px;"> The cat sat on the mat </span>

Word2Vec can not capture information like:
- is "the" a special context of the words "cat" and "mat"

or

- is "the" just a stopword?

GloVe method is built on an important idea, You can derive semantic relationships between words from the *co-occurrence matrix*. Given a corpus having $V$ words, the co-occurrence matrix $X$ will be a $V\times V$ matrix, where the $i-\text{th}$ row and $j-\text{th}$ column of $X$, $X_ij$ denotes how many times the word $i$ has co-occurred with word $j$. An example co-occurrence matrix might look as follows.

![Co_occurrenceMatrix](GloVe_coocurrenceMatrix.png)

This is the co-occurrence matrix for the sentence "the cat sat on the mat" with a window size of 1. You can observe that it is a symmetric matrix.

**`glove2word2vec`**: function from the **glove2word2vec** script in the Gensim library. This function is used to convert GloVe embeddings to Word2Vec format.



**Use of the **`glove2word2vec`** utility from Gensim to convert pre-trained GloVe word embeddings to the Word2Vec format.**

After running the code, a file named *glove.6B.100d.txt.word2vec* should be created in the directory set in Glove_path. This file will contain the word embeddings in the Word2Vec format, making them compatible with Gensim's Word2Vec models.

In [83]:
from gensim.scripts.glove2word2vec import glove2word2vec

# Assign a file path to the variable 'Glove_path'. This should point to the location of your GloVe embeddings file (glove.6B.100d.txt in this case).
Glove_path = "D:/git/Laboratory/NLP/Learning_NLP/data/glove.6B.100d.txt"

# Set the name of the output file after conversion
word2vec_output_file = 'glove.6B.100d.txt.word2vec'

# Apply the function 'glove2word2vec' to convert the embeddings and save them in Word2Vec format.
glove2word2vec(Glove_path, word2vec_output_file)        # 'Glove_path' is the path and 'word2vec_output_file' is the desired output file name

(400000, 100)

**Load the pre-trained GloVe word vectors. The Glove_model can be use to perform various operations on the vectors, such as finding similar words or performing vector arithmetic:**

In [84]:
# Load the Stanford GloVe model

# assigns a string to a variable
filename = './glove.6B.100d.txt.word2vec'               # Assign the string './glove.6B.100d.txt.word2vec' to the variable 'filename'

# Load the pre-trained GloVe_model as an instance of KeyedVectors
Glove_model = KeyedVectors.load_word2vec_format(filename,            # file path of the GloVe word vectors file in Word2Vec format.
                                                binary = False)      # indicates that the file is not in binary format. In this case, GloVe vectors are usually stored in plain text format

Performing some operations with the pre-trained model **Glove_model**:

In [85]:
print('Most similar words to word "human" : ')
Glove_model.most_similar('human')

Most similar words to word "human" : 


[('animal', 0.7462460398674011),
 ('rights', 0.7322573661804199),
 ('humans', 0.6607711911201477),
 ('animals', 0.6567643284797668),
 ('body', 0.6552972197532654),
 ('nature', 0.6483666896820068),
 ('beings', 0.6467736959457397),
 ('organization', 0.6319881081581116),
 ('scientific', 0.630841076374054),
 ('common', 0.6211239099502563)]

In [86]:
print('Glove Word Embedding of word "human":')
Glove_model['human']

Glove Word Embedding of word "human":


array([ 3.3864e-01,  5.9663e-01,  5.3322e-01,  3.1404e-01,  1.5321e-01,
        3.1749e-01, -4.2940e-01, -2.9150e-01, -2.1047e-03, -3.9309e-01,
       -8.5441e-01, -8.0708e-02,  1.2118e+00,  6.9316e-02,  8.0613e-03,
        8.7888e-01,  3.1908e-02,  5.8655e-01, -5.4892e-01, -7.8468e-03,
        1.7327e-01, -2.6693e-01,  4.2802e-01,  6.6123e-02,  5.1847e-01,
        7.7226e-01,  2.0608e-01, -4.5836e-01,  3.5485e-01,  7.1547e-01,
        6.0855e-01,  2.0254e-01, -4.8756e-01,  5.7974e-01,  8.6728e-02,
       -5.1852e-01, -3.7274e-01,  1.0014e+00, -2.9259e-01,  3.2290e-01,
       -9.7563e-01, -2.2288e-01, -2.3335e-01, -2.6891e-01,  1.4612e-01,
        1.2004e-01, -2.0402e-01, -9.4647e-02, -1.5402e+00, -5.9510e-02,
        1.0887e+00, -2.4998e-01, -2.5808e-01,  1.2798e+00, -1.2849e-01,
       -1.4511e+00, -2.4686e-01, -9.5046e-02,  1.7425e+00,  1.1977e-01,
       -1.9206e-01,  4.4368e-01, -1.6453e-01, -7.6663e-01,  1.1100e+00,
        4.6748e-01, -2.4673e-02,  4.7179e-03,  6.9761e-01, -2.29

### <a id='4-2-3'>4.2.3 FastText Word Embedding</a>

    **FastText** is an extension of the Word2Vec model developed by Facebook's AI Research (FAIR) lab. While Word2Vec operates at the word level, FastText **operates at the subword level**.
**FastText is a powerful tool** for working with text data, especially **in scenarios where you need to handle rare or out-of-vocabulary words**. It has found wide applications in NLP research and industry settings.

- *Subword Embeddings*:
    FastText breaks words into smaller units called **"subwords"** or **"n-grams"**. For example, the word "fast" can be represented as *"fa"*, *"ast"*, and *"fast"*. This allows **FastText to capture morphological information and handle out-of-vocabulary words**.
<br>

- *Character-level Embeddings*:
    **FastText also considers individual characters as special n-grams**, which means it can **generate embeddings for words even if they are not present in the training data**.
<br>

- *Handling Rare Words*:
    FastText is particularly **effective at handling rare words, misspellings, and noisy text data**, as it can generate embeddings for subwords and characters even if they haven't been seen before.
<br>

- *Pre-trained Models*:
    **Pre-trained FastText models are available for multiple languages and domains**. These models have been trained on large text corpora and can be used in various NLP tasks.
<br>

- *Training Objective*:
    **FastText aims to predict the next word in a sentence given its preceding words**. This is similar to Word2Vec's Skip-gram model, but FastText considers subwords instead of entire words. To prepare the training data, you define the "context word" as the word that follows a given word (that will be your "target word"). This means you will predict the surrounding word for a given word.
<br>

- *Efficiency*:
    FastText is known for its efficiency in training and generating embeddings, even for very large datasets.
<br>

- *Applications*:
    FastText embeddings can be used in various NLP applications such as **text classification, sentiment analysis, machine translation, and more**.
<br>

- *Similarity and Arithmetic Operations*:
    Like Word2Vec, you can perform operations like **finding similar words or performing vector arithmetic with FastText embeddings**.
<br>

- *Supervised Learning*:
    **FastText can also be used for supervised tasks, where it learns to predict labels from input text**. This makes it a versatile tool for various NLP tasks.
<br>

- *Usage in Gensim*:
    **You can use the Gensim library to train and work with FastText models**. Gensim provides an easy-to-use interface for training and using FastText embeddings.

---
- **Advantages**:
    - *Subword Information*: FastText can handle out-of-vocabulary words and rare words effectively by breaking them down into subword units (n-grams). This allows it to capture morphological information.
    - *Character-level Embeddings*: It considers individual characters as special n-grams, which means it can generate embeddings for words even if they are not present in the training data.
    - *Efficient Handling of Rare Words*: FastText is particularly effective at handling rare words, misspellings, and noisy text data.
    - *Language Agnostic*: It can work well with multiple languages and handle a wide variety of languages effectively.
    - *Pre-trained Models*: Pre-trained FastText models are available for various languages and domains, making it easy to leverage existing embeddings.
    - *Supervised Learning*: FastText can be used for supervised tasks where it learns to predict labels from input text. This makes it a versatile tool for various NLP tasks.
    - *Training Efficiency*: FastText is known for its efficiency in training and generating embeddings, even for very large datasets.

- **Disadvantages**:
    - *Increased Memory Consumption*: Due to the subword information and character-level embeddings, FastText models can be memory-intensive compared to traditional word embeddings.

    - *Slower Prediction Time*: While training is efficient, predicting word vectors can be slower compared to models like Word2Vec.

    - *Difficulty in Interpretation*: The subword information can make the interpretation of embeddings more complex compared to traditional word embeddings.

    - *Lack of Context Awareness*: FastText may not capture contextual information as well as more sophisticated models like contextual embeddings (e.g., BERT).

    - *Dependency on Parameter Tuning*: The performance of FastText can be sensitive to parameter choices, so fine-tuning may be required for optimal results.

    - *May Not Be Ideal for All Tasks*: While FastText is versatile, there are specialized models (e.g., BERT for contextual understanding) that may outperform it in specific NLP tasks.
---

**<span style="color:red"> Example:</span>**

Lets construct some training examples where,by scanning through the text, I wil prepare a "context word" and a "target word" according to Skip-Gram and FastText methodologie:

Consider the sentence:
<blockquote style="font-size: 20px;"> <b> I like natural language processing </b> </blockquote>

![FastText_example](figures/FastText_example.png)

Based on the training examples of Skip-Gram and FastText, you observe that in Skip-Gram each word is represented as a bag of words, while in FastText each word is represented as a bag of character n-grams. This training data preparation is the only difference between FastText word embeddings and Skip-Gram (or CBOW) word embeddings.

### **<span style="color:blue"> Pre-trained FastText model**
I will use the **pre-trained** vectors storaged in **Word vectors trained on Wikipedia 2017, UMBC webbase corpus, and statmt.org** dataset. In total, it contains 16B tokens.



In [8]:
from gensim.models.fasttext import load_facebook_model    # ---> The function "load_facebook_model" is used to load pre-trained FastText word embeddings trained using Facebook's FastText library.

from gensim.models import FastText, KeyedVectors

# Print the working directory
print(os.getcwd())

# assigns a string to a variable
filename = './data/wiki-news-300d-1M.vec'

# Load the pre-trained GloVe_model as an instance of KeyedVectors
fasttext_model = KeyedVectors.load_word2vec_format(filename)

print("Most similar words to word 'human': ")
fasttext_model.most_similar('human')

D:\git\Laboratory\NLP\Learning_NLP
Most similar words to word 'human': 


[('non-human', 0.7691742181777954),
 ('Human', 0.7620595693588257),
 ('nonhuman', 0.7084148526191711),
 ('beings', 0.7024695873260498),
 ('humans', 0.6974276304244995),
 ('animal', 0.6924618482589722),
 ('humanity', 0.6476197838783264),
 ('human-', 0.6355127692222595),
 ('mammalian', 0.6191367506980896),
 ('natural', 0.6171244978904724)]

In [9]:
print("fastText Word Embeddings of word 'human' ")
fasttext_model['human']

fastText Word Embeddings of word 'human' 


array([ 8.800e-03, -1.230e-02,  3.650e-02,  1.136e-01, -8.000e-03,
        1.574e-01,  9.080e-02,  1.293e-01, -1.400e-03,  1.081e-01,
       -8.610e-02, -3.450e-02, -5.610e-02,  3.300e-03, -4.000e-04,
        1.650e-02,  8.540e-02,  4.670e-02, -1.632e-01,  6.200e-02,
        5.800e-03,  1.162e-01, -7.390e-02,  1.525e-01, -6.330e-02,
        6.780e-02, -1.114e-01, -3.440e-02,  4.310e-02,  6.050e-02,
       -1.349e-01,  5.660e-02, -7.210e-02,  1.785e-01,  5.520e-02,
       -8.580e-02, -7.610e-02,  1.387e-01, -3.760e-02,  8.000e-03,
        1.093e-01, -6.550e-02,  7.130e-02, -1.020e-01, -6.250e-02,
       -6.340e-02, -6.770e-02, -8.940e-02,  5.400e-03,  6.740e-02,
       -8.720e-02,  1.085e-01, -7.148e-01,  5.700e-03, -4.660e-02,
        3.980e-02, -7.630e-02,  9.780e-02, -3.300e-03,  1.379e-01,
       -1.192e-01,  3.700e-03, -1.599e-01, -9.340e-02, -1.021e-01,
       -2.812e-01,  1.539e-01, -2.750e-02, -3.230e-02, -3.860e-02,
       -1.403e-01, -7.450e-02, -7.510e-02,  1.232e-01,  8.800e

### **<span style="color:blue"> Training a FastText model </span>**

The FastText model in Gensim has several parameters that can be adjusted during initialization. The most commonly used parameters are:

- **`sentences`**:
    This parameter is used to specify the input data for training the FastText model. It can be a list of sentences or an iterable that yields sentences.

- **`vector_size`**:
    This parameter determines the dimensionality of the word vectors. For example, if set to 100, each word will be represented by a vector of 100 dimensions.

- **`window`**:
    It defines the maximum distance between the current and predicted word within a sentence. The default value is 5.

- **`min_count`**:
    This parameter specifies the minimum frequency of a word in the corpus for it to be considered during training. Words that occur fewer times will be ignored. The default value is 5.

- **`sg`**:
    Skip-gram vs CBOW (Continuous Bag of Words). If sg is set to 1, the Skip-gram model is used; if it's set to 0, CBOW is used.

- **`negative`**:
    It specifies how many "negative" samples should be drawn for each positive sample during training. More negative samples can lead to a more accurate model, but also require more computation.

- **`ns_exponent`**:
    This parameter sets the exponent used to shape the negative sampling distribution. The default value is 0.75.

- **`alpha`**:
    The initial learning rate for training. It will gradually decrease as the training progresses.

- **`min_n`** and **`max_n`**:
    These parameters specify the minimum and maximum size of character n-grams to be used for training. By default, min_n=3 and max_n=6.

- **`word_ngrams`**:
    This parameter determines whether word n-grams should be used in addition to word vectors. By default, it's set to 1.

- **`workers`**:
    The number of CPU cores to use for training. By default, it's set to 3.

- **`epochs`**:
    The number of iterations over the corpus during training.



In [10]:
from gensim.models import FastText
from gensim.test.utils import common_texts

my_fasttext_model = FastText(common_texts,
                             vector_size = 100,
                             min_count = 1,
                             window = 5,
                             sg = 1)

In [12]:
print("Most similar words of word 'computer' :")
my_fasttext_model.wv.most_similar('computer')

Most similar words of word 'computer' :


[('user', 0.15659411251544952),
 ('response', 0.12383826076984406),
 ('eps', 0.030704911798238754),
 ('system', 0.025573883205652237),
 ('interface', 0.0058587524108588696),
 ('survey', -0.03156976401805878),
 ('minors', -0.0545564740896225),
 ('human', -0.0668589174747467),
 ('time', -0.06855931878089905),
 ('trees', -0.10636083036661148)]

In [13]:
# Word Embedding for Word "computer" using fastText
my_fasttext_model.wv['computer']

array([ 2.96936167e-04,  3.31060466e-04, -8.77768325e-04,  3.39444174e-04,
       -5.01747418e-04, -2.04214524e-03, -1.24066719e-03, -1.94044539e-03,
        1.34510931e-03, -2.41268426e-03,  9.18505422e-04, -1.03151030e-03,
       -7.63410062e-04,  7.31222244e-05,  1.38286629e-03,  5.19435504e-04,
       -2.98849802e-04, -1.19464763e-03, -1.17238448e-03, -6.08951552e-04,
       -6.78338984e-04,  3.92779708e-04,  9.88251195e-05,  8.12689308e-04,
        5.81971311e-04,  7.01953366e-04, -7.36806658e-04, -1.03962549e-03,
       -6.25258312e-04, -2.40496884e-04, -1.19316357e-03, -2.65940849e-04,
        7.36046524e-04, -7.21505727e-04, -1.27508014e-03,  1.24231781e-04,
        3.77583550e-04, -1.33155228e-03, -2.73441360e-03, -3.04829708e-04,
        9.28272377e-04, -7.28168816e-04, -1.12919568e-03, -3.21931177e-04,
       -2.06016310e-04, -1.04854174e-04, -6.22976047e-04, -1.61377620e-03,
        9.91107081e-04,  9.22983818e-05,  3.68000241e-04, -5.37839776e-04,
        1.13322982e-03,  

The results obtained are not so good because **the hyperparameters need to be nicely tuned**