🧠 **Introduction to NLP (Natural Language Processing)**

Natural Language Processing (NLP) is a branch of AI that allows machines to understand, interpret, and respond to human language. NLP bridges the gap between human communication and computer understanding.

🔍 **Common Applications of NLP:**

1. Chatbots and Virtual Assistants (like ChatGPT, Siri)

2. Sentiment Analysis (e.g., is a tweet positive or negative?)

3. Machine Translation (e.g., Google Translate)

4. Text Summarization

5. Search Engines (Google, Bing)

6. Spam Detection

In [None]:
! pip install nltk



In [None]:
corpus = (
    '''
    The children are playing in the garden. She enjoys reading books on the weekend. They were walking slowly along the river. He studies hard to achieve his goals.
    '''
)

In [None]:
print(corpus)


    The children are playing in the garden. She enjoys reading books on the weekend. They were walking slowly along the river. He studies hard to achieve his goals.
    


When working with text (called a corpus), we follow a few key preprocessing steps before feeding it into an NLP model.

1. **Tokenization**

This is the first and most important step in NLP. It splits text into smaller units.

### 📦 Types of Tokenization

| **Tokenizer**              | **Description**                                                 |
|---------------------------|-----------------------------------------------------------------|
| **Word Tokenizer**         | Splits text into individual words                              |
| **Sentence Tokenizer**     | Splits text into complete sentences                            |
| **WordPunct Tokenizer**    | Splits words and punctuation separately                        |
| **Treebank Word Tokenizer**| Handles contractions and punctuation more accurately           |


In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
print(word_tokenize(corpus))

['The', 'children', 'are', 'playing', 'in', 'the', 'garden', '.', 'She', 'enjoys', 'reading', 'books', 'on', 'the', 'weekend', '.', 'They', 'were', 'walking', 'slowly', 'along', 'the', 'river', '.', 'He', 'studies', 'hard', 'to', 'achieve', 'his', 'goals', '.']


In [None]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(corpus))

['\n    The children are playing in the garden.', 'She enjoys reading books on the weekend.', 'They were walking slowly along the river.', 'He studies hard to achieve his goals.']


In [None]:
from nltk.tokenize import wordpunct_tokenize
print(wordpunct_tokenize(corpus))

['The', 'children', 'are', 'playing', 'in', 'the', 'garden', '.', 'She', 'enjoys', 'reading', 'books', 'on', 'the', 'weekend', '.', 'They', 'were', 'walking', 'slowly', 'along', 'the', 'river', '.', 'He', 'studies', 'hard', 'to', 'achieve', 'his', 'goals', '.']


In [None]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(corpus))

['The', 'children', 'are', 'playing', 'in', 'the', 'garden.', 'She', 'enjoys', 'reading', 'books', 'on', 'the', 'weekend.', 'They', 'were', 'walking', 'slowly', 'along', 'the', 'river.', 'He', 'studies', 'hard', 'to', 'achieve', 'his', 'goals', '.']


2. **Stemmer**

A Stemmer is a tool in NLP that reduces a word to its base or root form — called the stem. The stem may not always be a valid word, but it helps group similar words together for analysis.

Types of stemmer:

| **Stemmer**         | **Description**                                                  |
|---------------------|------------------------------------------------------------------|
| Porter Stemmer      | Rule-based, widely used, moderate stemming                       |
| Lancaster Stemmer   | Very aggressive, can chop off too much                           |
| Snowball Stemmer    | Refined Porter, supports multiple languages                      |
| Regexp Stemmer      | Uses regular expressions to remove suffixes (customizable)       |


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

sentence = "The runners were running quickly towards the finishing line."
tokens = word_tokenize(sentence)
print("Tokens:", tokens)

# Initialize the Porter Stemmer
porter = PorterStemmer()
print("PorterStemmer:", [porter.stem(word) for word in tokens])


Tokens: ['The', 'runners', 'were', 'running', 'quickly', 'towards', 'the', 'finishing', 'line', '.']
PorterStemmer: ['the', 'runner', 'were', 'run', 'quickli', 'toward', 'the', 'finish', 'line', '.']


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import LancasterStemmer

sentence = "The runners were running quickly towards the finishing line."
tokens = word_tokenize(sentence)
print("Tokens:", tokens)

lancaster = LancasterStemmer()
print("LancasterStemmer:", [lancaster.stem(word) for word in tokens])

Tokens: ['The', 'runners', 'were', 'running', 'quickly', 'towards', 'the', 'finishing', 'line', '.']
LancasterStemmer: ['the', 'run', 'wer', 'run', 'quick', 'toward', 'the', 'fin', 'lin', '.']


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

sentence = "The runners were running quickly towards the finishing line."
tokens = word_tokenize(sentence)
print("Tokens:", tokens)

snowball = SnowballStemmer("english")
print("SnowballStemmer:", [snowball.stem(word) for word in tokens])

Tokens: ['The', 'runners', 'were', 'running', 'quickly', 'towards', 'the', 'finishing', 'line', '.']
SnowballStemmer: ['the', 'runner', 'were', 'run', 'quick', 'toward', 'the', 'finish', 'line', '.']


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import RegexpStemmer

sentence = "The runners were running quickly towards the finishing line."
tokens = word_tokenize(sentence)
print("Tokens:", tokens)

regex = RegexpStemmer('ing$|s$|e$|ners$', min=4)
print("RegexpStemmer:", [regex.stem(word) for word in tokens])

Tokens: ['The', 'runners', 'were', 'running', 'quickly', 'towards', 'the', 'finishing', 'line', '.']
RegexpStemmer: ['The', 'run', 'wer', 'runn', 'quickly', 'toward', 'the', 'finish', 'lin', '.']


3. **Lemmatization**

Lemmatization is the process of reducing a word to its dictionary root form (lemma). Unlike stemming, lemmatization gives real words and considers the context and part of speech (POS).


### 🌱 Types of Lemmatizers

| **Lemmatizer**        | **Library/Tool** | **Description**                                                        |
|-----------------------|------------------|------------------------------------------------------------------------|
| **WordNetLemmatizer** | NLTK             | Rule-based lemmatizer using WordNet; supports POS tagging              |
| **spaCy Lemmatizer**  | spaCy            | Context-aware lemmatizer using statistical models                      |
| **TextBlob Lemmatizer**| TextBlob        | Simple wrapper around WordNet; easy to use for beginners               |
                    |



In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Downloads
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Sample sentence
sentence = "The runners were running quickly towards the finishing line."

# Tokenize sentence
tokens = word_tokenize(sentence)
print("Tokens:", tokens)

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Basic lemmatization (default POS = noun)
print("Lemmatized (default noun):", [lemmatizer.lemmatize(word) for word in tokens])

print("Lemmatized (verb):", [lemmatizer.lemmatize(word,pos='v') for word in tokens])

print("Lemmatized (adjective):", [lemmatizer.lemmatize(word,pos='a') for word in tokens])

Tokens: ['The', 'runners', 'were', 'running', 'quickly', 'towards', 'the', 'finishing', 'line', '.']
Lemmatized (default noun): ['The', 'runner', 'were', 'running', 'quickly', 'towards', 'the', 'finishing', 'line', '.']
Lemmatized (verb): ['The', 'runners', 'be', 'run', 'quickly', 'towards', 'the', 'finish', 'line', '.']
Lemmatized (adjective): ['The', 'runners', 'were', 'running', 'quickly', 'towards', 'the', 'finishing', 'line', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
#  Install and load spaCy
!pip install -q spacy
import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")

# Sample sentence
sentence = "The runners were running quickly towards the finishing line."

# Process sentence with spaCy
doc = nlp(sentence)

# All tokens and their lemmas
print(" Tokens:", [token.text for token in doc])
print("Lemmatized (auto POS):", [token.lemma_ for token in doc])

# Filter: Lemmas of words that are verbs
verbs = [token.lemma_ for token in doc if token.pos_ == "VERB"]
print(" Lemmatized (verbs only):", verbs)

# Filter: Lemmas of words that are adjectives
adjectives = [token.lemma_ for token in doc if token.pos_ == "ADJ"]
print(" Lemmatized (adjectives only):", adjectives)


 Tokens: ['The', 'runners', 'were', 'running', 'quickly', 'towards', 'the', 'finishing', 'line', '.']
Lemmatized (auto POS): ['the', 'runner', 'be', 'run', 'quickly', 'towards', 'the', 'finish', 'line', '.']
 Lemmatized (verbs only): ['run', 'finish']
 Lemmatized (adjectives only): []


In [None]:
#  Install TextBlob and download corpora
!pip install -q textblob
from textblob import TextBlob
import nltk
nltk.download('punkt')
!python -m textblob.download_corpora
# Sample sentence
sentence = "The runners were running quickly towards the finishing line."

# Create TextBlob object
blob = TextBlob(sentence)

# Tokenize and lemmatize
print(" Tokens:", blob.words)

# Lemmatized words (TextBlob assumes default noun)
lemmatized_default = [word.lemmatize() for word in blob.words]
print(" Lemmatized (default):", lemmatized_default)

# Simulated POS lemmatization using TextBlob tags
print(" Tagged words:", blob.tags)

# Apply lemmatization with simple POS mapping
lemmatized_with_pos = [word.lemmatize(pos[0].lower()) if pos[0].lower() in ['a','n','v'] else word.lemmatize() for word, pos in blob.tags]
print(" Lemmatized (with POS):", lemmatized_with_pos)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.
 Tokens: ['The', 'runners', 'were', 'running', 'quickly', 'towards', 'the', 'finishing', 'line']
 Lemmatized (default): ['The', 'runner', 'were', 'running', 'quickly', 'towards', 'the', 'finishing', 'lin

3. **Stopwords**

Stopwords are common words in a language that are usually filtered out before processing text in NLP tasks. These words often don’t carry significant meaning on their own and are mostly used for grammar or sentence structure.



In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')

text = "The cat is sitting on the mat."
print(text)
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))

tokens = [word.lower() for word in tokens]
print("Lowerwords:", tokens)
filtered = [word for word in tokens if word.lower() not in stop_words]
print("After Stopword Removal:", filtered)


The cat is sitting on the mat.
Lowerwords: ['the', 'cat', 'is', 'sitting', 'on', 'the', 'mat', '.']
After Stopword Removal: ['cat', 'sitting', 'mat', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


4. **Named Entity Recognition**

Named Entity Recognition (NER) is a Natural Language Processing (NLP) technique used to identify and classify named entities in text into predefined categories such as:

### 🏷️ Named Entity Categories

| **Category**     | **Example**             |
|------------------|--------------------------|
| **Person**       | Elon Musk, Sachin        |
| **Organization** | Google, NASA             |
| **Location**     | India, New York          |
| **Date**         | 26 March 2025            |
| **Time**         | 5 PM                     |
| **Money**        | $500, ₹1000              |


In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('maxent_ne_chunker_tab')

# Sample text
text = "Elon Musk founded SpaceX in California on 14 March 2002."

# Tokenize and POS tagging
words = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(words)

# Named Entity Chunking
ne_tree = nltk.ne_chunk(pos_tags)
print(" Named Entities:")
print(ne_tree)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


 Named Entities:
(S
  (PERSON Elon/NNP)
  (PERSON Musk/NNP)
  founded/VBD
  (ORGANIZATION SpaceX/NNP)
  in/IN
  (GPE California/NNP)
  on/IN
  14/CD
  March/NNP
  2002/CD
  ./.)


**Text pre-processing:**

Machines can’t understand raw text — we need to convert words into numerical form (vectors). Here are common techniques:

1. One hot encoding
2. Bag of words(BOW)
3. Term Frequency – Inverse Document Frequency (TF-IDF)
4. Word2vec
5. Avgword2vec

1. **One Hot encoding:**

Each word is represented by a binary vector. The length of the vector equals the vocabulary size. Only one position is 1, others are 0.

In [None]:
#  Install scikit-learn if not already
!pip install -q scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

#  Sample sentences
texts = [
    "apple banana cherry",
    "banana apple banana",
    "cherry banana apple",
    "apple banana apple"
]

#  Initialize CountVectorizer
vectorizer = CountVectorizer(binary=True)  # binary=True gives One-Hot Encoding

# Fit and transform
X = vectorizer.fit_transform(texts)

# Show results
print(" Vocabulary:", vectorizer.get_feature_names_out())
print("One-Hot Encoded Matrix:\n", X.toarray())


 Vocabulary: ['apple' 'banana' 'cherry']
One-Hot Encoded Matrix:
 [[1 1 1]
 [1 1 0]
 [1 1 1]
 [1 1 0]]


In [None]:
sentence = "apple banana cherry banana apple"

# Step 1: Tokenize sentence into words
tokens = sentence.split()

# Step 2: Create a vocabulary (unique words)
vocab = sorted(set(tokens))
print(" Vocabulary:", vocab)

# Step 3: Create word-to-index mapping
word_to_index = {word: idx for idx, word in enumerate(vocab)}
print(" Word to Index Mapping:", word_to_index)

# Step 4: One-Hot Encode each word
ohe_vectors = []
for word in tokens:
    vector = [0] * len(vocab)
    vector[word_to_index[word]] = 1
    ohe_vectors.append(vector)

# Step 5: Print results
print("\n One-Hot Encoded Vectors:")
for word, vec in zip(tokens, ohe_vectors):
    print(f"{word}: {vec}")


 Vocabulary: ['apple', 'banana', 'cherry']
 Word to Index Mapping: {'apple': 0, 'banana': 1, 'cherry': 2}

 One-Hot Encoded Vectors:
apple: [1, 0, 0]
banana: [0, 1, 0]
cherry: [0, 0, 1]
banana: [0, 1, 0]
apple: [1, 0, 0]


Note:

Advantage:
1. easy to use in python (use scikit-learn )

Disadvantage:
1. sparse matrix (0,1) --> overfit the ml model
2. ML algorithm we need to fix the input size
3. No sematic meaning is capture
4. Out of vacabulary



2. **BOW:**
Bag of Words is a basic and popular technique to convert text into numerical features for machine learning models.

 Key Idea:
 Treat each document/sentence as a “bag” (collection) of words.Ignore grammar & word order (remove stopwords, and lower all words). Just count how many times each word appears.

In [None]:
#  Import required library
from sklearn.feature_extraction.text import CountVectorizer

#  Sample sentences
sentences = [
    "apple banana apple",
    "banana cherry"
]

#  Initialize CountVectorizer (BoW)
vectorizer = CountVectorizer()

# Fit and transform the sentences into BoW vectors
X = vectorizer.fit_transform(sentences)

# Get feature (vocabulary) names
vocab = vectorizer.get_feature_names_out()
print(" Vocabulary:", vocab)

# Convert sparse matrix to array for display
bow_array = X.toarray()
print("\n Bag of Words Vectors:")
for i, vec in enumerate(bow_array):
    print(f"Sentence {i+1}: {vec}")


 Vocabulary: ['apple' 'banana' 'cherry']

 Bag of Words Vectors:
Sentence 1: [2 1 0]
Sentence 2: [0 1 1]


Note:
Advantage:
1. simple and initutive
2. fixed size input is used fo ml algorithm

Disadvantage:
1. Sparse matrix(0,1) -->overfit in ML model
2. ordering of word is getting changed
3. out of vocabulary
4. no semantic meaning is capture

3. **N-grams**

An N-gram is a sequence of N words from a given text.N-grams help capture context and word patterns, which are often missed in Bag of Words.Useful for language modeling, text classification, and spam detection.

Example: "New York" (bigram) is more meaningful than "New", "York" separately.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text
text = ["I love natural language processing"]

#  Unigram (n=1)
vectorizer_uni = CountVectorizer(ngram_range=(1,1))
X_uni = vectorizer_uni.fit_transform(text)
print(" Unigrams:", vectorizer_uni.get_feature_names_out())

#  Bigram (n=2)
vectorizer_bi = CountVectorizer(ngram_range=(2,2))
X_bi = vectorizer_bi.fit_transform(text)
print(" Bigrams:", vectorizer_bi.get_feature_names_out())

#  Trigram (n=3)
vectorizer_tri = CountVectorizer(ngram_range=(3,3))
X_tri = vectorizer_tri.fit_transform(text)
print(" Trigrams:", vectorizer_tri.get_feature_names_out())


 Unigrams: ['language' 'love' 'natural' 'processing']
 Bigrams: ['language processing' 'love natural' 'natural language']
 Trigrams: ['love natural language' 'natural language processing']


4. TF-IDF

TF-IDF stands for:

1. Term Frequency (TF): How often a word appears in a document.

2. Inverse Document Frequency (IDF): How rare or unique a word is across all documents.

Goal: Highlight important words in a document and downweight common words like “the”, “is”, “and”

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#  Sample documents
documents = [
    "I love machine learning",
    "Machine learning is amazing",
    "I enjoy learning new things about AI"
]

#  Create TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Show the vocabulary
print(" Vocabulary:", vectorizer.get_feature_names_out())

# Convert to array for viewing
tfidf_matrix = X.toarray()

# Print TF-IDF matrix
import pandas as pd
df = pd.DataFrame(tfidf_matrix, columns=vectorizer.get_feature_names_out())
print("\n TF-IDF Matrix:")
print(df)


 Vocabulary: ['about' 'ai' 'amazing' 'enjoy' 'is' 'learning' 'love' 'machine' 'new'
 'things']

 TF-IDF Matrix:
      about        ai   amazing     enjoy  ...      love   machine       new    things
0  0.000000  0.000000  0.000000  0.000000  ...  0.720333  0.547832  0.000000  0.000000
1  0.000000  0.000000  0.584483  0.000000  ...  0.000000  0.444514  0.000000  0.000000
2  0.432385  0.432385  0.000000  0.432385  ...  0.000000  0.000000  0.432385  0.432385

[3 rows x 10 columns]


Note:

Advantage:

1. easy and initiative
2. fixed vocabulary size
3. word importance is getting capture

Disadvantage

1. Saprse matrix
2. OOV

4. **Word2Vec**

Word2Vec is a popular technique in NLP to convert words into vectors (dense numerical representations) that capture meaning and context.

Unlike Bag of Words or TF-IDF, Word2Vec doesn’t just count — it learns from context using a neural network.

###  Word2Vec Models

| **Model**                      | **Goal**                                             |
|-------------------------------|------------------------------------------------------|
| **CBOW (Continuous Bag of Words)** | Predict the current word based on surrounding context |
| **Skip-gram**                 | Predict the surrounding context words from the current word |


In [None]:
# Step 1: Force reinstall compatible numpy version
!pip install numpy==1.23.5 --upgrade --force-reinstall

# Step 2: Reinstall gensim to recompile properly with the new numpy
!pip install gensim --upgrade --force-reinstall


Collecting numpy==1.23.5
  Downloading numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Downloading numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m78.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4


In [1]:
# Install gensim if not already
!pip install -q gensim

#  Download Google’s pretrained Word2Vec model (if not already present)
import gensim.downloader as api

# Load the model (this will download it if not cached)
print(" Loading Google Word2Vec Model...")
model = api.load("word2vec-google-news-300")
print(" Model loaded!")

#  Test the model
# 1. Get the vector for a word
word = "king"
print(f"\n Vector for '{word}':")
print(model[word])  # 300-dimensional vector

# 2. Find similar words
print(f"\n Words similar to '{word}':")
print(model.most_similar(word))

# 3. Word analogy example
print("\n Word analogy: 'king' - 'man' + 'woman' ≈")
print(model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1))



 Loading Google Word2Vec Model...
 Model loaded!

 Vector for 'king':
[ 1.25976562e-01  2.97851562e-02  8.60595703e-03  1.39648438e-01
 -2.56347656e-02 -3.61328125e-02  1.11816406e-01 -1.98242188e-01
  5.12695312e-02  3.63281250e-01 -2.42187500e-01 -3.02734375e-01
 -1.77734375e-01 -2.49023438e-02 -1.67968750e-01 -1.69921875e-01
  3.46679688e-02  5.21850586e-03  4.63867188e-02  1.28906250e-01
  1.36718750e-01  1.12792969e-01  5.95703125e-02  1.36718750e-01
  1.01074219e-01 -1.76757812e-01 -2.51953125e-01  5.98144531e-02
  3.41796875e-01 -3.11279297e-02  1.04492188e-01  6.17675781e-02
  1.24511719e-01  4.00390625e-01 -3.22265625e-01  8.39843750e-02
  3.90625000e-02  5.85937500e-03  7.03125000e-02  1.72851562e-01
  1.38671875e-01 -2.31445312e-01  2.83203125e-01  1.42578125e-01
  3.41796875e-01 -2.39257812e-02 -1.09863281e-01  3.32031250e-02
 -5.46875000e-02  1.53198242e-02 -1.62109375e-01  1.58203125e-01
 -2.59765625e-01  2.01416016e-02 -1.63085938e-01  1.35803223e-03
 -1.44531250e-01 -5.

In [2]:

print(" Vector for 'king':", model['king'][:10])  # print first 10 values

print(" Similar to 'king':", model.most_similar('king')[:3])

 Vector for 'king': [ 0.12597656  0.02978516  0.00860596  0.13964844 -0.02563477 -0.03613281
  0.11181641 -0.19824219  0.05126953  0.36328125]
 Similar to 'king': [('kings', 0.7138045430183411), ('queen', 0.6510956883430481), ('monarch', 0.6413194537162781)]


Note:

Advantage:

1. sparse matrix
2. sematic information is getting captured
3. vocabulary size is fixed
4. oov is also solved here.

