# Word2Vec Model

### The main problem with BOW & TF-IDF

    -It does not capture position in text, semantics, co-occurrences in different documents.
    -TF-IDF gives importance to nuncommun word
    -Very high channce of overfitting

### Solution :
   ***Word2Vec***:
   - Each word is basically represented as a vector of 32 or more dimensions instead of a single number
   - The sementic information and the relation between different words is also preserved
   
### Visual representation - word2vec

<img src="vr-word2vec.jpeg">

### Steps To create Word2vec

    - Tokenization of the sentences
    - Create Histograms
    - Take most frequent words
    - Create a metrix with all the unique words. it also represent the occurrence relation between the words

## Import Libraries 

In [5]:
try:
    !pip install gensim
except:
  print("An exception occurred")

Collecting gensim
  Using cached gensim-4.0.1-cp38-cp38-win_amd64.whl (23.9 MB)
Installing collected packages: gensim
Successfully installed gensim-4.0.1


In [31]:
import nltk
from gensim.models import Word2Vec, keyedvectors
from nltk.corpus import stopwords
import re

In [9]:
english_text = """Perhaps one of the most significant advances in  made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as \"algebraic objects\". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before."""

In [7]:
english_text

'Perhaps one of the most significant advances in  made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before.'

### Data Cleaning

In [10]:
english_sentences = nltk.sent_tokenize(english_text)

In [12]:
english_corpus = []

for i in range(len(english_sentences)):
    # work with only text
    cleaning_text = re.sub('[^a-zA-Z]', ' ', english_sentences[i])
    # text to lower case
    cleaning_text = cleaning_text.lower()
    # tokenize each sentence
    cleaning_text = cleaning_text.split()
    # lematize each word
    sentence_lem = [word for word in cleaning_text if not word in set(stopwords.words("english"))]
    sentence = ' '.join(sentence_lem)
    english_corpus.append(sentence)

In [13]:
english_corpus

['perhaps one significant advances made arabic mathematics began time work al khwarizmi namely beginnings algebra',
 'important understand significant new idea',
 'revolutionary move away greek concept mathematics essentially geometry',
 'algebra unifying theory allowedrational numbers irrational numbers geometrical magnitudes etc treated algebraic objects',
 'gave mathematics whole new development path much broader concept existed provided vehicle future development subject',
 'another important aspect introduction algebraic ideas allowed mathematics applied itselfin way happened']

## Word2vec

In [59]:
english_corpus_tokens = [nltk.word_tokenize(sent) for sent in english_corpus]
english_corpus_tokens

[['perhaps',
  'one',
  'significant',
  'advances',
  'made',
  'arabic',
  'mathematics',
  'began',
  'time',
  'work',
  'al',
  'khwarizmi',
  'namely',
  'beginnings',
  'algebra'],
 ['important', 'understand', 'significant', 'new', 'idea'],
 ['revolutionary',
  'move',
  'away',
  'greek',
  'concept',
  'mathematics',
  'essentially',
  'geometry'],
 ['algebra',
  'unifying',
  'theory',
  'allowedrational',
  'numbers',
  'irrational',
  'numbers',
  'geometrical',
  'magnitudes',
  'etc',
  'treated',
  'algebraic',
  'objects'],
 ['gave',
  'mathematics',
  'whole',
  'new',
  'development',
  'path',
  'much',
  'broader',
  'concept',
  'existed',
  'provided',
  'vehicle',
  'future',
  'development',
  'subject'],
 ['another',
  'important',
  'aspect',
  'introduction',
  'algebraic',
  'ideas',
  'allowed',
  'mathematics',
  'applied',
  'itselfin',
  'way',
  'happened']]

In [70]:
model = Word2Vec(sentences=english_corpus_tokens,max_vocab_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")

with ***min_count*** means if the word apear less than k time just remove it , but we have a very lettle number of corpus so we just go with k = 1

## Find the most simalar words

In [71]:
sims = model.wv.most_similar('mathematics', topn=10)
sims

[('happened', 0.21884359419345856),
 ('essentially', 0.21634718775749207),
 ('one', 0.19545789062976837),
 ('magnitudes', 0.172089621424675),
 ('provided', 0.1691863238811493),
 ('allowedrational', 0.15168647468090057),
 ('applied', 0.14179691672325134),
 ('etc', 0.12344272434711456),
 ('objects', 0.11337737739086151),
 ('way', 0.10848793387413025)]

In [72]:
for index, word in enumerate(model.wv.index_to_key):
    print(f"word #{index}/{len(model.wv.index_to_key)} is {word}")

word #0/57 is mathematics
word #1/57 is algebra
word #2/57 is important
word #3/57 is significant
word #4/57 is numbers
word #5/57 is concept
word #6/57 is algebraic
word #7/57 is new
word #8/57 is development
word #9/57 is essentially
word #10/57 is greek
word #11/57 is away
word #12/57 is move
word #13/57 is revolutionary
word #14/57 is idea
word #15/57 is understand
word #16/57 is happened
word #17/57 is unifying
word #18/57 is beginnings
word #19/57 is namely
word #20/57 is khwarizmi
word #21/57 is al
word #22/57 is work
word #23/57 is time
word #24/57 is began
word #25/57 is arabic
word #26/57 is made
word #27/57 is advances
word #28/57 is one
word #29/57 is geometry
word #30/57 is allowedrational
word #31/57 is theory
word #32/57 is way
word #33/57 is itselfin
word #34/57 is applied
word #35/57 is allowed
word #36/57 is ideas
word #37/57 is introduction
word #38/57 is aspect
word #39/57 is another
word #40/57 is subject
word #41/57 is future
word #42/57 is vehicle
word #43/57 is 

### get the vector of a given word

In [74]:
significant_vec = model.wv["significant"]
significant_vec

array([-8.2448432e-03,  9.3017407e-03, -1.9731275e-04, -1.9713643e-03,
        4.6059717e-03, -4.0967749e-03,  2.7472277e-03,  6.9532283e-03,
        6.0662227e-03, -7.5135226e-03,  9.3805743e-03,  4.6787788e-03,
        3.9655888e-03, -6.2458627e-03,  8.4573152e-03, -2.1487135e-03,
        8.8288533e-03, -5.3552347e-03, -8.1358906e-03,  6.8190722e-03,
        1.6721791e-03, -2.2069688e-03,  9.5202010e-03,  9.4941193e-03,
       -9.7839115e-03,  2.5004668e-03,  6.1588893e-03,  3.8725769e-03,
        2.0207369e-03,  4.3002027e-04,  6.7495904e-04, -3.8218624e-03,
       -7.1397247e-03, -2.0996237e-03,  3.9335918e-03,  8.8206399e-03,
        9.2637045e-03, -5.9778020e-03, -9.4062285e-03,  9.7687161e-03,
        3.4348422e-03,  5.1636407e-03,  6.2810839e-03, -2.8027596e-03,
        7.3330300e-03,  2.8249149e-03,  2.8792685e-03, -2.3841150e-03,
       -3.1231495e-03, -2.3775839e-03,  4.2859283e-03,  7.4012700e-05,
       -9.5855510e-03, -9.6692564e-03, -6.1551640e-03, -1.2997685e-04,
      

In [77]:
print(model.wv.most_similar(positive=['new', 'gave'], topn=5))

[('introduction', 0.17697592079639435), ('irrational', 0.17059855163097382), ('namely', 0.14671754837036133), ('happened', 0.14007315039634705), ('away', 0.13959093391895294)]


### Printing Dependencies

### Printing Dependencies

In [78]:
%load_ext watermark

In [79]:
%watermark --iversion

gensim: 4.0.1
nltk  : 3.5
re    : 2.2.1

