BOOK NAME : Bhargav-Srinivasa-Desikan-_Bhargav-Srinivasa-Desikan_-Natural-Language-Processing-and-Computational

### USING GENSIM
#### FOR CREATING BOW AND TFIDF MODELS

In [1]:
documents = [u"Football club Arsenal defeat local rivals this weekend.",
             u"Weekend football frenzy takes over London.", 
             u"Bank open for takeover bids after losing millions.", u"London football clubs bid to move to Wembley stadium.", 
             u"Arsenal bid 50 million pounds for striker Kane.", 
             u"Financial troubles result in loss of millions for bank.", 
             u"Western bank files for bankruptcy after financial losses.", 
             u"London football club is taken over by oil millionaire from Russia.", 
             u"Banking on finances not working for Russia."]

The above is our corpus (which is basically the collection of all our documents)

In [2]:
#import the required libraries
import spacy  #for preprocessing and cleaning
from gensim import corpora #for creating our models

In [6]:
#load the english language
nlp = spacy.load('en_core_web_sm')
texts = []
for documnet in documents:
    doc = nlp(documnet)
    text = []
    for words in doc:
        if not words.is_stop and not words.is_punct and not words.like_num:
            text.append(words.lemma_)
    texts.append(text)

In [7]:
texts #gives the cleaned corpus

[['football', 'club', 'arsenal', 'defeat', 'local', 'rival', 'weekend'],
 ['weekend', 'football', 'frenzy', 'take', 'london'],
 ['bank', 'open', 'takeover', 'bid', 'lose', 'million'],
 ['london', 'football', 'club', 'bid', 'wembley', 'stadium'],
 ['arsenal', 'bid', 'pound', 'striker', 'kane'],
 ['financial', 'trouble', 'result', 'loss', 'million', 'bank'],
 ['western', 'bank', 'file', 'bankruptcy', 'financial', 'loss'],
 ['london', 'football', 'club', 'take', 'oil', 'millionaire', 'russia'],
 ['bank', 'finance', 'work', 'russia']]

In [10]:
#next we will be obtianing a bow model using gensim
from gensim import corpora
dictionery = corpora.Dictionary(texts)
print(dictionery.token2id)

{'arsenal': 0, 'club': 1, 'defeat': 2, 'football': 3, 'local': 4, 'rival': 5, 'weekend': 6, 'frenzy': 7, 'london': 8, 'take': 9, 'bank': 10, 'bid': 11, 'lose': 12, 'million': 13, 'open': 14, 'takeover': 15, 'stadium': 16, 'wembley': 17, 'kane': 18, 'pound': 19, 'striker': 20, 'financial': 21, 'loss': 22, 'result': 23, 'trouble': 24, 'bankruptcy': 25, 'file': 26, 'western': 27, 'millionaire': 28, 'oil': 29, 'russia': 30, 'finance': 31, 'work': 32}


**NOTE : Remember the type of input the corpora.Dictionery() expects to obtain the BOW model. The input given here is in the form of list of lists where each list contain the tokenized form of the corresponding document.**

In [12]:
#once the obtain the BOW model we can also represent each documnet in our corpus in the form of integer
corpus = [dictionery.doc2bow(text) for text in texts]
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(3, 1), (6, 1), (7, 1), (8, 1), (9, 1)],
 [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)],
 [(1, 1), (3, 1), (8, 1), (11, 1), (16, 1), (17, 1)],
 [(0, 1), (11, 1), (18, 1), (19, 1), (20, 1)],
 [(10, 1), (13, 1), (21, 1), (22, 1), (23, 1), (24, 1)],
 [(10, 1), (21, 1), (22, 1), (25, 1), (26, 1), (27, 1)],
 [(1, 1), (3, 1), (8, 1), (9, 1), (28, 1), (29, 1), (30, 1)],
 [(10, 1), (30, 1), (31, 1), (32, 1)]]

**Here you can see that each document is represented as the BOW model. Here the tuples for each tokens represent (word_id, word_count)**

In [17]:
#we can also easily obtain a TFIDF model using gensim
from gensim.models import tfidfmodel
tfidf = tfidfmodel.TfidfModel(corpus)

#to see our tfidf corpus
for doc in tfidf[corpus]:
    print(doc)

[(0, 0.3292179861221233), (1, 0.24046829370585296), (2, 0.4809365874117059), (3, 0.1774993848325406), (4, 0.4809365874117059), (5, 0.4809365874117059), (6, 0.3292179861221233)]
[(3, 0.24212967666975266), (6, 0.4490913847888623), (7, 0.6560530929079719), (8, 0.32802654645398593), (9, 0.4490913847888623)]
[(10, 0.18797844084016113), (11, 0.25466485399352906), (12, 0.5093297079870581), (13, 0.3486540744136096), (14, 0.5093297079870581), (15, 0.5093297079870581)]
[(1, 0.29431054749542984), (3, 0.21724253258131512), (8, 0.29431054749542984), (11, 0.29431054749542984), (16, 0.5886210949908597), (17, 0.5886210949908597)]
[(0, 0.354982288765831), (11, 0.25928712547209604), (18, 0.5185742509441921), (19, 0.5185742509441921), (20, 0.5185742509441921)]
[(10, 0.19610384738673725), (13, 0.3637247180792822), (21, 0.3637247180792822), (22, 0.3637247180792822), (23, 0.5313455887718271), (24, 0.5313455887718271)]
[(10, 0.18286519950508276), (21, 0.3391702611796705), (22, 0.3391702611796705), (25, 0.495

**Here as well the tuples for each token represent ----> (word_id, tfidf_score)**

**NOTE : Again remember the input that the tfidfmodel.TfidfModel() expects which here is the corpus of document to BOW.**

### N-GRAMS 

When working with textual data, context can be very important. As we
discussed before, we sometimes lose this context in vector representations,
knowing only the count of each word. N-grams, and in particular, bigrams
are going to help us solve this problem, at least to some extent.
An n-gram is a contiguous sequence of n items in the text. In our case, we
will be dealing with words being the item, but depending on the use case,
it could be even letters, syllables, or sometimes in the case of speech,
phonemes. A bi-gram is when n = 2.
One way bi-grams are calculated in the text is by calculating the
conditional probability of a token given by the preceding token. It can also
just be calculated by choosing words that appear next to each other, but it
is more useful for us to use bi-grams that are more likely to appear as a
pair. Such a bi-gram is called a collocation. What this means is that we're
trying to find pairs of words that are more likely to appear around each
other. For example, New York or Machine Learning could be two possible
pairs of words created by bi-grams. In other words, based on the training
data (usually the corpus), we identify that it is with high probability that
the word York follows the word New, and that it is worth considering New
York as one identity. We must be careful to get rid of stop words before
running a bi-gram model on our corpus, as there could be meaningless bigrams
formed. The Gensim bi-gram model is basically an implementation
of collocation identification.
We can clearly see how this is useful - we can now pick up phrases from
our corpus, and New York certainly provides us with more information
than the words New and York separately. This means it can be added to our
preprocessing pipeline.

In [21]:
from gensim.models import Phrases

bigram = Phrases(texts)

In [24]:
text = [bigram[line] for line in texts]
text

[['football', 'club', 'arsenal', 'defeat', 'local', 'rival', 'weekend'],
 ['weekend', 'football', 'frenzy', 'take', 'london'],
 ['bank', 'open', 'takeover', 'bid', 'lose', 'million'],
 ['london', 'football', 'club', 'bid', 'wembley', 'stadium'],
 ['arsenal', 'bid', 'pound', 'striker', 'kane'],
 ['financial', 'trouble', 'result', 'loss', 'million', 'bank'],
 ['western', 'bank', 'file', 'bankruptcy', 'financial', 'loss'],
 ['london', 'football', 'club', 'take', 'oil', 'millionaire', 'russia'],
 ['bank', 'finance', 'work', 'russia']]

Each line will now have all possible bi-grams created. It should be noted
that in our toy example, we will have no bi-grams or meaningless bigrams
being created. Since by creating new phrases we add words to our dictionary, this step
must be done before we create our dictionary. Hence once the above step of generating bigrams is done we will then proceed to create our dictionery as done at the start of this module.

**one popular preprocessing technique involves removing both high frequency and low-frequency words. We can do this in Gensim with the dictionary module. Let's say we would like to get rid of words that occur in less than 20 documents, or in more than 50% of the documents, we would add the following:**

In [38]:
dictionery.filter_extremes(no_below = 20, no_above = 0.5)

### POS TAGGING

In [67]:
import spacy
from pprint import pprint

In [60]:
#load the english language
nlp = spacy.load('en_core_web_sm')

In [61]:
#example sentences to be POS tagged
sent_0 = nlp(u'Mathieu and I went to the park.')
sent_1 = nlp(u'If Clement was asked to take out the garbage, he would refuse.')
sent_2 = nlp(u'Baptiste was in charge of the refuse treatment center.')
sent_3 = nlp(u'Marie took out her rather suspicious and fishy cat to go fish for fish.')

In [70]:
for token in sent_0:
    pprint((token.text, token.pos_, token.tag_))

('Mathieu', 'PROPN', 'NNP')
('and', 'CCONJ', 'CC')
('I', 'PRON', 'PRP')
('went', 'VERB', 'VBD')
('to', 'ADP', 'IN')
('the', 'DET', 'DT')
('park', 'NOUN', 'NN')
('.', 'PUNCT', '.')


In [71]:
for token in sent_1:
    print((token.text, token.pos_, token.tag_))

('If', 'ADP', 'IN')
('Clement', 'PROPN', 'NNP')
('was', 'VERB', 'VBD')
('asked', 'VERB', 'VBN')
('to', 'PART', 'TO')
('take', 'VERB', 'VB')
('out', 'PART', 'RP')
('the', 'DET', 'DT')
('garbage', 'NOUN', 'NN')
(',', 'PUNCT', ',')
('he', 'PRON', 'PRP')
('would', 'VERB', 'MD')
('refuse', 'VERB', 'VB')
('.', 'PUNCT', '.')


In [72]:
for token in sent_2:
    print((token.text, token.pos_, token.tag_))

('Baptiste', 'PROPN', 'NNP')
('was', 'VERB', 'VBD')
('in', 'ADP', 'IN')
('charge', 'NOUN', 'NN')
('of', 'ADP', 'IN')
('the', 'DET', 'DT')
('refuse', 'ADJ', 'JJ')
('treatment', 'NOUN', 'NN')
('center', 'NOUN', 'NN')
('.', 'PUNCT', '.')


In [73]:
sent_4 = nlp(u"Harshal is very good at NLP.")
for token in sent_4:
    print((token.text, token.pos_, token.tag_))

('Harshal', 'PROPN', 'NNP')
('is', 'VERB', 'VBZ')
('very', 'ADV', 'RB')
('good', 'ADJ', 'JJ')
('at', 'ADP', 'IN')
('NLP', 'PROPN', 'NNP')
('.', 'PUNCT', '.')


In [74]:
for token in sent_3:
    print((token.text, token.pos_, token.tag_))

('Marie', 'PROPN', 'NNP')
('took', 'VERB', 'VBD')
('out', 'PART', 'RP')
('her', 'PRON', 'PRP')
('rather', 'ADV', 'RB')
('suspicious', 'ADJ', 'JJ')
('and', 'CCONJ', 'CC')
('fishy', 'ADJ', 'JJ')
('cat', 'NOUN', 'NN')
('to', 'PART', 'TO')
('go', 'VERB', 'VB')
('fish', 'NOUN', 'NN')
('for', 'ADP', 'IN')
('fish', 'NOUN', 'NN')
('.', 'PUNCT', '.')


In [78]:
from spacy import displacy

In [84]:
displacy.render(sent_0, style = 'ent', jupyter = True)

In [86]:
displacy.render(sent_1, style = 'dep', jupyter = True)