# Introduction Natural Language Processing

## The Plan

![](https://cdn-images-1.medium.com/max/2400/1*BiVCmiQtCBIdBNcaOKjurg.png)

## Text Representation

* It is impossible to make them understand words naturally. 
* But encoding such words into numeric form can solve our problem.
* The process of converting textual information into numbers is called Vectorization. 

## Things we will cover in text representation
* Bag of Words
    * Count Vectors
    * Tf-idf Vectors
* Word embeddings

In [1]:
data = ["it was the best of times, it was the worst of times","it was the age of wisdom, it was the age of foolishness","it was the epoch of belief, it was the epoch of incredulity","it was the season of Light, it was the season of Darkness"]
data

['it was the best of times, it was the worst of times',
 'it was the age of wisdom, it was the age of foolishness',
 'it was the epoch of belief, it was the epoch of incredulity',
 'it was the season of Light, it was the season of Darkness']

## Bag of Words 

### Count Vectors

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
def count_vec(text):
    vectorizer = CountVectorizer()
    vectorizer.fit(text)
    doc_term_matrix= vectorizer.transform(text)
    return doc_term_matrix.toarray(), vectorizer.vocabulary_

In [4]:
vector, vocab = count_vec(data)

In [5]:
vocab

{'it': 7,
 'was': 13,
 'the': 11,
 'best': 2,
 'of': 9,
 'times': 12,
 'worst': 15,
 'age': 0,
 'wisdom': 14,
 'foolishness': 5,
 'epoch': 4,
 'belief': 1,
 'incredulity': 6,
 'season': 10,
 'light': 8,
 'darkness': 3}

In [6]:
vector

array([[0, 0, 1, 0, 0, 0, 0, 2, 0, 2, 0, 2, 2, 2, 0, 1],
       [2, 0, 0, 0, 0, 1, 0, 2, 0, 2, 0, 2, 0, 2, 1, 0],
       [0, 1, 0, 0, 2, 0, 1, 2, 0, 2, 0, 2, 0, 2, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 2, 1, 2, 2, 2, 0, 2, 0, 0]])

#### Advantages
* Simple to understand and implement
* Provide information about word presence and quantity.

#### Disadvantages
* Sparsity
* Lack of ordering of words
* All words treated equal
* No information about word relation
* No context


### Tf-idf Vectorization

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
def tf_idf_vec(text):
    vectorizer = TfidfVectorizer()
    vectorizer.fit(text)
    doc_term_matrix= vectorizer.transform(text)
    return doc_term_matrix.toarray(), vectorizer.vocabulary_

In [9]:
vector,vocab = tf_idf_vec(data)

In [10]:
vocab

{'it': 7,
 'was': 13,
 'the': 11,
 'best': 2,
 'of': 9,
 'times': 12,
 'worst': 15,
 'age': 0,
 'wisdom': 14,
 'foolishness': 5,
 'epoch': 4,
 'belief': 1,
 'incredulity': 6,
 'season': 10,
 'light': 8,
 'darkness': 3}

In [11]:
vector

array([[0.        , 0.        , 0.31072843, 0.        , 0.        ,
        0.        , 0.        , 0.32430197, 0.        , 0.32430197,
        0.        , 0.32430197, 0.62145686, 0.32430197, 0.        ,
        0.31072843],
       [0.62145686, 0.        , 0.        , 0.        , 0.        ,
        0.31072843, 0.        , 0.32430197, 0.        , 0.32430197,
        0.        , 0.32430197, 0.        , 0.32430197, 0.31072843,
        0.        ],
       [0.        , 0.31072843, 0.        , 0.        , 0.62145686,
        0.        , 0.31072843, 0.32430197, 0.        , 0.32430197,
        0.        , 0.32430197, 0.        , 0.32430197, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , 0.31072843, 0.        ,
        0.        , 0.        , 0.32430197, 0.31072843, 0.32430197,
        0.62145686, 0.32430197, 0.        , 0.32430197, 0.        ,
        0.        ]])

#### Advantages
* Simple to understand and implement
* Provide information about word presence and quantity
* Provide information about word importance


#### Disadvantages
* Sparsity
* Lack of ordering of words
* No information about word relation
* No context

## Word Embeddings

* http://ronxin.github.io/wevi/
* https://projector.tensorflow.org/

### Implementation of Word2Vec via Gensim

In [12]:
import nltk
nltk.download('abc')

[nltk_data] Downloading package abc to
[nltk_data]     /home/parasmehan123/nltk_data...
[nltk_data]   Unzipping corpora/abc.zip.


True

In [13]:
from nltk.corpus import abc

In [14]:
import gensim

In [15]:
# splitting sentences into tokens, Word2Vec Model takes list of lists as input where each sublist contains
# tokens for a sentence.
abc.sents()

[['PM', 'denies', 'knowledge', 'of', 'AWB', 'kickbacks', 'The', 'Prime', 'Minister', 'has', 'denied', 'he', 'knew', 'AWB', 'was', 'paying', 'kickbacks', 'to', 'Iraq', 'despite', 'writing', 'to', 'the', 'wheat', 'exporter', 'asking', 'to', 'be', 'kept', 'fully', 'informed', 'on', 'Iraq', 'wheat', 'sales', '.'], ['Letters', 'from', 'John', 'Howard', 'and', 'Deputy', 'Prime', 'Minister', 'Mark', 'Vaile', 'to', 'AWB', 'have', 'been', 'released', 'by', 'the', 'Cole', 'inquiry', 'into', 'the', 'oil', 'for', 'food', 'program', '.'], ...]

In [16]:
# Function to find different word in corpora 'abc'
nltk.corpus.abc.words()

['PM', 'denies', 'knowledge', 'of', 'AWB', 'kickbacks', ...]

In [17]:
# loading model or training the model
model= gensim.models.Word2Vec(abc.sents())

In [18]:
X= list(model.wv.vocab)
X

['PM',
 'denies',
 'knowledge',
 'of',
 'AWB',
 'kickbacks',
 'The',
 'Prime',
 'Minister',
 'has',
 'denied',
 'he',
 'knew',
 'was',
 'paying',
 'to',
 'Iraq',
 'despite',
 'writing',
 'the',
 'wheat',
 'exporter',
 'asking',
 'be',
 'kept',
 'fully',
 'informed',
 'on',
 'sales',
 '.',
 'Letters',
 'from',
 'John',
 'Howard',
 'and',
 'Deputy',
 'Mark',
 'Vaile',
 'have',
 'been',
 'released',
 'by',
 'Cole',
 'inquiry',
 'into',
 'oil',
 'for',
 'food',
 'program',
 'In',
 'one',
 'letters',
 'Mr',
 'asks',
 'managing',
 'director',
 'Andrew',
 'Lindberg',
 'remain',
 'in',
 'close',
 'contact',
 'with',
 'Government',
 'Opposition',
 "'",
 's',
 'Gavan',
 'O',
 'Connor',
 'says',
 'letter',
 'sent',
 '2002',
 ',',
 'same',
 'time',
 'though',
 'a',
 'trucking',
 'company',
 'He',
 'can',
 'longer',
 'wipe',
 'its',
 'hands',
 'illicit',
 'payments',
 'which',
 '$',
 '290',
 'million',
 '"',
 'responsibility',
 'this',
 'must',
 'lay',
 'may',
 'at',
 'feet',
 'Coalition',
 'minist

In [19]:
data=model.most_similar('science')
data

  """Entry point for launching an IPython kernel.


[('law', 0.9384385347366333),
 ('policy', 0.9277134537696838),
 ('agriculture', 0.9268344640731812),
 ('general', 0.9261758923530579),
 ('media', 0.9194194674491882),
 ('practice', 0.9192597270011902),
 ('discussion', 0.913200318813324),
 ('reservoir', 0.9107450246810913),
 ('heritage', 0.9105319976806641),
 ('board', 0.9103513956069946)]

In [20]:
data=model.most_similar('AWB')
data

  """Entry point for launching an IPython kernel.


[('Federal', 0.8237030506134033),
 ('Court', 0.8147238492965698),
 ('government', 0.8067227602005005),
 ('inquiry', 0.8026444315910339),
 ('company', 0.7983270883560181),
 ('Government', 0.7962547540664673),
 ('exporter', 0.7610700130462646),
 ('Labor', 0.7486500144004822),
 ('veto', 0.7433414459228516),
 ('tabled', 0.7348763942718506)]

In [21]:
similarity_two_words = model.similarity('science','AWB')
print(similarity_two_words)

0.5374271026077122


  """Entry point for launching an IPython kernel.


In [22]:
try:
    similarity_two_words = model.similarity('india','delhi')
except Exception as e: 
    print(e)

"word 'india' not in vocabulary"


  


### Pre-trained Word Embeddings

Various pre-trained models are available like Google Word2Vec, Godin, FastText, GloVe

In [1]:
from gensim.models import KeyedVectors

In [2]:
path= '/home/parasmehan123/AI Sumer School 12 July 2019/GoogleNews-vectors-negative300.bin' 

In [None]:
# Load vectors directly from the file
model = KeyedVectors.load_word2vec_format(path, binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [49]:
# Access vectors for specific words with a keyed lookup:
model['easy']

array([ 3.06640625e-01,  6.83593750e-02, -1.60156250e-01,  1.19628906e-01,
       -6.56127930e-03,  4.39453125e-03,  1.44531250e-01,  6.20117188e-02,
        7.17773438e-02,  2.67333984e-02,  9.91210938e-02, -2.30712891e-02,
        5.66406250e-02, -1.74804688e-01, -5.32226562e-02,  8.98437500e-02,
        2.94921875e-01, -6.59179688e-02,  1.35742188e-01, -1.73828125e-01,
        7.32421875e-02,  2.08007812e-01,  7.27539062e-02,  2.19726562e-01,
       -5.02929688e-02, -1.15234375e-01, -1.80664062e-01, -4.29153442e-06,
       -1.69921875e-01, -7.61718750e-02, -4.30297852e-03,  1.71875000e-01,
        2.57812500e-01, -1.33789062e-01,  3.95507812e-02,  4.24194336e-03,
       -2.80761719e-02, -1.54296875e-01,  1.76757812e-01,  6.68945312e-02,
        2.71484375e-01, -1.43554688e-01,  4.02343750e-01, -1.19140625e-01,
       -2.58789062e-02, -5.63964844e-02,  3.78417969e-02,  4.29687500e-02,
        2.92968750e-02, -2.11181641e-02, -4.15039062e-02,  6.29882812e-02,
       -1.90429688e-02, -

In [50]:
# see the shape of the vector (300,)
model['easy'].shape

(300,)

What is so great about these vectors?

In [31]:
result = model.similar_by_vector(model['king']-model['man']+model['woman'])
result[1]

MemoryError: 

In [34]:
result = model.similar_by_vector(model['Delhi']-model['India']+model['France'])
result[1]

MemoryError: 

In [85]:
result = model.similar_by_vector(model['NASA']-model['USA']+model['India'])
result

[('NASA', 0.8089109659194946),
 ('Nasa', 0.7152836918830872),
 ('ISRO', 0.6773823499679565),
 ('Isro', 0.6696537733078003),
 ('astronauts', 0.6047458648681641),
 ('NASAs', 0.603725790977478),
 ('space_shuttle', 0.6025021076202393),
 ('Research_Organisation_Isro', 0.5952415466308594),
 ('spacecraft', 0.5853996276855469),
 ('orbiter', 0.5807898640632629)]

Make sentence vectors

In [33]:
# Processing sentences is not as simple as with Spacy:
vectors = [model[x] for x in "This is some text I am processing with Spacy".split(' ')]

In [35]:
import numpy as np
vectors=np.array(vectors)

In [36]:
embedding_matrix=np.vstack(vectors)

In [37]:
type(embedding_matrix)

numpy.ndarray

In [38]:
embedding_matrix

array([[-0.2890625 ,  0.19921875,  0.16015625, ...,  0.12792969,
         0.12109375, -0.22949219],
       [ 0.00704956, -0.07324219,  0.171875  , ...,  0.01123047,
         0.1640625 ,  0.10693359],
       [ 0.17871094,  0.09130859, -0.00165558, ...,  0.125     ,
         0.08056641,  0.01672363],
       ...,
       [-0.09033203,  0.04394531,  0.11621094, ..., -0.3359375 ,
        -0.15234375,  0.00254822],
       [-0.02490234,  0.02197266, -0.03540039, ...,  0.01080322,
        -0.01879883, -0.06884766],
       [ 0.06054688,  0.09326172, -0.07373047, ..., -0.07177734,
        -0.02893066, -0.02185059]], dtype=float32)

In [39]:
embedding_matrix.shape

(9, 300)

#### Advantages
* Simple to understand and implement
* Provide information about word presence and quantity
* Provide information about word importance
* Dense
* Provide information about word relation
* Provide information about order

#### Disadvantages
* No context
