# Natural Language Processing

- Ability to understand and interpret Language

- Types of text data:
    - Messages
    - Email
    - Any article on the internet

- Applications:
    - Chatbots - (FAQ based, conversation)
    - Identify sentiment of the language ( Review on the internet for some product or some movie )
    - Topic Modelling (what are the trending topics Quora?)
    - Machine Translation
    - Text Completion
    - Home assistants
    -etc etc
    

- Topics :
    - Regular Expression
    - Reading text data ( using context manager)
    - Text Processing
        - Tokenisation
        - Stemming & Lemmatisation
        - StopWords
        - Converting text data to vectors:
            - One Hot Encoding
            - Bag of Words
            - TF- IDF
            - Count Vectorisor
            - NLTK and Spacy
                - POS 
                - NER
                - Training models on text data
        - Projects:
            - Email Classification (Spam / Ham classification)
            - Sentiment Analysis

In [None]:
!pip install spacy

In [None]:
# !python -m spacy download en_core_web_sm     #----> install english core library

In [5]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable


In [184]:
import re
import numpy as np

In [8]:
import spacy

In [12]:
import pandas as pd

- Terminologies
    - Corpus ---> Entire Dataset
    - Document
    - Token ---> one word or one sentence

In [11]:
text_data = ["I like this book", "I dont like football", "I like reading stories", "I am a football fan"]
text_data   #----->corpus

['I like this book',
 'I dont like football',
 'I like reading stories',
 'I am a football fan']

In [15]:
text_series = pd.Series(text_data)
text_series

0          I like this book
1      I dont like football
2    I like reading stories
3       I am a football fan
dtype: object

    "I like this book",  #--->Document
    "I dont like football",  # ----> Document
    "I like reading stories", # -----> Document
    "I am a football fan"

    Tokenisation
        - Splitting the data into words or sentences ( word tokenisation & Sentence tokenisation)

In [16]:
text_data

['I like this book',
 'I dont like football',
 'I like reading stories',
 'I am a football fan']

    Document --> 'I like this book very much'

    Tokens :    
            I, like, this , book , very , much   ( Unigram)
            I like, like this, this book, book very,  very much  ( bi-gram)
            I like this, like this book, this book very, book very much  ( 3- gram)
    

In [169]:
# !python -m spacy download en_core_web_lg

# Tokenisation
    - Word Tokenisation
    - Sentence Tokenisation

In [171]:
nlp = spacy.load('en_core_web_sm')

In [19]:
text_data

['I like this book',
 'I dont like football',
 'I like reading stories',
 'I am a football fan']

In [57]:
single_text = "I am liking football very much. I watch football every day"  # Document which has 2 sentences

doc = nlp(single_text)

In [58]:
type(doc)

spacy.tokens.doc.Doc

In [59]:
type(single_text)

str

In [60]:
## Tokennize the sentences
list(doc.sents)
tokenized_sentences = [sentence for sentence in doc.sents]
tokenized_sentences

[I am liking football very much., I watch football every day]

In [61]:
# Word tokenisation
for sentence in doc.sents:
    for word in sentence:
        print(word)

I
am
liking
football
very
much
.
I
watch
football
every
day


In [62]:
list(doc)  # ---> word tokens

[I, am, liking, football, very, much, ., I, watch, football, every, day]

In [63]:
list(doc.sents)  # ----> sentence tokens

[I am liking football very much., I watch football every day]

In [64]:
for token in doc:
    print(token)

I
am
liking
football
very
much
.
I
watch
football
every
day


In [65]:
### Try this in NLTK

# from nltk.tokenize import sent_tokenize, word_tokenize

# sent_tokenize(single_text)
# word_tokenize(single_text)

In [66]:
print(dir(doc[2]))

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_dep', 'has_extension', 'has_head', 'has_morph', 'has_vector', 'head', 'i', 'idx', 'iob_strings', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex', 'lex_id', 'like_email', 'like

In [67]:
doc[2]

liking

In [68]:
doc[2].is_currency

False

In [69]:
doc[2].is_lower

True

In [70]:
doc[2].is_space

False

In [74]:
doc[2] , doc[2].lemma_   # ----> Lemmatisation of the token ---> converts the token to root form

(liking, 'like')

- Lemmatisation

        good   ----> good
        better ----> good
        best   ---> good


        go     ---> go
        going  ---> go
        gone   ---> go
        went   ---> go

In [78]:
doc[0] ,doc[0].is_alpha

(I, True)

# Part of Speech (POS)

In [93]:
single_text = 'I am liking football very much. I watched football every day'

token_list = []
lemma_list = []
pos_list   = []

for token in nlp(single_text):
    token_list.append(token)
    lemma_list.append(token.lemma_)
    pos_list.append(token.pos_)
    # print(f"{token},\t  { token.lemma_}")
    
df = pd.DataFrame({"token":token_list,
                  "lemma":lemma_list,
                  "pos":pos_list})
df

Unnamed: 0,token,lemma,pos
0,I,I,PRON
1,am,be,AUX
2,liking,like,VERB
3,football,football,NOUN
4,very,very,ADV
5,much,much,ADV
6,.,.,PUNCT
7,I,I,PRON
8,watched,watch,VERB
9,football,football,NOUN


# Named Entity Recognition (NER)

In [100]:
single_text = 'I am working at Facebook Inc in US. Facebook and Google are big companies with $200 Billion worth. Tesla is also big company in France. '

for entity in nlp(single_text).ents:
    print(f"{entity}---> {entity.label_}---> {spacy.explain(entity.label_)}")

Facebook Inc---> ORG---> Companies, agencies, institutions, etc.
US---> GPE---> Countries, cities, states
Google---> ORG---> Companies, agencies, institutions, etc.
$200 Billion---> MONEY---> Monetary values, including unit
France---> GPE---> Countries, cities, states


In [102]:
from spacy import displacy

In [103]:
displacy.render(nlp(single_text), style = "ent")

## Stemming and Lemmatisation

- Stemming ( converts to the root word ---> rule based ( eg. it removed 'y' from the end or 'er' from the end) )

        happy    --> happi   ---> ( may not give meaningful words)
        happier  --> happi
        happiest --> happi

        ability ---> abiliti

        going   --> go
        seeing  --> see
        caring  --> car  ( it should be care)  --> meaningless



- Lemmatisation (gives meaningful words, takes into account the grammar and english vocabulary)

        good   ----> good
        better ----> good
        best   ---> good


        go     ---> go
        going  ---> go
        gone   ---> go
        went   ---> go

# Filtering  the words

In [106]:
earnings_text= """Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

In [107]:
earnings_text

'Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:\n\n·         Revenue was $51.7 billion and increased 20%\n·         Operating income was $22.2 billion and increased 24%\n·         Net income was $18.8 billion and increased 21%\n·         Diluted earnings per share was $2.48 and increased 22%\n“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”\n“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft

In [109]:
doc = nlp(earnings_text)
filtered_tokens = []

for token in doc:
    if token.pos_ not in ["SPACE","PUNCT","X"]:
        filtered_tokens.append(token)
filtered_tokens[:15]

[Microsoft,
 Corp.,
 today,
 announced,
 the,
 following,
 results,
 for,
 the,
 quarter,
 ended,
 December,
 31,
 2021,
 as]

## StopWords

    - Commomly used words like I, me, they, them , there, this, that

In [352]:
text = "I will eat good food then go to sleep and then i will go and read something"

doc = nlp(text)

filtered_text = []
for token in doc:
    if not token.is_stop:
        filtered_text.append(token.text)
        
(filtered_text)
text_without_stop  = " ".join(filtered_text)
text_without_stop

'eat good food sleep read'

# Vectorisation of the data
    - Convert the text document into numerical format
    
    - Bag of Words ( BOW) 
        - Count Vectorisation
        - TF-IDF ( Term Frequency & Inverse Document Frequency)
    - Word2Vec

In [110]:
single_text

'I am working at Facebook Inc in US. Facebook and Google are big companies with $200 Billion worth. Tesla is also big company in France. '

### Converting a simple text  corpus to numerical matrix using CountVectorisation

In [159]:
text_data = ["I like this book", 
             "I dont like football", 
             "I am reading this book", 
             "I am a football fan"]

In [160]:
text_df = pd.DataFrame({"features":text_data, "label":[1,0,1,0]})
text_df

Unnamed: 0,features,label
0,I like this book,1
1,I dont like football,0
2,I am reading this book,1
3,I am a football fan,0


In [116]:
from sklearn.feature_extraction.text import CountVectorizer

In [119]:
'''Count Vectorizer

        i   like   this   book   dont   football    am   reading   a   fan
Row1    1     1     1        1      0        0       0     0       0    0
Row2    1     1     0        0      1        1       0     0       0    0
Row3
Row3

'''

print()




In [161]:
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(text_data)

    - Sparse Matrix ---> Has lot of zeros and very less values

In [127]:
text_data

['I like this book',
 'I dont like football',
 'I am reading this book',
 'I am a football fan']

In [124]:
vectors.toarray()  # ----> Text data in vectorized format

array([[0, 1, 0, 0, 0, 1, 0, 1],
       [0, 0, 1, 0, 1, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 1, 1],
       [1, 0, 0, 1, 1, 0, 0, 0]], dtype=int64)

In [126]:
vectorizer.get_feature_names_out()  # -----> doesnt consider i and a  

array(['am', 'book', 'dont', 'fan', 'football', 'like', 'reading', 'this'],
      dtype=object)

In [162]:
feature_set = pd.DataFrame(data = vectors.toarray(), columns=vectorizer.get_feature_names_out())
feature_set["label"] = text_df["label"]
feature_set   #---> data prepared for machine learning model 

Unnamed: 0,am,book,dont,fan,football,like,reading,this,label
0,0,1,0,0,0,1,0,1,1
1,0,0,1,0,1,1,0,0,0
2,1,1,0,0,0,0,1,1,1
3,1,0,0,1,1,0,0,0,0


In [163]:
from sklearn import svm
clf_svm = svm.SVC(kernel = "linear")

X_train =  vectors.toarray()
y_train = text_df["label"]

clf_svm.fit(X_train, y_train)

SVC(kernel='linear')

In [145]:
y_train

0    1
1    0
2    1
3    0
Name: label, dtype: int64

In [157]:
text_data

['I like this book',
 'I dont like football',
 'I am reading this book',
 'I am a football fan']

In [172]:
pred_text = ["I like this book","People watch football", "i  read books","football is good","books"]
pred_vector = (vectorizer.transform(pred_text)).toarray()


# print(pred_vector)

clf_svm.predict(pred_vector)

array([1, 0, 0, 0, 0], dtype=int64)

In [139]:
pred_vector

<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

#### Another Example

In [147]:


train_x = ["i love the book", "this is a great book", "the fit is great", "i love the shoes" ]
train_y = ["BOOKS", "BOOKS", "CLOTHING", "CLOTHING"]


vectorizer = CountVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)

In [148]:
from sklearn import svm
clf_svm = svm.SVC(kernel = "linear")
clf_svm.fit(train_x_vectors, train_y)

SVC(kernel='linear')

In [156]:
test_x = ["shoes fit well", "i love to read this","interesting book to read", "i like the story"]
test_x_vectors = vectorizer.transform(test_x)
clf_svm.predict(test_x_vectors)

array(['CLOTHING', 'BOOKS', 'BOOKS', 'CLOTHING'], dtype='<U8')

    - Disadvantages of Count Vectorisation:
            - It only considers those words which are present in the training data. 
            - Doesnt consider the semantic relationship between the words ( books and chapter is related, or shoes and shirt can be related)

# TF-IDF Vectorizer

    - Tf-IDF 
        - Term Frequency   (TF)                 - Number of time a word has appeared / Total number of words in the document
        - Inverse Document Frequency  (IDF)     - log(Total number of documents / No. of documents the word has appeared)
        - TF-IDF                                - Term Frequency(TF) * Inverse Document Frequency(IDF)
       

In [177]:
text_data

['I like this book',
 'I dont like football',
 'I am reading this book',
 'I am a football fan']

In [173]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [174]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(text_data)

In [176]:
vectors.toarray()

array([[0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.        , 0.        , 0.66767854, 0.        , 0.52640543,
        0.52640543, 0.        , 0.        ],
       [0.46580855, 0.46580855, 0.        , 0.        , 0.        ,
        0.        , 0.59081908, 0.46580855],
       [0.52640543, 0.        , 0.        , 0.66767854, 0.52640543,
        0.        , 0.        , 0.        ]])

In [178]:
vectorizer.get_feature_names_out() 

array(['am', 'book', 'dont', 'fan', 'football', 'like', 'reading', 'this'],
      dtype=object)

In [179]:
feature_set = pd.DataFrame(data = vectors.toarray(), columns=vectorizer.get_feature_names_out())
feature_set["label"] = text_df["label"]
feature_set

Unnamed: 0,am,book,dont,fan,football,like,reading,this,label
0,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0,0.57735,1
1,0.0,0.0,0.667679,0.0,0.526405,0.526405,0.0,0.0,0
2,0.465809,0.465809,0.0,0.0,0.0,0.0,0.590819,0.465809,1
3,0.526405,0.0,0.0,0.667679,0.526405,0.0,0.0,0.0,0


In [180]:
from sklearn import svm
clf_svm = svm.SVC(kernel = "linear")

X_train =  vectors.toarray()
y_train = text_df["label"]

clf_svm.fit(X_train, y_train)

SVC(kernel='linear')

In [189]:
pred_text = ["I like this book","People watch football", "i  read books","football is good","books", 'I am reading this book']
pred_vector = (vectorizer.transform(pred_text)).toarray()


# print(pred_vector)

clf_svm.predict(pred_vector)

array([1, 0, 0, 0, 0, 1], dtype=int64)

    - Disadvantages of Count Vectorisation:
            - It only considers those words which are present in the training data. 
            - Doesnt consider the semantic relationship between the words ( books and chapter is related, or shoes and shirt can be related)

# Word2Vec
    - Handle the relationships between the words
    - Consider the words which are not in the training data
    - Gives similarity between words
    
    
    
    - Understands the semantic relationship
    
    'KING - MAN + WOMAN = QUEEN'

In [282]:
text_data = ['I like this book',
             'I dont like football',
             "Books are great for learning",
             'Football is a good sport']

In [283]:
text = "I love football"

nlp(text).vector

array([ 0.10108825,  0.1753288 , -0.46511328, -0.58186936, -0.24014316,
        0.18683529,  0.1151624 ,  0.2941464 ,  0.65600866, -0.15250449,
        0.13406175, -0.53378177,  0.7110953 , -0.19966024,  0.03958007,
       -0.47140816, -0.03733912,  0.2771263 , -0.8190035 , -0.4270077 ,
        0.7018841 , -0.60961455,  0.7150829 ,  0.45296928,  1.0245095 ,
        0.40628394,  1.095893  , -0.83086866,  0.4250836 ,  0.10669468,
       -0.2223094 ,  0.28379348, -0.39372268,  0.2548144 , -0.08345608,
       -0.44437304, -0.44359565, -0.35237408, -0.2669944 ,  0.22077931,
       -0.19576967,  0.65629363,  0.5119625 , -0.20124204,  0.06920663,
       -0.3317858 ,  0.3664886 ,  0.49511528, -0.7734642 , -0.41126522,
       -0.16300905, -0.09808165,  0.03496218,  0.8300558 , -0.56442803,
       -0.0997112 , -0.37365708,  0.4113036 ,  0.254392  , -0.02073604,
       -0.58385825, -0.09051654, -0.24396247,  0.0168677 , -0.81754273,
       -1.1022238 , -0.5932955 , -0.04311176,  0.41030392, -0.74

In [284]:
docs = [ nlp(x) for x in   text_data]

docs

[I like this book,
 I dont like football,
 Books are great for learning,
 Football is a good sport]

In [285]:
X_train_w2v = []
for sent in docs:
    X_train_w2v.append(sent.vector)

In [286]:
text_data

['I like this book',
 'I dont like football',
 'Books are great for learning',
 'Football is a good sport']

In [287]:
y_train = ["BOOKS","FOOTBALL","BOOKS","FOOTBALL"]

In [288]:
y_train

['BOOKS', 'FOOTBALL', 'BOOKS', 'FOOTBALL']

In [289]:
# X_train_w2v

In [290]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train_w2v, y_train)
# clf_svm_w2v = svm.SVC(kernel = "linear")
# clf_svm_w2v.fit(X_train_w2v, train_y)

RandomForestClassifier()

### prediction

In [292]:
pred_text = ["I like this book",
             "People watch football", 
             "i  read books",
             "football is good",
             "books", 
             'I am reading this book',
            "i like this chapter",
            "i am reading lesson 3",
            "The page is good",
            "Book content is good",
            "Books are good for knowledge",
            "Sports is good for health"]
X_test_w2v = []

for sent in [ nlp(x) for x in  pred_text]:
    X_test_w2v.append(sent.vector)
    
    
preds = rf.predict(X_test_w2v)
result = pd.DataFrame({"Input":pred_text, "Prediction":preds})
result

Unnamed: 0,Input,Prediction
0,I like this book,BOOKS
1,People watch football,FOOTBALL
2,i read books,BOOKS
3,football is good,BOOKS
4,books,BOOKS
5,I am reading this book,BOOKS
6,i like this chapter,BOOKS
7,i am reading lesson 3,BOOKS
8,The page is good,BOOKS
9,Book content is good,BOOKS


### Another classification example

In [319]:


train_x = ["i love the book", "this is a great book", "the shirt is good to  wear", "the shoes are nice"]
train_y = ["BOOKS", "BOOKS", "CLOTHING", "CLOTHING"]



# train_x = ["love book", " great book", " shirt good wear", "shoes nice"] ---> after removing the stop words

train_x_vectors = []
for sent in [ nlp(x) for x in  train_x]:
    train_x_vectors.append(sent.vector)


In [332]:
docs[0][2], docs[0][1].is_stop

(the, True)

In [320]:
train_y

['BOOKS', 'BOOKS', 'CLOTHING', 'CLOTHING']

In [321]:
from sklearn import svm
clf_svm = svm.SVC(kernel = "linear")
clf_svm.fit(train_x_vectors, train_y)

SVC(kernel='linear')

In [322]:
test_x = ["shoes fit well",
          "i love to read this",
          "interesting book to read", 
          "i like the story",
          "pants are good fit",
         "I am writing a novel",
         "This hat is good",
         "i am reading a chapter",
         "I am wearing trousers"]


test_x_vectors = []

for sent in [ nlp(x) for x in  test_x]:
    test_x_vectors.append(sent.vector)
    
pred = clf_svm.predict(test_x_vectors)

result = pd.DataFrame({"Input":test_x, "Prediction":pred})
result


Unnamed: 0,Input,Prediction
0,shoes fit well,CLOTHING
1,i love to read this,CLOTHING
2,interesting book to read,CLOTHING
3,i like the story,BOOKS
4,pants are good fit,CLOTHING
5,I am writing a novel,BOOKS
6,This hat is good,CLOTHING
7,i am reading a chapter,BOOKS
8,I am wearing trousers,CLOTHING


In [323]:
train_x

['i love the book',
 'this is a great book',
 'the shirt is good to  wear',
 'the shoes are nice']