# NLP Project I

Introduction

Domain: Digital content management

Context: Text in the form of blogs, posts, articles, etc. are written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem

Data Description: Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of 19,320  bloggers  gathered  from  blogger.com  in  August  2004.  The  corpus  incorporates  a  total  of  681,288  posts  and  over  140  million  words  -  or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign

In [1]:
import pandas as pd
import re
import numpy as np
import pandas as pd 
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from nltk.stem.snowball import SnowballStemmer

from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report, confusion_matrix

import random
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
import matplotlib as plt
%matplotlib inline

Load the Dataset

In [2]:
blog = pd.read_csv('blogtext.csv')

In [3]:
blog.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


Data Analysis - Inferences

In [4]:
blog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 681284 entries, 0 to 681283
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      681284 non-null  int64 
 1   gender  681284 non-null  object
 2   age     681284 non-null  int64 
 3   topic   681284 non-null  object
 4   sign    681284 non-null  object
 5   date    681284 non-null  object
 6   text    681284 non-null  object
dtypes: int64(2), object(5)
memory usage: 36.4+ MB


Age and id have int datatype while the others - gender, topic, sign, date and text have object datatype

In [5]:
blog.shape

(681284, 7)

The dataset is quite large with 680k+ records

In [6]:
blog.sample(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
76553,1601316,female,24,indUnk,Taurus,"12,April,2004",Sometimes I'm a cold hearted Bitch....
582219,4091880,male,42,Transportation,Gemini,"12,August,2004",First posting ever to this blog.......We mu...
390312,3906531,male,15,Education,Gemini,"06,July,2004",I finally joined in !!! !!!!
453725,1743209,male,17,Student,Gemini,"08,August,2004",As you may have noticed - I have go...
161967,2952149,male,13,Student,Scorpio,"13,March,2004",Today was a pretty busy day. First my m...
535620,3815511,male,14,Student,Sagittarius,"03,July,2004",Today i cut myself on glass and bu...
160828,756402,female,16,indUnk,Cancer,"19,August,2003",Well Neil left today. Everybody who's g...
416668,3644782,female,23,Engineering,Libra,"08,April,2004",*Bleh* Things are ... slow at the momen...
333672,3668625,female,25,indUnk,Sagittarius,"13,July,2004",Jeepers! I think that time is playing t...
558938,2173787,female,17,Student,Capricorn,"09,November,2003",I spent my afternoon not doing the thin...


For analyzing the blogs, useful attributes are gender, age, topic, sign and text

In [7]:
blog['topic'].value_counts()

indUnk                     251015
Student                    153903
Technology                  42055
Arts                        32449
Education                   29633
Communications-Media        20140
Internet                    16006
Non-Profit                  14700
Engineering                 11653
Law                          9040
Publishing                   7753
Science                      7269
Government                   6907
Consulting                   5862
Religion                     5235
Fashion                      4851
Marketing                    4769
Advertising                  4676
BusinessServices             4500
Banking                      4049
Chemicals                    3928
Telecommunications           3891
Accounting                   3832
Military                     3128
Museums-Libraries            3096
Sports-Recreation            3038
HumanResources               3010
RealEstate                   2870
Transportation               2326
Manufacturing 

Topic being the Target attribute is skewed towards indUnk and student

In [8]:
blog['gender'].value_counts()

male      345193
female    336091
Name: gender, dtype: int64

Gender attribute is quite balanced

Check for missing values

In [9]:
blog.isna().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

There seem to be no missing values in the dataset

In [10]:
!pip install langdetect



In [10]:
# As the dataset is large, use fewer records
blog_df = blog.sample(100000)

In [11]:
blog_df.shape

(100000, 7)

Data Preprocessing

Eliminate Non-English textual data

In [13]:
from langdetect import detect

def detect_english(text):
    try:
        return detect(text) == 'en'
    except:
        return False

In [14]:
blog_df = blog_df[blog_df['text'].apply(detect_english)]

In [12]:
blog_df.shape

(100000, 7)

In [13]:
blog_df.isna().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

Preprocess unstructured data to make it consumable for model training

Eliminate All special Characters and Numbers

In [14]:
blog_df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
471789,3732266,female,17,Government,Cancer,"24,June,2004",YEhey.papalabas na ang long-awaited 'Th...
3754,3887270,female,17,Student,Leo,"25,July,2004","PHirST tyme for 'Cmoore' (haha), ..."
90549,3113729,male,27,Communications-Media,Capricorn,"07,June,2004","[Saddle Creek] • May 25, 2..."
563566,3399714,male,24,Military,Gemini,"06,July,2004","So, I really have nothing much to talk ..."
603747,942828,female,34,indUnk,Cancer,"24,January,2003",Two notes for those devoted...


In [15]:
pattern = "[^\w ]"
blog_df.text = blog_df.text.apply(lambda s : re.sub(pattern,"",s))

In [16]:
blog_df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
471789,3732266,female,17,Government,Cancer,"24,June,2004",YEheypapalabas na ang longawaited The n...
3754,3887270,female,17,Student,Leo,"25,July,2004",PHirST tyme for Cmoore haha goin ...
90549,3113729,male,27,Communications-Media,Capricorn,"07,June,2004",Saddle Creek May 25 2004 ...
563566,3399714,male,24,Military,Gemini,"06,July,2004",So I really have nothing much to talk a...
603747,942828,female,34,indUnk,Cancer,"24,January,2003",Two notes for those devoted...


Lowercase all textual data

In [17]:
blog_df.text = blog_df.text.apply(lambda s: s.lower())

Remove all Stopwords

In [18]:
!pip install stopwords



In [19]:
nltk.download('all')

[nltk_data] Error loading all: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


False

In [20]:
stopwords=set(stopwords.words('english'))
blog_df.text = blog_df.text.apply(lambda s: ' '.join([words for words in s.split() if words not in stopwords]))

Remove all extra white spaces

In [21]:
blog_df.text = blog_df.text.apply(lambda s: s.strip())

In [22]:
blog_df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
471789,3732266,female,17,Government,Cancer,"24,June,2004",yeheypapalabas na ang longawaited notebook pri...
3754,3887270,female,17,Student,Leo,"25,July,2004",phirst tyme cmoore haha goin pacific mallasian...
90549,3113729,male,27,Communications-Media,Capricorn,"07,June,2004",saddle creek may 25 2004 urllink good lifes of...
563566,3399714,male,24,Military,Gemini,"06,July,2004",really nothing much talk discuss really quiet ...
603747,942828,female,34,indUnk,Cancer,"24,January,2003",two notes devoted venerable newman urllink mr ...


Build a base Classification model

Create dependent and independent variables

In [23]:
X = blog_df['text']
y = blog_df['topic']

Split data into train and test

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42,test_size = 0.2)

Vectorize data

In [27]:
cv = CountVectorizer(ngram_range=(1,2))

In [28]:
cv.fit(X_train)

print('Vocabulary: ',cv.vocabulary_)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [29]:
cv.get_feature_names()

['00',
 '00 45',
 '00 bayern',
 '00 became',
 '00 chance',
 '00 couple',
 '00 denmark',
 '00 doesnt',
 '00 election',
 '00 extra',
 '00 horny',
 '00 innings',
 '00 less',
 '00 meters',
 '00 morning',
 '00 new',
 '00 oran',
 '00 pitch',
 '00 pretty',
 '00 really',
 '00 second',
 '00 seconds',
 '00 third',
 '00 well',
 '00 wouldnt',
 '000',
 '000 000',
 '000 023',
 '000 approximate',
 '000 attend',
 '000 bipolar',
 '000 cheers',
 '000 crore',
 '000 day',
 '000 duration',
 '000 era',
 '000 jobs',
 '000 lbs',
 '000 leaving',
 '000 lolol',
 '000 men',
 '000 mile',
 '000 oh',
 '000 peeps',
 '000 people',
 '000 placidus',
 '000 pounds',
 '000 someone',
 '000 songs',
 '000 square',
 '000 straight',
 '000 thats',
 '000 times',
 '0000',
 '0000 3847',
 '0000 mirage',
 '0000 new',
 '0000 wont',
 '00000',
 '00000 asked',
 '000000',
 '000000 1337',
 '000000 colour',
 '000000 even',
 '000000 fontsize',
 '000000 scrollbarhighlightcolor',
 '000000 urllink',
 '00000000',
 '00000000 eight',
 '00000000 ke

In [30]:
X_train_cv = cv.transform(X_train)

In [31]:
X_test_cv = cv.transform(X_test)

In [32]:
blog_df['topic'].value_counts()

indUnk                     35141
Student                    21548
Technology                  5981
Arts                        4534
Education                   4066
Communications-Media        2809
Internet                    2254
Non-Profit                  2072
Engineering                 1658
Law                         1217
Publishing                  1128
Science                     1006
Government                   997
Consulting                   836
Religion                     722
Marketing                    691
Advertising                  681
Fashion                      666
BusinessServices             589
Banking                      553
Chemicals                    553
Accounting                   541
Telecommunications           534
Military                     468
Sports-Recreation            452
Museums-Libraries            449
HumanResources               421
RealEstate                   400
Manufacturing                360
Transportation               332
Biotech   

Transform Labels

In [33]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(y_train)
y_train_enc = le.transform(y_train)
y_test_enc = le.transform(y_test)

In [34]:
y_train_enc

array([34,  4, 14, ..., 39, 21, 34])

In [35]:
y_test_enc

array([39,  3, 34, ..., 34, 39, 39])

In [36]:
y_train

373712               Student
342773                  Arts
257659           Engineering
537473              Internet
442852    Telecommunications
                 ...        
51834                 indUnk
300534            Publishing
458420                indUnk
653566                   Law
286758               Student
Name: topic, Length: 76377, dtype: object

Build a base model for Supervised Learning - Classification

In [37]:
rfmodel1 = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', min_samples_leaf = 3, random_state = 42)

rfmodel1.fit(X_train_cv, y_train_enc)

RandomForestClassifier(criterion='entropy', min_samples_leaf=3, random_state=42)

In [38]:
y_pred = rfmodel1.predict(X_test_cv)

Performance Metrics

In [39]:
print('Accuracy: ', accuracy_score(y_test_enc, y_pred))

Accuracy:  0.37083006022518983


In [40]:
print(classification_report(y_test_enc, y_pred))
cm = confusion_matrix(y_test_enc, y_pred)
print(cm)

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        97
           1       0.00      0.00      0.00       109
           2       0.00      0.00      0.00        30
           3       0.00      0.00      0.00        43
           4       0.00      0.00      0.00       923
           5       0.00      0.00      0.00        16
           6       0.00      0.00      0.00       106
           7       0.00      0.00      0.00        58
           8       0.00      0.00      0.00       104
           9       0.00      0.00      0.00       113
          10       0.00      0.00      0.00       601
          11       0.00      0.00      0.00        28
          12       0.00      0.00      0.00       158
          13       0.00      0.00      0.00       853
          14       1.00      0.00      0.01       341
          15       0.00      0.00      0.00        11
          16       1.00      0.01      0.02       107
          17       0.00    

TF - IDF Vectorization

In [41]:
tf = TfidfVectorizer(max_features = 200)

tf.fit(X_train)

TfidfVectorizer(max_features=200)

In [42]:
print('\nWord indexes:')
print(tf.vocabulary_)


Word indexes:
{'tomorrow': 172, 'need': 113, 'also': 2, 'everything': 40, 'must': 110, 'im': 76, 'going': 56, 'wont': 190, 'let': 86, 'post': 130, 'first': 44, 'part': 124, 'blog': 17, 'dont': 32, 'read': 135, 'want': 181, 'today': 170, 'times': 169, 'much': 108, 'hate': 66, 'job': 78, 'school': 144, 'thinking': 164, 'guess': 61, 'maybe': 102, 'getting': 51, 'people': 125, 'things': 162, 'believe': 12, 'another': 4, 'life': 87, 'morning': 106, 'day': 27, 'home': 70, 'found': 45, 'one': 123, 'work': 191, 'came': 20, 'hope': 71, 'doesnt': 30, 'come': 23, 'make': 97, 'anyone': 5, 'feel': 41, 'bad': 11, 'actually': 0, 'friends': 47, 'find': 43, 'someone': 148, 'still': 152, 'ill': 75, 'went': 187, 'good': 58, 'made': 96, 'friend': 46, 'house': 73, 'nice': 117, 'see': 145, 'old': 122, 'well': 186, 'back': 10, 'game': 49, 'night': 118, 'last': 82, 'week': 185, 'got': 59, 'looking': 93, 'sure': 154, 'like': 88, 'real': 136, 'big': 15, 'next': 116, 'year': 195, 'play': 128, 'ive': 77, 'urllin

In [43]:
X_train_tf = tf.transform(X_train)
X_test_tf = tf.transform(X_test)

In [46]:
rfmodel2 = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', min_samples_leaf = 3, random_state = 42)

rfmodel2.fit(X_train_tf, y_train_enc)

RandomForestClassifier(criterion='entropy', min_samples_leaf=3, random_state=42)

In [47]:
y_pred_tf = rfmodel2.predict(X_test_tf)

In [48]:
print('Accuracy: ', accuracy_score(y_test_enc, y_pred_tf))

Accuracy:  0.3739198743126473


In [49]:
print(classification_report(y_test_enc, y_pred_tf))
cm = confusion_matrix(y_test_enc, y_pred_tf)
print(cm)

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        97
           1       0.00      0.00      0.00       109
           2       0.00      0.00      0.00        30
           3       0.00      0.00      0.00        43
           4       0.29      0.01      0.01       923
           5       0.00      0.00      0.00        16
           6       0.00      0.00      0.00       106
           7       0.00      0.00      0.00        58
           8       0.00      0.00      0.00       104
           9       0.00      0.00      0.00       113
          10       0.00      0.00      0.00       601
          11       0.00      0.00      0.00        28
          12       0.00      0.00      0.00       158
          13       0.25      0.00      0.00       853
          14       0.00      0.00      0.00       341
          15       0.00      0.00      0.00        11
          16       0.00      0.00      0.00       107
          17       0.00    

In [50]:
from sklearn.svm import LinearSVC

svc = LinearSVC(C=1.0, penalty='l1', dual=False, loss='squared_hinge')

In [51]:
svc_cv = svc.fit(X_train_cv, y_train_enc)
y_pred_svccv = svc_cv.predict(X_test_cv)

In [52]:
print('Accuracy: ', accuracy_score(y_test_enc, y_pred_svccv))

Accuracy:  0.3473684210526316


In [53]:
print(classification_report(y_test_enc, y_pred_svccv))
cm = confusion_matrix(y_test_enc, y_pred_svccv)
print(cm)

              precision    recall  f1-score   support

           0       0.21      0.14      0.17        97
           1       0.10      0.08      0.09       109
           2       0.11      0.07      0.08        30
           3       0.24      0.09      0.13        43
           4       0.19      0.14      0.16       923
           5       0.00      0.00      0.00        16
           6       0.23      0.13      0.17       106
           7       0.10      0.03      0.05        58
           8       0.19      0.14      0.16       104
           9       0.04      0.02      0.02       113
          10       0.14      0.09      0.11       601
          11       0.15      0.07      0.10        28
          12       0.15      0.09      0.12       158
          13       0.21      0.15      0.17       853
          14       0.19      0.11      0.14       341
          15       0.00      0.00      0.00        11
          16       0.34      0.25      0.29       107
          17       0.16    

In [54]:
svc_tf = svc.fit(X_train_tf, y_train_enc)
y_pred_svctf = svc_tf.predict(X_test_tf)

In [55]:
print('Accuracy: ', accuracy_score(y_test_enc, y_pred_svctf))

Accuracy:  0.3799423932966745


In [56]:
print(classification_report(y_test_enc, y_pred_svctf))
cm = confusion_matrix(y_test_enc, y_pred_svctf)
print(cm)

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        97
           1       0.00      0.00      0.00       109
           2       0.00      0.00      0.00        30
           3       0.00      0.00      0.00        43
           4       0.00      0.00      0.00       923
           5       0.00      0.00      0.00        16
           6       0.00      0.00      0.00       106
           7       0.00      0.00      0.00        58
           8       0.00      0.00      0.00       104
           9       0.00      0.00      0.00       113
          10       0.00      0.00      0.00       601
          11       0.00      0.00      0.00        28
          12       0.00      0.00      0.00       158
          13       0.00      0.00      0.00       853
          14       0.00      0.00      0.00       341
          15       0.00      0.00      0.00        11
          16       0.00      0.00      0.00       107
          17       0.00    

Performance Insights

For the Random Forest model, both Count and TF-IDF vectorizers gave similar accuracy results of 37%, the weighted precision and recall were marginally better for the Count vectorizer. For SVM, Count vectorizer has given a lower accuracy of 34% compared to an accuracy of 38% for the TF-IDF vectorizer. However, precision of the count vectorizer model (0.31) is better than that of the TF-IDF model (0.23) while the recall is lower. Overall, both count vectorizer and TF-IDF performed more or less similar.

SVM model on TF-IDF vectors outperformed with 38% accuracy and good weighted average precision and recall. SVM performed better than RandomForest because I have kept the hyperparameter min_samples_leaf = 3 for RF model which ensures the model runs faster but is not the most accurate. SVM, on the other hand had an 'L1' regularization with C = 1, so moderate regularization.

SVM with 'L1' regularization and C = 1 improved the performance. The model ran relatively fast and also gave good accuracy. Larger C would have led to overfitting while lower C might have led to underfitting.

We have an imbalanced dataset but we want to assign greater contribution to classes with more examples in the dataset, so the weighted average precision and recall values are preferred.

## Chatbot

In [57]:
import json

In [91]:
with open('GLBot.json') as file:
    corpus = json.load(file)
    
print(corpus)

{'intents': [{'tag': 'Intro', 'patterns': ['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'please guide me', 'hey ya', 'talking to you for first time'], 'responses': ['Hello! how can i help you ?', 'Hello! How may I assist you ?'], 'context_set': ''}, {'tag': 'Exit', 'patterns': ['thank you', 'thanks', 'cya', 'see you', 'later', 'see you later', 'goodbye', 'i am leaving', 'have a Good day', 'you helped me', 'thanks a lot', 'thanks a ton', 'you are the best', 'great help', 'too good', 'you are a good learning buddy'], 'responses': ['I hope I was able to assist you, Good Bye', 'Good Bye, I hope your query is resolved'], 'context_set': ''}, {'tag': 'Olympus', 'patterns': ['olympus', 'explain me how olympus works', 'I am not able to understand olympus', 'olympus window not working', 'no access to olympus', 'unable 

In [92]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prash\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [93]:
W = [] 
L = [] 
doc_x = [] 
doc_y = [] 

for intent in corpus['intents']:
    for pattern in intent['patterns']:
        w_temp = nltk.word_tokenize(pattern)
        W.extend(w_temp)
        doc_x.append(w_temp)
        doc_y.append(intent['tag'])
        
    if intent['tag'] not in L:
        L.append(intent['tag'])

In [94]:
#Stemming
from nltk.stem import PorterStemmer
 
stemmer = PorterStemmer()
W = [stemmer.stem(w.lower()) for w in W if w != '?'] #stemming or learning the root word
W = sorted(list(set(W))) #sorted words
L = sorted(L) #sorted list of tags or labels

In [95]:
Train = []
Target = []
out_empty = [0 for _ in range(len(L))]

for x, doc in enumerate(doc_x):
    bag=[]
    
    w_temp = [stemmer.stem(w.lower()) for w in doc]
    
    for w in W:
        if w in w_temp:
            bag.append(1)
        else:
            bag.append(0)
            
    output_row = out_empty[:]
    output_row[L.index(doc_y[x])] = 1
    
    Train.append(bag)
    Target.append(output_row)

In [96]:
import tensorflow as tf
from keras import models, layers

nnmodel = models.Sequential()
nnmodel.add(layers.Dense(128, input_dim = len(Train[0]), activation = 'relu'))
nnmodel.add(layers.Dense(64, activation = 'relu'))
nnmodel.add(layers.Dense(32, activation = 'relu'))
nnmodel.add(layers.Dense(len(Target[0]), activation = 'softmax'))

adam = tf.keras.optimizers.Adam(learning_rate=0.01)
nnmodel.compile(loss='categorical_crossentropy', optimizer=adam, metrics=["accuracy"])

print(nnmodel.summary())

# Training the model
nnmodel.fit(Train, Target, epochs=200, verbose=1)

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_10 (Dense)            (None, 128)               19840     
                                                                 
 dense_11 (Dense)            (None, 64)                8256      
                                                                 
 dense_12 (Dense)            (None, 32)                2080      
                                                                 
 dense_13 (Dense)            (None, 8)                 264       
                                                                 
Total params: 30,440
Trainable params: 30,440
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoc

Epoch 152/200
Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


<keras.callbacks.History at 0x23405d809d0>

In [97]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def tokenize_lemmatize(text): 
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

def bag_of_words(text, vocab): 
    tokens = tokenize_lemmatize(text)
    bow = [0] * len(vocab)
    for w in tokens: 
        for idx, word in enumerate(vocab):
            if word == w: 
                bow[idx] = 1
    return np.array(bow)

In [98]:
def pred_label(text, vocab, labels): 
    bow = bag_of_words(text, vocab)
    result = nnmodel.predict(np.array([bow]))[0]
    thresh = 0.2
    y_pred = [[idx, res] for idx, res in enumerate(result) if res > thresh]

    y_pred.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in y_pred:
        return_list.append(labels[r[0]])
    return return_list

In [104]:
# Running the chatbot
print("BOT : Chat with the bot[Type 'quit' to stop] !")
print("\nBOT : If answer is not  right[Type '*'] !")
while True:

    inp = input("\n\nYou: ")

    if inp.lower() == "*":
        print("\nBOT:Please rephrase your question and try again")

    if inp.lower() == "quit":
        break

    if inp.lower() != "*":
        intents = pred_label(inp, W, L)
        tag = intents[0]
        list_of_intents = corpus["intents"]
        for i in list_of_intents: 
            if i["tag"] == tag:
                result = random.choice(i["responses"])
                break
        print("\nBOT : ", result)

BOT : Chat with the bot[Type 'quit' to stop] !

BOT : If answer is not  right[Type '*'] !


You: hello

BOT :  Hello! How may I assist you ?


You: who are you

BOT :  I am your virtual learning assistant


You: access to olympus

BOT :  Connect with your Program Manager to access Olympus the learning portal


You: not able to understand ada boosting

BOT :  Link: Machine Learning wiki 


You: what is neural network

BOT :  Link: Neural Nets wiki


You: *

BOT:Please rephrase your question and try again


You: stupid

BOT :  Please use respectful words


You: you did not help me

BOT :  Transferring the request to your PM


You: quit
