<a href="https://colab.research.google.com/github/lazypnkj/AIML-projects/blob/main/NLP_Project_Pankaj_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part A

**• DOMAIN:** Digital content management

**CONTEXT:** Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc.
are written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a
classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

**• DATA DESCRIPTION:** Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of
19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or
approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and
the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is
marked as unknown.) All bloggers included in the corpus fall into one of three age groups:

• 8240 "10s" blogs (ages 13-17),

• 8086 "20s" blogs(ages 23-27) and

• 2994 "30s" blogs (ages 33-47


 For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of
common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label url link.

**• PROJECT OBJECTIVE:** To build a NLP classifier which can use input text parameters to determine the label/s of the blog. Specific to this case
study, you can consider the text of the blog: ‘text’ feature as independent variable and ‘topic’ as dependent variable.



**Steps and tasks:**

1. Read and Analyse Dataset.


In [None]:
import warnings
warnings.filterwarnings('ignore')
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from zipfile import ZipFile

file_name = "/content/drive/MyDrive/blogs.zip"
  
with ZipFile(file_name, 'r') as zip:
    zip.printdir()
    print('Extracting all the files now...')
    zip.extractall()
    print('Done!')


File Name                                             Modified             Size
blogtext.csv                                   2019-09-20 22:33:20    800419647
Extracting all the files now...
Done!


In [None]:
#selecting subset of the data due to memory issues and notebook crashing
blog_data = pd.read_csv('/content/blogtext.csv',nrows = 10000,index_col=False) 
blog_data.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [None]:
blog_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      10000 non-null  int64 
 1   gender  10000 non-null  object
 2   age     10000 non-null  int64 
 3   topic   10000 non-null  object
 4   sign    10000 non-null  object
 5   date    10000 non-null  object
 6   text    10000 non-null  object
dtypes: int64(2), object(5)
memory usage: 547.0+ KB


In [None]:
blog_data.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [None]:
blog_data['topic'].value_counts()

indUnk                     3287
Technology                 2654
Fashion                    1622
Student                    1137
Education                   270
Marketing                   156
Engineering                 127
Internet                    118
Communications-Media         99
BusinessServices             91
Sports-Recreation            80
Non-Profit                   71
InvestmentBanking            70
Science                      63
Arts                         45
Consulting                   21
Museums-Libraries            17
Banking                      16
Automotive                   14
Law                          11
LawEnforcement-Security      10
Religion                      9
Accounting                    4
Publishing                    4
HumanResources                2
Telecommunications            2
Name: topic, dtype: int64

In [None]:
blog_data['gender'].value_counts()

male      5916
female    4084
Name: gender, dtype: int64

**A. Clearly write outcome of data analysis**

-No null values present in the dataset.

-ID and date columns can be dropped since these do not have a significant use.

-datatypes can be changed based on the requirement i.e int to object for all columns.

-Since the dataset is huge, we can select a small chunk for analysis



Dropping date and ID column

In [None]:
blog_data.drop(labels=['id','date'], axis=1,inplace=True)

In [None]:
blog_data['age']=blog_data['age'].astype('object') #changing dtype to object for age column

**B. Clean the Structured Data**


i. Missing value analysis and imputation

In [None]:
print('Missing/Null values:',blog_data.isnull().sum())

Missing/Null values: gender    0
age       0
topic     0
sign      0
text      0
dtype: int64


In [None]:
pip install langdetect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 KB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993242 sha256=f37142961363ec7f65526b76c07cf1d5e22dc224147587bf7b4d0ab14bb16650
  Stored in directory: /root/.cache/pip/wheels/13/c7/b0/79f66658626032e78fc1a83103690ef6797d551cb22e56e734
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


ii. Eliminate Non-English textual data.

In [None]:
from langdetect import detect_langs

for text in blog_data['text']:
    try:
        lang = detect_langs(text)[0].lang
        if lang == 'en':
      
            pass
        else:
        
            blog_data['text'].remove(text)
    except:
        pass

In [None]:
blog_data.shape

(10000, 5)

2. Preprocess unstructured data to make it consumable for model training.

A. Eliminate All special Characters and Numbers 

In [None]:
import re
blog_data['clean_text'] = blog_data['text'].apply(lambda x: re.sub(r'[^A-Za-z]+',' ',x))

In [None]:
blog_data.head()

Unnamed: 0,gender,age,topic,sign,text,clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",Info has been found pages and MB of pdf files...
1,male,15,Student,Leo,These are the team members: Drewe...,These are the team members Drewes van der Laa...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,In het kader van kernfusie op aarde MAAK JE E...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,Thanks to Yahoo s Toolbar I can now capture t...


B. Lowercase all textual data

In [None]:
blog_data['clean_text'] = blog_data['clean_text'].apply(lambda x: x.lower())
blog_data.head()

Unnamed: 0,gender,age,topic,sign,text,clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info has been found pages and mb of pdf files...
1,male,15,Student,Leo,These are the team members: Drewe...,these are the team members drewes van der laa...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,in het kader van kernfusie op aarde maak je e...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks to yahoo s toolbar i can now capture t...


C. Remove all Stopwords

In [None]:
from nltk.corpus import stopwords
stopwords=set(stopwords.words('english'))

blog_data['clean_text']=blog_data['clean_text'].apply(lambda x: ' '.join([words for words in x.split() if words not in stopwords]))
blog_data.head()

Unnamed: 0,gender,age,topic,sign,text,clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info found pages mb pdf files wait untill team...
1,male,15,Student,Leo,These are the team members: Drewe...,team members drewes van der laag urllink mail ...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks yahoo toolbar capture urls popups means...


D. Remove all extra white spaces

In [None]:
blog_data['clean_text']=blog_data['clean_text'].apply(lambda x: x.strip())
blog_data.head()

Unnamed: 0,gender,age,topic,sign,text,clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info found pages mb pdf files wait untill team...
1,male,15,Student,Leo,These are the team members: Drewe...,team members drewes van der laag urllink mail ...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks yahoo toolbar capture urls popups means...


# 3. Build a base Classification model

A. Create dependent and independent variables

In [None]:
data = blog_data[['clean_text','topic']]
data.head()

Unnamed: 0,clean_text,topic
0,info found pages mb pdf files wait untill team...,Student
1,team members drewes van der laag urllink mail ...,Student
2,het kader van kernfusie op aarde maak je eigen...,Student
3,testing testing,Student
4,thanks yahoo toolbar capture urls popups means...,InvestmentBanking


In [None]:
data['CategoryId'] = data['topic'].factorize()[0]
data.head()

Unnamed: 0,clean_text,topic,CategoryId
0,info found pages mb pdf files wait untill team...,Student,0
1,team members drewes van der laag urllink mail ...,Student,0
2,het kader van kernfusie op aarde maak je eigen...,Student,0
3,testing testing,Student,0
4,thanks yahoo toolbar capture urls popups means...,InvestmentBanking,1


In [None]:
x = data['clean_text']
y = data['CategoryId']

In [None]:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit_transform(y)

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [None]:
x.head()

0    info found pages mb pdf files wait untill team...
1    team members drewes van der laag urllink mail ...
2    het kader van kernfusie op aarde maak je eigen...
3                                      testing testing
4    thanks yahoo toolbar capture urls popups means...
Name: clean_text, dtype: object

B. Split data into train and test.

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.2) #splitting into 80(train) and 20(test)

C. Vectorize data using any one vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', 
                      ngram_range=(1, 3), stop_words = 'english')

corpus = list(X_train)+list(X_test)

In [None]:
count_vect.fit(corpus)

vectXtrain = count_vect.transform(X_train)
vectXtest = count_vect.transform(X_test)

In [None]:
count_vect.get_feature_names_out()[:10]

array(['aa', 'aa amazing', 'aa amazing things', 'aa anger',
       'aa anger management', 'aa compared', 'aa compared tougher',
       'aa keeps', 'aa keeps saying', 'aa nice'], dtype=object)

In [None]:
label_counts = {}

for label in data.topic:
    if label in label_counts:
        label_counts[label] += 1
    else:
        label_counts[label] = 1

label_counts

{'Student': 1137,
 'InvestmentBanking': 70,
 'indUnk': 3287,
 'Non-Profit': 71,
 'Banking': 16,
 'Education': 270,
 'Engineering': 127,
 'Science': 63,
 'Communications-Media': 99,
 'BusinessServices': 91,
 'Sports-Recreation': 80,
 'Arts': 45,
 'Internet': 118,
 'Museums-Libraries': 17,
 'Accounting': 4,
 'Technology': 2654,
 'Law': 11,
 'Consulting': 21,
 'Automotive': 14,
 'Religion': 9,
 'Fashion': 1622,
 'Publishing': 4,
 'Marketing': 156,
 'LawEnforcement-Security': 10,
 'HumanResources': 2,
 'Telecommunications': 2}

D. Build a base model for Supervised Learning - Classification.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(vectXtrain, y_train)
pred = rfc.predict(vectXtest)

E. Clearly print Performance Metrics. 

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support as score

precision, recall, f1score, support = score(y_test, pred, average='micro')

print('Accuracy score: ', accuracy_score(y_test, pred))
print('Precision score:', precision)
print('F1 score: ', f1score)
print('Recall score: ',recall )

Accuracy score:  0.4725
Precision score: 0.4725
F1 score:  0.4725
Recall score:  0.4725


# 4. Improve Performance of model.

A. Experiment with other vectorisers. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf = TfidfVectorizer(min_df=3,  max_features=None, 
             strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
             ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
             stop_words = 'english')



tf_idf.fit(list(X_train) + list(X_test))
Xtrain_tf =  tf_idf.transform(X_train) 
Xtest_tf = tf_idf.transform(X_test)

B. Build classifier Models using other algorithms than base model

Logistic Regression model

In [None]:
from sklearn.linear_model import LogisticRegression

lrmodel=LogisticRegression(solver='lbfgs')

lrmodel.fit(Xtrain_tf, y_train)

lr_pred = lrmodel.predict(Xtest_tf) 

In [None]:
precision, recall, f1score, support = score(y_test, lr_pred, average='micro')
print('Accuracy score: ', accuracy_score(y_test, lr_pred))
print('Precision score:', precision)
print('F1 score: ', f1score)
print('Recall score: ',recall )

Accuracy score:  0.61
Precision score: 0.61
F1 score:  0.61
Recall score:  0.61


**Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import RandomizedSearchCV


model = LogisticRegression()
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l1','l2']
c_values = [100, 10, 1.0, 0.1]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
random_search = RandomizedSearchCV(model, grid, scoring='accuracy')
grid_result = random_search.fit(Xtrain_tf, y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.664875 using {'solver': 'lbfgs', 'penalty': 'l2', 'C': 100}


In [None]:
lrmodel2=LogisticRegression(solver='lbfgs', penalty = 'l2', C = 100)

lrmodel2.fit(Xtrain_tf, y_train)

lr_pred2 = lrmodel2.predict(Xtest_tf) 

In [None]:
precision, recall, f1score, support = score(y_test, lr_pred2, average='micro')
print('Accuracy score: ', accuracy_score(y_test, lr_pred2))
print('Precision score:', precision)
print('F1 score: ', f1score)
print('Recall score: ',recall )

Accuracy score:  0.6685
Precision score: 0.6685
F1 score:  0.6685
Recall score:  0.6685


# 5. Share insights on relative performance comparison.

**A. Which vectorizer performed better? Probable reason?**

Answer: TF-IDF vectorizer performed better as per model performances above on both, the accuracy improved from 47% to 61% and to 66% after hyperparametere tuning. Reason can be the fact the unlike count vectorizer, TF-IDF does not only focus on word count but also with the importance of words in the corpus. This way we can neglect/remove words with less importance which would reduce the input diamensions leading to a less complex model than we would get with countvectorizer

**B. Which model outperformed? Probable reason?**

Answer - Logistic regression model performed better because of the change in vectorizer probably since we used TF-IDF as vectorizer for in order to build this model instead of count vectorizer used for initial random forest model.

**C. Which parameter/hyperparameter significantly helped to improve performance?Probable reason?**

Answer: The model performance improved after we got following hyperparameters after hyperparameter tuning using grid search 'solver': 'lbfgs', 'penalty': 'l2', 'C': 100. The probable reason could be based on the C value since the solver was the same for the model before hyperparameter tuning and L2 is the default penalty for logistic regression models. The only new value was for hyperparameter C as 100 instead of 1.0.

**D. According to you, which performance metric should be given most importance, why?.**

Answer: I feel the metric importance should be based on the type of problem or data we are dealing with. For example accuracy generally should be used for classification problems or however in case of imbalances classes we should give importance to precision, recall, F1 score or AUC-ROC. While for regression problems i think it should MAE, MSE/RMSE etc

*****************************************

# Part B

**• DOMAIN:** Customer support

**• CONTEXT:** Great Learning has a an academic support department which receives numerous support requests every day throughout the year.
Teams are spread across geographies and try to provide support round the year. Sometimes there are circumstances where due to heavy
workload certain request resolutions are delayed, impacting company’s business. Some of the requests are very generic where a proper
resolution procedure delivered to the user can solve the problem. Company is looking forward to design an automation which can interact with
the user, understand the problem and display the resolution procedure [ if found as a generic request ] or redirect the request to an actual human
support executive if the request is complex or not in it’s database.

**• DATA DESCRIPTION: **A sample corpus is attached for your reference. Please enhance/add more data to the corpus using your linguistics skills.


**• PROJECT OBJECTIVE:** Design a python based interactive semi - rule based chatbot which can do the following:
1. Start chat session with greetings and ask what the user is looking for. 
2. Accept dynamic text based questions from the user. Reply back with relevant answer from the designed corpus.
3. End the chat session only if the user requests to end else ask what the user is looking for. Loop continues till the user asks to end it.

Loading/Reading data file

In [2]:
import json

f = open('GL Bot.json')
data = json.load(f)


In [3]:
print(data)

{'intents': [{'tag': 'Intro', 'patterns': ['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time'], 'responses': ['Hello! how can i help you ?'], 'context_set': ''}, {'tag': 'Exit', 'patterns': ['thank you', 'thanks', 'cya', 'see you', 'later', 'see you later', 'goodbye', 'i am leaving', 'have a Good day', 'you helped me', 'thanks a lot', 'thanks a ton', 'you are the best', 'great help', 'too good', 'you are a good learning buddy'], 'responses': ['I hope I was able to assist you, Good Bye'], 'context_set': ''}, {'tag': 'Olympus', 'patterns': ['olympus', 'explain me how olympus works', 'I am not able to understand olympus', 'olympus window not working', 'no access to olympus', 'unable to see link in olympus', 'no link visible on olympus', 'whom to contact for olympus', 'lot of p

Data Preprocessing

In [4]:
import warnings
warnings.filterwarnings("ignore")
import nltk
nltk.download('punkt')
nltk.download('wordnet')

tags = [] #empty list to store tags/classes
docs = [] #empty list to store documents
words=[] #empty list to store all tokenized words

for x in data['intents']:
  for y in x['patterns']:
    tokens = nltk.word_tokenize(y)
    words.extend(tokens)

    docs.append((tokens, x['tag']))

    if x['tag'] not in tags:
      tags.append(x['tag'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [5]:
print('Tags:',tags)
print('Documents:',docs)
print('Words:',words)

Tags: ['Intro', 'Exit', 'Olympus', 'SL', 'NN', 'Bot', 'Profane', 'Ticket']
Documents: [(['hi'], 'Intro'), (['how', 'are', 'you'], 'Intro'), (['is', 'anyone', 'there'], 'Intro'), (['hello'], 'Intro'), (['whats', 'up'], 'Intro'), (['hey'], 'Intro'), (['yo'], 'Intro'), (['listen'], 'Intro'), (['please', 'help', 'me'], 'Intro'), (['i', 'am', 'learner', 'from'], 'Intro'), (['i', 'belong', 'to'], 'Intro'), (['aiml', 'batch'], 'Intro'), (['aifl', 'batch'], 'Intro'), (['i', 'am', 'from'], 'Intro'), (['my', 'pm', 'is'], 'Intro'), (['blended'], 'Intro'), (['online'], 'Intro'), (['i', 'am', 'from'], 'Intro'), (['hey', 'ya'], 'Intro'), (['talking', 'to', 'you', 'for', 'first', 'time'], 'Intro'), (['thank', 'you'], 'Exit'), (['thanks'], 'Exit'), (['cya'], 'Exit'), (['see', 'you'], 'Exit'), (['later'], 'Exit'), (['see', 'you', 'later'], 'Exit'), (['goodbye'], 'Exit'), (['i', 'am', 'leaving'], 'Exit'), (['have', 'a', 'Good', 'day'], 'Exit'), (['you', 'helped', 'me'], 'Exit'), (['thanks', 'a', 'lot'],

In [6]:
nltk.download('omw-1.4')
lemmer = nltk.stem.WordNetLemmatizer()

#each words to lower case, remove punctuations if any, lemmatize

puncts = [',','!','?',';'] 

words = [lemmer.lemmatize(word.lower()) for word in words if word not in puncts]
words = sorted(list(set(words)))

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [7]:
print('Words:',words)

Words: ['a', 'able', 'access', 'activation', 'ada', 'adam', 'aifl', 'aiml', 'am', 'an', 'ann', 'anyone', 'are', 'artificial', 'backward', 'bad', 'bagging', 'batch', 'bayes', 'belong', 'best', 'blended', 'bloody', 'boosting', 'bot', 'buddy', 'classification', 'contact', 'create', 'cross', 'cya', 'day', 'deep', 'did', 'diffult', 'do', 'ensemble', 'epoch', 'explain', 'first', 'for', 'forest', 'forward', 'from', 'function', 'good', 'goodbye', 'gradient', 'great', 'hate', 'have', 'hell', 'hello', 'help', 'helped', 'hey', 'hi', 'hidden', 'hour', 'how', 'hyper', 'i', 'imputer', 'in', 'intelligence', 'is', 'jerk', 'joke', 'knn', 'later', 'layer', 'learner', 'learning', 'leaving', 'link', 'listen', 'logistic', 'lot', 'machine', 'me', 'ml', 'my', 'naive', 'name', 'nb', 'net', 'network', 'neural', 'no', 'not', 'of', 'olympus', 'olypus', 'on', 'online', 'operation', 'opertions', 'otimizer', 'parameter', 'piece', 'please', 'pm', 'problem', 'propagation', 'random', 'regression', 'relu', 'screw', 'se

In [8]:
tags = sorted(list(set(tags)))
print('Tags:', tags)

Tags: ['Bot', 'Exit', 'Intro', 'NN', 'Olympus', 'Profane', 'SL', 'Ticket']


**Create training data for the ML/DL classifier**


In [9]:
train_data = []

emp_output = [0] * len(tags) #empty output list equal to number of classes


for x in docs:
    bow = [] #bag of words
    pattern_words = x[0] 
    pattern_words = [lemmer.lemmatize(w.lower()) for w in pattern_words]
    
    for word in words:
        bow.append(1) if word in pattern_words else bow.append(0)

    output_row = list(emp_output)
    output_row[tags.index(x[1])] = 1
    train_data.append([bow, output_row])

In [10]:
import random
import numpy as np

random.shuffle(train_data)
train_data = np.array(train_data)

In [11]:
print(train_data)

[[list([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0])
  list([0, 0, 1, 0, 0, 0, 0, 0])]
 [list([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0])


In [12]:
X_train = list(train_data[:,0])
y_train = list(train_data[:,1])

In [13]:
print(X_train)
print(y_train)

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Model Building

In [14]:
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras import optimizers

model = Sequential()
model.add(Dense(128, input_shape=(len(X_train[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(y_train[0]), activation='softmax'))

adam = optimizers.Adam()
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

history = model.fit(np.array(X_train), np.array(y_train), epochs=200, batch_size=5, verbose=1)


Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

Save the model

In [15]:
model.save('Glbot.h5', history)

Function to clean sentences

In [16]:
def cleaned_sentence(text):
    text_words = nltk.word_tokenize(text)
    text_words = [lemmer.lemmatize(word.lower()) for word in text_words]
    return text_words

Function to return BOW array

In [17]:
def bow(text, words, show_details=True):

    text_words = cleaned_sentence(text)

    bag_of_words = [0]*len(words) 
    for s in text_words:
        for i,w in enumerate(words):
            if w == s: 
               
                bag_of_words[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    return(np.array(bag_of_words))

**Function to predict classes**

In [18]:
def pred_tag(text, model):
   
    p = bow(text, words,show_details=False)
    pred = model.predict(np.array([p]))[0]
    error = 0.20
    result = [[i,r] for i,r in enumerate(pred) if r>error]
    
    result.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    
    for r in result:
        return_list.append({"intent": tags[r[0]], "probability": str(r[1])})
    return return_list

**Function to get responses from trained model**

In [19]:
def get_Response(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result

In [20]:
def Glbot_response(text):
    ints = pred_tag(text, model)
    result = get_Response(ints, data)
    return result

**Function to initiate chat**

In [31]:
endchat_list = ['exit','break','quit','see you later','chat with you later','end the chat','bye','ok bye']

def initiate_chat():
    print("Bot: My name is Greatbot. Let's have a conversation!\n\n")
    while True:
        inp = str(input()).lower()
        if inp.lower() in endchat_list:
            break
        if inp.lower()== '' or inp.lower()== '*':
            print('Please re-phrase your query!')
            print("-"*50)
        else:
            print(f"Bot: {Glbot_response(inp)}"+'\n')
            print("-"*50)

In [33]:
initiate_chat()

Bot: My name is Greatbot. Let's have a conversation!


Hi there
Bot: Hello! how can i help you ?

--------------------------------------------------
who is this
Bot: I am your virtual learning assistant

--------------------------------------------------
How to use olympus
Bot: Link: Olympus wiki

--------------------------------------------------
what is machine learning
Bot: Link: Machine Learning wiki 

--------------------------------------------------
Are you stupid
Bot: Please use respectful words

--------------------------------------------------
create a ticket
Bot: Tarnsferring the request to your PM

--------------------------------------------------
Thank you
Bot: I hope I was able to assist you, Good Bye

--------------------------------------------------
bye




                                                                END OF NOTEBOOK


