# **Projet : Question Answering**
____

### Benaboud Oumaïma

**Loading Dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
import pandas as pd
path = "/content/drive/MyDrive/NLP_datasets/intents.csv"
df = pd.read_csv(path)
df

Unnamed: 0,tags,patterns
0,greeting,Hi there
1,greeting,How are you
2,greeting,Is anyone there?
3,greeting,Hey
4,greeting,Hola
...,...,...
905,disadvantges of using k-means,what can be the possible reasons for not using...
906,advantages of knn algorithm,what are the advanatges of knn algorithm
907,advantages of knn algorithm,mention some of the advanatges of knn algorithm
908,advantages of knn algorithm,what are the benefits of using knn algorithm


## Pre-processing the data

**Remove punctuation**

In [None]:
import string
def remove_punctuation(text):
    return text.translate(str.maketrans("", "", string.punctuation))

In [None]:
df_no_punct = df["patterns"].apply(remove_punctuation)
df["df_no_punct"]=df_no_punct
df

Unnamed: 0,tags,patterns,df_no_punct
0,greeting,Hi there,Hi there
1,greeting,How are you,How are you
2,greeting,Is anyone there?,Is anyone there
3,greeting,Hey,Hey
4,greeting,Hola,Hola
...,...,...,...
905,disadvantges of using k-means,what can be the possible reasons for not using...,what can be the possible reasons for not using...
906,advantages of knn algorithm,what are the advanatges of knn algorithm,what are the advanatges of knn algorithm
907,advantages of knn algorithm,mention some of the advanatges of knn algorithm,mention some of the advanatges of knn algorithm
908,advantages of knn algorithm,what are the benefits of using knn algorithm,what are the benefits of using knn algorithm


**Tokenization and lowercase**

In [None]:
import nltk
nltk.download('punkt')
def tokenize_text(text):
    tokens = nltk.word_tokenize(text)
    lower_tokens = [token.lower() for token in tokens]
    return lower_tokens


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
df_token = df["df_no_punct"].apply(tokenize_text)
df["Tokens"]=df_token
df

Unnamed: 0,tags,patterns,df_no_punct,Tokens
0,greeting,Hi there,Hi there,"[hi, there]"
1,greeting,How are you,How are you,"[how, are, you]"
2,greeting,Is anyone there?,Is anyone there,"[is, anyone, there]"
3,greeting,Hey,Hey,[hey]
4,greeting,Hola,Hola,[hola]
...,...,...,...,...
905,disadvantges of using k-means,what can be the possible reasons for not using...,what can be the possible reasons for not using...,"[what, can, be, the, possible, reasons, for, n..."
906,advantages of knn algorithm,what are the advanatges of knn algorithm,what are the advanatges of knn algorithm,"[what, are, the, advanatges, of, knn, algorithm]"
907,advantages of knn algorithm,mention some of the advanatges of knn algorithm,mention some of the advanatges of knn algorithm,"[mention, some, of, the, advanatges, of, knn, ..."
908,advantages of knn algorithm,what are the benefits of using knn algorithm,what are the benefits of using knn algorithm,"[what, are, the, benefits, of, using, knn, alg..."


**Remove Stopwords**

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
def remove_stopwords(tokens):
    return [token for token in tokens if token not in stopwords.words('english')]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
df_clean = df["Tokens"].apply(remove_stopwords)
df["clean_df"]=df_clean
df

Unnamed: 0,tags,patterns,df_no_punct,Tokens,clean_df
0,greeting,Hi there,Hi there,"[hi, there]",[hi]
1,greeting,How are you,How are you,"[how, are, you]",[]
2,greeting,Is anyone there?,Is anyone there,"[is, anyone, there]",[anyone]
3,greeting,Hey,Hey,[hey],[hey]
4,greeting,Hola,Hola,[hola],[hola]
...,...,...,...,...,...
905,disadvantges of using k-means,what can be the possible reasons for not using...,what can be the possible reasons for not using...,"[what, can, be, the, possible, reasons, for, n...","[possible, reasons, using, kmeans, algorithm]"
906,advantages of knn algorithm,what are the advanatges of knn algorithm,what are the advanatges of knn algorithm,"[what, are, the, advanatges, of, knn, algorithm]","[advanatges, knn, algorithm]"
907,advantages of knn algorithm,mention some of the advanatges of knn algorithm,mention some of the advanatges of knn algorithm,"[mention, some, of, the, advanatges, of, knn, ...","[mention, advanatges, knn, algorithm]"
908,advantages of knn algorithm,what are the benefits of using knn algorithm,what are the benefits of using knn algorithm,"[what, are, the, benefits, of, using, knn, alg...","[benefits, using, knn, algorithm]"


**Stemming and Lemmatization**

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
stemmer = PorterStemmer()
wnl = WordNetLemmatizer()
def stemming(tokens):
    return [stemmer.stem(token) for token in tokens]
def lemmatization (tokens):
    return [wnl.lemmatize(token) for token in tokens]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
stem_df = df['clean_df'].apply(stemming)
df["stem_df"]=stem_df
lemma_df = df['clean_df'].apply(lemmatization)
df["lemma_words"]=lemma_df
df

Unnamed: 0,tags,patterns,df_no_punct,Tokens,clean_df,stem_df,lemma_words
0,greeting,Hi there,Hi there,"[hi, there]",[hi],[hi],[hi]
1,greeting,How are you,How are you,"[how, are, you]",[],[],[]
2,greeting,Is anyone there?,Is anyone there,"[is, anyone, there]",[anyone],[anyon],[anyone]
3,greeting,Hey,Hey,[hey],[hey],[hey],[hey]
4,greeting,Hola,Hola,[hola],[hola],[hola],[hola]
...,...,...,...,...,...,...,...
905,disadvantges of using k-means,what can be the possible reasons for not using...,what can be the possible reasons for not using...,"[what, can, be, the, possible, reasons, for, n...","[possible, reasons, using, kmeans, algorithm]","[possibl, reason, use, kmean, algorithm]","[possible, reason, using, kmeans, algorithm]"
906,advantages of knn algorithm,what are the advanatges of knn algorithm,what are the advanatges of knn algorithm,"[what, are, the, advanatges, of, knn, algorithm]","[advanatges, knn, algorithm]","[advanatg, knn, algorithm]","[advanatges, knn, algorithm]"
907,advantages of knn algorithm,mention some of the advanatges of knn algorithm,mention some of the advanatges of knn algorithm,"[mention, some, of, the, advanatges, of, knn, ...","[mention, advanatges, knn, algorithm]","[mention, advanatg, knn, algorithm]","[mention, advanatges, knn, algorithm]"
908,advantages of knn algorithm,what are the benefits of using knn algorithm,what are the benefits of using knn algorithm,"[what, are, the, benefits, of, using, knn, alg...","[benefits, using, knn, algorithm]","[benefit, use, knn, algorithm]","[benefit, using, knn, algorithm]"


In [None]:
df['df_clean'] = df['lemma_words'].apply(lambda x: ' '.join(x))
df

Unnamed: 0,tags,patterns,df_no_punct,Tokens,clean_df,stem_df,lemma_words,df_clean
0,greeting,Hi there,Hi there,"[hi, there]",[hi],[hi],[hi],hi
1,greeting,How are you,How are you,"[how, are, you]",[],[],[],
2,greeting,Is anyone there?,Is anyone there,"[is, anyone, there]",[anyone],[anyon],[anyone],anyone
3,greeting,Hey,Hey,[hey],[hey],[hey],[hey],hey
4,greeting,Hola,Hola,[hola],[hola],[hola],[hola],hola
...,...,...,...,...,...,...,...,...
905,disadvantges of using k-means,what can be the possible reasons for not using...,what can be the possible reasons for not using...,"[what, can, be, the, possible, reasons, for, n...","[possible, reasons, using, kmeans, algorithm]","[possibl, reason, use, kmean, algorithm]","[possible, reason, using, kmeans, algorithm]",possible reason using kmeans algorithm
906,advantages of knn algorithm,what are the advanatges of knn algorithm,what are the advanatges of knn algorithm,"[what, are, the, advanatges, of, knn, algorithm]","[advanatges, knn, algorithm]","[advanatg, knn, algorithm]","[advanatges, knn, algorithm]",advanatges knn algorithm
907,advantages of knn algorithm,mention some of the advanatges of knn algorithm,mention some of the advanatges of knn algorithm,"[mention, some, of, the, advanatges, of, knn, ...","[mention, advanatges, knn, algorithm]","[mention, advanatg, knn, algorithm]","[mention, advanatges, knn, algorithm]",mention advanatges knn algorithm
908,advantages of knn algorithm,what are the benefits of using knn algorithm,what are the benefits of using knn algorithm,"[what, are, the, benefits, of, using, knn, alg...","[benefits, using, knn, algorithm]","[benefit, use, knn, algorithm]","[benefit, using, knn, algorithm]",benefit using knn algorithm


**Vectorization**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['df_clean']).toarray()

In [None]:
Y = pd.get_dummies(df["tags"]).to_numpy()

## Knowledge Discovery

**Spliting data into training and testing sets**

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=42)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(819, 564)
(91, 564)
(819, 255)
(91, 255)


**Sequential Neural Network  Model**


This model is a sequential neural network with three dense layers. The first layer has 128 neurons, the second layer has 64 neurons, and the third output layer has a number of neurons equal to the number of intents to predict the output intent with softmax activation. The model uses a dropout layer with a rate of 0.5 after each dense layer to help prevent overfitting. The optimizer used for training is Stochastic Gradient Descent (SGD) with Nesterov accelerated gradient. The learning rate for the optimizer is set to 0.01 .

In [None]:
from keras.layers import Dense, Activation, Dropout
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam
# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(X_train[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(y_train[0]), activation='softmax'))

# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = tf.keras.optimizers.legacy.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(
    loss='categorical_crossentropy', 
    optimizer=sgd, 
    metrics=['accuracy'])

#fitting the model
history = model.fit(X_train, y_train, 
    epochs=800, 
    batch_size=8, 
    verbose=1)


Epoch 1/800


  super().__init__(name, **kwargs)


Epoch 2/800
Epoch 3/800
Epoch 4/800
Epoch 5/800
Epoch 6/800
Epoch 7/800
Epoch 8/800
Epoch 9/800
Epoch 10/800
Epoch 11/800
Epoch 12/800
Epoch 13/800
Epoch 14/800
Epoch 15/800
Epoch 16/800
Epoch 17/800
Epoch 18/800
Epoch 19/800
Epoch 20/800
Epoch 21/800
Epoch 22/800
Epoch 23/800
Epoch 24/800
Epoch 25/800
Epoch 26/800
Epoch 27/800
Epoch 28/800
Epoch 29/800
Epoch 30/800
Epoch 31/800
Epoch 32/800
Epoch 33/800
Epoch 34/800
Epoch 35/800
Epoch 36/800
Epoch 37/800
Epoch 38/800
Epoch 39/800
Epoch 40/800
Epoch 41/800
Epoch 42/800
Epoch 43/800
Epoch 44/800
Epoch 45/800
Epoch 46/800
Epoch 47/800
Epoch 48/800
Epoch 49/800
Epoch 50/800
Epoch 51/800
Epoch 52/800
Epoch 53/800
Epoch 54/800
Epoch 55/800
Epoch 56/800
Epoch 57/800
Epoch 58/800
Epoch 59/800
Epoch 60/800
Epoch 61/800
Epoch 62/800
Epoch 63/800
Epoch 64/800
Epoch 65/800
Epoch 66/800
Epoch 67/800
Epoch 68/800
Epoch 69/800
Epoch 70/800
Epoch 71/800
Epoch 72/800
Epoch 73/800
Epoch 74/800
Epoch 75/800
Epoch 76/800
Epoch 77/800
Epoch 78/800
Epoch 7

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# Predict the labels for the test set
y_pred = model.predict(X_test)
y_pred = np.argmax(y_pred, axis=-1)
y_test = np.argmax(y_test, axis=-1)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

# Print the classification report
target_names = list(df['tags'].unique())
print(classification_report(y_test, y_pred))

Accuracy: 0.7912087912087912
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
          13       1.00      1.00      1.00         1
          18       1.00      1.00      1.00         1
          21       1.00      1.00      1.00         1
          23       0.00      0.00      0.00         0
          31       1.00      1.00      1.00         1
          33       1.00      1.00      1.00         1
          36       0.00      0.00      0.00         4
          37       1.00      1.00      1.00         1
          41       0.00      0.00      0.00         1
          47       1.00      1.00      1.00         1
          52       1.00      1.00      1.00         1
          54       1.00      1.00      1.00         1
          57       1.00      1.00      1.00         1
          60       1.00      1.00      1.00         1
          61       0.00      0.00      0.00         1
          62       1.00      1.00      1.00         

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
from tensorflow.keras.models import save_model

# Save the model
model.save('qa_model.h5')

In [None]:
import pickle

# Save the vectorizer
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)


**Testing**

In [None]:
from tensorflow.keras.models import load_model
import pickle

# Load the model
model = load_model('qa_model.h5')

# Load the vectorizer
with open('vectorizer.pkl', 'rb') as f:
    vectorizer = pickle.load(f)

In [None]:
import random
import json
# loading responses
responses = json.load(open("/content/drive/MyDrive/NLP_datasets/responses.json", "r"))

In [None]:
import numpy as np
# transforming input and predicting intent
def predict_tag(inp_str):
    inp_vector = vectorizer.transform([inp_str.lower()]).toarray()
    prediction = model.predict(inp_vector)
    predicted_tag = pd.get_dummies(df["tags"]).columns[np.argmax(prediction)]
    return predicted_tag

In [None]:
def chatbot_response(msg):
   tag = predict_tag(msg)
   list_of_intents = responses['intents']
   for intent in list_of_intents:
       if(intent['tag']== tag):
           result = random.choice(intent['responses'])
           break
   else:
       result = "Sorry, I don't have a response for that."
   return result


In [None]:
question = "hola"
answer = chatbot_response(question)
print(answer)

Hi there, how can I help?


**AI Chat BOT**

In [None]:
def start_chat():
    print("---------------------------------------------")
    print("---------------  AI Chat bot  ---------------")
    print("---------------------------------------------")
    print("Ask any questions...")
    print("My goal is to understand your queries and respond as accurately as possible.")
    print("Type EXIT to quit...")
    print("---------------------------------------------")
    while True:
        inp = input("Ask anything... : ")
        if inp.upper() == "EXIT":
            break
        else:
            if inp:
                response = chatbot_response(inp)
                print("Response... : ", response)
            else:
                pass

In [None]:
start_chat()

---------------------------------------------
---------------  AI Chat bot  ---------------
---------------------------------------------
Ask any questions...
My goal is to understand your queries and respond as accurately as possible.
Type EXIT to quit...
---------------------------------------------
Ask anything... : hola
Response... :  Good to see you again
Ask anything... : what is the difference between classification and neural network ?
Response... :  A neural  network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems of neurons, either organic or artificial in nature.
Ask anything... : exit
