# Therapy Chatbot

The dataset contains 80 user responses, in the response_text column, to a therapy chatbot. Bot said: 'Describe a time when you have acted as a resource for someone else'. User responded. If a response is 'not flagged', the user can continue talking to the bot. If it is 'flagged', the user is referred to help. We are going to predict if it is flagged or not according to users responses.

### Libraries and Utilities

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords") 
from nltk.stem.wordnet import WordNetLemmatizer


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

from wordcloud import WordCloud, STOPWORDS

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nicholas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Loading Data

In [26]:
data = pd.read_csv("data/Sheet_1.csv",encoding= "latin1" )
# data.drop(["Unnamed: 3","Unnamed: 4","Unnamed: 5",
#            "Unnamed: 6","Unnamed: 7",], axis = 1, inplace =True)
data = pd.concat([data["class"],data["response_text"]], axis = 1)

data.dropna(axis=0, inplace =True)
data.head(10)

Unnamed: 0,class,response_text
0,not_flagged,I try and avoid this sort of conflict
1,flagged,Had a friend open up to me about his mental ad...
2,flagged,I saved a girl from suicide once. She was goin...
3,not_flagged,i cant think of one really...i think i may hav...
4,not_flagged,Only really one friend who doesn't fit into th...
5,not_flagged,a couple of years ago my friends was going to ...
6,flagged,Roommate when he was going through death and l...
7,flagged,i've had a couple of friends (you could say mo...
8,not_flagged,Listened to someone talk about relationship tr...
9,flagged,I will always listen. I comforted my sister wh...


### 0 to Not Flagged and 1 to Flagged

In [27]:
data["class"] = [1 if each == "flagged" else 0 for each in data["class"]]
data.head()

Unnamed: 0,class,response_text
0,0,I try and avoid this sort of conflict
1,1,Had a friend open up to me about his mental ad...
2,1,I saved a girl from suicide once. She was goin...
3,0,i cant think of one really...i think i may hav...
4,0,Only really one friend who doesn't fit into th...


In [28]:
data.response_text[16]

'I have helped advise friends who have faced circumstances similar to mine'

### Regular Expression

We can remove non-letter characters in our text with Regular Expression method.
The lower() methods returns the lowercased string from the given string. It converts all uppercase characters to lowercase. If no uppercase characters exist, it returns the original string.

In [29]:
first_text = data.response_text[16]
text = re.sub("[^a-zA-Z]"," ",first_text)
text = text.lower() 
print(text)

i have helped advise friends who have faced circumstances similar to mine


### Irrelevant Words (Stopwords)

In [30]:
text = nltk.word_tokenize(text)
text = [ word for word in text if not word in set(stopwords.words("english"))]
print(text)

['helped', 'advise', 'friends', 'faced', 'circumstances', 'similar', 'mine']


### Lemmatization

In [31]:
lemmatizer = WordNetLemmatizer()
text = [(lemmatizer.lemmatize(lemmatizer.lemmatize(lemmatizer.lemmatize(word, "n"),pos = "v"),pos="a")) for word in text]
print(text)

['help', 'advise', 'friend', 'face', 'circumstance', 'similar', 'mine']


### All Words

In [32]:
description_list = []
for description in data.response_text:
       
    description = re.sub("[^a-zA-Z]"," ",description)
    description = description.lower() 
    
    description = nltk.word_tokenize(description)
    description = [ word for word in description if not word in set(stopwords.words("english"))]
    
    lemmatizer = WordNetLemmatizer()
    description = (lemmatizer.lemmatize(lemmatizer.lemmatize(lemmatizer.lemmatize(word, "n"),pos = "v"),pos="a") for word in description)
    
    description = " ".join(description)
    description_list.append(description)

In [33]:
description_list[16]

'help advise friend face circumstance similar mine'

### Bag of Words

In [34]:
# max_features = 100
# count_vectorizer = CountVectorizer(max_features=max_features)
# sparce_matrix = count_vectorizer.fit_transform(description_list).toarray()
# print("Top {} Most Used Words: {}".format(max_features,count_vectorizer.get_feature_names()))

from sklearn.feature_extraction.text import CountVectorizer
max_feature = 500

cv = CountVectorizer(max_features = max_feature, stop_words = "english")

sparce_matrix = cv.fit_transform(description_list).toarray() # x

print("Most frequently used {} words {}".format(max_feature, cv.get_feature_names_out()))

Most frequently used 500 words ['able' 'absolutely' 'acquaintance' 'act' 'action' 'activity' 'addiction'
 'adequate' 'admit' 'advice' 'advise' 'age' 'ago' 'agony' 'alcoholic'
 'allow' 'anniversary' 'answer' 'anxiety' 'anxious' 'appose' 'ask'
 'attention' 'aunt' 'avoid' 'away' 'bad' 'basically' 'bedroom' 'best'
 'big' 'bite' 'blow' 'blue' 'blunt' 'book' 'boyfriend' 'break' 'bring'
 'brother' 'bunch' 'calm' 'camp' 'campsite' 'cancer' 'car' 'care' 'catch'
 'category' 'cause' 'chance' 'change' 'chat' 'circumstance' 'clean'
 'cocaine' 'come' 'comfort' 'commit' 'common' 'complete' 'completely'
 'concern' 'confine' 'conflict' 'convince' 'cop' 'cope' 'counselor'
 'countless' 'couple' 'crazy' 'cut' 'cutter' 'damn' 'date' 'day' 'deal'
 'death' 'define' 'depress' 'depression' 'desire' 'diagnose' 'dialog'
 'die' 'difficulty' 'disorder' 'doc' 'dont' 'douche' 'drag' 'drive' 'drug'
 'dump' 'ear' 'early' 'email' 'emotional' 'encourage' 'end' 'entire'
 'essential' 'esteem' 'eventually' 'everyday' 'ex' 

### Naive Bayes

In [35]:
y = data.iloc[:,0].values
x = sparce_matrix

#### Train Test Split

In [36]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 42)

#### Fit the Model

In [37]:
nb = GaussianNB()
nb.fit(x_train,y_train)
y_pred = nb.predict(x_test)
print("Accuracy: {}".format(round(nb.score(y_pred.reshape(-1,1),y_test),2)))


# from sklearn.naive_bayes import GaussianNB
# from sklearn import metrics
# gnb = GaussianNB()

# gnb.fit(x_train,y_train)

# y_pred = gnb.predict(x_test)

# print('The accuracy of the Random Forest is',metrics.accuracy_score(y_pred,y_test))

ValueError: X has 1 features, but GaussianNB is expecting 397 features as input.