In [1]:
from IPython.display import Image, display; display(Image(url="https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.sprinklr.com%2Fblog%2Fchatbot-examples%2F&psig=AOvVaw3GjLwPVFaNAUG6e4xKJYH2&ust=1705391165437000&source=images&cd=vfe&opi=89978449&ved=0CBMQjRxqFwoTCJDLi8yZ34MDFQAAAAAdAAAAABAI"))



## <div style="color:white;display:fill;border-radius:8px;background-color:##800080;font-size:150%; letter-spacing:1.0px"><p style="padding: 15px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'> | </span> </span></b>Defining the Question</b></p></div>

## <b><span style='color:#F1A424'>|</span> Executive Summary:</b> 

**Mental health, fundamentally a state of well-being, is crucial for individuals to realize their abilities, manage life's normal stresses, work productively, and contribute to their communities. Despite the rising global prevalence of mental health issues, including a 13% increase over the last decade noted by the WHO, access to effective treatments remains uneven, particularly among urban youths who face distinct challenges and stressors.
 Saidika, a burgeoning mental health service provider for urban youth, has encountered challenges due to the growing demand for mental health services. The volume of clients has impeded the prompt allocation of therapy resources, particularly for urgent cases, prompting the need for innovative solutions to enhance the efficiency and effectiveness of mental health care delivery. By leveraging the capabilities of AI and advancements in NLP, the project aims to bridge the gap between the growing demand for mental health services and the current limitations in supply and accessibility.**


## <b><span style='color:#F1A424'>|</span> Problem Statement:</b> 

**Saidika's platform is currently unable to efficiently handle the increasing influx of clients seeking mental health services. The inability to quickly triage and prioritize client needs is leading to potential delays in addressing urgent cases, which could have severe consequences on the well-being of individuals in need.**
**

## <b><span style='color:#F1A424'>|</span> Proposed Solution:</b> 

**Main Objective is to integrate ban advanced AI-powered mental health chatbot into Saidika's existing platform
to optimize client management processes, ensuring timely and appropriate allocation of therapy resources to those in need.**


## <b><span style='color:#F1A424'>|</span>Specific Obectives:</b> 
- **Client Categorization: To develop a chatbot that can accurately categorize clients based on their responses, distinguishing between varying levels of care requirements and scheduling clients based on their assessed needs and therapists' availability, optimizing the use of Saidika's resources.**
- **Urgency Escalation: To ensure the chatbot is capable of rapidly identifying and escalating urgent cases to therapists, facilitating prompt intervention.**
- **Service Accessibility: To broaden access to mental health care by providing a 24/7 chatbot service that will offer real-time interaction to clients who require immediate attention or a platform to express their concerns, bridging the gap until a professional is available.**
- **Resource Optimization: To aid therapists in managing their workload more effectively by allowing the chatbot to handle routine inquiries and non-urgent interactions.**
- **Data Collection and Analysis: To gather and analyze interaction data to continually improve the chatbot’s performance and the platform’s services.**
- **User Experience Enhancement: To create a user-friendly chatbot interface that provides a supportive environment for clients to express their concerns.**
- **Integration and Compatibility: To seamlessly integrate the chatbot into both web and mobile applications, ensuring functionality across various devices.**


## <b><span style='color:#F1A424'>|</span> Project Impact:</b> 

**The successful implementation of the mental health chatbot is expected to significantly improve the scalability of Saidika's services, enabling them to handle a greater volume of clients without sacrificing the quality of care. This technological solution aims to not only streamline operations but also to provide a critical early support system for individuals seeking mental health assistance. The chatbot's ability to analyze data will also furnish Saidika with valuable insights, driving policy and decision-making to better serve the community's mental health needs. Ultimately, the project endeavors to foster a more resilient urban youth population, better equipped to contribute positively to their communities**

## <div style="color:white;display:fill;border-radius:8px;background-color:#800080;font-size:150%; letter-spacing:1.0px"><p style="padding: 12px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'>|</span></span></b>DATA PERTINENCE AND ATTRIBUTION</b></p></div>



**The business aims to gain valuable insights into mental health trends, sentiments, and urgency levels by leveraging a diverse dataset acquired from public domain resources and Saidika's private, anonymized user data with proper consent and privacy law adherence. The data primarily consists of information gathered from health forums, Reddit, a dedicated mental health forum, and Beyond Blue.**

**Data Preparation:**

**Data Sources: Public domain resources and private Saidika user data.**

**Variable Types:**

- **Categorical variables: Representing various types of mental health issues.**

- **Binary variables: Indicating urgency levels.**
- **Continuous variables: Expressing sentiment scores associated with mental health discussions.**

**Preprocessing Steps:**

- **Text data cleaning: Removal of identifiable information.**

- **Tokenization: Breaking down text into tokens.**

- **Lemmatization: Reducing words to their base or root form.**

- **Vectorization: Converting text into numerical vectors suitable for Natural Language Processing (NLP) tasks.**

**Libraries Used:**

- **BeautifulSoup: Utilized for parsing and extracting data from HTML content.**

- **Python Libraries (NLTK, spaCy): Applied for NLP tasks such as tokenization, lemmatization, and other text processing operations.**

**Algorithms:**

- **Logistic Regression: Employed for analyzing categorical and binary variables, predicting urgency levels based on mental health issues.**

- **LSTM (Long Short-Term Memory): Utilized for sequence modeling in NLP, capturing dependencies in sentiment scores over the course of discussions.**

- **BERT (Bidirectional Encoder Representations from Transformers): Implemented for advanced contextualized embeddings, enhancing understanding of the nuanced context within mental health discourse.**

- **GPT (Generative Pre-trained Transformer): Employed for generating human-like text responses and comprehending the context of mental health discussions.**

**Overall, the objective is to extract meaningful insights, patterns, and correlations from this rich dataset, contributing to a deeper understanding of mental health issues, sentiments, and urgency levels, ultimately informing strategies for better mental health support and intervention.**








## <div style="color:white;display:fill;border-radius:8px;background-color:#800080;font-size:150%; letter-spacing:1.0px"><p style="padding: 12px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'>1 |</span></span></b>Data Loading & Preparation</b></p></div>

## <b>1.1 <span style='color:#F1A424'>|</span> Importing Necessary Libraries</b> 

In [2]:
import re
import string
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  #plotting statistical graphs
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
import squarify
from collections import Counter

# Load the Text Cleaning Package
import neattext.functions as nfx

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator ##is a data visualization technique used
#for representing text data in which the size of each word indicates its frequency

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix,roc_auc_score,classification_report
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,ExtraTreesClassifier
from sklearn.linear_model import RidgeClassifier,SGDClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB


import nltk
from nltk.corpus import stopwords

from tqdm import tqdm ##new progress bars repeatedly
import os
import nltk ##building Python programs to work with human language data
#import spacy #for training the NER model tokenize words
#import random
#from spacy.util import compounding
#from spacy.util import minibatch


pd.set_option('max_colwidth', 400)
pd.set_option('use_mathjax', False)


import warnings
warnings.filterwarnings("ignore")

## <b>1.2 <span style='color:#F1A424'>|</span>Loading in our Data</b> 

In [4]:
# load the dataset -> feature extraction -> data visualization -> data cleaning -> train test split
# -> model building -> model training -> model evaluation -> model saving -> streamlit application deploy

# load the dataset just using specific features
df = pd.read_csv('reddit_data.csv')

df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'reddit_data.csv'

## <div style="color:white;display:fill;border-radius:8px;background-color:#800080;font-size:150%; letter-spacing:1.0px"><p style="padding: 12px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'>2 |</span></span></b> Data Quality Checks</b></p></div>
   
- **Another crucial step in any project involves ensuring the quality of your data. Remember that your model’s performance is directly tied to the data it processes. Therefore, take the time to remove duplicates and handle missing values appropriately.**

- **Here we always check for missing values, outliers and remove any unnecessary variables/features/columns. Since we have text data, outliers cannot be checked.**

In [None]:
df.info()

## <b>2.1 <span style='color:#F1A424'>|</span> Checking for NaN Values</b> 

In [None]:
print(df.isna().sum())
print("*"*40)

**As noted earlier, we don not have any null values.**

## <b>2.2 <span style='color:#F1A424'>|</span> Checking for Sentence Length Consistency</b> 

In [None]:
df['tweet'].apply(len).value_counts()

**This can give you an overview of the number of words per tweet. We also notice that some consist of less then five words hence won't be instrumental in constructing our predictive model.**

In [None]:
sum(df['tweet'].apply(len) > 5) , sum(df['tweet'].apply(len) <= 5)

**We have `43464` tweets with characters gretaer than 5 and only `14` tweets with characters less than 5 characters.**

In [None]:
print("Shape of the dataset before filtering:")
print(df.shape)
print("*"*40)
df = df[df['tweet'].apply(len) > 5]
print("Shape of the dataset after filtering:")
print(df.shape)

## <b>2.3 <span style='color:#F1A424'>|</span> Checking for Duplicates</b> 

In [None]:
print(df.duplicated().sum())
print("*"*40)

**We will have to check if indeed these are duplicate values.**

In [None]:
# checking if the duolicate values are indeed duplicates
df[df.duplicated(subset=['tweet'],keep=False)].sort_values(by='tweet').sample(10)

In [None]:
df[df['id'] == 1094014959636410368]

**We can see that the dataset does indeed contain entries that are duplicate tweets. We will go ahead and drop these duplicate entries although the number of duolicates `21772` accounts for almost half of our data.**

In [None]:
df = df.drop_duplicates()

print(df.duplicated().sum())
print("*"*40)

In [None]:
df.info()

## <div style="color:white;display:fill;border-radius:8px;background-color:#800080;font-size:150%; letter-spacing:1.0px"><p style="padding: 12px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'>3 |</span></span></b> Data Preprocessing</b></p></div>

- **Preprocessing procedures are tokenizing(spliting),stemming and lemmatization which are dependent on the model you choose to use.**

## <b>3.1 <span style='color:#F1A424'>|</span> Text Cleaning (Source Text)</b> 
+ Mentions / User handles
+ Hashtags
+ URLs
+ Special Characters
+ Whitespaces
+ Emojis
+ Contractions
+ Stopwords

**For cleaning our text we will be using the NeatText Library. NeatText is a simple NLP package for cleaning textual data and text preprocessing. It offers a variety of features for cleaning unstructured text data, reducing noise (such as special characters and stopwords), and extracting specific information from the text. It can be used via an object-oriented approach or a functional/method-oriented approach, providing flexibility in its usage. The package includes classes such as TextCleaner, TextExtractor, and TextMetrics for different text processing tasks.**

https://pypi.org/project/neattext/

In [None]:
# load the text cleaning packages

import neattext as nt
import neattext.functions as nfx

# Methods and Attributes of the function
dir(nt)

### <b>3.1.1 <span style='color:#F1A424'>|</span> Mentions / User Handles</b> 

In [None]:
# Noise scan
df['tweet'].apply(lambda x: nt.TextFrame(x).noise_scan()['text_noise'])

In [None]:
# Ensure all entries in 'tweet' column are strings
df['tweet'] = df['tweet'].astype(str)

# Now apply the clean_text function
df['clean_tweet'] = df['tweet'].apply(lambda x: nfx.clean_text(x, puncts=False, stopwords=False))

In [None]:
# Extract userhandles into another column before removing them
df['userhandle'] = df['clean_tweet'].apply(nfx.extract_userhandles)

In [None]:
# Remove the userhandles
df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_userhandles)

df[['tweet', 'clean_tweet', 'userhandle']].head()

### <b>3.1.2 <span style='color:#F1A424'>|</span> Hashtags</b> 

In [None]:
# Extract hashtags into another column before removing them
df['hashtags'] = df['clean_tweet'].apply(nfx.extract_hashtags)

df[['tweet', 'clean_tweet', 'hashtags']].head()

In [None]:
# Remove hashtags
df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_hashtags)

df[['tweet', 'clean_tweet', 'hashtags']].head()

### <b>3.1.3 <span style='color:#F1A424'>|</span> URLs</b> 

In [None]:
# Extract URLs into another column before removing them
# If we were to remove the URLs after remove the special characters e.g '//' the function would be ubable to detect the URLs
df['urls'] = df['clean_tweet'].apply(nfx.extract_urls)

df[['tweet', 'clean_tweet', 'urls']].sample(5)

In [None]:
df[['tweet', 'clean_tweet', 'urls']].loc[15]

In [None]:
df[['tweet', 'clean_tweet', 'urls']].loc[16515]

In [None]:
df[['tweet', 'clean_tweet', 'urls']].loc[12827]

In [None]:
# Remove URLS
df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_urls)

In [None]:
df[['tweet', 'clean_tweet', 'urls']].loc[15]

### <b>3.1.4 <span style='color:#F1A424'>|</span> Special Characters</b> 

In [None]:
# Remove special characters

df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_special_characters)

df[['tweet', 'clean_tweet']].sample(5)

### <b>3.1.5 <span style='color:#F1A424'>|</span> Multiple Whitespaces</b> 

In [None]:
# Remove whitespaces
df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_multiple_spaces)

df[['tweet', 'clean_tweet']].head()

### <b>3.1.6 <span style='color:#F1A424'>|</span> Emojis</b> 

In [None]:
# Remove emojis
df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_emojis)

df[['tweet', 'clean_tweet']].sample(5)

### <b>3.1.7 <span style='color:#F1A424'>|</span> Contractions</b> 

In [None]:
import contractions

# Apply the contractions.fix function to the clean_tweet column
df['clean_tweet'] = df['clean_tweet'].apply(contractions.fix)

df[['tweet', 'clean_tweet']].head()

### <b>3.1.8 <span style='color:#F1A424'>|</span> Stopwords</b> 

In [None]:
# Extract stopwords
df['clean_tweet'].apply(lambda x: nt.TextExtractor(x).extract_stopwords())

In [None]:
# Remove the stop words

df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_stopwords)

df[['tweet', 'clean_tweet']].head()

In [None]:
# Noise Scan after cleaning text
df['clean_tweet'].apply(lambda x: nt.TextFrame(x).noise_scan()['text_noise'])

## <b>3.2 <span style='color:#F1A424'>|</span> Linguistic Processing (Clean Text)</b> 

+ Tokenization
+ Stemming / Lemmatization
+ Parts of Speech Tagging
+ Calculating Sentiment Based on Polarity & Subjectivity

### <b>3.2.1 <span style='color:#F1A424'>|</span> Tokenization</b> 

In [None]:
test_sample = df['clean_tweet'].loc[12827]

test_sample

In [None]:
from nltk.tokenize import RegexpTokenizer

basic_token_pattern = r"(?u)\b\w\w+\b"

tokenizer = RegexpTokenizer(basic_token_pattern)

tokenizer.tokenize(test_sample)

In [None]:
# Tokenise the clean_tweet column
df['preprocessed_tweet'] = df['clean_tweet'].apply(lambda x: tokenizer.tokenize(x))

# df.iloc[100]["preprocessed_tweet"][:20]

In [None]:
df[['clean_tweet', 'preprocessed_tweet']].iloc[100]

### <b>3.2.2 <span style='color:#F1A424'>|</span> Lemmatization</b> 

In [None]:
# Define a function to lemmatise the tokens
def lemmatise_tokens(tokens):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

# Lemmatise the tokens
df['lemma_preprocessed_tweet'] = df['preprocessed_tweet'].apply(lambda x: lemmatise_tokens(x))

# df.iloc[100]["preprocessed_tweet"][:20]
    

In [None]:
df[['clean_tweet', 'lemma_preprocessed_tweet']].iloc[260]

In [None]:
# Define a function to stem the tokens
def stem_tokens(tokens):
    stemmer = nltk.stem.PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

# Stem the tokens
df['stemma_preprocessed_tweet'] = df['preprocessed_tweet'].apply(lambda x: stem_tokens(x))

# df.iloc[100]["preprocessed_tweet"][:20]

In [None]:
df[['clean_tweet', 'stemma_preprocessed_tweet']].iloc[200]

### <b>3.2.3 <span style='color:#F1A424'>|</span> Calculating Sentiment Based on Polarity & Subjectivity</b>

TextBlob is a Python library for processing textual data, including sentiment analysis. It uses natural language processing (NLP) and the Natural Language Toolkit (NLTK) to achieve its tasks. When a sentence is passed into TextBlob, it returns two outputs: polarity and subjectivity. The polarity score is a float within the range [-1, 1], where -1 indicates a negative sentiment and 1 indicates a positive sentiment. The subjectivity score is a float within the range, where 0 is very objective and 1 is very subjective.

In [None]:
from textblob import TextBlob

# Create a function to get the subjectivity
def getSubjectivity(text):
  return TextBlob(text).sentiment.subjectivity

# Create a function to get the polarity
def getPolarity(text):
  return TextBlob(text).sentiment.polarity

# Create two new columns 'Subjectivity' & 'Polarity'
df['Subjectivity'] = df['clean_tweet'].apply(getSubjectivity)
df['Polarity'] = df['clean_tweet'].apply(getPolarity)

# Show the new dataframe with columns 'Subjectivity' & 'Polarity'
df[['clean_tweet','Subjectivity','Polarity']].head()

In [None]:
# Create a function to compute the negative, positive and nuetral analysis
def getAnalysis(score):
  if score < 0:
    return 'Negative'
  elif score == 0:
    return 'Neutral'
  else:
    return 'Positive'
  
df['sentiment'] = df['Polarity'].apply(getAnalysis)

# Show the dataframe
df[['clean_tweet','Subjectivity','Polarity','sentiment']].head()

In [None]:
df['sentiment'].value_counts()

In [None]:
# # using VADER
# from nltk.sentiment.vader import SentimentIntensityAnalyzer
# analyser = SentimentIntensityAnalyzer()

# # Create a function to get the sentiment scores
# def sentiment_analyzer_scores(text):
#     score = analyser.polarity_scores(text)
#     return score

# # Get the compound sentiment scores
# df['compound_sentiment'] = df['clean_tweet'].apply(lambda x: sentiment_analyzer_scores(x)['compound'])

# # Get the sentiment scores whereby there is positive, negative and neutral sentiment
# df['sentiment'] = df['compound_sentiment'].apply(lambda x: 'positive' if x >= 0.05 else ('negative' if x <= -0.05 else 'neutral'))

# df[['clean_tweet', 'compound_sentiment', 'sentiment']].head()

In [None]:
# df['sentiment'].value_counts()

In [None]:
df['preprocessed_tweet']

In [None]:
df['lemma_preprocessed_tweet'] = df['lemma_preprocessed_tweet'].apply(lambda x: ' '.join(x))

In [None]:
df['stemma_preprocessed_tweet'] = df['stemma_preprocessed_tweet'].apply(lambda x: ' '.join(x))

df['preprocessed_tweet'] = df['preprocessed_tweet'].apply(lambda x: ' '.join(x))

In [None]:
df['preprocessed_tweet']

In [None]:
df.info()

In [None]:
# save the dataframe to csv using the name 'interim_data.csv' fo the data folder
# df.to_csv('interim_data.csv', index=False)