# Capstone

# <u>Executive Summary</u>

### What is your goal?

* My long-term goal is to build a chatbot for the domain of domestic violence/mental health. It will act as a friend and check in with you at least weekly. During the conversation, it will monitor the sentiment of the user during a conversation and also monitor for risk factors that indicate the need to trigger early intervention and referral to a human before stigma arises
* We have split the project into phases: 
    * Phase 1 (capstone goal): Scraping Data from 1 forum, EDA including preliminary Topic Modelling and Sentiment Analysis
    * Phase 2: Increase dataset size and rerun tasks in phase 1
    * Phase 3: Introduce subject matter experts and continue to fine-tune models. Increase dataset to have enough data to create different recurrent neural networks such as LSTM and GRUs that handle vanishing/exploding gradient well.
    * Phase 4: Refine model for deployment to incorporate it into the chatbot
    
The remainder of this report and executive summary will be written in the context of reporting on preliminary findings for Phase 1.

### What are your metrics?

* The metrics or key success factors have been split separately for scraping, topic modelling and sentiment analysis:
    * Scraping: successfully scraped 1 conversational thread in the depression sub-forum of beyondblue
    * Topic Modelling using LDA: Accurate identification of distinct topics in beyondblue mental health forum. Topics should not have considerable overlapping of words across other topics
    * Sentiment Analysis: Extract the overall sentiment of postings in beyondblue and individual sentiment of each posting in the sub-forum

### What were your findings?

Based on the word phrases from the topic models and visual examination of the posts, we can see that the forum has users that provide a lot of positive reinforcement and people who are empathetic towards the users coming in a state of depression. Therefore, there is hardly any negative sentiment in the post and overall sentiment is neutral to positive. This would be the general sentiment that a normal counsellor would express towards a person who is seeking counsel from a state of depression.

Common risk factors in terms of topics were inconclusive using LDA and we will be exploring better algorithms such as latent semantic analysis to detect more meaningful topics. LDA was a good initial model to get a high-level understanding of the frequently used terms by users in the forum.

### What risks/limitations/assumptions affect these findings?

Due to time constraints, the models that have been selected for the first iteration were simple models to extract preliminary findings to see what topics we are getting from LDA and also what the sentiment of postings are. 

During EDA, stopwords list was extended with pronouns manually based on visual examination. This will not scale well when adding new data in the next iteration so in the next iteration, we will be implementing part of speech tagging to overcome this current limitation.

Manually evaluating the sentiment analysis will be time-consuming for future iterations and this is a limitation to scalability that we seek to address in future iterations.

# 1 SCRAPING DATA

In [6]:
#General imports
import numpy as np
import pandas as pd
import math
from decimal import Decimal
import re
import string

# webscraping and related imports
import requests
import bs4
from bs4 import BeautifulSoup
import time

%config InlineBackend.figure_format = 'retina'

## 1.1 Beautiful Soup scraping

In [8]:
url = "https://www.beyondblue.org.au/get-support/online-forums/depression/depression-fight-it-or-embrace-it-"
#conducting a request of the stated URL above:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
#Ex. 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found
print('Status Code: ',response.status_code)

#specifying a desired format of “page” using the html parser - this allows python to read the various components of the page, rather than treating it as one long string.
soup = BeautifulSoup(response.text, "lxml")
#printing soup in a more structured tree format that makes for easier reading
print(soup.prettify())

Status Code:  200
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie10 lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->
<!--[if IE 7]>    <html class="no-js lt-ie10 lt-ie9 lt-ie8" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie10 lt-ie9" lang="en"> <![endif]-->
<!--[if IE 9]>    <html class="no-js lt-ie10" lang="en"> <![endif]-->
<!--[if gt IE 9]><!-->
<html class="no-js" lang="en">
 <!--<![endif]-->
 <head id="Head1">
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="beyondblue" property="og:title"/>
  <meta content="website" property="og:type"/>
  <meta content="https://www.beyondblue.org.au/App_Themes/standard/images/BB_FB_logo.png" property="og:image"/>
  <title>
   DEPRESSION: Fight it or embrace it?
  </title>
  <!-- Default = user scalable, no responsive -->
  <meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, user-scalable=no" name="viewport"/>
  <!-- modernizr 

In [9]:
#Function to extract date
def extract_age_from_result(soup):
    age = []
    for div in soup.find_all(name="div", attrs={"class":"media"}):
        for div in div.find_all(name="div", attrs={"class":"sfforumPostAge"}):
            age.append(div["sfforumPostAge"])
    return(age)

In [10]:
#Function to extract author
def extract_author_from_result(soup):
    authors = []
    for div in soup.find_all(name="div", attrs={"class":"media"}):
        for span in div.find_all(name="span", attrs={"class":"sfforumUser"}):
            authors.append(span["sfforumUser"])
    return(authors)

In [11]:
#Function to extract badge title
def extract_badge_from_result(soup):
    badges = []
    for div in soup.find_all(name="div", attrs={"class":"media"}):
        for div in div.find_all(name="div", attrs={"class":"badgeTitle"}):
            badges.append(div["badgeTitle"])
    return(badges)

In [12]:
#Function to extract like counter
def extract_likes_from_result(soup): 
    likes = []
    for div in soup.find_all(name="div", attrs={"class":"media"}):
        for span in div.find_all(name="span", attrs={"class":"likeCounter"}):
            likes.append(span["likeCounter"])
    return(likes)

In [13]:
#Function to extract post content
def extract_comments_from_result(soup): 
    comments = []
    for div in soup.find_all(name="div", attrs={"class":"media"}):
        for div in div.find_all(name="div", attrs={"class":"sfContentBlock"}):
            comments.append(div["sfContentBlock"])
    return(comments)

In [14]:
header=["date","author","badge","likes","content"]
postings=pd.DataFrame(columns=header)

# for field in soup:
#     #items['date'] = soup.find_all("span",{"class": "date"}) #extract date
    
#     items['job_title'] = extract_job_title_from_result(soup) # extract job title
#     items['company'] = extract_company_from_result(soup) # extract company
#     items['location'] = extract_location_from_result(soup) # extract location
#     items['salary'] = extract_salary_from_result(soup) # extract salary
#     items['description'] = extract_summary_from_result(soup) # extract job description
print(postings)

Empty DataFrame
Columns: [date, author, badge, likes, content]
Index: []


In [15]:
import regex as re
#scraping code:
record_max = 6 # # of pages scraped
step = 1
for start in range(1,record_max,step):
    url = "https://www.beyondblue.org.au/get-support/online-forums/depression/depression-fight-it-or-embrace-it-/page/"+ str(start)
    time.sleep(2)
    # You can use status codes to understand how the target server responds to your request.
    #Ex. 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found
    print('Status Code: ',response.status_code, start)
    response = requests.get(url)
    #specifying a desired format of “page” using the html parser - this allows python to read the various components of the page, rather than treating it as one long string.
    soup = BeautifulSoup(response.text, "lxml")
    for li in soup.find_all(name="li", attrs={"class":"sfforumThreadPost"}):
        #specifying row num for index of job posting in dataframe
        num = (len(postings) + 1) 
        #creating an empty list to hold the data for each posting
        posting = []
        a = li.find(name="div", attrs={"class":"sfforumPostAge"}) # grab posting date
        posting_date = re.findall(r"[\d]{1,2}\s[A-Z|a-z]*\s[\d]{4}", a.text)
        posting.append(posting_date)
        b = li.find(name="span", attrs={"class":"sfforumUser"}) # grab user
        posting.append(b.text.strip())
        c = li.find(name="div", attrs={"class":"badgeTitle"}) # grab badge title
        if not c:
            posting.append("No Badge") # for new user with no badge
        else:
            posting.append(c.text.strip())
        d = li.find(name="span", attrs={"class":"likeCounter"}) # grab like counter
        posting.append(d.text.strip())
        e = li.find(name="div", attrs={"class":"sfContentBlock"}) # grab comments
        posting.append(e.text.strip())
        postings.loc[num] = posting #appending list of job post info to dataframe at index num

print("Scraping complete")

Status Code:  200 1
Status Code:  200 2
Status Code:  200 3
Status Code:  200 4
Status Code:  200 5
Scraping complete


In [16]:
postings

Unnamed: 0,date,author,badge,likes,content
1,[3 March 2018],Doolhof,Community Champion,4,Right now I feel like I don't have the energy ...
2,[3 March 2018],quirkywords,Community Champion,3,Mrs dool.\nI am sending you a big reassuring h...
3,[3 March 2018],Summer Rose,Valued Contributor,2,Hi Doolhof\nI'm sorry that you feel so low. Y...
4,[4 March 2018],Doolhof,Community Champion,2,"Hi Quirky,\nThanks for the virtual hug, I need..."
5,[4 March 2018],Doolhof,Community Champion,3,"Hi Summer Rose,\nThanks for your encouragement..."
6,[4 March 2018],demonblaster,Valued Contributor,,Dools so sorry to hear you're in deep darl.\nI...
7,[4 March 2018],demonblaster,Valued Contributor,,Dear Dools how are you feeling today darl 🤗\nI...
8,[4 March 2018],Doolhof,Community Champion,1,"Dear DB,\nThank you so very much. I have been ..."
9,[4 March 2018],demonblaster,Valued Contributor,,Hi all \nYeah it does pull us under its so dam...
10,[5 March 2018],Doolhof,Community Champion,1,"Hi DB,\nWoke up this morning wondering why I h..."


In [17]:
postings.to_excel("postings.xlsx") # Write scraped results to file

# [See /Gopinaath/capstone-eda.ipynb for continuation with EDA](./capstone-eda.ipynb)