# Run Sentiment Analysis on Forum Data
This notebook loads the Youbemom forum data and calculates sentiment

## Data Sources
- youbemom-merged.db (created with 1.1-Merge_Databases.ipynb)

## Changes
- 2020-08-13: Set up data cleaning
- 2020-08-20: Added t-tests
- 2020-08-26: Added plots
- 2020-09-14: Added more plots
- 2020-09-15: Compared parent and child sentiment
- 2020-12-10: Changed data set
- 2020-12-13: Moved data analysis to new file
- 2020-12-15: Created new sentiment table, removed urls from strings

## Database Structure
- threads
 - id: automatically assigned
 - url: url of top post
 - subforum: subforum of post
 - dne: post does not exist
- posts
 - id: automatically assigned
 - family_id: thread->id
 - message_id: the unique id of the message from the html
 - parent_id: id of post this post is responding to, 0 if top post
 - date_recorded: date the data is fetched
 - date_created: date the data was created
 - title: title of the post
 - body: body of the post
 - subforum: subforum of post
 - deleted: has post been deleted
 - text: title + body
 - text_no_url: text without urls
 - neg_sentiment
 - neu_sentiment
 - pos_sentiment
 - compound_sentiment
 - neg_sentiment_no_url
 - neu_sentiment_no_url
 - pos_sentiment_no_url
 - compound_sentiment_no_url

## TODO
- 

## Imports

In [1]:
import sqlite3
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from datetime import datetime
from pathlib import Path
from youbemom import create_connection
import re
from math import floor

## Functions
For formatting the data

In [2]:
def format_data(df):
    """ format the data frame from sql so dates are in
        datetime format and creates text column from
        title and body
    :param df: data frame
    :return df: formatted data frame
    """
    df['title'] = df['title'].replace('This post has been deleted\.', '', regex=True)
    df['text'] = df['title'] + " " + df['body']
    return df

In [3]:
def remove_urls(df):
    """ removes urls from text strings and creates
        new column of text without urls
    :param df: data frame
    :return df: formatted data frame
    """
    pattern = r'(http|ftp|https):\/\/[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'
    regex_pat = re.compile(pattern, flags=re.IGNORECASE)
    df['text_no_url'] = df['text'].str.replace(regex_pat, "")
    return df

For creating the sentiment values

In [4]:
def sentiment_analyzer_scores(sentence, analyzer):
    """ create sentiment scores with the VADER analyzer
    :param sentence: sentence to create scores for
    :param analyzer: VADER sentiment analyzer
    :return score: a dictionary of scores (neg, neu, pos, compound)
    """
    score = analyzer.polarity_scores(sentence)
    return score

## File Locations

In [5]:
p = Path.cwd()
path_parent = p.parents[0]

In [6]:
path_db = path_parent / "database" / "youbemom-merged.db"
path_db = str(path_db)

## Load Data

In [7]:
conn = create_connection(path_db)

In [8]:
df = pd.read_sql_query("SELECT * FROM posts", conn)

## Format Data
Format the data to create a text column from the title and body and a text_no_url column that removes all urls from the text.

In [9]:
df = format_data(df)

In [10]:
df = remove_urls(df)

If creating a separate table for sentiment, run these, else keep commented out

In [11]:
# df.drop('title', axis=1, inplace=True)
# df.drop('body', axis=1, inplace=True)

## Add Sentiment Scores

In [12]:
analyzer = SentimentIntensityAnalyzer()

With the text as collected:

In [13]:
sentiment = df['text'].apply(lambda x: sentiment_analyzer_scores(x, analyzer))

In [14]:
df['neg_sentiment'] = sentiment.apply(lambda x: x.get('neg', 0))
df['neu_sentiment'] = sentiment.apply(lambda x: x.get('neu', 0))
df['pos_sentiment'] = sentiment.apply(lambda x: x.get('pos', 0))
df['compound_sentiment'] = sentiment.apply(lambda x: x.get('compound', 0))

In [15]:
sentiment_no_urls = df['text_no_url'].apply(lambda x: sentiment_analyzer_scores(x, analyzer))

In [16]:
df['neg_sentiment_no_url'] = sentiment_no_urls.apply(lambda x: x.get('neg', 0))
df['neu_sentiment_no_url'] = sentiment_no_urls.apply(lambda x: x.get('neu', 0))
df['pos_sentiment_no_url'] = sentiment_no_urls.apply(lambda x: x.get('pos', 0))
df['compound_sentiment_no_url'] = sentiment_no_urls.apply(lambda x: x.get('compound', 0))

To explore differences:

In [17]:
# sum(sentiment != sentiment_no_urls)
# sum(df['text'] != df['text_no_url'])
# df[sentiment != sentiment_no_urls]
# df[(sentiment == sentiment_no_urls) & (df['text'] != df['text_no_url'])]

## Write to Database

In [18]:
nrow = len(df.index)
nchunks = 167
chunksize = floor(nrow / nchunks)

In [19]:
for i in range(nchunks):
    if i == 0:
        df[0 : chunksize].to_sql('posts', conn, if_exists='replace', index=False)
    elif i == nchunks - 1:
        df[chunksize * i : nrow + 1].to_sql('posts', conn, if_exists='append', index=False)
    else:
        df[chunksize * i : chunksize * (i + 1)].to_sql('posts', conn, if_exists='append', index=False)

In [20]:
conn.commit()
conn.close()