# Run Sentiment Analysis on Forum Data
This notebook loads the Youbemom forum data and calculates sentiment

## Data Sources
- youbemom-merged.db (created with 1.1-Merge_Databases.ipynb)

## Changes
- 2020-08-13: Set up data cleaning
- 2020-08-20: Added t-tests
- 2020-08-26: Added plots
- 2020-09-14: Added more plots
- 2020-09-15: Compared parent and child sentiment
- 2020-12-10: Changed data set
- 2020-12-13: Moved data analysis to new file

## Database Structure
- threads
 - id: automatically assigned
 - url: url of top post
 - subforum: subforum of post
 - dne: post does not exist
- posts
 - id: automatically assigned
 - family_id: thread->id
 - message_id: the unique id of the message from the html
 - parent_id: id of post this post is responding to, 0 if top post
 - date_recorded: date the data is fetched
 - date_created: date the data was created
 - title: title of the post
 - body: body of the post
 - subforum: subforum of post
 - deleted: has post been deleted

## TODO
- 

## Imports

In [1]:
import sqlite3
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from datetime import datetime
from pathlib import Path
from youbemom import create_connection

## Functions
For formatting the data

In [2]:
def format_data(df):
    """ format the data frame from sql so dates are in
        datetime format and creates text column from
        title and body
    :param df: data frame
    :return df: formatted data frame
    """
    df['date_recorded'] = pd.to_datetime(df['date_recorded'])
    df['date_created'] = pd.to_datetime(df['date_created'])
    # text = title + body
    df['title'] = df['title'].replace('This post has been deleted\.', '', regex=True)
    df['text'] = df['title'] + " " + df['body']
    return df

In [3]:
# def clean_text(text):
#     """ cleans the input text of punctuation, extra
#         spaces, and makes letters lower case
#     :param text: text (= title + body here)
#     :return clean: clean text
#     """
#     clean = "".join([t for t in text if t not in string.punctuation])
#     clean = re.sub(" +", " ", clean)
#     clean = clean.strip()
#     clean = clean.lower()
#     return clean

In [4]:
# def remove_stopwords(text):
#     """ remove all stop words from the text
#         using stopwords from nltk.corpus
#     :param text: text with stopwords
#     :return words: text without stopwords
#     """
#     words = [w for w in text if w not in stopwords.words('english')]
#     return words

For creating the sentiment values

In [5]:
def sentiment_analyzer_scores(sentence, analyzer):
    """ create sentiment scores with the VADER analyzer
    :param sentence: sentence to create scores for
    :param analyzer: VADER sentiment analyzer
    :return score: a dictionary of scores (neg, neu, pos, compound)
    """
    score = analyzer.polarity_scores(sentence)
    return score

## File Locations

In [7]:
p = Path.cwd()
path_parent = p.parents[0]

In [8]:
path_db = path_parent / "database" / "youbemom-merged.db"
path_db = str(path_db)

## Load Data

In [10]:
conn = sqlite3.connect(path_db)
df = pd.read_sql_query("SELECT * from posts", conn)

## Format Data
Format the data to make dates into datetimes and create a text column from the title and body. Also, filter the data to include only dates starting Jan. 1, 2019. The scraper picked up one post from 2015 so this removes that. I want to see if there is a difference between parent and child posts so I made an "is_parent" indicator.

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16696050 entries, 0 to 16696049
Data columns (total 10 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   id             int64 
 1   family_id      int64 
 2   message_id     object
 3   parent_id      object
 4   date_recorded  object
 5   date_created   object
 6   title          object
 7   body           object
 8   subforum       object
 9   deleted        int64 
dtypes: int64(3), object(7)
memory usage: 1.2+ GB


In [12]:
df = format_data(df)

In [None]:
# df = df[(df['date_created']>pd.Timestamp(2019,1,1))]

In [13]:
df['before'] = df['date_created'] <= pd.Timestamp(2020,2,28)
df['during'] = df['date_created'] >= pd.Timestamp(2020,4,1)
df['march'] = ~df['before'] & ~df['during']
df.loc[df['before'], 'period'] = 'before'
df.loc[df['march'], 'period'] = 'march'
df.loc[df['during'], 'period'] = 'during'

In [14]:
df['is_parent'] = df['parent_id'] == ""

In [15]:
df['weekday'] = df['date_created'].dt.day_name()

In [17]:
df['week_n'] = df['date_created'].dt.isocalendar().week

In [18]:
df['weekday_n'] = df['date_created'].dt.day

In [19]:
df['month'] = df['date_created'].dt.month_name()

In [20]:
df['month_n'] = df['date_created'].dt.month

## Add Sentiment Scores

In [21]:
analyzer = SentimentIntensityAnalyzer()

With the text as collected:

In [22]:
sentiment = df['text'].apply(lambda x: sentiment_analyzer_scores(x, analyzer))

In [23]:
df['neg_sentiment'] = sentiment.apply(lambda x: x.get('neg', 0))
df['neu_sentiment'] = sentiment.apply(lambda x: x.get('neu', 0))
df['pos_sentiment'] = sentiment.apply(lambda x: x.get('pos', 0))
df['compound_sentiment'] = sentiment.apply(lambda x: x.get('compound', 0))

Example sentiment:

In [25]:
sentiment[5]

{'neg': 0.0, 'neu': 0.747, 'pos': 0.253, 'compound': 0.5267}

I can also do this without stop words or punctuation but VADER includes puctuation in sentiment calculation. "!", ALLCAPS, and degree modifiers (like "extremely" or "marginally") affects the magnitude of a sentiment. Conjunctions like "but" can flip the sentiment polarity. Below, we can see that the shift in sentiment is the same whether or not we include stop words, pucntuation, and lowercase everything but I'll use the as-collected text instead of the clean text in the rest of the analysis.

In [26]:
df.to_sql('posts', conn, if_exists='replace', index=False)