# Sentiment Analysis Preprocessing

Preprocessing specifically for sentiment analysis

Preprocessing Notes:  
For sentiment analysis, using uncleaned version of data. We want to leave text close to its original form: no lowercaseing everything, no lemmatizing, etc.   
See https://towardsdatascience.com/are-you-scared-vader-understanding-how-nlp-pre-processing-impacts-vader-scoring-4f4edadbc91d 

In [132]:
#import json
import pandas as pd
from nltk import tokenize
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
#import nltk
#nltk.download('vader_lexicon')
import re
import regex
import matplotlib.pyplot as plt

## Read In Data

In [3]:
dfraw = pd.read_json('../data/metadata_w_2020articles.json')


In [137]:
df = dfraw.T.reset_index().rename(columns={'index':'uuid'})

## Preprocessing

In [138]:
df.columns

Index(['uuid', 'source', 'year', 'article_text', 'title', 'full_article_text'], dtype='object')

In [139]:
#remove certain characters for newline, tab from text and title
df['title'] = df['title'].apply(lambda x: re.sub(r'[\n\t\r]', '', x))
df['article_text'] = df['article_text'].apply(lambda x: re.sub(r'[\n\t\r]', '', x))
df['full_article_text'] = df['full_article_text'].apply(lambda x: re.sub(r'[\n\t\r]', '', x))


#remove certain quote characters. this quote led to neutral vader sentiment scores. 
df['title'] = df['title'].apply(lambda x: re.sub(r'[’‘"]' , '', x))
df['article_text'] = df['article_text'].apply(lambda x: re.sub(r'[’‘"]', '', x))
df['full_article_text'] = df['full_article_text'].apply(lambda x: re.sub(r'[’‘"]', '', x))



In [140]:
# clean out len 0 articels

#how many empty articles?
print("Number of empty articles: ", len(df[df['article_text'].str.len() == 0]))
print("Number of empty articles: ", len(df[df['full_article_text'].str.len() == 0]))
print("Article count pre drop: ", df.shape[0])

df = df[df['article_text'].str.len() > 0]

print("Article count post drop: ",  df.shape[0])


Number of empty articles:  9
Number of empty articles:  9
Article count pre drop:  37044
Article count post drop:  37035


We notice thats some sentences are missing a space between sentences. Without the space, the sentences get lumped together and the tokenizer recognizes them as one sentence rather than two. This seems to happen where there should new paragraphs (perhaps a paragrah identifer was in the raw scraped data but was replaced "" rather than " ").

We use regex to find these sentences and add a space to them.

In [141]:
# looking at text to investigate the issue

test_num = 3

print("TITLE: ",df['title'][test_num])
print("TEXT 512: ", df['article_text'][test_num])
print("TEXT Full: ", df['full_article_text'][test_num])

TITLE:  Trump slams relief bill, calls on Congress to increase stimulus money
TEXT 512:  President Trump on Tuesday evening blasted Congress over the already-passed COVID-19 relief package and called on both chambers to send him a new bill increasing stimulus checks from $600 to $2,000.The president expressed dismay with the $2.3 trillion package that Congress passed Monday, which includes $900 billion in coronavirus relief and $1.4 trillion to fund the government until October, conflating the two bills and saying the spending goals were misguided.“A few months ago, Congress started negotiation


In [142]:
# Look for pattern: lowercase then ?.!, directly followed by a capital letter
missing_space_regex = re.compile(r'(?<=[a-z][?.!])(?=[A-Z])') #(?![\d\s])
# Look for pattern: lowercase then ?.!, directly followed by a quote and then a capital letter
missing_space_regex_quote = re.compile(r'(?<=[a-z][?.!])(?=["“][A-Z])')

# With quotes, it hard to program which sentence a quote should be associated with: 
# i.e.: ... a “big win” for U.S. special forces.“Big win for our very elite ...
# Given that whole problem with spaces not separating sentences seems to occur at paragraph breaks,
# i assume we should add a space before the quotes. 

def add_missing_spaces(text):
    return missing_space_regex.sub(' ', text)
    
def add_missing_spaces_quote(text): 
    return missing_space_regex_quote.sub(' ',text)


#quotetest = 'Test."Big'
#print(add_missing_spaces_quote(quotetest))

# Apply the function to the 'text' columns
df['corrected_article_text'] = df['article_text'].apply(add_missing_spaces)
df['corrected_article_text'] = df['corrected_article_text'].apply(add_missing_spaces_quote)

df['corrected_article_text_full'] = df['full_article_text'].apply(add_missing_spaces)
df['corrected_article_text_full'] = df['corrected_article_text_full'].apply(add_missing_spaces_quote)


In [143]:
#confirm it looks right now

test_num = 3

print("TEXT 512 original: ", df['article_text'][test_num])
print("TEXT 512 corrected: ", df['corrected_article_text'][test_num])
print("TEXT Full corrected: ", df['corrected_article_text_full'][test_num])

TEXT 512 original:  President Trump on Tuesday evening blasted Congress over the already-passed COVID-19 relief package and called on both chambers to send him a new bill increasing stimulus checks from $600 to $2,000.The president expressed dismay with the $2.3 trillion package that Congress passed Monday, which includes $900 billion in coronavirus relief and $1.4 trillion to fund the government until October, conflating the two bills and saying the spending goals were misguided.“A few months ago, Congress started negotiation
TEXT 512 corrected:  President Trump on Tuesday evening blasted Congress over the already-passed COVID-19 relief package and called on both chambers to send him a new bill increasing stimulus checks from $600 to $2,000.The president expressed dismay with the $2.3 trillion package that Congress passed Monday, which includes $900 billion in coronavirus relief and $1.4 trillion to fund the government until October, conflating the two bills and saying the spending go

In [144]:
#df = df[0:10]

# replace the orig cols with corrected
df['article_text'] = df['corrected_article_text']
df['full_article_text'] = df['corrected_article_text_full']


# concatenate shortened text and title
# this may not be that interesting for sentiment analysis
# not reshortening here
df['title_text'] = (df['title'] + '. ' +  df['article_text'])#.apply(lambda x: x[:512])

In [145]:
df.columns


Index(['uuid', 'source', 'year', 'article_text', 'title', 'full_article_text',
       'corrected_article_text', 'corrected_article_text_full', 'title_text'],
      dtype='object')

In [146]:
df = df.drop(['corrected_article_text','corrected_article_text_full'], axis=1)


In [147]:
print(df.columns)
print(len(df))


Index(['uuid', 'source', 'year', 'article_text', 'title', 'full_article_text',
       'title_text'],
      dtype='object')
37035


In [148]:
# makeing sentiment analysis dataset:
df.to_csv('../data/2020articles_cleaned_forsentiment.csv')