# SENTIMENT ANALYSIS 


### Text Mining on Earnings Calls during a Pandemic as a Means to Predict End-Of-The-Month Stock Performances
####  Olin School of Business <br> Jose Luis Rodriguez  <br> jlr@wustl.edu <br> Fall 2021

In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import metrics
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jlroo/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [3]:
%%capture
#define text normalization function
%run ./'02-Normalization.ipynb' #defining text normalization function

[nltk_data] Downloading package stopwords to /Users/jlroo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jlroo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jlroo/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/jlroo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Lexicon-Based Sentiment Analysis (Unsupervised Machine Learning)

### VADER Lexicon

We will use the **VADER lexicon** available through the NLTK module. VADER stands for **Valence Aware Dictionary and sEntiment Reasoner**. The lexicon informs both about the **polarity** (positive /negative) and **intensity** of the sentiment.

You can read on how VADER was created here (it's a pretty exciting and accessible read): http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf.


In [4]:
analyzer = SentimentIntensityAnalyzer()

In [5]:
def analyze_sentiment_vader_lexicon(review, threshold = 0.1, verbose = False):
    scores = analyzer.polarity_scores(review)  
    binary_sentiment = 'positive' if scores['compound'] >= threshold else 'negative'
    if verbose:                             
        print('VADER Polarity (Binary):', binary_sentiment)
        print('VADER Score:', round(scores['compound'], 2))
    return binary_sentiment,scores['compound']  

In [6]:
hrl_df = pd.read_csv('data/hrl_mrk21.csv')
hrl_df['date_market'] = pd.to_datetime(hrl_df['date_market'])
hrl_df['market_month'] = hrl_df['date_market'].apply(lambda i:str(i.month).zfill(2) + '-2021')

In [7]:
month_sent = []
threshold = 0.2
for month in hrl_df['market_month'].unique():
    month_data = hrl_df[hrl_df['market_month'] == month]
    cps = []
    for n in range(month_data.shape[0]):
        cdata = month_data.iloc[n].to_dict()
        corpus = cdata['corpus']
        cps.extend([i.strip() for i in corpus.split('\n') if i.strip() != ''])
    corpus = ' '.join(normalize_corpus(cps))
    sent = analyze_sentiment_vader_lexicon(corpus,
                                           threshold = threshold,
                                           verbose = False)  
    month_sent.append(sent)

In [8]:
VADER_polarity_df = pd.DataFrame(month_sent, columns = ['VADER Polarity','VADER Score'])
VADER_polarity_df

Unnamed: 0,VADER Polarity,VADER Score
0,positive,1.0
1,positive,1.0
2,positive,1.0
3,positive,1.0
4,positive,1.0
5,positive,1.0
6,positive,1.0
7,positive,1.0
8,positive,1.0
9,positive,1.0


In [9]:
month_df = pd.DataFrame({'count':hrl_df.groupby(['market_month','direction'])['direction'].count()}).reset_index()
neg = month_df[month_df['direction'] == 'negative'].reset_index(drop=True)
net_direction = month_df[month_df['direction'] == 'positive'].reset_index(drop=True)
net_direction['net_change'] = (pos['count'] - neg['count']).reset_index(drop=True)
net_direction['direction'] = net_direction['net_change'].apply(lambda i:'negative' if  i< 0 else 'positive')