# Sentiment Analysis of Financial News Headlines with Market Comparison

## Cardiff University School of Computer Science and Informatics - Final Year Project

Growth of the internet and the digital economy, along with technical advances in computer and data science have supported a wave of alternative data sources that can be used to measure and predict the financial markets. One of these non-traditional metrics is public opinion mining, commonly referred to as sentiment analysis. This study investigates the hypothesis that the sentiment of financial news headlines reflects and directs the performance of the U.S. stock market through proving a significant correlation between the polarity of the sentiment and the change in price of a security, thus working to disprove the controversial ‘efficient market hypothesis’. To evaluate the publics sentiment a vast dataset of ‘financial news’ headlines are required ranging over a broad period. Additionally, a natural language processing and machine learning classification model is built to predict the sentiment polarity of headlines. Finally, statistical analysis is conducted on the data to prove any significant correlation within the results. The study can demonstrate the hypothesis to an extent, showing that the sentiment of financial news headlines relating to the overall U.S. market, directly reflects the price of the U.S. market index. Despite this no correlation between the price of an individual stock and the sentiment of directly relating financial news headlines could be found. Additionally, there was no evidence to suggest that the daily sentiment of a security had any influence over its corresponding price change the subsequent day.

In [None]:
"""Collect Headline Data"""

import pandas

cnbcData = pandas.read_csv('Data/Original/cnbc_headlines.csv')
guardianData = pandas.read_csv('Data/Original/guardian_headlines.csv')
reutersData = pandas.read_csv('Data/Original/reuters_headlines.csv')


In [None]:
"""Collect and Clean Market Data"""

import yfinance

def getMarketData(ticker):
    """
    Takes a stock market ticker (string) as a parameter and returns the close prices
    of that stock (list of float values) each trading day between 2017-12-22 and 2020-07-18.
    
    EXAMPLE: '^GSPC' -> [3962.18, 3965.72, 413.21, ...]
    
    ARGS:
        ticker (string) - A stock market ticker value. 
    
    RETURNS:
        securityData (list of float values) - - A list of stock close price values between 2017-12-22 and 2020-07-18.
    """
    securityData = yfinance.download(ticker, start='2017-12-22', end='2020-07-18')
    securityData = securityData['Close']
    return securityData


## Data Cleaning

In [None]:
"""Initial Clean Headline Data"""

from datetime import datetime, timedelta

def cleanData(dataSet):
    """
    Takes a news headline pandas dataframe as a parameter and removes any irrelevant
    data (for the scope of the project) that may be present, returning the cleaned data frame.
    
    ARGS:
        dataSet (pandas data frame) - Financial news headline data frame containing irrelevent data columns. 
    
    RETURNS:
        dataSet (pandas data frame) - Cleaned financial news headline data frame.
    """
    dataSet = dataSet.dropna()
    try:
        dataSet = dataSet.drop('Description', axis=1)
    except:
        print("Headline data set contains no description")
    dataSet = dataSet.drop_duplicates(subset=['Headlines'], keep='first')
    dataSet.reset_index(drop=True, inplace=True)
    return dataSet

def dateConversion(date):
    """
    Takes date values found within the CNBC data set as parameters and returns them converted to a format
    that can be manipulated to generate 'datetime64' values.
    
    ARGS: 
        date (string) - An incorrectly fromated string data value. 
    
    RETURNS:
        date (string) - A correctly fromated string data value
    """
    date = date.replace("Sept", "Sep").replace("March", "Mar").replace("April", "Apr").replace("June", "Jun").replace("July", "Jul")
    if date[0].isspace():
        date = date.replace(" ", "0", 1)
        date = date.replace(",  ", ", 0", 1)
    return date

#CNBC

cnbcData = cleanData(cnbcData)
dateFormat = '%I:%M  %p ET %a, %d %b %Y'
dates = []
for item in cnbcData.iloc[:, 1].values:
    item = dateConversion(item)
    dates.append(datetime.strptime(item, dateFormat).strftime("%m-%d-%Y"))
cnbcData["Time"] = dates
cnbcData["Time"] = cnbcData["Time"].astype("datetime64")
cnbcData.rename(columns={"Time":"Date"}, inplace = True)

#Guardian

guardianData = cleanData(guardianData)
guardianData["Time"] = pandas.to_datetime(guardianData["Time"], errors = 'coerce', format="%d-%b-%y")
guardianData.rename(columns={"Time":"Date"}, inplace = True)

#Reuters

reutersData = cleanData(reutersData)
reutersData["Time"] = reutersData["Time"].astype("datetime64")
reutersData.rename(columns={"Time":"Date"}, inplace = True)


In [None]:
"""Combine Headline Data"""

dataSets = [cnbcData, guardianData, reutersData]
headlineData = pandas.concat(dataSets)
print("Pre Cleaning: ")
headlineData.info()
headlineData = headlineData.sort_values(by="Date")
headlineData = cleanData(headlineData)
print(" ")
print("Post Cleaning: ")
headlineData.info()
headlineData.to_csv('Data/all_headlines.csv')


## Exploratory Data Analysis 

In [None]:
"""Explority Data Analysis - Investigating Distribution"""

import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 2, figsize=(16, 2))

#Time Series Plot
spy = getMarketData('SPY')
axs[0].plot(spy, 'tab:red')
axs[0].set_title('S&P 500 Index Time Series')
axs[0].axes.get_xaxis().set_ticks([])
axs[0].set_xlabel('Date (December 2017 - July 2020)')
axs[0].set_ylabel('Price (USD)')

#Frequency Distribution Plot
headlineQuantity = headlineData["Headlines"].groupby([headlineData["Date"].dt.year, headlineData["Date"].dt.month]).count().tolist()
axs[1].bar(list(range(len(headlineQuantity))), headlineQuantity)
axs[1].set_title('Number Of Headlines Published Monthly')
axs[1].axes.get_xaxis().set_ticks([])
axs[1].set_xlabel('Month (December 2017 - July 2020)')
axs[1].set_ylabel('Headlines')


In [None]:
"""Explority Data Analysis - Vocabulary Representation"""

from wordcloud import WordCloud, STOPWORDS

fig, axs = plt.subplots(1, 3, figsize=(17, 6))

reutersDataString = " ".join(reutersData["Headlines"].to_list())
guardianDataString = " ".join(guardianData["Headlines"].to_list())
cnbcDataString = " ".join(cnbcData["Headlines"].to_list())
#Generating Word Clouds For Each Dataset
reutersWordcloud = WordCloud(stopwords=STOPWORDS, background_color='black', max_words=250).generate(reutersDataString)
guardianWordcloud = WordCloud(stopwords=STOPWORDS, background_color='black', max_words=250).generate(guardianDataString)
cnbcWordcloud = WordCloud(stopwords=STOPWORDS, background_color='black', max_words=250).generate(cnbcDataString)

#Word Cloud Plot
axs[0].imshow(reutersWordcloud, interpolation='bilinear')
axs[0].set_title('Reuters Word Cloud')
axs[0].axis("off")
axs[1].imshow(guardianWordcloud, interpolation='bilinear')
axs[1].set_title('Guardian Word Cloud')
axs[1].axis("off")
axs[2].imshow(cnbcWordcloud, interpolation='bilinear')
axs[2].set_title('CNBC Word Cloud')
axs[2].axis("off")


In [None]:
"""Explority Data Analysis - Word Type Frequency"""

import nltk, nltk.classify
from collections import Counter
import re

typeList = []
#Basic Pre-Processing
headlineString = " ".join(headlineData["Headlines"])
headlineString = re.sub(r'[^a-zA-Z]', " ", headlineString.lower())
tokenizedHeadlines = nltk.word_tokenize(str(headlineString))
wordTypeList = nltk.pos_tag(tokenizedHeadlines) #Word Type Identfication
for wordType in wordTypeList:
    typeList.append(wordType[1])
wordsTypeFrequency = Counter(typeList) #Count Word Type Frequency
wordsTypeFrequency = {k: v for k, v in sorted(wordsTypeFrequency.items(), key=lambda item: item[1])}

label = list(wordsTypeFrequency.keys())
frequency = list(wordsTypeFrequency.values())
    
#Word Type Frequency Plot
plt.figure(figsize=(16, 2))
plt.bar(label, frequency)
plt.xlabel('Word Type')
plt.ylabel('Frequency')
plt.title('Word Type And Respective Frequency Of Occurrence In All Headlines')


## Data Pre-Processing 

In [None]:
"""Pre-Processing"""

from nltk.corpus import words, stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

#nltk.download()

def preProcessing(text):
    """
    Processes a raw text string (parameter) through various NLP pre-processing techniques
    and returns a reduced list of words suitable for producing a BoW model.
            
    EXAMPLE: '#024: Only the best stock can survive in this economy!' -> ['only', 'best', 'stock', 'survive', 'economy']
    
    ARGS:
        text (string) - The raw text headline under consideration.
    
    RETURNS:
        text (list of string values) - A list of words that passed the pre-processing criteria from the string passed in.
    """
    text = namedEntityRecognition(text)[0] #Named Entity Recognition
    text = re.sub(r'[^a-zA-Z]', " ", text.lower()).split() #Formatting Words
    text = list(set(text) - set(str(stopwords))) #Stop Word Removal
    text = list(set(text) & set(words.words())) #Non-Word Removal 
    for word in text:
        word = lemmatizer.lemmatize(word) #Lemmatization
        word = stemmer.stem(str(word)) #Stemming 
    return text


In [None]:
"""Named Entity Recognition"""

import spacy
from spacy import displacy

NER = spacy.load("en_core_web_sm")

def namedEntityRecognition(text):
    """
    Locates and classifies specific named entities within unstructured text (parameter) and classifies
    them into pre-defined categories. In this custom function, company names being considered within the 
    scope of the problem are identified, removed, and stored.
            
    EXAMPLE: 'This is why I hate Walmart and Amazon!' -> 'This is why I hate COMPANY and COMPANY', ['WMT', 'AMZN']
    
    ARGS:
        text (string) - The raw text headline under consideration.
    
    RETURNS:
        text (string) - The raw text headline under consideration with named entities removed.
        securities (list of string values) - A list of the securities ticker strings identified within the headline. 
    """
    selectedSecurities = {'S&P': 'SPY', 'google': 'GOOGL', 'amazon': 'AMZN', 'apple': 'AAPL',
                          'microsoft': 'MSFT','visa': 'V', 'johnson': 'JNJ', 'walmart': 'WMT',
                          'exxon': 'XOM', 'FB': 'facebook', 'TSLA': 'tesla'}
    securityNames = selectedSecurities.keys()
    text = text.lower().split()
    for word in text:
        word = stemmer.stem(str(word)) #Reduce Company To Stem
    securities = []
    for security in securityNames:
        if security in text:
            text[text.index(security)] = 'COMPANY' #Company Name Removal
            securities.append(selectedSecurities[security]) #Identify for Analysis 
    if len(securities) == 0:
        securities.append('^GSPC')
    text = ' '.join(text)
    text = NER(text)
    text = ' '.join([t.text if not t.ent_type_ else t.ent_type_ for t in text]).lower()
    return text, securities


In [None]:
"""Generate Word Set (Bag-Of-Words Representation)"""

import numpy 
import time

wordSet = []
start = time.time()
for headline in headlineData["Headlines"].to_list():
    composedHeadline = preProcessing(headline)
    wordSet = numpy.union1d(wordSet, composedHeadline) #Add Word If Not Present In BoW
end = time.time()
textfile = open("Data/wordset.txt", "w")
for element in wordSet:
    textfile.write(element + "\n")
textfile.close()

print("Runtime: " + str(round((end - start), 2)))


In [None]:
""""Dimensionality Reduction"""

headlineString = " ".join(headlineData["Headlines"])
headlineString = re.sub(r'[^a-zA-Z]', " ", headlineString.lower()).split()
print('Pre Pre-Processing Dimensionality: ' + str(len(set(headlineString))))

wordList = open('Data/wordset.txt', 'r')
wordSet = wordList.read().split()
print('Post Pre-Processing Dimensionality: ' + str(len(wordSet)))


In [None]:
"""Feature Extraction"""

def featureExtraction(words, document):
    """
    Extracts feature values from a document by counting their occurrence, utilisng the BoW 
    model, and generating and returning a frequency dictionary.
            
    EXAMPLE: ['microsoft', 'a', 'employee', 'awful', 'finance'], 'A Microsoft employee provides awful tales of neglect' 
    -> {'microsoft': 1, 'a': 1, 'employee': 1, 'awful': 1, 'finance': 0}
    
    ARGS:
        words (list of string values) - Words contained within the BoW model.
        document (string) - The headline document under consideration.
    
    RETURNS:
        wordFrequency (dictionary) - The words and their corrasponding frequency of occurance in the headline.
    """
    wordFrequency = dict.fromkeys(words, 0)
    for word in document:
        if word in words:
            wordFrequency[word] = document.count(word) #Frequency Of Occurrence
    return wordFrequency


In [None]:
"""Annotating Training Data"""

import random

#Data Acsess
wordList = open('Data/wordset.txt', 'r')
wordSet = wordList.read().split()
headlineData = pandas.read_csv('Data/all_headlines.csv')

#Data Frame Construction
annotatedTrainingData = pandas.DataFrame(data={'Document': [], 'Sentiment': []})
trainingFeatures = pandas.DataFrame(data=dict.fromkeys(wordSet, []))
annotatedTrainingData = annotatedTrainingData.append(trainingFeatures, ignore_index=True)
#trainingDataCorpus = pandas.read_csv('Data/training_data.csv')

def annotateTrainingData():
    """
    This method allows a user to manually annotate randomly selected headlines from a dataset with their
    corrasponding sentiment. This is utilised to construct a test-train data set as a .csv file.
            
    EXAMPLE:
    
    >>> 'This is why I love Google's financial strategy' - Enter the sentiment of this headline:
    >>>
    
    """
    #Instructions
    print("Evaluate the overall sentiment of each headline:")
    print("If positive enter 'positive', if negative enter 'negative' and if nuetral enter 'nuetral'.")
    print("Enter 'END' at any point to stop the process!")
    unprocessedHeadlines = headlineData["Headlines"].to_list()
    for headline in unprocessedHeadlines:
        headline = namedEntityRecognition(headline)[0]
    entry = ""
    while entry != "END":
        headlineIndex = random.randint(0, len(unprocessedHeadlines)) #Random headline Selection
        headline = unprocessedHeadlines[headlineIndex]
        unprocessedHeadlines.remove(headline)
        entry = input(headline + " - Enter the sentiment of this headline: ") #Labeling
        if entry != 'END':
            #Process Dataset For Machine Learning
            sentiment = entry
            features = list(featureExtraction(wordSet, preProcessing(headline)).values())
            entry = [headline, sentiment]
            for feature in features:
                entry.append(feature)
            #Add Entry To Data Frame
            annotatedTrainingData.loc[len(annotatedTrainingData)] = entry
            annotatedTrainingData.to_csv('Data/Training/annotated_training_data.csv')
            headlineData.drop(headlineIndex, axis=0)
            headlineData.to_csv('Data/all_headlines.csv')

#annotateTrainingData()

In [None]:
"""External Training Data (Document-Term Matrix)"""

#Data Acsess
externalData = pandas.read_csv('Data/Training/external_data.csv', encoding = 'unicode_escape', engine ='python')
externalData = cleanData(externalData)
wordList = open('Data/wordset.txt', 'r')
wordSet = wordList.read().split()

#Data Frame Construction
externalTrainingData = pandas.DataFrame(data={'Document': [], 'Sentiment': []})
trainingFeatures = pandas.DataFrame(data=dict.fromkeys(wordSet, []))
externalTrainingData = externalTrainingData.append(trainingFeatures, ignore_index=True)

headlines = externalData["Headlines"].to_list()
sentiment = externalData["Sentiment"].to_list()

#Process Dataset For Machine Learning
for row in range(1, len(headlines)):
    headlines[row] = namedEntityRecognition(headlines[row])[0] 
    features = list(featureExtraction(wordSet, preProcessing(headlines[row])).values())
    entry = [headlines[row], sentiment[row]]
    for feature in features:
        entry.append(feature)
    #Add Entry To Data Frame
    externalTrainingData.loc[len(externalTrainingData)] = entry
externalTrainingData.to_csv('Data/Training/external_training_data.csv')


In [None]:
"""Combine Training Data""" #Redundant With No Annotation 

annotatedTrainingData = pandas.read_csv('Data/Training/annotated_training_data.csv')
externalTrainingData = pandas.read_csv('Data/Training/external_training_data.csv')

labledData = pandas.concat([annotatedTrainingData, externalTrainingData])
labledData = labledData.sample(frac = 1) #Shuffle Data
labledData.to_csv('Data/Training/all_labled_data.csv')


In [None]:
"""Feeding""" 

labledData = pandas.read_csv('Data/Training/all_labled_data.csv')

#Classifier Training Format
structuredLabledData = []
for index, row in labledData.iterrows():
    features = {}
    sentiment = row['Sentiment']
    for feature in wordSet:
        features[feature] = row[feature]
    structuredLabledData.append((features, sentiment))
    

In [None]:
"""Test-Train Split"""

#80-20 Split
onePercent = len(structuredLabledData)//100 
trainingData, testingData = structuredLabledData[onePercent*80:], structuredLabledData[:onePercent*20]
xTrain, yTrain, xTest, yTest = [], [], [], []
for observation in trainingData:
    xTrain.append(observation[0])
    yTrain.append(observation[1])
for observation in testingData:
    xTest.append(observation[0])
    yTest.append(observation[1])


In [None]:
"""Standardisation"""

from sklearn.preprocessing import StandardScaler 

scaler = StandardScaler()

numericFeatureList = []
for values in xTrain:
    numericFeatures = list(xTrain[0].values())
    numericFeatureList.append(numericFeatures)

#Standardize Data 
scaler.fit(numericFeatureList)


## Model Evaluation

In [None]:
"""Classifier Testing"""

import sklearn
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, ConfusionMatrixDisplay

def evaluationMeasures(classifier, testData):
    """
    Takes a classification model and testing data as parameters and produces a series of
    evaluation metrics and plots to determine the ability of the model.
    
    ARGS:
        classifier (function) - Machine learning classification model.
        testData (data frame array) - Headlines and their corresponding sentiment. 
    
    EXAMPLE:
        
        Accuracy: 0.81
        Recall: 0.74
        Precision: 0.62
        F1 Score: 0.77
        
    """
    #Confusion Matrix
    testFeatures = [feature for (feature, label) in testData]
    testLabels = [label for (feature, label) in testData]
    testPredicted = [classifier.classify(feature) for feature in testFeatures]
    confusionMatrix = sklearn.metrics.confusion_matrix(testLabels, testPredicted)
    display = ConfusionMatrixDisplay(confusion_matrix = confusionMatrix)
    #Alternate Metrics
    accuracy = accuracy_score(testPredicted, testLabels) #Accuracy
    recall = recall_score(testPredicted, testLabels, average=None) #Recall
    precision = precision_score(testPredicted, testLabels, average=None) #Precision
    f1Score = f1_score(testPredicted, testLabels, average=None) #F1 Score
    #Output
    display.plot()
    print(" ")
    print("Accuracy: " + str(round(accuracy, 2)))
    print("Recall: " + str(round(recall[0], 2)))
    print("Precision: " + str(round(precision[0], 2)))
    print("F1 Score: " + str(round(f1Score[0], 2)))
    print(" ")
    

In [None]:
"""Testing Baseline"""

from sklearn.dummy import DummyClassifier

baselineClassifier = DummyClassifier(strategy="stratified")
baselineClassifier.fit(xTrain, yTrain)
print(" ")
print("Stratified Classification Baseline: ")
print("Accuracy: " + str(round(baselineClassifier.score(xTest, yTest), 2)))
print(" ")


In [None]:
"""Naive Bayes Evaluation (No Cross Validation)"""

nbClassifier = nltk.NaiveBayesClassifier.train(trainingData)
print("Naive Bayes Classifier: ")
evaluationMeasures(nbClassifier, testingData)


In [None]:
"""Informative Features"""

nbClassifier.show_most_informative_features(10)


In [None]:
"""Cross Validation"""

import sklearn.model_selection

def crossValidation(classifier, trainingData):
    """
    Performs cross validation on a classifier and training data (parameters) and returns the trained classifier.
    
    ARGS:
        classifier (function) - Machine learning classification model.
        trainingData (data frame array) - Headlines and their corresponding sentiment. 
    
    RETURNS:
        classifier (function) - Trained machine learning classification model.
    """
    KFoldCV = sklearn.model_selection.KFold(n_splits=10) #10 Folds
    KFoldAccuracy = []
    split = 1
    for trainIndex, testIndex in KFoldCV.split(trainingData): #Fold Testing
        classifier = classifier.train(trainingData[trainIndex[0]:trainIndex[len(trainIndex) - 1]])
        split += 1
    return classifier


In [None]:
"""Cross Validation Impact"""

print("Evaluation Measures After Cross Validation (Naive Bayes): ")
nbClassifier = crossValidation(nbClassifier, structuredLabledData)
print(" ")
evaluationMeasures(nbClassifier, testingData)


In [None]:
"""Support Vector Machine Evaluation"""

from sklearn.svm import SVC
from nltk.classify.scikitlearn import SklearnClassifier

svmClassifier = nltk.classify.SklearnClassifier(SVC())
svmClassifier = crossValidation(svmClassifier, structuredLabledData)
print("Support Vector Machine Classifier: ")
evaluationMeasures(svmClassifier, testingData)


In [None]:
"""Logistic Regression Evaluation"""

from sklearn.linear_model import LogisticRegression

lrClassifier = SklearnClassifier(LogisticRegression(max_iter=500))
lrClassifier = crossValidation(lrClassifier, structuredLabledData)
print("Logistic Regression Classifier: ")
evaluationMeasures(lrClassifier, testingData)


In [None]:
"""Calculate Learning Curve"""

from sklearn.model_selection import learning_curve

svmEstimator = SVC()

trainSizes, trainScores, testScores, fitTimes, scoreTimes = learning_curve(svmEstimator, numericFeatureList, yTrain, cv=10, return_times=True)


In [None]:
"""Plot Learning Curve"""

fig, axs = plt.subplots(1, 3, figsize=(18, 5))

axs[0].plot(trainSizes, numpy.mean(trainScores, axis=1), 'tab:blue', label = 'Train')
axs[0].plot(trainSizes, numpy.mean(testScores, axis=1), 'tab:red', label = 'Validation')
axs[0].set_xlabel('Experience')
axs[0].set_ylabel('Score')
axs[0].set_title('SVM Learning Curve')
axs[0].legend(loc='lower right')
axs[0].grid()

axs[1].plot(trainSizes, numpy.mean(fitTimes, axis=1))
axs[1].set_xlabel('Experience')
axs[1].set_ylabel('Fit Time')
axs[1].set_title('SVM Model Scailability')
axs[1].grid()

axs[2].plot(numpy.mean(fitTimes, axis=1), numpy.mean(testScores, axis=1))
axs[2].set_xlabel('Fit Time')
axs[2].set_ylabel('Score')
axs[2].set_title('SVM Model Performance')
axs[2].grid()


In [None]:
"""Optimising Model (Hyper Parameter Tuning - Kernel)"""

from sklearn.svm import LinearSVC

svmClassifierOptamised = nltk.classify.SklearnClassifier(LinearSVC())
svmClassifierOptamised = crossValidation(svmClassifierOptamised, structuredLabledData)
print("Support Vector Machine Classifier: ")
evaluationMeasures(svmClassifierOptamised, testingData)


## Model Implementation

In [None]:
"""Final Pipeline"""

#Data Acsess
wordList = open('Data/complete-wordset.txt', 'r')
wordSet = wordList.read().split()

def classificationPipeline(document):
    """
    Outlines the final classification pipeline process for an individual document.
            
    EXAMPLE: 'Apple's sales plumit in the latest quater' -> ['Apple's sales plumit in the latest quater', ['APPL'], 'negative']
    
    ARGS:
        document (string) - A financial news headline.
    
    RETURNS:
        [headline, securities, sentiment] - A list containing the headline (string), securities identified within the
        headline (list of string values) and the classified sentiment (string).
    """
    headline = document
    document = preProcessing(document) #Pre-Processing
    securities = namedEntityRecognition(headline)[1] #NER
    features = (featureExtraction(wordSet, document)) #Feature Extraction
    sentiment = svmClassifierOptamised.classify(features) #Classification
    return [headline, securities, sentiment]


In [None]:
"""Sentiment Processing"""

#Data Acsess
unclassifiedData = pandas.read_csv('Data/all_headlines.csv')
unclassifiedData = cleanData(unclassifiedData)

#Data Frame Construction
classifiedData = pandas.DataFrame(data={'Document': [], 'Securities': [], 'Sentiment': [], 'Date': []})

headlines = unclassifiedData["Headlines"].to_list()

start = time.time()
for row in range(0, len(headlines)): #Headline Processing
    entry = classificationPipeline(headlines[row])
    entry.append(pandas.to_datetime(unclassifiedData["Date"][row]))
    classifiedData.loc[len(classifiedData)] = entry
classifiedData.to_csv('Data/classified_data.csv')
end = time.time()

print("Runtime: " + str(round((end - start), 2)) + 's')

## Result Analysis 

In [None]:
"""Sentiment Results - Distribution"""

#Data Acsess 
classifiedData = pandas.read_csv('Data/classified_data.csv')

#Sentiment Frequency
sentiments = classifiedData["Sentiment"].to_list()
sentimentFrequency = Counter(sentiments)

fig, axs = plt.subplots(1, 2, figsize=(16, 2))

#Sentiment Frequency Plot
labels = sentimentFrequency.keys()
frequency = sentimentFrequency.values()
axs[0].bar(labels, frequency, color=['grey', 'red', 'green'])
axs[0].set_xlabel('Sentiment Polarity')
axs[0].set_ylabel('Sentiment Occurance')
axs[0].set_title('Sentiment Occurance From Classification')

classifiedData = classifiedData[classifiedData['Sentiment'] != 'neutral'] #Objectivity Filtering
sentiments = classifiedData["Sentiment"].to_list()
print("Number of Subjective Headlines: " + str(len(sentiments)))

#Security Frequency
securities = classifiedData["Securities"].to_list()
securities = ' '.join(securities)
securities = re.sub(r'[^a-zA-Z]', " ", securities).split()
securitiesFrequency = Counter(securities)
securitiesFrequency.pop('GSPC')

#Security Frequency Plot
labels = securitiesFrequency.keys()
frequency = securitiesFrequency.values()
axs[1].bar(labels, frequency)
axs[1].set_xlabel('Security')
axs[1].set_ylabel('Security Occurance')
axs[1].set_title('Security Occurance From Classification')


In [None]:
"""Average Sentiment"""

def calculateAverageSentiment(sentiments):
    """
    Calculates and returns an average sentiment numeric value based on a list of
    sentiments passed in as a parameter.
            
    EXAMPLE: ['positive', 'negative', 'negative', 'positive', 'positive'] -> 0.6
    
    ARGS:
        sentiments (list of string values) - A list of the classified sentiments for individual securities. 
    
    RETURNS:
        average (float value) - The sentiment average numeric representation of the values passed in.
    """
    totalSum = 0
    for sentiment in sentiments:
        if sentiment == 'positive': #Value Assignment
            totalSum += 1
    average = round(totalSum/len(sentiments), 3) #Average Calculation
    return average


In [None]:
"""Data Normalization"""

def normalizeMarketData(marketData):
    """
    Takes a list of close price values and returns them normalized into a range between 0 and 1.
            
    EXAMPLE: [10, 6, 8.5, 7, 2, 3] -> [1, 0.6, 0.85, 0.7, 0.2, 0.3]
    
    ARGS:
        marketData (list of float values) - A list of stock close price values. 
    
    RETURNS:
        normalizedPriceList (list of float values) - A list of stock close price values normalized between 0 and 1.
    """
    marketData = (marketData - numpy.min(marketData)) / (numpy.max(marketData) - numpy.min(marketData)) #Normalize
    normalizedPriceList = []
    for price in marketData: #Cleaning List
        normalizedPriceList.append(price)
    return normalizedPriceList


In [None]:
"""Time Series (Sentiment and Price)"""

def timeSeries(security):
    """
    Generates a Time Series Plot for an individual security and its headline sentiment value.
    
    ARGS:
        security (String) - Security Ticker.
    
    RETURNS:
        monthlySentiment (list of float values) - List of average monthly sentiment values for an individual security.
        monthlyPrice (list of float values) -  List of monthly normalized price values for an individual security.
    """
    marketData = getMarketData(security)
    normalizedPriceList = normalizeMarketData(marketData)
    monthlyPrice = normalizedPriceList[::21] #Monthly Close Price
    #Monthly Sentiment Results for Security
    monthlySentimentData = classifiedData.loc[classifiedData['Securities'] == "['" + security + "']"]['Sentiment'].groupby([pandas.to_datetime(headlineData["Date"]).dt.year, pandas.to_datetime(headlineData["Date"]).dt.month])
    missingValues = len(monthlyPrice) - len(monthlySentimentData)
    monthlySentiment = []
    for sentiment in monthlySentimentData:
        sentimentList = sentiment[1].to_list()
        sentimentAverage = calculateAverageSentiment(sentimentList)
        monthlySentiment.append(sentimentAverage)
    insertPlace = len(monthlyPrice) // missingValues
    for x in range(0, missingValues):
        monthlySentiment.insert(insertPlace, 0.5)
        insertPlace += insertPlace
    #Months with Insufficient Data
    monthlySentiment = monthlySentiment[2:]
    monthlyPrice = monthlyPrice[2:]
    return monthlySentiment, monthlyPrice
    

In [None]:
"""Scatter Plot (Sentiment Against Price Change)"""

def scatterPlot(security):
    """
    Generates a Scatter Plot for an individual security and its headline sentiment value.
    
    ARGS: 
        security (String) - Security Ticker.
    
    RETURNS:
        dailySentimentValues (list of float values) - List of average daily sentiment values for an individual security.
        dailyPriceChangeValues (list of float values) - List of daily normalized price change values for an individual security.
    """
    marketData = getMarketData(security) #Collect Data
    dailyPriceChange = numpy.diff(marketData) #Calculate Change
    dailyPriceChange = normalizeMarketData(dailyPriceChange) #Normalize Price Values
    #Daily Sentiment Results for Security
    dailySentimentData = classifiedData.loc[classifiedData['Securities'] == "['" + security + "']"]['Sentiment'].groupby(classifiedData["Date"])
    #Ignore non-trading days
    marketData = marketData.reset_index()
    tradingDates = marketData['Date'].to_list()
    formattedTradingDates = []
    for date in tradingDates:
        date = str(date.date())
        formattedTradingDates.append(date)
    dailySentiment = []
    for date in formattedTradingDates:
        dailySentiment.append('NA')
    for sentiment in dailySentimentData:
        if sentiment[0] in formattedTradingDates:
            sentimentList = sentiment[1].to_list()
            sentimentAverage = calculateAverageSentiment(sentimentList)
            dailySentiment[formattedTradingDates.index(sentiment[0])] = sentimentAverage
    dailySentimentValues, dailyPriceChangeValues = [], []
    for x in range(0, len(dailySentiment) - 1):
        if dailySentiment[x] != 'NA':
            dailySentimentValues.append(dailySentiment[x])
            dailyPriceChangeValues.append(dailyPriceChange[x])
    return dailySentimentValues, dailyPriceChangeValues
    

In [None]:
"""Visualize Results"""

from scipy import stats

def securityResults(security):
    """
    Generates an overview of the results for an individual security and its headline sentiment. 
    Visualisations include a time series plot and scatter plot. Metrics include a correlation 
    coefficient, test statistic and p-value.
    
    ARGS: 
        security (String) - Security Ticker.
    
    """
    fig, axs = plt.subplots(1, 2, figsize=(16, 2))
    #Plot Time Series 
    reflectGradient, intercept, rValue, pValue, stdError = stats.linregress(timeSeries(security)[0][:28], timeSeries(security)[1][:28])
    axs[0].plot(timeSeries(security)[0], 'tab:blue', label='Headline Sentiment')
    axs[0].plot(timeSeries(security)[1], 'tab:red', label= str(security + ' Price'))
    axs[0].set_title('Monthly ' + security + ' Headline Sentiment and Market Price')
    axs[0].set_xlabel('(Febuary 2018 - July 2020)')
    axs[0].legend(loc='upper left')
    #Plot Scatter Graph Change
    axs[1].scatter(scatterPlot(security)[0], scatterPlot(security)[1], c='black', marker='x')
    directGradient, intercept, rValue, pValue, stdError = stats.linregress(scatterPlot(security)[0], scatterPlot(security)[1])
    x1 = numpy.linspace(numpy.min(scatterPlot(security)[0]), numpy.max(scatterPlot(security)[0]) ,500)
    y1 = directGradient * x1 + intercept
    axs[1].plot(x1, y1,'-r')
    axs[1].set_title('Daily ' + security + ' Headline Sentiment Against Market Price Change')
    axs[1].set_xlabel('Daily ' + security + ' Price Change (USD)')
    axs[1].set_ylabel('Daily Sentiment Change')
    #Statistical Analysis
    directionStatistics = stats.ttest_ind(scatterPlot(security)[0], scatterPlot(security)[1])
    reflectionStatistics = stats.ttest_ind(timeSeries(security)[0][:28], timeSeries(security)[1][:28])
    print('Direction Correlation: ' + str(round(directGradient, 2)))
    print('Direction T-Value: ' + str(round(directionStatistics[0], 2)))
    print('Direction P-Value: ' + str(round(directionStatistics[1], 2)))
    print('Reflection Correlation: ' + str(round(reflectGradient, 2)))
    print('Reflection T-Value: ' + str(round(reflectionStatistics[0], 2)))
    print('Reflection P-Value: ' + str(round(reflectionStatistics[1], 2)))


In [None]:
"""S&P 500 Results"""

securityResults('^GSPC')


In [None]:
"""Apple Results"""

securityResults('AAPL')


In [None]:
"""Google Results"""

securityResults('WMT')


In [None]:
"""Amazon Results"""

securityResults('AMZN')


In [None]:
"""Google Results"""

securityResults('GOOGL')


In [None]:
"""Microsoft Results"""

securityResults('MSFT')


Author: Harvey Allen (1926159)