In [1]:
# test

# Intro

team name

Mention limitations from the given data

report location/background/info: Prepare a report containing your results from the analysis. It should contain the following: Intro, data cleaning/pre-processing, visualizations (at least 3), analysis/results, conclusion/future improvements.

In [170]:
# Imports
import pandas as pd
import numpy as np
import plotly.express as px
import string
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
from spacy import displacy

In [42]:
# Remove automatic formatting with symbol usage (Ex: $ sign -> MathJax)
pd.options.display.html.use_mathjax=False

In [43]:
# Read training data
train_data = pd.read_csv('../data/advanced_trainset.csv')
train_data.head()

Unnamed: 0,Sentence,Sentiment
0,According to the Finnish-Russian Chamber of Co...,neutral
1,The Swedish buyout firm has sold its remaining...,neutral
2,$SPY wouldn't be surprised to see a green close,positive
3,Shell's $70 Billion BG Deal Meets Shareholder ...,negative
4,SSH COMMUNICATIONS SECURITY CORP STOCK EXCHANG...,negative


In [44]:
# Read testing data
test_data = pd.read_csv('../data/advanced_testset.csv')
test_data.head()

Unnamed: 0,Sentence
0,Earnings per share ( EPS ) dropped to EUR 0.21...
1,$SONC Amazing run since middle of March - obvi...
2,"Ruukki Romania , the local arm of Finnish meta..."
3,Self-service and automation are in a bigger ro...
4,Alma Media 's operating profit amounted to EUR...


In [48]:
# Read supplementary stock ticker data
stocks = pd.read_csv('../data/stock_tickers.csv')
stocks.head()

Unnamed: 0,Symbol,Name,Last Sale,Net Change,% Change,Market Cap,Country,IPO Year,Volume,Sector,Industry
0,A,Agilent Technologies Inc. Common Stock,$134.87,-1.06,-0.78%,40476290000.0,United States,1999.0,2070939,Capital Goods,Electrical Products
1,AA,Alcoa Corporation Common Stock,$84.15,-1.95,-2.265%,15519010000.0,,2016.0,4585478,Basic Industries,Metal Fabrications
2,AAC,Ares Acquisition Corporation Class A Ordinary ...,$9.83,0.02,0.204%,1228750000.0,,2021.0,186747,Finance,Business Services
3,AACG,ATA Creativity Global American Depositary Shares,$1.21,-0.06,-4.724%,37966070.0,China,,7154,Miscellaneous,Service to the Health Industry
4,AACI,Armada Acquisition Corp. I Common Stock,$9.9781,0.1181,1.198%,206641500.0,United States,2021.0,174251,Consumer Durables,Consumer Electronics/Appliances


In [171]:
# Set spacy NLP English pipeline
nlp = spacy.load('en_core_web_sm')

# EDA

Look through the dataset for things that catch your eye. What proportion of responses are negative, positive, and neutral? Do you see any imbalances in the data? What else do you find? Please provide charts and visualizations to support your claim.

In [25]:
sentiment_counts = train_data['Sentiment'].value_counts().to_frame().reset_index()
sentiment_counts

Unnamed: 0,index,Sentiment
0,neutral,2363
1,positive,1383
2,negative,636


In [26]:
fig = px.bar(sentiment_counts, x='index', y='Sentiment', \
             title="Sentiment Counts in Training Data", labels={'index':'Sentiment', 'Sentiment': 'Count'})
fig.show()

As we see from this bar chart, there is a significant imbalance in the number of observations we have for neutral, positive, and negative sentences. This will mean... TODO: HERE

## Subject of Sentences

Another point of interest is to identify the subject of the sentence. This gives us an idea of what the sentiment is directed towards. For example, if the sentence is "AAPL is popping off," we would want to identify the sentiment as well as what the sentiment is directed towards. This process is a combination of EDA and feature engineering, so we will include visualizations here and the actual data manipulation in the **Feature Engineering** section.

TODO: talk about tokenization here

In [178]:
sent = a.loc[0]['Sentence']
doc=nlp(sent)
displacy.render(doc, style="dep")

In [179]:
displacy.render(doc, style="ent")

With this visualization, we can see the breakdown of the sentence and determine the subjects as well as the relations between different words. However, as we can see in this example, the spacy NLP processing is not quite able to identify complex sentence tokens such as the "Finnish-Russian Chamber of Commerce." Thus, we will have to select multiple groups to identify as subjects...

## Negative Sentences

In [106]:
# Convert Sentence series of negative sentiment into a string for EDA purposes
neg_words = train_data[train_data['Sentiment'] == 'negative']['Sentence'].str.cat(sep=' ')
neg_words = neg_words.split(' ')

In [107]:
# Grab as many words as possible while ignoring numbers, or incorrectly formatted words
neg_words = [word.strip().lower() for word in neg_words if not any(c for c in word.strip() if c not in string.ascii_letters + "'")]

In [109]:
Counter(neg_words).most_common()[:10]

[('the', 524),
 ('in', 346),
 ('of', 314),
 ('to', 287),
 ('eur', 228),
 ('a', 185),
 ('mn', 164),
 ('from', 151),
 ('and', 149),
 ('for', 125)]

As we can see from the top 10 most common words in negative sentiment, it's impossible to gauge distinct or important words that correlate with negative sentiment. In order to find the more important words, we can calculate the term frequency - inverse data frequency score for each word and identify highest weighted words.

TODO: discuss tf-idf formula and reasoning here.

In [115]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(neg_words)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)

In [120]:
df

Unnamed: 0,aaland,ab,aberdeen,about,above,abp,acanb,acando,accommodation,according,...,year,years,yesterday,yet,yit,york,you,yoy,zoltan,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9518,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9520,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9521,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Feature Engineering

Do you need to make any changes to “Sentence” to make it more digestible for your model? Will you make any restrictions to your sample? Even if you don’t choose to make any changes to the data, please describe your reasoning.

We want to map the sentences to a specific stock, market, or even country. In order to do this, we imported the stock_tickers.csv file (renamed) from the NASDAQ website to try matching words in the sentences to a specific ticker. However, our data exploration revealed that some sentences didn't contain clean tickers or even any tickers at all, so we'll be attempting to create a category that contains information on the `subject` of the sentence.

To do this, we will use spacy, (TODO: describe spacy here!!)...

In [190]:
def get_subject(sent):
    '''
    Tokenizes and identifies the subject of the sentence using spacy's English pipeline.
    '''
    doc=nlp(sent)
    sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj" or tok.pos_ == "PROPN")]
    return sub_toks

In [191]:
a = train_data.copy()
a

Unnamed: 0,Sentence,Sentiment
0,According to the Finnish-Russian Chamber of Co...,neutral
1,The Swedish buyout firm has sold its remaining...,neutral
2,$SPY wouldn't be surprised to see a green close,positive
3,Shell's $70 Billion BG Deal Meets Shareholder ...,negative
4,SSH COMMUNICATIONS SECURITY CORP STOCK EXCHANG...,negative
...,...,...
4377,Investments in product development stood at 6....,neutral
4378,HSBC Says Unit to Book $585 Million Charge on ...,negative
4379,RISING costs have forced packaging producer Hu...,negative
4380,"In the building and home improvement trade , s...",neutral


In [192]:
a['Subject'] = a['Sentence'].apply(get_subject)

In [193]:
a

Unnamed: 0,Sentence,Sentiment,Subject
0,According to the Finnish-Russian Chamber of Co...,neutral,"[Chamber, Commerce, companies, Finland, Russia]"
1,The Swedish buyout firm has sold its remaining...,neutral,"[firm, Finland]"
2,$SPY wouldn't be surprised to see a green close,positive,[SPY]
3,Shell's $70 Billion BG Deal Meets Shareholder ...,negative,"[Shell, BG, Shareholder, Skepticism]"
4,SSH COMMUNICATIONS SECURITY CORP STOCK EXCHANG...,negative,"[SSH, COMMUNICATIONS, SECURITY, CORP, STOCK, E..."
...,...,...,...
4377,Investments in product development stood at 6....,neutral,[Investments]
4378,HSBC Says Unit to Book $585 Million Charge on ...,negative,"[HSBC, Unit, Book]"
4379,RISING costs have forced packaging producer Hu...,negative,"[RISING, costs, Huhtamaki, Hampshire]"
4380,"In the building and home improvement trade , s...",neutral,"[sales, EUR, mn]"


# Model Building

Create a NLP model that uses the “Sentence” as an input, using “Sentiment”  as labels. Ideally, you will compare the results of several different models to find the optimal choice. What led you to choose your final model? Did you run into any roadblocks? Please describe your process in depth. Make sure to train your model on the training set only.

# Model Testing

Please report the performance of your model on the training set. How does your model perform? Please report your accuracy and F1 score. Also, using the test set, please provide a CSV of your predicted values for “Sentiment” with your submission.

# References

https://www.nasdaq.com/market-activity/stocks/screener