Project Sentiment

Using Natural Language Processing to predict Tesla stock movement based on news article sentiment from the New York Times

Data Gathering

News articles from The New York Times discussing Tesla were sourced for sentiment analysis. Article information was retrieved from API requests and web scraping.

The New York Times: API Requests

The New York Times Developer Network: Article Search API https://developer.nytimes.com/docs/articlesearch-product/1/overview

The Article Search API of The New York Time was queried for articles containing the term 'tesla' between January 1, 2010 (the year that Tesla launched its IPO) and May 31, 2019. The search term returned a total of 2,540 article hits.

The following information was requested for each article document:

Web URL
Snippet (Headline)
Publication Date
Identifier
Lead Paragraph

The New York Times: Web Scraping

The text of the article body was retrieved by accessing each web URL and extracting the body main body. Of the 2,540 web URLs, 1,829 full text articles were captured. The remaining 711 articles, generally before 2013, were not included in the dataset.

Natural Language Processing

Natural Language Processing (NLP) was applied to each article snippet (headline), lead paragraph and article body. Two sentiment analysis toolsets were applied to the article texts.

VADER Sentiment Analysis

Valence Aware Dictionary and Sentiment Reasoner (VADER) https://www.nltk.org/_modules/nltk/sentiment/vader.html

Based on a design to evaluate short sentences, VADER Sentiment Analysis was applied to the snippet (headline) and lead paragraph of each article. Negative, neutral and positive percentages were recorded for each article, including a compound score.

TextBlob Sentiment Analysis

TextBlob https://textblob.readthedocs.io/en/dev/

TextBlob Sentiment Analysis was applied to the snippet (headline), lead paragraph and article body of each article. Polarity and subjectivity scores were recorded for each article.

Aggegating Records by Date

In order to merge with the daily closing stock price of Tesla, articles were grouped by date. For dates with multiple articles, the mean sentiment score of all articles was aggregated for each date. Additionally, the total number of articles retrieved on a given day was recorded to quantify news intensity.

Final DataFrame with Sentiment Analysis by Date

The final Pandas DataFrame of sentiment analysis for The New York Times discussing Tesla contains 1,829 articles grouped on 1,032 unique days. 15 feature columns were engineered from article information using NLP:

Daily Records Beginning January 25, 2013

TextBlob polarity and subjectivity: article body, lead paragraph and snippet (headline)
VADER negative, neutral, positive and compound scores: lead paragraph and snippet (headline)
Total article count

The distribution of each sentiment feature is shown below:

Baseline Models

Classification models are used to predict the binary outcome of whether the stock price of Tesla moved up (1) or down(0) for that day.

Features

Vader compound of snippet (continuous)
Vader positive sentiment of snippet (continuous)
Vader negative sentiment of snippet (continuous)
Vader neutral sentiment of snippet (continuous)
TextBlob article polarity (continuous)
TextBlob article subjectivity (continuous)
Article count for the day (continuous)
Daily trading volumne of Tesla (continuous)
Nasdaq movement (binary)

Logistics Model

The model provided the following coefficients:

Accuracy and F1 score:

Gaussian Naives Baye Model

The Gaussian Naives Baye model provided the following results:

Final Model

Several versions of the models were tested with additions and deletions of various features. The best results were yielded by the Logistics Model with the following features:

TextBlob article polarity (continuous)
TextBlob article subjectivity (continuous)
Article count for the day (continuous)
Daily trading volumne of Tesla (continuous)
Nasdaq movement (binary)

In this case, TextBlob sentiment analysis on article bodies proved to have better predictive power than Vader on article headlines/snippets.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
API-NYT-v1.ipynb		API-NYT-v1.ipynb
NYT-Fei.ipynb		NYT-Fei.ipynb
NYT_API_Web_Scraping-Adam.ipynb		NYT_API_Web_Scraping-Adam.ipynb
Nasdaq2010-2019.csv		Nasdaq2010-2019.csv
README.md		README.md
TSLA 2010.csv		TSLA 2010.csv
TSLA.csv		TSLA.csv
nyt_api_2010_2019.json		nyt_api_2010_2019.json
sentiment.pkl		sentiment.pkl
sentiment_counts.pkl		sentiment_counts.pkl
sentiment_dates.pkl		sentiment_dates.pkl

laranea/Project-Sentiment

Folders and files

Latest commit

History

Repository files navigation

Project Sentiment

Data Gathering

The New York Times: API Requests

The New York Times: Web Scraping

Natural Language Processing

VADER Sentiment Analysis

TextBlob Sentiment Analysis

Aggegating Records by Date

Final DataFrame with Sentiment Analysis by Date

Baseline Models

Features

Logistics Model

Gaussian Naives Baye Model

Final Model

Results

About

Resources

Stars

Watchers

Forks

Languages