# Objective

Develop a minimum viable model that can predict which direction a stock will go

## The Data

### Input Variables

1. Sentiment
    - Bullish, Bearish, Total_compound
2. Financial
3. Technical

### Target Variable

1. 1-day price direction
2. 2-day price direction

# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

import requests
import json

# Cleaning the Data

In [2]:
# Import data and convert date column to datetime datatype
data = pd.read_csv('historic_sentiment_analysis.csv')
data['date'] = pd.to_datetime(data['date'])

In [3]:
data.head()

Unnamed: 0,stock,Bearish,Neutral,Bullish,Total_Compound,date,assetType,assetMainType,cusip,symbol,...,bookValuePerShare,shortIntToFloat,shortIntDayToCover,divGrowthRate3Year,dividendPayAmount,dividendPayDate,beta,vol1DayAvg,vol10DayAvg,vol3MonthAvg
0,CLOV,0.036,0.749,0.215,0.328,2021-06-03,EQUITY,EQUITY,18914F103,CLOV,...,0.0,0.0,0.0,0.0,0.0,,0.0,13468700.0,13468699.0,477110200.0
1,CLNE,0.017,0.789,0.194,0.398,2021-06-03,EQUITY,EQUITY,184499101,CLNE,...,0.0,0.0,0.0,0.0,0.0,,1.8433,5293610.0,5293614.0,143419800.0
2,TLRY,0.117,0.786,0.097,0.018,2021-06-03,EQUITY,EQUITY,88688T100,TLRY,...,0.0,0.0,0.0,0.0,0.0,,0.0,28527700.0,28527703.0,493355600.0
3,AAPL,0.08,0.72,0.2,0.174,2021-06-03,EQUITY,EQUITY,37833100,AAPL,...,0.0,0.0,0.0,0.0,0.22,00:00.0,1.20359,73329560.0,73329559.0,2016039000.0
4,WKHS,0.119,0.764,0.117,-0.019,2021-06-03,EQUITY,EQUITY,98138J206,WKHS,...,0.0,0.0,0.0,0.0,0.0,,2.63773,11332520.0,11332520.0,279540900.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 335 entries, 0 to 334
Data columns (total 100 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   stock                               335 non-null    object        
 1   Bearish                             335 non-null    float64       
 2   Neutral                             335 non-null    float64       
 3   Bullish                             335 non-null    float64       
 4   Total_Compound                      335 non-null    float64       
 5   date                                335 non-null    datetime64[ns]
 6   assetType                           335 non-null    object        
 7   assetMainType                       335 non-null    object        
 8   cusip                               335 non-null    object        
 9   symbol                              335 non-null    object        
 10  description              

## Unnecessary Columns

Let's dig into dividend data. 

In [5]:
data[['divYield', 'divAmount', 'divDate', 'dividendYield', 'dividendAmount', 'dividendDate']].head(10)

Unnamed: 0,divYield,divAmount,divDate,dividendYield,dividendAmount,dividendDate
0,0.0,0.0,,0.0,0.0,
1,0.0,0.0,,0.0,0.0,
2,0.0,0.0,,0.0,0.0,
3,0.7,0.88,00:00.0,0.7,0.88,00:00.0
4,0.0,0.0,,0.0,0.0,
5,0.0,0.0,,0.0,0.0,
6,0.0,0.0,,0.0,0.0,
7,0.0,0.0,,0.0,0.0,
8,0.71,0.88,00:00.0,0.71,0.88,00:00.0
9,0.09,0.64,00:00.0,0.09,0.64,00:00.0


Most of the values are null/zero values because most stocks don't provide dividends.

Also, there are duplicate columns (ex: divAmount & dividendAmount).

For simplicity, let's consolidate them columns into one as follows:
1. Remove the dividendDate/divDate columns. Keeping this would be redundant
2. Remove divYield column, it contains the same information as divAmount
3. The information from the 6 columns is contained in divAmount:
    - Whether the stock pays a dividend or not
    - How much is paid per stock owned

In [6]:
data.drop(['divYield', 'divDate', 'dividendYield', 'dividendAmount', 'dividendDate', 'dividendPayDate'], axis=1, inplace=True)

Several columns are either identifiers, duplicates or empty, we don't need them for this project

In [7]:
data.drop(['cusip',
           'assetType',
           'description',
           'assetMainType',
           'symbol',
           'securityStatus',
           'symbol.1',
           'bidTick',
           'exchangeName',
           'peRatio.1'], axis=1, inplace=True)

Categorical columns

In [8]:
data.select_dtypes(include='object')

Unnamed: 0,stock,bidId,askId,lastId,exchange
0,CLOV,P,P,P,q
1,CLNE,Q,P,P,q
2,TLRY,P,P,P,q
3,AAPL,P,P,D,q
4,WKHS,P,P,D,q
...,...,...,...,...,...
330,CRSR,V,K,D,q
331,AMD,Q,Q,D,q
332,CLNE,Q,Q,D,q
333,AMZN,V,V,D,q


In [9]:
print(data['bidId'].nunique())
print(data['askId'].nunique())
print(data['lastId'].nunique())
print(data['exchange'].nunique())

11
11
10
1


exchange column has only 1 unique value, which would likely not add predictability

In [10]:
data.drop(['exchange'], axis=1, inplace=True)

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 335 entries, 0 to 334
Data columns (total 83 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   stock                               335 non-null    object        
 1   Bearish                             335 non-null    float64       
 2   Neutral                             335 non-null    float64       
 3   Bullish                             335 non-null    float64       
 4   Total_Compound                      335 non-null    float64       
 5   date                                335 non-null    datetime64[ns]
 6   bidPrice                            335 non-null    float64       
 7   bidSize                             335 non-null    int64         
 8   bidId                               335 non-null    object        
 9   askPrice                            335 non-null    float64       
 10  askSize                   

## Boolean Values

In [12]:
data.select_dtypes(include='boolean')

Unnamed: 0,marginable,shortable,delayed,realtimeEntitled
0,True,True,True,False
1,True,True,True,False
2,True,True,True,False
3,True,True,True,False
4,True,True,True,False
...,...,...,...,...
330,True,True,True,False
331,True,True,True,False
332,True,True,True,False
333,True,True,True,False


In [13]:
print(data['marginable'].nunique())
print(data['shortable'].nunique())
print(data['delayed'].nunique())
print(data['realtimeEntitled'].nunique())

1
1
1
1


None of these columns provide any valuable information

In [14]:
data.drop(['marginable', 'shortable', 'delayed', 'realtimeEntitled'], axis=1, inplace=True, )

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 335 entries, 0 to 334
Data columns (total 79 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   stock                               335 non-null    object        
 1   Bearish                             335 non-null    float64       
 2   Neutral                             335 non-null    float64       
 3   Bullish                             335 non-null    float64       
 4   Total_Compound                      335 non-null    float64       
 5   date                                335 non-null    datetime64[ns]
 6   bidPrice                            335 non-null    float64       
 7   bidSize                             335 non-null    int64         
 8   bidId                               335 non-null    object        
 9   askPrice                            335 non-null    float64       
 10  askSize                   

In [16]:
#data = data.transpose(copy=True).drop_duplicates().transpose(copy=True)

## Null Values

In [17]:
data.isna().sum().sum()

0

We're good to go

## Columns with minimal unique values

Variables with a single value in the column will not likely provide any predictability

In [18]:
list(data.columns)

for column in list(data.columns):
    if data[column].nunique() <= 1:
        data.drop(column, axis=1, inplace=True)

# Bring in price data with TDAmeritrade API

In [53]:
# Date range of our dataset
print(data['date'].min())
print(data['date'].max())
print(data['date'].max() - data['date'].min())

2021-06-03 00:00:00
2021-07-15 00:00:00
42 days 00:00:00


In [51]:
api_key = "***REMOVED***"
price_data = pd.DataFrame()

for stock in list(data['stock'].unique()):
    symbol = stock
    url = f'https://api.tdameritrade.com/v1/marketdata/{symbol}/pricehistory?apikey={api_key}&periodType=month&period=2&frequencyType=daily&frequency=1'
    raw_data = requests.get(url).json()
    raw_data = pd.json_normalize(raw_data, record_path=['candles'])
    raw_data['datetime'] = pd.to_datetime(raw_data['datetime'], unit='ms')
    raw_data['stock'] = [stock for x in range(len(raw_data))]
    price_data = pd.concat([price_data, raw_data], ignore_index=True)

price_data

Unnamed: 0,open,high,low,close,volume,datetime,stock
0,7.352,7.685,7.185,7.47,7968626,2021-05-14 05:00:00,CLOV
1,7.590,7.930,6.590,6.82,38561966,2021-05-17 05:00:00,CLOV
2,6.770,7.200,6.600,6.97,16185710,2021-05-18 05:00:00,CLOV
3,6.740,6.910,6.519,6.84,9210396,2021-05-19 05:00:00,CLOV
4,6.870,7.180,6.760,7.13,9340391,2021-05-20 05:00:00,CLOV
...,...,...,...,...,...,...,...
1465,11.850,12.690,11.550,12.46,1629803,2021-07-08 05:00:00,ASTS
1466,12.590,12.850,12.050,12.76,1625977,2021-07-09 05:00:00,ASTS
1467,14.040,14.240,12.810,13.30,4519166,2021-07-12 05:00:00,ASTS
1468,13.020,13.200,12.250,12.47,2175529,2021-07-13 05:00:00,ASTS


In [52]:
price_data[price_data['stock'] == 'AAPL']

Unnamed: 0,open,high,low,close,volume,datetime,stock
126,126.25,127.89,125.85,127.45,81917951,2021-05-14 05:00:00,AAPL
127,126.82,126.93,125.17,126.27,74244624,2021-05-17 05:00:00,AAPL
128,126.56,126.99,124.78,124.85,63342929,2021-05-18 05:00:00,AAPL
129,123.16,124.915,122.86,124.69,92611989,2021-05-19 05:00:00,AAPL
130,125.23,127.72,125.1,127.31,76857123,2021-05-20 05:00:00,AAPL
131,127.82,128.0,125.21,125.43,79295436,2021-05-21 05:00:00,AAPL
132,126.01,127.94,125.94,127.1,63092945,2021-05-24 05:00:00,AAPL
133,127.82,128.32,126.32,126.9,72009482,2021-05-25 05:00:00,AAPL
134,126.955,127.39,126.42,126.85,56575920,2021-05-26 05:00:00,AAPL
135,126.44,127.64,125.08,125.28,94625601,2021-05-27 05:00:00,AAPL
