# Objective

Develop a minimum viable model that can predict which direction a stock will go

## The Data

### Input Variables

1. Sentiment
    - Bullish, Bearish, Total_compound
2. Financial
3. Technical

### Target Variable

1. 1-day price direction
2. 2-day price direction

# Import Libraries

In [38]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

import requests
import json

# Cleaning the Data

In [2]:
# Import data and convert date column to datetime datatype
data = pd.read_csv('historic_sentiment_analysis.csv')
data['date'] = pd.to_datetime(data['date'])

In [3]:
data.head()

Unnamed: 0,stock,Bearish,Neutral,Bullish,Total_Compound,date,assetType,assetMainType,cusip,symbol,...,bookValuePerShare,shortIntToFloat,shortIntDayToCover,divGrowthRate3Year,dividendPayAmount,dividendPayDate,beta,vol1DayAvg,vol10DayAvg,vol3MonthAvg
0,CLOV,0.036,0.749,0.215,0.328,2021-06-03,EQUITY,EQUITY,18914F103,CLOV,...,0.0,0.0,0.0,0.0,0.0,,0.0,13468700.0,13468699.0,477110200.0
1,CLNE,0.017,0.789,0.194,0.398,2021-06-03,EQUITY,EQUITY,184499101,CLNE,...,0.0,0.0,0.0,0.0,0.0,,1.8433,5293610.0,5293614.0,143419800.0
2,TLRY,0.117,0.786,0.097,0.018,2021-06-03,EQUITY,EQUITY,88688T100,TLRY,...,0.0,0.0,0.0,0.0,0.0,,0.0,28527700.0,28527703.0,493355600.0
3,AAPL,0.08,0.72,0.2,0.174,2021-06-03,EQUITY,EQUITY,37833100,AAPL,...,0.0,0.0,0.0,0.0,0.22,00:00.0,1.20359,73329560.0,73329559.0,2016039000.0
4,WKHS,0.119,0.764,0.117,-0.019,2021-06-03,EQUITY,EQUITY,98138J206,WKHS,...,0.0,0.0,0.0,0.0,0.0,,2.63773,11332520.0,11332520.0,279540900.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 325 entries, 0 to 324
Data columns (total 100 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   stock                               325 non-null    object        
 1   Bearish                             325 non-null    float64       
 2   Neutral                             325 non-null    float64       
 3   Bullish                             325 non-null    float64       
 4   Total_Compound                      325 non-null    float64       
 5   date                                325 non-null    datetime64[ns]
 6   assetType                           325 non-null    object        
 7   assetMainType                       325 non-null    object        
 8   cusip                               325 non-null    object        
 9   symbol                              325 non-null    object        
 10  description              

## Unnecessary Columns

Let's dig into dividend data. 

In [5]:
data[['divYield', 'divAmount', 'divDate', 'dividendYield', 'dividendAmount', 'dividendDate']].head(10)

Unnamed: 0,divYield,divAmount,divDate,dividendYield,dividendAmount,dividendDate
0,0.0,0.0,,0.0,0.0,
1,0.0,0.0,,0.0,0.0,
2,0.0,0.0,,0.0,0.0,
3,0.7,0.88,00:00.0,0.7,0.88,00:00.0
4,0.0,0.0,,0.0,0.0,
5,0.0,0.0,,0.0,0.0,
6,0.0,0.0,,0.0,0.0,
7,0.0,0.0,,0.0,0.0,
8,0.71,0.88,00:00.0,0.71,0.88,00:00.0
9,0.09,0.64,00:00.0,0.09,0.64,00:00.0


Most of the values are null/zero values because most stocks don't provide dividends.

Also, there are duplicate columns (ex: divAmount & dividendAmount).

For simplicity, let's consolidate them columns into one as follows:
1. Remove the dividendDate/divDate columns. Keeping this would be redundant
2. Remove divYield column, it contains the same information as divAmount
3. The information from the 6 columns is contained in divAmount:
    - Whether the stock pays a dividend or not
    - How much is paid per stock owned

In [6]:
data.drop(['divYield', 'divDate', 'dividendYield', 'dividendAmount', 'dividendDate', 'dividendPayDate'], axis=1, inplace=True)

Several columns are either identifiers, duplicates or empty, we don't need them for this project

In [7]:
data.drop(['cusip',
           'assetType',
           'description',
           'assetMainType',
           'symbol',
           'securityStatus',
           'symbol.1',
           'bidTick',
           'exchangeName',
           'peRatio.1'], axis=1, inplace=True)

Categorical columns

In [8]:
data.select_dtypes(include='object')

Unnamed: 0,stock,bidId,askId,lastId,exchange
0,CLOV,P,P,P,q
1,CLNE,Q,P,P,q
2,TLRY,P,P,P,q
3,AAPL,P,P,D,q
4,WKHS,P,P,D,q
...,...,...,...,...,...
320,TSLA,Z,H,D,q
321,WISH,Q,N,D,q
322,MSFT,N,N,D,q
323,AMZN,H,H,D,q


In [9]:
print(data['bidId'].nunique())
print(data['askId'].nunique())
print(data['lastId'].nunique())
print(data['exchange'].nunique())

11
11
10
1


exchange column has only 1 unique value, which would likely not add predictability

In [10]:
data.drop(['exchange'], axis=1, inplace=True)

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 325 entries, 0 to 324
Data columns (total 83 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   stock                               325 non-null    object        
 1   Bearish                             325 non-null    float64       
 2   Neutral                             325 non-null    float64       
 3   Bullish                             325 non-null    float64       
 4   Total_Compound                      325 non-null    float64       
 5   date                                325 non-null    datetime64[ns]
 6   bidPrice                            325 non-null    float64       
 7   bidSize                             325 non-null    int64         
 8   bidId                               325 non-null    object        
 9   askPrice                            325 non-null    float64       
 10  askSize                   

## Boolean Values

In [12]:
data.select_dtypes(include='boolean')

Unnamed: 0,marginable,shortable,delayed,realtimeEntitled
0,True,True,True,False
1,True,True,True,False
2,True,True,True,False
3,True,True,True,False
4,True,True,True,False
...,...,...,...,...
320,True,True,True,False
321,True,True,True,False
322,True,True,True,False
323,True,True,True,False


In [13]:
print(data['marginable'].nunique())
print(data['shortable'].nunique())
print(data['delayed'].nunique())
print(data['realtimeEntitled'].nunique())

1
1
1
1


None of these columns provide any valuable information

In [14]:
data.drop(['marginable', 'shortable', 'delayed', 'realtimeEntitled'], axis=1, inplace=True, )

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 325 entries, 0 to 324
Data columns (total 79 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   stock                               325 non-null    object        
 1   Bearish                             325 non-null    float64       
 2   Neutral                             325 non-null    float64       
 3   Bullish                             325 non-null    float64       
 4   Total_Compound                      325 non-null    float64       
 5   date                                325 non-null    datetime64[ns]
 6   bidPrice                            325 non-null    float64       
 7   bidSize                             325 non-null    int64         
 8   bidId                               325 non-null    object        
 9   askPrice                            325 non-null    float64       
 10  askSize                   

In [16]:
#data = data.transpose(copy=True).drop_duplicates().transpose(copy=True)

## Null Values

In [17]:
data.isna().sum().sum()

0

We're good to go

## Columns with minimal unique values

Variables with a single value in the column will not likely provide any predictability

In [18]:
list(data.columns)

for column in list(data.columns):
    if data[column].nunique() <= 1:
        data.drop(column, axis=1, inplace=True)

# Bring in price data with TDAmeritrade API

In [37]:
api_key = "***REMOVED***"
stock = 'AAPL'
url = f'https://api.tdameritrade.com/v1/marketdata/{stock}/pricehistory?apikey={api_key}&periodType=year&period=1&frequencyType=daily&frequency=1'
raw_data = requests.get(url).json()

aapl_price_data = pd.json_normalize(raw_data, record_path=['candles'])
aapl_price_data

Unnamed: 0,open,high,low,close,volume,datetime
0,97.2650,99.9550,95.2575,95.4775,191649140,1594616400000
1,94.8400,97.2550,93.8775,97.0575,170989364,1594702800000
2,98.9900,99.2475,96.4900,97.7250,153197932,1594789200000
3,96.5625,97.4050,95.9050,96.5225,110577672,1594875600000
4,96.9875,97.1475,95.8400,96.3275,92186900,1594962000000
...,...,...,...,...,...,...
248,143.5350,144.8900,142.6600,144.5700,104911589,1625634000000
249,141.5800,144.0600,140.6650,143.2400,105575458,1625720400000
250,142.7500,145.6500,142.6522,145.1100,99890800,1625806800000
251,146.2100,146.3200,144.0000,144.5000,76299719,1626066000000
