# Objective

Develop a minimum viable model that can predict which direction a stock will go

## The Data

### Input Variables

1. Sentiment
    - Bullish, Bearish, Total_compound
2. Financial
3. Technical

### Target Variable

1. 1-day price direction
2. 2-day price direction

# Import Libraries

In [116]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

import requests
import json
import datetime

# Cleaning the Data

In [117]:
# Import data and convert date column to datetime datatype
data = pd.read_csv('historic_sentiment_analysis.csv')
data['date'] = pd.to_datetime(data['date'])

In [118]:
data.head()

Unnamed: 0,stock,Bearish,Neutral,Bullish,Total_Compound,date,assetType,assetMainType,cusip,symbol,...,bookValuePerShare,shortIntToFloat,shortIntDayToCover,divGrowthRate3Year,dividendPayAmount,dividendPayDate,beta,vol1DayAvg,vol10DayAvg,vol3MonthAvg
0,CLOV,0.036,0.749,0.215,0.328,2021-06-03,EQUITY,EQUITY,18914F103,CLOV,...,0.0,0.0,0.0,0.0,0.0,,0.0,13468700.0,13468699.0,477110200.0
1,CLNE,0.017,0.789,0.194,0.398,2021-06-03,EQUITY,EQUITY,184499101,CLNE,...,0.0,0.0,0.0,0.0,0.0,,1.8433,5293610.0,5293614.0,143419800.0
2,TLRY,0.117,0.786,0.097,0.018,2021-06-03,EQUITY,EQUITY,88688T100,TLRY,...,0.0,0.0,0.0,0.0,0.0,,0.0,28527700.0,28527703.0,493355600.0
3,AAPL,0.08,0.72,0.2,0.174,2021-06-03,EQUITY,EQUITY,37833100,AAPL,...,0.0,0.0,0.0,0.0,0.22,00:00.0,1.20359,73329560.0,73329559.0,2016039000.0
4,WKHS,0.119,0.764,0.117,-0.019,2021-06-03,EQUITY,EQUITY,98138J206,WKHS,...,0.0,0.0,0.0,0.0,0.0,,2.63773,11332520.0,11332520.0,279540900.0


In [119]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 335 entries, 0 to 334
Data columns (total 100 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   stock                               335 non-null    object        
 1   Bearish                             335 non-null    float64       
 2   Neutral                             335 non-null    float64       
 3   Bullish                             335 non-null    float64       
 4   Total_Compound                      335 non-null    float64       
 5   date                                335 non-null    datetime64[ns]
 6   assetType                           335 non-null    object        
 7   assetMainType                       335 non-null    object        
 8   cusip                               335 non-null    object        
 9   symbol                              335 non-null    object        
 10  description              

## Unnecessary Columns

Let's dig into dividend data. 

In [120]:
data[['divYield', 'divAmount', 'divDate', 'dividendYield', 'dividendAmount', 'dividendDate']].head(10)

Unnamed: 0,divYield,divAmount,divDate,dividendYield,dividendAmount,dividendDate
0,0.0,0.0,,0.0,0.0,
1,0.0,0.0,,0.0,0.0,
2,0.0,0.0,,0.0,0.0,
3,0.7,0.88,00:00.0,0.7,0.88,00:00.0
4,0.0,0.0,,0.0,0.0,
5,0.0,0.0,,0.0,0.0,
6,0.0,0.0,,0.0,0.0,
7,0.0,0.0,,0.0,0.0,
8,0.71,0.88,00:00.0,0.71,0.88,00:00.0
9,0.09,0.64,00:00.0,0.09,0.64,00:00.0


Most of the values are null/zero values because most stocks don't provide dividends.

Also, there are duplicate columns (ex: divAmount & dividendAmount).

For simplicity, let's consolidate them columns into one as follows:
1. Remove the dividendDate/divDate columns. Keeping this would be redundant
2. Remove divYield column, it contains the same information as divAmount
3. The information from the 6 columns is contained in divAmount:
    - Whether the stock pays a dividend or not
    - How much is paid per stock owned

In [121]:
data.drop(['divYield', 'divDate', 'dividendYield', 'dividendAmount', 'dividendDate', 'dividendPayDate'], axis=1, inplace=True)

Several columns are either identifiers, duplicates or empty, we don't need them for this project

In [122]:
data.drop(['cusip',
           'assetType',
           'description',
           'assetMainType',
           'symbol',
           'securityStatus',
           'symbol.1',
           'bidTick',
           'exchangeName',
           'peRatio.1'], axis=1, inplace=True)

Categorical columns

In [123]:
data.select_dtypes(include='object')

Unnamed: 0,stock,bidId,askId,lastId,exchange
0,CLOV,P,P,P,q
1,CLNE,Q,P,P,q
2,TLRY,P,P,P,q
3,AAPL,P,P,D,q
4,WKHS,P,P,D,q
...,...,...,...,...,...
330,CRSR,V,K,D,q
331,AMD,Q,Q,D,q
332,CLNE,Q,Q,D,q
333,AMZN,V,V,D,q


In [124]:
print(data['bidId'].nunique())
print(data['askId'].nunique())
print(data['lastId'].nunique())
print(data['exchange'].nunique())

11
11
10
1


exchange column has only 1 unique value, which would likely not add predictability

In [125]:
data.drop(['exchange'], axis=1, inplace=True)

In [126]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 335 entries, 0 to 334
Data columns (total 83 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   stock                               335 non-null    object        
 1   Bearish                             335 non-null    float64       
 2   Neutral                             335 non-null    float64       
 3   Bullish                             335 non-null    float64       
 4   Total_Compound                      335 non-null    float64       
 5   date                                335 non-null    datetime64[ns]
 6   bidPrice                            335 non-null    float64       
 7   bidSize                             335 non-null    int64         
 8   bidId                               335 non-null    object        
 9   askPrice                            335 non-null    float64       
 10  askSize                   

## Boolean Values

In [127]:
data.select_dtypes(include='boolean')

Unnamed: 0,marginable,shortable,delayed,realtimeEntitled
0,True,True,True,False
1,True,True,True,False
2,True,True,True,False
3,True,True,True,False
4,True,True,True,False
...,...,...,...,...
330,True,True,True,False
331,True,True,True,False
332,True,True,True,False
333,True,True,True,False


In [128]:
print(data['marginable'].nunique())
print(data['shortable'].nunique())
print(data['delayed'].nunique())
print(data['realtimeEntitled'].nunique())

1
1
1
1


None of these columns provide any valuable information

In [129]:
data.drop(['marginable', 'shortable', 'delayed', 'realtimeEntitled'], axis=1, inplace=True, )

In [130]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 335 entries, 0 to 334
Data columns (total 79 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   stock                               335 non-null    object        
 1   Bearish                             335 non-null    float64       
 2   Neutral                             335 non-null    float64       
 3   Bullish                             335 non-null    float64       
 4   Total_Compound                      335 non-null    float64       
 5   date                                335 non-null    datetime64[ns]
 6   bidPrice                            335 non-null    float64       
 7   bidSize                             335 non-null    int64         
 8   bidId                               335 non-null    object        
 9   askPrice                            335 non-null    float64       
 10  askSize                   

In [131]:
#data = data.transpose(copy=True).drop_duplicates().transpose(copy=True)

## Null Values

In [132]:
data.isna().sum().sum()

0

We're good to go

## Columns with minimal unique values

Variables with a single value in the column will not likely provide any predictability

In [133]:
list(data.columns)

for column in list(data.columns):
    if data[column].nunique() <= 1:
        data.drop(column, axis=1, inplace=True)

# Bring in price data with TDAmeritrade API

In [134]:
# Date range of our dataset
print(data['date'].min().date())
print(data['date'].max().date())
print(data['date'].max().date() - data['date'].min().date())

2021-06-03
2021-07-15
42 days, 0:00:00


Based on the date range of our dataset, our API call should generate about 2 months of price history

In [137]:
api_key = "***REMOVED***"
price_data = pd.DataFrame()

for stock in list(data['stock'].unique()):
    symbol = stock
    url = f'https://api.tdameritrade.com/v1/marketdata/{symbol}/pricehistory?apikey={api_key}&periodType=month&period=2&frequencyType=daily&frequency=1'
    raw_data = requests.get(url).json()
    raw_data = pd.json_normalize(raw_data, record_path=['candles'])
    raw_data.rename(columns = {'datetime': 'date'}, inplace=True)
    raw_data['date'] = pd.to_datetime(raw_data['date'], unit='ms')
    raw_data['date'] = [raw_data['date'][i].date() for i in range(len(raw_data['date']))]
    raw_data['stock'] = [stock for x in range(len(raw_data))]
    price_data = pd.concat([price_data, raw_data], ignore_index=True)
    price_data = price_data[['date', 'stock', 'close']]

In [138]:
# Filter out dates to match those of the 'data' dataframe
filter_ = (price_data['date'] >= data['date'].min().date()) & (price_data['date'] <= data['date'].max().date())
price_data = price_data[filter_]

In [139]:
print(price_data.shape)
print(data.shape)

(1015, 3)
(335, 72)


In [143]:
price_data

Unnamed: 0,date,stock,close
13,2021-06-03,CLOV,8.94
14,2021-06-04,CLOV,9.00
15,2021-06-07,CLOV,11.92
16,2021-06-08,CLOV,22.15
17,2021-06-09,CLOV,16.92
...,...,...,...
1465,2021-07-08,ASTS,12.46
1466,2021-07-09,ASTS,12.76
1467,2021-07-12,ASTS,13.30
1468,2021-07-13,ASTS,12.47


# PUT ALL OF THIS IN A DATAFRAME!!!

In [151]:
for ind in price_data.index:
    for indx in data.index:
        if price_data['date'][ind] == data['date'][indx] and price_data['stock'][ind] == data['stock'][indx]:
            print(price_data['date'][ind], price_data['stock'][ind], price_data['close'][ind], data.iloc[indx])

2021-06-03 CLOV 8.94 stock                       CLOV
Bearish                    0.036
Neutral                    0.749
Bullish                    0.215
Total_Compound             0.328
                        ...     
dividendPayAmount              0
beta                           0
vol1DayAvg           1.34687e+07
vol10DayAvg          1.34687e+07
vol3MonthAvg          4.7711e+08
Name: 0, Length: 72, dtype: object
2021-06-04 CLOV 9.0 stock                       CLOV
Bearish                     0.04
Neutral                     0.78
Bullish                     0.18
Total_Compound             0.231
                        ...     
dividendPayAmount              0
beta                           0
vol1DayAvg           1.74203e+07
vol10DayAvg          1.74203e+07
vol3MonthAvg         4.83602e+08
Name: 6, Length: 72, dtype: object
2021-06-07 CLOV 11.92 stock                       CLOV
Bearish                    0.027
Neutral                    0.815
Bullish                    0.158
Total_Com

# Prepare for Modeling