# Logistic Regression

In [1]:
# Data Manipulation
import numpy as np
import pandas as pd
from datetime import datetime

# Plotting graphs
import matplotlib.pyplot as plt

# Machine learning
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler

from ta import add_all_ta_features #pip install --upgrade ta https://github.com/bukosabino/ta https://medium.datadriveninvestor.com/predicting-the-stock-market-with-python-bba3cf4c56ef
from fastai.tabular.all import add_datepart #pip install fastai https://docs.fast.ai/tabular.core.html https://www.analyticsvidhya.com/blog/2018/10/predicting-stock-price-machine-learningnd-deep-learning-techniques-python/

### (Part1) Baseline Model without technical features

In this part, we will run a logistic regression with 3 features (High, Low, Open), which are scraped from Yahoo Finance. We will use Apple Stock Price data from 2022-01-01 to 2021-4-28 (4month). 

In [2]:
df = pd.read_csv('data/AAPL_data.csv')

In [3]:
df.dropna()
df = df[df['Date'] >= '2022-01-01']
df

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close
1179,2022-01-03,182.880005,177.710007,177.830002,182.009995,104487900.0,181.778397
1180,2022-01-04,182.940002,179.119995,182.630005,179.699997,99310400.0,179.471344
1181,2022-01-05,180.169998,174.639999,179.610001,174.919998,94537600.0,174.697418
1182,2022-01-06,175.300003,171.639999,172.699997,172.000000,96904000.0,171.781143
1183,2022-01-07,174.139999,171.029999,172.889999,172.169998,86709100.0,171.950928
...,...,...,...,...,...,...,...
1255,2022-04-22,167.869995,161.500000,166.460007,161.789993,84775200.0,161.789993
1256,2022-04-25,163.169998,158.460007,161.119995,162.880005,96046400.0,162.880005
1257,2022-04-26,162.339996,156.720001,162.250000,156.800003,95623200.0,156.800003
1258,2022-04-27,159.789993,155.380005,155.910004,156.570007,88063200.0,156.570007


**Add classification variable**  
- y = 1 if the next day's closing price is higher than today's closing price  
- y = -1 if the next day's closing price is lower than today's closing price

In [4]:
X = df[['High', 'Low', 'Open']]
y = np.where(df['Close'].shift(-1) > df['Close'], 1, -1)

**Split data to train and test set**

In [5]:
split = int(0.8 * len(df))
X_train, X_test, y_train, y_test = X[:split], X[split:], y[:split], y[split:]

In [6]:
print(len(X_train), len(X_test))

64 17


**Logistic Regression**  
- train the model with X_train and y_train

In [7]:
model = LogisticRegression(max_iter=float('inf'))
model = model.fit(X_train, y_train)

In [8]:
pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))

Unnamed: 0,0,1
0,High,[0.28853731241410807]
1,Low,[0.12436727913018701]
2,Open,[-0.4761359649808954]


In [9]:
y_predict = model.predict(X_test)

**Evaluate the model**

In [10]:
metrics.confusion_matrix(y_test, y_predict)

array([[7, 4],
       [4, 2]])

In [11]:
print(metrics.classification_report(y_test, y_predict))

              precision    recall  f1-score   support

          -1       0.64      0.64      0.64        11
           1       0.33      0.33      0.33         6

    accuracy                           0.53        17
   macro avg       0.48      0.48      0.48        17
weighted avg       0.53      0.53      0.53        17



### Conclusion: Accuracy of this model is 0.53.

In [12]:
model.score(X_test, y_test)

0.5294117647058824

### (Part2) Baseline Model with selected technical features

In this part, we will run a logistic regression with selected technical features. We will continue to use Apple Stock Price data from 2022-01-01 to 2021-4-28 (4month), but previous data are also used to calculate technical features. 

In [13]:
df = pd.read_csv('data/AAPL_data.csv')
df.dropna()

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close
0,2017-04-28,36.075001,35.817501,36.022499,35.912498,83441600.0,33.907143
1,2017-05-01,36.799999,36.240002,36.275002,36.645000,134411600.0,34.598736
2,2017-05-02,37.022499,36.709999,36.884998,36.877499,181408800.0,34.818253
3,2017-05-03,36.872501,36.067501,36.397499,36.764999,182788000.0,34.712040
4,2017-05-04,36.785000,36.452499,36.630001,36.632500,93487600.0,34.586937
...,...,...,...,...,...,...,...
1255,2022-04-22,167.869995,161.500000,166.460007,161.789993,84775200.0,161.789993
1256,2022-04-25,163.169998,158.460007,161.119995,162.880005,96046400.0,162.880005
1257,2022-04-26,162.339996,156.720001,162.250000,156.800003,95623200.0,156.800003
1258,2022-04-27,159.789993,155.380005,155.910004,156.570007,88063200.0,156.570007


**Add date features**

Produced features : `Year`, `Month`, `Week`, `Day`, `Dayofweek`, `Dayofyear`, `Is_month_end`, `Is_month_start`, `Is_quarter_end`, `Is_quarter_start`, `Is_year_end`, `Is_year_start`

In [14]:

df["Date"]=pd.to_datetime(df.Date, format="%Y-%m-%d")
df.index=df['Date']
data = df.sort_index(ascending=True, axis=0)
add_datepart(df, 'Date', drop=False)
df.drop('Elapsed', axis=1, inplace=True)
df

Unnamed: 0_level_0,Date,High,Low,Open,Close,Volume,Adj Close,Year,Month,Week,Day,Dayofweek,Dayofyear,Is_month_end,Is_month_start,Is_quarter_end,Is_quarter_start,Is_year_end,Is_year_start
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2017-04-28,2017-04-28,36.075001,35.817501,36.022499,35.912498,83441600.0,33.907143,2017,4,17,28,4,118,False,False,False,False,False,False
2017-05-01,2017-05-01,36.799999,36.240002,36.275002,36.645000,134411600.0,34.598736,2017,5,18,1,0,121,False,True,False,False,False,False
2017-05-02,2017-05-02,37.022499,36.709999,36.884998,36.877499,181408800.0,34.818253,2017,5,18,2,1,122,False,False,False,False,False,False
2017-05-03,2017-05-03,36.872501,36.067501,36.397499,36.764999,182788000.0,34.712040,2017,5,18,3,2,123,False,False,False,False,False,False
2017-05-04,2017-05-04,36.785000,36.452499,36.630001,36.632500,93487600.0,34.586937,2017,5,18,4,3,124,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-04-22,2022-04-22,167.869995,161.500000,166.460007,161.789993,84775200.0,161.789993,2022,4,16,22,4,112,False,False,False,False,False,False
2022-04-25,2022-04-25,163.169998,158.460007,161.119995,162.880005,96046400.0,162.880005,2022,4,17,25,0,115,False,False,False,False,False,False
2022-04-26,2022-04-26,162.339996,156.720001,162.250000,156.800003,95623200.0,156.800003,2022,4,17,26,1,116,False,False,False,False,False,False
2022-04-27,2022-04-27,159.789993,155.380005,155.910004,156.570007,88063200.0,156.570007,2022,4,17,27,2,117,False,False,False,False,False,False


**Add technical features**

Produced features : all features from https://github.com/bukosabino/ta 

In [15]:
df = add_all_ta_features(
    df, high="High", low="Low", open="Open", close="Close", volume="Volume")

  dip[idx] = 100 * (self._dip[idx] / value)
  din[idx] = 100 * (self._din[idx] / value)
  self._psar_up = pd.Series(index=self._psar.index)
  self._psar_down = pd.Series(index=self._psar.index)


**Limit number of technical features based on previous literature**

In [16]:
df = df[df['Date'] >= '2022-01-01']
selected_features = ['trend_sma_fast', 'trend_ema_fast', 'momentum_stoch_rsi_k', 'momentum_stoch_rsi_d', 'momentum_rsi', \
                    'trend_macd', 'momentum_wr', 'volume_adi', 'momentum_roc', 'volume_obv', \
                    'volatility_bbh', 'volatility_bbl']
basic_features = ['High', 'Low', 'Open', 'Volume', 'Year', 'Month', 'Week', 'Day', 'Dayofweek']

X = df[selected_features + basic_features]
y = np.where(df['Close'].shift(-1) > df['Close'], 1, -1)

**Split data to train and test set**

In [17]:
split = int(0.8 * len(df))
X_train, X_test, y_train, y_test = X[:split], X[split:], y[:split], y[split:]

**Logistic Regression**  

In [18]:
model = LogisticRegression(max_iter=float('inf'))
model = model.fit(X_train, y_train)

In [19]:
pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))

Unnamed: 0,0,1
0,trend_sma_fast,[-1.9567440375481505e-15]
1,trend_ema_fast,[-2.1192741882152557e-15]
2,momentum_stoch_rsi_k,[1.2477339851180678e-17]
3,momentum_stoch_rsi_d,[-5.172434148890053e-18]
4,momentum_rsi,[2.1099867789274797e-15]
5,trend_macd,[-2.760733907149884e-16]
6,momentum_wr,[4.04066712449573e-15]
7,volume_adi,[5.486440335450997e-10]
8,momentum_roc,[1.1332935622194192e-15]
9,volume_obv,[-1.2605746784832788e-09]


**Evaluate the model**

In [20]:
y_predict = model.predict(X_test)
metrics.confusion_matrix(y_test, y_predict)
print(metrics.classification_report(y_test, y_predict))

              precision    recall  f1-score   support

          -1       0.70      0.64      0.67        11
           1       0.43      0.50      0.46         6

    accuracy                           0.59        17
   macro avg       0.56      0.57      0.56        17
weighted avg       0.60      0.59      0.59        17



### Conclusion: Accuracy of this model is 0.59.

The accuracy improved by including more features. 

In [21]:
model.score(X_test, y_test)

0.5882352941176471

### (Part3) Baseline Model with full set of features

In this part, we will run a logistic regression with full set of 100 features. We will continue to use Apple Stock Price data from 2022-01-01 to 2021-4-28 (4month), but previous data are also used to calculate technical features. 

**Include all of technical features except for the following**

In [22]:
X = df.drop(['Close', 'trend_psar_down', 'trend_psar_up', 'Date', 'Adj Close'], axis=1)
X.isnull().sum().sort_values()

High                       0
trend_visual_ichimoku_a    0
trend_cci                  0
trend_adx_neg              0
trend_adx_pos              0
                          ..
volatility_bbl             0
volatility_bbh             0
volatility_bbm             0
volatility_kcw             0
others_cr                  0
Length: 100, dtype: int64

**Split data to train and test set**

In [23]:
split = int(0.8 * len(df))
X_train, X_test, y_train, y_test = X[:split], X[split:], y[:split], y[split:]

**Logistic Regression**  

In [24]:

model = LogisticRegression(max_iter=float('inf'))
model = model.fit(X_train, y_train)
pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))

Unnamed: 0,0,1
0,High,[-1.7245743109029675e-13]
1,Low,[-2.1521384714623148e-13]
2,Open,[-3.2369722637925777e-13]
3,Volume,[1.2948357642623333e-08]
4,Year,[2.279136102545938e-13]
...,...,...
95,momentum_pvo_hist,[9.297594426822735e-16]
96,momentum_kama,[-1.3442427008200508e-13]
97,others_dr,[1.15369013471174e-13]
98,others_dlr,[1.157195338593034e-13]


**Evaluate the model**

In [25]:
y_predict = model.predict(X_test)
metrics.confusion_matrix(y_test, y_predict)
print(metrics.classification_report(y_test, y_predict))

              precision    recall  f1-score   support

          -1       0.58      0.64      0.61        11
           1       0.20      0.17      0.18         6

    accuracy                           0.47        17
   macro avg       0.39      0.40      0.40        17
weighted avg       0.45      0.47      0.46        17



### Conclusion: Accuracy of this model is 0.47.

The accuracy dropped because we included too many features (over-fitting problem). 

In [26]:
model.score(X_test, y_test)

0.47058823529411764

### Adding data from Stocktwits

We scraped additional data from StockTwits, dating back to 12/31/2022. For each day, we count the number of posts labeled "bearish" and the number labeled "bullish". We rerun the regression with the selected features in addition to the StockTwits sentiment-based features.

In [27]:
df = pd.read_csv('data/AAPL_data.csv')
df.dropna()

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close
0,2017-04-28,36.075001,35.817501,36.022499,35.912498,83441600.0,33.907143
1,2017-05-01,36.799999,36.240002,36.275002,36.645000,134411600.0,34.598736
2,2017-05-02,37.022499,36.709999,36.884998,36.877499,181408800.0,34.818253
3,2017-05-03,36.872501,36.067501,36.397499,36.764999,182788000.0,34.712040
4,2017-05-04,36.785000,36.452499,36.630001,36.632500,93487600.0,34.586937
...,...,...,...,...,...,...,...
1255,2022-04-22,167.869995,161.500000,166.460007,161.789993,84775200.0,161.789993
1256,2022-04-25,163.169998,158.460007,161.119995,162.880005,96046400.0,162.880005
1257,2022-04-26,162.339996,156.720001,162.250000,156.800003,95623200.0,156.800003
1258,2022-04-27,159.789993,155.380005,155.910004,156.570007,88063200.0,156.570007


In [28]:

df["Date"]=pd.to_datetime(df.Date, format="%Y-%m-%d")
df.index=df['Date']
data = df.sort_index(ascending=True, axis=0)
add_datepart(df, 'Date', drop=False)
df.drop('Elapsed', axis=1, inplace=True)
df

Unnamed: 0_level_0,Date,High,Low,Open,Close,Volume,Adj Close,Year,Month,Week,Day,Dayofweek,Dayofyear,Is_month_end,Is_month_start,Is_quarter_end,Is_quarter_start,Is_year_end,Is_year_start
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2017-04-28,2017-04-28,36.075001,35.817501,36.022499,35.912498,83441600.0,33.907143,2017,4,17,28,4,118,False,False,False,False,False,False
2017-05-01,2017-05-01,36.799999,36.240002,36.275002,36.645000,134411600.0,34.598736,2017,5,18,1,0,121,False,True,False,False,False,False
2017-05-02,2017-05-02,37.022499,36.709999,36.884998,36.877499,181408800.0,34.818253,2017,5,18,2,1,122,False,False,False,False,False,False
2017-05-03,2017-05-03,36.872501,36.067501,36.397499,36.764999,182788000.0,34.712040,2017,5,18,3,2,123,False,False,False,False,False,False
2017-05-04,2017-05-04,36.785000,36.452499,36.630001,36.632500,93487600.0,34.586937,2017,5,18,4,3,124,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-04-22,2022-04-22,167.869995,161.500000,166.460007,161.789993,84775200.0,161.789993,2022,4,16,22,4,112,False,False,False,False,False,False
2022-04-25,2022-04-25,163.169998,158.460007,161.119995,162.880005,96046400.0,162.880005,2022,4,17,25,0,115,False,False,False,False,False,False
2022-04-26,2022-04-26,162.339996,156.720001,162.250000,156.800003,95623200.0,156.800003,2022,4,17,26,1,116,False,False,False,False,False,False
2022-04-27,2022-04-27,159.789993,155.380005,155.910004,156.570007,88063200.0,156.570007,2022,4,17,27,2,117,False,False,False,False,False,False


In [29]:
df = add_all_ta_features(
    df, high="High", low="Low", open="Open", close="Close", volume="Volume")

  dip[idx] = 100 * (self._dip[idx] / value)
  din[idx] = 100 * (self._din[idx] / value)
  self._psar_up = pd.Series(index=self._psar.index)
  self._psar_down = pd.Series(index=self._psar.index)


In [30]:
def convert(date_string):
    year, month, day = [int(i) for i in date_string.split('-')]
    return datetime(year=year, month=month, day=day)

In [31]:
df_sentiment = pd.read_csv('data/AAPL_byday_RoBERTa.csv')
df_sentiment.date = df_sentiment.date.apply(convert)
df_sentiment.rename(columns={'date':'Date'}, inplace=True)

df = df[df['Date'] >= '2022-01-01']
df.index=np.array(range(len(df)))
selected_features = ['trend_sma_fast', 'trend_ema_fast', 'momentum_stoch_rsi_k', 'momentum_stoch_rsi_d', 'momentum_rsi', \
                    'trend_macd', 'momentum_wr', 'volume_adi', 'momentum_roc', 'volume_obv', \
                    'volatility_bbh', 'volatility_bbl']
basic_features = ['High', 'Low', 'Open', 'Volume', 'Year', 'Month', 'Week', 'Day', 'Dayofweek']
sentiment_features = ['bullish', 'bearish']

df = df.merge(df_sentiment, how='inner', on='Date').fillna(0)

X = df[selected_features + basic_features + sentiment_features]
y = np.where(df['Close'].shift(-1) > df['Close'], 1, -1)

split = int(0.8 * len(df))
X_train, X_test, y_train, y_test = X[:split], X[split:], y[:split], y[split:]

model = LogisticRegression(max_iter=float('inf'))
model = model.fit(X_train, y_train)

pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))

Unnamed: 0,0,1
0,trend_sma_fast,[-1.956744038632702e-15]
1,trend_ema_fast,[-2.1192741893912765e-15]
2,momentum_stoch_rsi_k,[1.24773398576919e-17]
3,momentum_stoch_rsi_d,[-5.172434152175762e-18]
4,momentum_rsi,[2.1099867800998575e-15]
5,trend_macd,[-2.760733908651547e-16]
6,momentum_wr,[4.040667126723224e-15]
7,volume_adi,[5.486440334712738e-10]
8,momentum_roc,[1.1332935628455332e-15]
9,volume_obv,[-1.2605746785901506e-09]


In [32]:
y_predict = model.predict(X_test)
metrics.confusion_matrix(y_test, y_predict)
print(metrics.classification_report(y_test, y_predict))

              precision    recall  f1-score   support

          -1       0.70      0.64      0.67        11
           1       0.43      0.50      0.46         6

    accuracy                           0.59        17
   macro avg       0.56      0.57      0.56        17
weighted avg       0.60      0.59      0.59        17



In [33]:
model.score(X_test, y_test)

0.5882352941176471

We see that adding the sentiment from Stocktwits did not change the model's test accuracy. This is likely because, as we can see, the model is determined by a subset of features with comparatively massive coefficients, such as volume, volume_obv, and volume_adi. These features have coefficients which are orders of magnitude higher than the other coefficient, so their values (mostly) determine the model's predictions. Simply standardizing features will not address this problem in the case of logistic regression, but we will clearly have to explore feature normalization techn 

### Conclusion

Overall, the accuracy of basic logistic regression model was at most 0.59. The model performed best when we included an adequate number of features. We then improved this accuracy when normalizing features to 0.65. Unfortunately, incorporating the sentiment of pre-labeled posts did not improve them model's predictive power, largely because, after feature normalization, a few features have coefficients which are orders of magnitude larger than the other features. We will have to continue to explore feature selection and normalization techniques in order to refine our model. Additionally, to increase the accuracy, we may need to use more sophisticated ML models than Logistic Regression.

followed the tutorial and changed based on dataset:  
https://blog.quantinsti.com/machine-learning-logistic-regression-python/#:~:text=Logistic%20Regression%20and%20Linear%20regression&text=It%20is%20a%20classification%20problem,as%20a%20generalized%20linear%20model.