# HW3-1 S&P 500 Stock Prediction

## About Data

Data set: S&P500 stock price for specific dates, you may refer to test.csv and train.csv for detailed date.

There are 6 lables: Date, Open Price(開盤價), Close Price(收盤價), High Price(當日最高價), Low Price(當日最低價), Volume(成交量)

The result we want is to predice stock movement, to determine whether is go higher or lower

## Preprocessing

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import linear_model, metrics, model_selection, neural_network, svm
from sklearn.preprocessing import StandardScaler

In [3]:
train_dataset = "train.csv"
train = pd.read_csv(train_dataset)

# Since Date is not related to the price, so I just drop it
train = train.drop(["Date"], axis=1)

train.info() # show info about training dataset, there are 2262 data
train.head() # show first 5 data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2262 entries, 0 to 2261
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Open Price   2262 non-null   float64
 1   Close Price  2262 non-null   float64
 2   High Price   2262 non-null   float64
 3   Low Price    2262 non-null   float64
 4   Volume       2262 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 88.5 KB


Unnamed: 0,Open Price,Close Price,High Price,Low Price,Volume
0,902.99,931.8,934.73,899.35,4048270080
1,929.17,927.45,936.63,919.53,5413910016
2,931.17,934.7,943.85,927.28,5392620032
3,927.45,906.65,927.45,902.37,4704940032
4,905.73,909.73,910.0,896.81,4991549952


In [6]:
test_dataset = "test.csv"
test = pd.read_csv(test_dataset)

test = test.drop(["Date"], axis=1)
test.info() # there are 252 data for testing
test.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Open Price   252 non-null    float64
 1   Close Price  252 non-null    float64
 2   High Price   252 non-null    float64
 3   Low Price    252 non-null    float64
 4   Volume       252 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 10.0 KB


Unnamed: 0,Open Price,Close Price,High Price,Low Price,Volume
0,2683.73,2695.81,2695.89,2682.36,1846463232
1,2697.85,2713.06,2714.37,2697.77,2090595328
2,2719.31,2723.99,2729.29,2719.07,2100767744
3,2731.33,2743.15,2743.45,2727.92,1918869120
4,2742.67,2747.71,2748.51,2737.6,1894823936


In [7]:
# Since we want to predict whether it's go up or down, not precise price
# Open, High, Low price and volume are features to train, Close price is used for determine up or down
# Note: close prices are not always the same as open price in the next day
def data_split(dataset):
    o = dataset['Open Price']
    c = dataset['Close Price']

    x = dataset.loc[:, dataset.columns != 'Close Price']
    y = [] # 1 for higher, 0 for lower or equal
    for i in range(len(c)):
        if(c[i] > o[i]):
            y.append(1)
        else:
            y.append(0)

    return x, y

In [8]:
# evaluation function for training results
def evaluate(model, x_train, y_train, x_test, y_test):
    model.fit(x_train, y_train)

    train_predict = model.predict(x_train)
    test_predict = model.predict(x_test)

    train_acc = metrics.accuracy_score(y_train, train_predict)
    test_acc = metrics.accuracy_score(y_test, test_predict)

    return train_acc, test_acc

## Logistic Regression

1. Volume might be too huge to prices, the result are clearly better after dropping

Q: What about dividing volume?

A: It seems better, that means unbalanced value would effect LR a lot

2. What about Linear Regression?

A: Not good, linear regression is trying to find a point on the line to fit, and stock model is not really suitable for it.

Stock shouldn't use linear regression module to predict


In [7]:
x_train, y_train = data_split(train)
x_test, y_test = data_split(test)
clf = linear_model.LogisticRegression(multi_class="auto", solver="lbfgs", max_iter=100, penalty='l2')
train_acc, test_acc = evaluate(clf, x_train, y_train, x_test, y_test)

print('----Without Any Processing----')
print('Training accuracy: ' + str(train_acc))
print('Testing accuracy: ' + str(test_acc))

x_train_nv = x_train.drop(['Volume'], axis=1)
x_test_nv = x_test.drop(['Volume'], axis=1)
train_acc, test_acc = evaluate(clf, x_train_nv, y_train, x_test_nv, y_test)

print('----After Dropping Volume----')
print('Training accuracy: ' + str(train_acc))
print('Testing accuracy: ' + str(test_acc))

x_train_dv = x_train
for i in range(len(x_train_dv)):
    x_train_dv['Volume'][i] = x_train_dv['Volume'][i] / 1000000

x_test_dv = x_test
for i in range(len(x_test_dv)):
    x_test_dv['Volume'][i] = x_test_dv['Volume'][i] / 1000000

train_acc, test_acc = evaluate(clf, x_train_dv, y_train, x_test_dv, y_test)
print('----After Dividing Volume----')
print('Training accuracy: ' + str(train_acc))
print('Testing accuracy: ' + str(test_acc))

clf2 = linear_model.SGDClassifier(loss="perceptron", eta0=1, learning_rate="constant", penalty=None)
train_acc, test_acc = evaluate(clf2, x_train_nv, y_train, x_test_nv, y_test)
print('----Implement SGDClassifier----')
print('Training accuracy: ' + str(train_acc))
print('Testing accuracy: ' + str(test_acc))

----Without Any Processing----
Training accuracy: 0.5455349248452697
Testing accuracy: 0.503968253968254
----After Dropping Volume----
Training accuracy: 0.847922192749779
Testing accuracy: 0.8214285714285714
----After Dividing Volume----
Training accuracy: 0.8488063660477454
Testing accuracy: 0.8293650793650794
----Implement SGDClassifier----
Training accuracy: 0.4557913351016799
Testing accuracy: 0.49603174603174605


## Neural Network

1. the value of volume seems not affect accuracy, cause NN is used to extract features from their values. the amount isn't matter

2. the result without volume seems better, however it's still bad

3. The features we put into NN should be normalized, or it's scale will affect it's training result

But the features without volume somehow made the result worsen

```
----After Normalized----
Training accuracy: 0.6710875331564987
Testing accuracy: 0.623015873015873
```

Maybe dropping volume would make the features too few to make a classifier

4. Enlarging hidden layers won't get a better result. However, hidden layers with small sizes ((2,0) for example) might be unuseful with terrible acc.

In [17]:
x_train, y_train = data_split(train)
x_test, y_test = data_split(test)
x_train_nv = x_train.drop(['Volume'], axis=1)
x_test_nv = x_test.drop(['Volume'], axis=1)
clf = neural_network.MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5,), random_state=1)

train_acc, test_acc = evaluate(clf, x_train, y_train, x_test, y_test)

print('----Without Any Processing----')
print('Training accuracy: ' + str(train_acc))
print('Testing accuracy: ' + str(test_acc))

train_acc, test_acc = evaluate(clf, x_train_nv, y_train, x_test_nv, y_test)

print('----After Dropping Volume----')
print('Training accuracy: ' + str(train_acc))
print('Testing accuracy: ' + str(test_acc))

# Multi-layer Perceptron is sensitive to feature scaling. Normalizing should get better result
scaler = StandardScaler()
scaler.fit(x_train)
x_train_nm = scaler.transform(x_train)
x_test_nm = scaler.transform(x_test)

train_acc, test_acc = evaluate(clf, x_train_nm, y_train, x_test_nm, y_test)

print('----After Normalized----')
print('Training accuracy: ' + str(train_acc))
print('Testing accuracy: ' + str(test_acc))

----Without Any Processing----
Training accuracy: 0.45446507515473034
Testing accuracy: 0.49603174603174605
----After Dropping Volume----
Training accuracy: 0.5455349248452697
Testing accuracy: 0.503968253968254
----After Normalized----
Training accuracy: 0.8483642793987621
Testing accuracy: 0.8293650793650794


### Trying another way for preprocessing and see if it will get better

To calculate the difference between open price and high/low price

It might reflect better feature for training, and it turns out better

### Conclusion:

The way to treat data is a huge topic to ML or NN module, the same data but represent in a different way would affect the training result.

That is, preprocessing is important.

In [9]:
train_difference_low = []
train_difference_high = []
for i in range(len(x_train)):
    train_difference_low.append(x_train['Open Price'][i] - x_train['Low Price'][i])
    train_difference_high.append(x_train['Open Price'][i] - x_train['High Price'][i])

difference_dict = {
    "difference_low" : train_difference_low,
    "difference_high" : train_difference_high
}

x_train_df = pd.DataFrame(difference_dict)

test_difference_low = []
test_difference_high = []
for i in range(len(x_test)):
    test_difference_low.append(x_test['Open Price'][i] - x_test['Low Price'][i])
    test_difference_high.append(x_test['Open Price'][i] - x_test['High Price'][i])

difference_dict = {
    "difference_low" : test_difference_low,
    "difference_high" : test_difference_high
}

x_test_df = pd.DataFrame(difference_dict)

clf = neural_network.MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5,), random_state=1)

train_acc, test_acc = evaluate(clf, x_train_df, y_train, x_test_df, y_test)

print('----Using Another Logic----')
print('Training accuracy: ' + str(train_acc))
print('Testing accuracy: ' + str(test_acc))

----Using Another Logic----
Training accuracy: 0.852343059239611
Testing accuracy: 0.8293650793650794


## Other implementation: SVC

SVC would find a optimal hyperplane to divide between data

In this case, it might be suitable because the goal is simple: to determined whether it's go higher or lower

And the features are also simple to read

Results: Training acc is good for the first sight, but testing shows that it's severely overfitted

The RBF kernel are fast and easy to train, but it will be overfitting in this case

Changed to linear kernel below, and it takes much time. But the result seems to be much more reasonable compare to other methods

And the result is alike to Logistic regression

Yet, linear classification might reflect to actual situation to stock price as well, you can't find a certain pattern for it with just opening and high/low price.

In [10]:
x_train, y_train = data_split(train)
x_test, y_test = data_split(test)

x_train_dv = x_train
for i in range(len(x_train_dv)):
    x_train_dv['Volume'][i] = x_train_dv['Volume'][i] / 1000000

x_test_dv = x_test
for i in range(len(x_test_dv)):
    x_test_dv['Volume'][i] = x_test_dv['Volume'][i] / 1000000

scaler = StandardScaler()
scaler.fit(x_train_dv)
x_train_nm = scaler.transform(x_train_dv)
x_test_nm = scaler.transform(x_test_dv)

clf = svm.SVC(gamma='auto')

train_acc, test_acc = evaluate(clf, x_train_nv, y_train, x_test_nv, y_test)

print('----Without Any Processing----')
print('Training accuracy: ' + str(train_acc))
print('Testing accuracy: ' + str(test_acc))

----Without Any Processing----
Training accuracy: 0.9973474801061007
Testing accuracy: 0.503968253968254


In [11]:
clf = svm.SVC(kernel='linear', probability=True, gamma='auto') # using ploy kernel might take too much time to find classification

train_acc, test_acc = evaluate(clf, x_train_nv, y_train, x_test_nv, y_test)

print('----Change Another Kernel----')
print('Training accuracy: ' + str(train_acc))
print('Testing accuracy: ' + str(test_acc))

----Change Another Kernel----
Training accuracy: 0.8465959328028294
Testing accuracy: 0.8293650793650794


## Conclusion

1. The way we treat the data is important, the same module with different interpretation with data would lead to different result

2. Using only open/high/low price and volume is fair enough to predict whether close price will go higher or lower

All three method can reach up to 80% acc easily with simple implementation, but it's a long way to achieve over 90% acc

3. However, the actual situation of stock prices changing is heavily depending on incidents in real life, and the effect is rapid.

S&P 500 is a capitalization-weighted index for major 500 companies in America, and would show the overall economy situation

There are four major drops in the history of S&P 500:

- The dot-com bubble in 2000

- The financial crisis of 2007–08

- China–United States trade war started in 2018

- COVID-19 epidemic in 2020

Maybe using a crawler for getting breaking news 

And analysis the content is positive or negative to economy might achieve better prediction with more than 83% accuracy

## References

- https://en.wikipedia.org/wiki/S%26P_500_Index

- stock price chart in google search

- https://stats.stackexchange.com/questions/43538/what-is-the-difference-between-logistic-regression-and-neural-networks

- https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC3-4%E8%AC%9B-%E6%94%AF%E6%8F%B4%E5%90%91%E9%87%8F%E6%A9%9F-support-vector-machine-%E4%BB%8B%E7%B4%B9-9c6c6925856b

- https://scikit-learn.org/stable/modules/neural_networks_supervised.html