<p style="text-align: center;"> <span style="color:firebrick"> <font size="5"> <b> USC Marshall School of Business </b> </font> </p> </span> 

<p style="text-align: center;"> <b> <font font size="5"> DSO 530 - Final Project </p> </b></font>

<p style="text-align: center;"> <b> Spring 2021 </b> </p>

## <span style="color:black"> <font size="3">By: Ningchuan Peng</span>

For this project I am going to exam European call option pricing data on the S&P 500. A European call option gives the holder the right (but not the obligation) to purchase an asset at a given time for a given price. Valuing such an option is tricky because it depends on the future value of the underlying asset. 

There are two datasets `option_train.csv` and `option_test_wolabel.csv`. The training data set has information on 1,680 separate options. In particular, for each option we have recorded 
- Value (C): Current option value 
- S: Current asset value 
- K: Strike price of option 
- r: Annual interest rate 
- tau: Time to maturity (in years) 
- BS: The Black-Scholes formula was applied to this data (using some 𝜎) to get C_pred. And If an option has C_pred – C > 0, i.e., the prediction over estimated the option value, we associate that option by (Over); otherwise, we associate that option with (Under). 

The test data set is similar except it has only 1,120 options and is missing the Value and BS variables.

The core idea of the project is to use the training data to build statistical/ML models with 
1. Value as the response (i.e., a regression problem) and then 
2. BS as the response (i.e., a classification problem). 

The other four variables will be used as the predictors. Ultimately, I will select what I consider to be the most accurate approach and use it to make predictions for C(Regression) and BS(Classification) on the 1,120 options in the test data set.

In [1]:
# import the packages
import itertools
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import roc_curve, auc, mean_squared_error
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier, VotingRegressor
from sklearn.svm import SVC, SVR
from sklearn.gaussian_process import GaussianProcessClassifier, GaussianProcessRegressor

# create standardization and normalization method
mms = MinMaxScaler()
stdsc = StandardScaler()

In [2]:
# read the data and do data cleaning
option = pd.read_csv("option_train.csv")
option['BS'] = option['BS'] .map({'Under': 0, 'Over': 1})
new_seq = option.columns.tolist()[1:] + option.columns.tolist()[:1]
option = option[new_seq]
option.head()

Unnamed: 0,S,K,tau,r,BS,Value
0,431.623898,420,0.34127,0.03013,0,21.670404
1,427.015526,465,0.166667,0.03126,1,0.125
2,427.762336,415,0.265873,0.03116,0,20.691244
3,451.711658,460,0.063492,0.02972,1,1.035002
4,446.718974,410,0.166667,0.02962,0,39.55302


In [3]:
# create a function for cross validation and give back the R2 and Accuracy for regression and classification problems
def get_error(ml_method, std_method, output_col='BS', input_cols=['S','K','tau','r'], ran_state=0):
    # k-fold cross validation
    if output_col == 'BS':
        kfolds = StratifiedKFold(n_splits = 10, random_state = ran_state, shuffle = True)
    elif output_col == 'Value':
        kfolds = KFold(n_splits = 10, random_state = ran_state, shuffle = True)
    
    # standardization and normalization
    if std_method != None:
        X_train = std_method.fit_transform(option[input_cols].values)
    else:
        X_train = option[input_cols].values
    
    # cross validation
    cv_ers = cross_val_score(ml_method, 
                             X_train, 
                             option[output_col], 
                             cv = kfolds)
    # give back the output
    if output_col == 'BS':
        mean_cv_er = 1-np.mean(cv_ers)
        cv_ers = 1-cv_ers
    else:
        mean_cv_er = np.mean(cv_ers)
    return cv_ers, mean_cv_er

# for testing
print('Linear Regression:', get_error(LinearRegression(), stdsc, 'Value', ['S','K','tau','r'])[1])
print('Logistic Regression:', get_error(LogisticRegression(), mms, 'BS', ['S','K','tau','r'])[1])

Linear Regression: 0.9107826304176896
Logistic Regression: 0.08630952380952372


In [4]:
# create basic models for classification
cal_models = {'LOR': LogisticRegression(),
              'LDA': LinearDiscriminantAnalysis(), 
              'KNN': KNeighborsClassifier(), 
              'DeT': DecisionTreeClassifier(random_state=0), 
              'RF': RandomForestClassifier(random_state=0), 
              'GAU': GaussianProcessClassifier(random_state=0), 
              'SVC': SVC(probability=True),
              'AB': AdaBoostClassifier(random_state=0),
              'QDA': QuadraticDiscriminantAnalysis()}

In [5]:
# use standardized and normalized data to train basic model
for i in cal_models.keys():
    print(i, get_error(cal_models[i], stdsc, 'BS')[1])
    print(i, get_error(cal_models[i], mms, 'BS')[1])
    print('-----')
    
# The below result shows that using standardized data to train models will give back better result

LOR 0.08571428571428574
LOR 0.08630952380952372
-----
LDA 0.08690476190476204
LDA 0.08690476190476204
-----
KNN 0.08333333333333337
KNN 0.08511904761904743
-----
DeT 0.08750000000000002
DeT 0.08630952380952372
-----
RF 0.07202380952380949
RF 0.07261904761904747
-----
GAU 0.06607142857142845
GAU 0.08452380952380945
-----
SVC 0.06845238095238082
SVC 0.06785714285714273
-----
AB 0.08571428571428563
AB 0.08571428571428563
-----
QDA 0.0797619047619047
QDA 0.0797619047619047
-----


In [6]:
# voting classifier is a classifier that combine several classifiers together, it might gives back better result

# we use VOT(KNN, RF, GAU) for example
VOT = VotingClassifier(estimators=[('KNN', KNeighborsClassifier()), 
                                   ('RF', RandomForestClassifier(random_state=0)),
                                   ('GAU', GaussianProcessClassifier(random_state=0))],
                        voting='soft',
                        weights=[1, 1, 1])
models = {'KNN': KNeighborsClassifier(), 'RF': RandomForestClassifier(random_state=0),
          'GAU': GaussianProcessClassifier(random_state=0), 'VOT': VOT}
data = pd.DataFrame(columns =  ["Method", "R-square"])

for i in models.keys():
    #print(i)
    answer = get_error(models[i], stdsc, 'BS', ran_state=2)
    for j in range(10):
        app_row = {'Method': i, 'R-square':answer[0][j]}
        data = data.append(app_row, ignore_index=True)
    #print(np.std(answer[0]))
    #print(answer[1])
    #print('--------')

display(data.groupby(['Method']).agg({'R-square':['mean', 'std']}))
# The below result shows that using voting classifier might give back outcome with better accuracy and also lower variance

Unnamed: 0_level_0,R-square,R-square
Unnamed: 0_level_1,mean,std
Method,Unnamed: 1_level_2,Unnamed: 2_level_2
GAU,0.070238,0.025223
KNN,0.078571,0.029134
RF,0.072024,0.023045
VOT,0.064881,0.024534


In [7]:
# to get the best combination for voting classifier, we use the best subset selection to test all the combinations of basic models
# create empty dictionary for dataset
mean_rs_dict = {}
std_rs_dict = {}

# iteration of the number of models in voting classification=
for i in range(1, len(cal_models)+1):
    
    # iteration of certain number of models in voting classification
    for j in list(itertools.combinations(cal_models, i)):
        name_list = []
        vot_list = []
        
        # for a certain model
        for k in j:
            name_list.append(k)
            vot_list.append((k, cal_models[k]))
        
        VOT = VotingClassifier(estimators = vot_list,
                               voting='soft',
                               weights=[1]*len(vot_list))
        answer = get_error(VOT, stdsc, 'BS')
        mean_rs_dict[' '.join([str(elem) for elem in name_list])] = answer[1]
        std_rs_dict[' '.join([str(elem) for elem in name_list])] = np.std(answer[0])
        #print(name_list)
        #print(answer[1])
        #print('-----')

# show the final result
result = pd.Series(mean_rs_dict).to_frame().merge(pd.Series(std_rs_dict).to_frame(),
                                                  left_index=True, right_index=True)
result.rename(columns={'0_x':'mean_rs', '0_y':'std_rs'}, inplace=True)
result = result.sort_values(['mean_rs', 'std_rs'], ascending=[True, True])
display(result.head(10))

Unnamed: 0,mean_rs,std_rs
RF GAU QDA,0.065476,0.026352
RF SVC AB QDA,0.065476,0.027016
RF GAU AB QDA,0.066071,0.026143
GAU,0.066071,0.027206
RF GAU SVC,0.066071,0.026143
RF GAU SVC AB,0.066071,0.026143
RF SVC,0.066071,0.028228
RF SVC QDA,0.066667,0.027251
GAU AB,0.066667,0.02802
RF SVC AB,0.066667,0.025226


In [8]:
# based on the above best subset selection, we find the three models will give back the best result
# then, we want to conduct grid search for three models(RF, GAU, QDA) to find the best hyperparameters repectively

# transform the data for grid search
kfolds = StratifiedKFold(n_splits = 10, random_state = 0, shuffle = True)
input_cols=['S','K','tau','r']
X_train = stdsc.fit_transform(option[input_cols].values)
y_train = option['BS']

In [9]:
# Grid search for Random Forest Classifier
rf_params = {'n_estimators': [100, 200, 300],
          'criterion': ['gini', 'entropy'],
          'min_samples_leaf': [10, 20],
          'max_features':[ 'auto', 'sqrt', 'log2']}

rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv = kfolds)
rf_gs.fit(X_train, y_train)

print('Grid Search for RandomForestClassifier',rf_gs.best_params_)
rf_best = rf_gs.best_estimator_

Grid Search for RandomForestClassifier {'criterion': 'entropy', 'max_features': 'log2', 'min_samples_leaf': 10, 'n_estimators': 100}


In [10]:
# grid search for Gaussian Process Classifier
gau_params = {'max_iter_predict': [100, 150, 200],
              'warm_start': [True, False],
              'multi_class':[ 'one_vs_rest', 'one_vs_one']}

gau_gs = GridSearchCV(GaussianProcessClassifier(), gau_params, cv = kfolds)
gau_gs.fit(X_train, y_train)

print('Grid Search for GaussianProcessClassifier',gau_gs.best_params_)
gau_best = gau_gs.best_estimator_

Grid Search for GaussianProcessClassifier {'max_iter_predict': 100, 'multi_class': 'one_vs_rest', 'warm_start': True}


In [11]:
# grid search for Quadratic Discriminant Analysis
qda_params = {'reg_param': [0, 0.1, 0.2, 0.3, 0.4, 0.5],
              'store_covariance': [True, False],}

qda_gs = GridSearchCV(QuadraticDiscriminantAnalysis(), qda_params, cv = kfolds)
qda_gs.fit(X_train, y_train)

print('Grid Search for QuadraticDiscriminantAnalysis',qda_gs.best_params_)
qda_best = qda_gs.best_estimator_

Grid Search for QuadraticDiscriminantAnalysis {'reg_param': 0, 'store_covariance': True}


In [12]:
# build the best voting classifier
best_vot = VotingClassifier(estimators=[('rf_best', rf_best),
                                        ('gau_best',gau_best),
                                        ('qda_best', qda_best)],
                            voting='soft',
                            weights=[1, 1, 1])

# use all the data we have to train the model
final_X = stdsc.fit_transform(option[['S','K','tau','r']].values)
final_y = option['BS']
best_vot.fit(final_X, final_y)

VotingClassifier(estimators=[('rf_best',
                              RandomForestClassifier(criterion='entropy',
                                                     max_features='log2',
                                                     min_samples_leaf=10)),
                             ('gau_best',
                              GaussianProcessClassifier(warm_start=True)),
                             ('qda_best',
                              QuadraticDiscriminantAnalysis(reg_param=0,
                                                            store_covariance=True))],
                 voting='soft', weights=[1, 1, 1])

In [13]:
# read the data for prediction
opt_pre = pd.read_csv("option_test_wolabel.csv")

# transform the data for prediction and then do the prediction
opt_pre_X = stdsc.transform(opt_pre[['S','K','tau','r']].values)
opt_pre_y = best_vot.predict(opt_pre_X)

In [14]:
# the final result is:
opt_pre_y

array([1, 0, 0, ..., 1, 0, 0], dtype=int64)