## Question 2: Explain the role of the regularization parameter C in a Support Vector Machine (SVM) model. How does varying C affect the model’s bias and variance trade-off?

In Support Vector Machines (SVMs), the regularisation parameter C is a key hyperparameter that is used to control the complexity and tolerance of the model. The role of C is to balance the trade-off between maximising intervals and minimising classification errors. This balance affects the performance of the SVM model on both training and unseen data.

The following is the role and impact of the regularisation parameter C:

__Role of C:__
C controls the tolerance of the SVM model during training. Smaller values of C encourage greater spacing of the model, i.e., some training samples are allowed to be misclassified in order to maintain the simplicity of the model.
Larger values of C cause the model to adapt more tightly to the training data to minimise classification errors, i.e. reduce the spacing, which can lead to more complex decision boundaries.

__Affects the bias and variance trade-off of the model:__

- __Small C (larger intervals, high bias, low variance):__
    - A smaller C value will result in a larger interval for the model, allowing some training samples to be misclassified.
    - This will lead to high bias because the model is more concerned with classifying the training data correctly rather than striving for a smaller training error.
    - Models with high bias may be oversimplified and insensitive to noise in the data, and therefore may perform better on unseen data.

- __Large C (smaller interval, low bias, high variance):__
    - A larger C value will result in the model adapting more tightly to the training data to minimise classification errors.
    - This will result in low bias as the model is more concerned with classifying the training data correctly, i.e., it seeks a smaller training error.
    - Models with low bias may be more complex and sensitive to noise in the training data and therefore may over fit and perform poorly on unseen data.

Thus, the choice of C affects the bias and variance trade-off of the SVM model. __Smaller C values produce high bias, low variance models for noisier data, while larger C values produce low bias, high variance models for cleaner data.__ Choosing the appropriate C-value is key, and methods such as cross-validation are often required to determine the optimal hyperparameter settings that will allow the model to perform well on both training and test data.

## Question 3: Follow the 7-steps to model building for your selected ticker.

In [78]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler

### Step 1: Data Collection
Call the sdk of tushare to grab the daily market data of the stock 600073 from 2010-01-01 to 2022-12-31, including the
- Trading Date,
- Open Price,
- High Price,
- Low Price,
- Closing Price,
- Previous Close Price,
- Price Change,
- Price Change Percentage,
- Volume,
- Turnover.

In [79]:
# 通过tushare获取某只股票的日行情数据
# import tushare as ts
#
# ts.set_token('7cb6ebc6b67bc4757d18b217c149110ad8f2654766fef3b0a18828ee')
# pro = ts.pro_api()
# # 上海梅林 600073.SH
# df = pro.daily(ts_code='600073.SH', start_date='2010-01-01', end_date='2022-12-31')
# print(df)

# open csv file
# import csv
# with open('600073.csv', 'r') as file:
#     csv_reader = csv.reader(file)
#     for row in csv_reader:
#         print(row)

# use pandas to read csv
data = pd.read_csv('600073.csv')
print(data.head())

  Stock Code  Trade date  Open Price  High Price  Low Price  Close Price  \
0  600073.SH    20211231        7.95        8.13       7.94         8.09   
1  600073.SH    20211230        7.92        7.99       7.92         7.97   
2  600073.SH    20211229        8.01        8.05       7.93         7.94   
3  600073.SH    20211228        8.07        8.09       8.00         8.03   
4  600073.SH    20211227        7.99        8.07       7.96         8.07   

   Previous Close Price  Price Change  Price Change Percentage     Volume  \
0                  7.97          0.12                   1.5056  150337.07   
1                  7.94          0.03                   0.3778   64586.06   
2                  8.03         -0.09                  -1.1208   90722.45   
3                  8.07         -0.04                  -0.4957   88413.60   
4                  7.99          0.08                   1.0013  123911.38   

     Turnover  
0  121128.014  
1   51445.668  
2   72347.819  
3   70987.108  


### Step 2: Chose Different Features and Compute Features
Use different Features below.
1. O-C, means the difference between daily opening and closing prices.
2. H-L, means the difference between daily high and low prices.
3. Sign, a symbol or momentum used to indicate a price change.
4. Past Returns, indicates the price return over a period of time in the past, here I chose the price return of past 5 trading days.
5. Momentum, is a characteristic associated with the trend or momentum of price changes and is used to capture rapid changes in stock prices, here I chose the daily change in stock prices.
6. SMA, represents the average of prices over a period of time and is used to smooth price data, here I compute the last 20 trading days' average.
7. EMA, is a recursive moving average that assigns higher weights to the latest data and is used to track the latest price developments.

In [80]:
# Calculate new feature values based on the selected features
data['O-C'] = data['Close Price'] - data['Open Price']
data['H-L'] = data['High Price'] - data['Low Price']
# Sign is a labeled column, with 0 indicating a negative trend and 1 indicating a positive trend
# Positive ups and downs of less than 0.25% are marked as negative categories
data['Sign'] = np.where(data['Price Change Percentage'] > 0.25, 1, 0)
data['Past Returns'] = data['Close Price'].pct_change(5)
data['Momentum'] = data['Close Price'].diff()
data['SMA'] = data['Close Price'].rolling(window=20).mean()
Nobs = 20  # where Nobs is the length of the time window
alpha = 2 / (Nobs + 1)
data['EMA'] = data['Close Price'].ewm(alpha=alpha, adjust=False).mean()

features = data[['O-C', 'H-L', 'Sign', 'Past Returns', 'Momentum', 'SMA', 'EMA']]
# features = data[['O-C', 'H-L', 'Sign', 'Past Returns']]
labels = data['Sign']
print(features)

       O-C   H-L  Sign  Past Returns  Momentum      SMA        EMA
0     0.14  0.19     1           NaN       NaN      NaN   8.090000
1     0.05  0.07     1           NaN     -0.12      NaN   8.078571
2    -0.07  0.12     0           NaN     -0.03      NaN   8.065374
3    -0.04  0.09     0           NaN      0.09      NaN   8.062005
4     0.08  0.11     1           NaN      0.04      NaN   8.062767
...    ...   ...   ...           ...       ...      ...        ...
2820  0.24  0.26     1     -0.059459     -0.09  10.3635  10.570931
2821 -0.30  0.40     0     -0.075540     -0.16  10.3950  10.543223
2822 -0.18  0.25     0     -0.035326      0.37  10.4540  10.553392
2823  0.12  0.28     1      0.006500      0.19  10.5320  10.580688
2824  0.19  0.38     1      0.020893     -0.09  10.6050  10.596813

[2825 rows x 7 columns]


### Step 3: Data Preprocessing
The main purpose of normalizing the feature data using StandardScaler is to scale the features to a standard normal distribution with mean 0 and standard deviation 1. This process is part of data preprocessing and its main purpose includes:
- Feature scaling: different features may have different ranges of values. Normalization allows features to be scaled to similar scales so that the model is easier to handle. This helps to avoid certain features having an excessive impact on the model.
- Reduces the risk of model over-fitting: Normalization helps reduce the sensitivity of the model to feature values, reducing the risk of over-fitting.
- Improve model performance: certain machine learning algorithms, such as Support Vector Machines and K-Nearest Neighbor, are very sensitive to the scale of the features. Standardization can improve model performance and make it easier to converge.

\`StandardScaler\` normalizes the value of each feature by subtracting the mean of the feature and dividing by the standard deviation of the feature. This will change the mean of the feature to 0 and the standard deviation to 1.

After normalization, \`scaled_features\` contains normalized feature data that can be used to build machine learning models. Standardized data are often more suitable for use in many machine learning algorithms because they follow a standard normal distribution, which contributes to model stability and performance.

In [81]:
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

### Step 4: Split the Data into Training and Test Sets
X_train: this is the array containing the training feature data which will be used to train the machine learning model.

X_test: this is the array containing the test feature data, which will be used to evaluate the performance of the model.

y_train: this is the array containing the training labels (target values) corresponding to the feature data of the training set. It is used to train the model to learn how to make predictions.

y_test: this is the array containing the test labels, corresponding to the feature data of the test set. It is used to evaluate the performance of the model on new data.

The main purpose of splitting the dataset into a training set and a test set is to evaluate the generalization ability of the model. By splitting the data into two separate sets, you can test the model's performance on data not seen during training. This helps to determine if the model is able to make accurate predictions on unseen data and to check for overfitting or underfitting problems.

In the code, the \`train_test_split\` function splits the original dataset of \`scaled_features\` and \`labels\` into a training set and a test set at a specified ratio (\`test_size=0.2\`). the \`random_state\` parameter is used to set the random number seed to ensure that the splits are reproducible.

Once the dataset is split, we can use the training set to train the model and then use the test set to evaluate the model's performance, such as calculating accuracy, generating confusion matrices, plotting ROC curves, and so on. This helps to determine if the model generalizes enough for use in real-world prediction tasks.

In [82]:
# test_size=0.2 means that 20% of the data will be allocated to the test set, while 80% will be used for the training set
# random_state=42 is the seed value used to control the random splitting of the dataset. Specifying the same random_state value will ensure that you get the same random split every time you run the code, making the results repeatable.
X_train, X_test, y_train, y_test = train_test_split(scaled_features, labels, test_size=0.2, random_state=42)

### Step 5: Model Construction Random Forest Classifier
\`randomForestClassifier\` is the class of random forest classifiers used to create a random forest model.
\`n_estimators\` parameter specifies the number of decision trees to be included in the random forest. Here, a setting of 100 means that the random forest will include 100 decision trees.
\`random_state\` parameter is used to control the randomness and ensure that the training process of the model is repeatable.

The next line of code \`rf_classifier.fit(X_train, y_train)\` is used to fit (or train) the model to the training data, where:
\`X_train\` is the training feature data, containing the features used to train the model.
\`y_train\` is the training label, i.e., the target variable, corresponding to the training feature data.
Through this process, the Random Forest classifier learns how to classify based on the training data. The trained model can be used in subsequent prediction tasks to predict the category labels of new data points.

In [83]:
# After running this cell, an error occurred.
# __ValueError: Input X contains NaN.__
# This error indicates that there are missing values (NaN) in my data, and the Random Forest classifier RandomForestClassifier does not handle missing values by default. To solve this problem, I will use data preprocessing techniques to handle the missing values, using \`SimpleImputer\` to replace the missing values with the mean values of the corresponding features.
from sklearn.impute import SimpleImputer

X_train_imputed = SimpleImputer(strategy='mean').fit_transform(X_train)
X_train = X_train_imputed
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

### Step 6: Model Predict

In [84]:
y_pred = rf_classifier.predict(X_test)

### Step 6: Hyperparameter Tuning

In [85]:
# from sklearn.model_selection import GridSearchCV
#
# # 定义要调优的超参数网格
# param_grid = {
#     'n_estimators': [50, 100, 150],
#     'max_depth': [None, 10, 20, 30],
#     'min_samples_split': [2, 5, 10]
# }
#
# # 使用交叉验证进行超参数搜索
# grid_search = GridSearchCV(rf_classifier, param_grid, cv=5, scoring='accuracy', verbose=2)
# grid_search.fit(X_train, y_train)
#
# # 输出最佳超参数设置
# print("Best Hyperparameters:")
# print(grid_search.best_params_)
#
# # 使用最佳超参数重新训练模型
# best_rf_classifier = grid_search.best_estimator_
# best_rf_classifier.fit(X_train, y_train)
#
# # 在测试数据上评估性能
# y_pred = best_rf_classifier.predict(X_test)
# accuracy = accuracy_score(y_test, y_pred)
# roc_auc = roc_auc_score(y_test, y_pred)
# conf_matrix = confusion_matrix(y_test, y_pred)
# class_report = classification_report(y_test, y_pred)
#
# print(f"Accuracy: {accuracy}")
# print(f"ROC AUC: {roc_auc}")
# print(f"Confusion Matrix:\n{conf_matrix}")
# print(f"Classification Report:\n{class_report}")

### Step 7: Model Evaluation

In [86]:
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"ROC AUC: {roc_auc}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

Accuracy: 1.0
ROC AUC: 1.0
Confusion Matrix:
[[326   0]
 [  0 239]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       326
           1       1.00      1.00      1.00       239

    accuracy                           1.00       565
   macro avg       1.00      1.00      1.00       565
weighted avg       1.00      1.00      1.00       565

