## Question 1: What are voting classifiers in ensemble learning?

A voting classifier is an ensemble learning method designed to bring together the predictions of several different machine learning models to arrive at a final classification decision. This method is commonly used to solve classification problems in which each model votes on which category the sample belongs to. Below is a detailed explanation of voting classifiers:

1. __Hard Voting__:
Hard Voting is a form of voting classifier in which multiple models select the final prediction according to Majority Vote. In Hard Voting, each model makes a prediction for a given input sample, and then the final prediction is the category that receives the most votes.
__EXAMPLE__: Suppose there are three different models A, B, and C that each classify the same sample. Model A predicts category 1, model B predicts category 2, and model C predicts category 1. Since category 1 received two votes and category 2 only one vote, the final decision in hard voting is to choose category 1.
__Hard voting is applicable to both binary classification problems and multi-category classification problems.__ It is a simple but effective method that is often used to combine several different types of models to obtain more stable and accurate classification results.

2. __Soft Voting__:
Unlike hard voting, soft voting takes into account the predicted probability or score of each model, not just the category labels. In soft voting, each model estimates a probability or score for each category for a given input sample and the final decision is based on the average of these probabilities.
__EXAMPLE__: Suppose there are three different models A, B, and C that classify the same sample. Model A estimates the probability of the sample belonging to category 1 to be 0.7 and category 2 to be 0.3; model B estimates the probability of category 1 to be 0.4 and category 2 to be 0.6; and model C estimates the probability of category 1 to be 0.6 and category 2 to be 0.4. In soft voting the probabilities of each category are averaged, and the final decision is then likely to be on category 1 because it has the highest average probability.
__Soft voting is commonly used to deal with probability estimation and is particularly suitable for multi-category classification problems__ because it is able to utilise probabilistic information from multiple models.

__Advantages of voting classifiers include:__
- Improved model robustness, as it combines the opinions of multiple models.
- Reduces the variance of the model, especially when a single model may overfit the data.
- Allows combining different types of models for better performance.

---

## Question 2: Explain the role of the regularization parameter C in a Support Vector Machine (SVM) model. How does varying C affect the model’s bias and variance trade-off?

In Support Vector Machines (SVMs), the regularisation parameter C is a key hyperparameter that is used to control the complexity and tolerance of the model. The role of C is to balance the trade-off between maximising intervals and minimising classification errors. This balance affects the performance of the SVM model on both training and unseen data.

The following is the role and impact of the regularisation parameter C:

__Role of C:__
C controls the tolerance of the SVM model during training. Smaller values of C encourage greater spacing of the model, i.e., some training samples are allowed to be misclassified in order to maintain the simplicity of the model.
Larger values of C cause the model to adapt more tightly to the training data to minimise classification errors, i.e. reduce the spacing, which can lead to more complex decision boundaries.

__Affects the bias and variance trade-off of the model:__

- __Small C (larger intervals, high bias, low variance):__
    - A smaller C value will result in a larger interval for the model, allowing some training samples to be misclassified.
    - This will lead to high bias because the model is more concerned with classifying the training data correctly rather than striving for a smaller training error.
    - Models with high bias may be oversimplified and insensitive to noise in the data, and therefore may perform better on unseen data.

- __Large C (smaller interval, low bias, high variance):__
    - A larger C value will result in the model adapting more tightly to the training data to minimise classification errors.
    - This will result in low bias as the model is more concerned with classifying the training data correctly, i.e., it seeks a smaller training error.
    - Models with low bias may be more complex and sensitive to noise in the training data and therefore may overfit and perform poorly on unseen data.

Thus, the choice of C affects the bias and variance trade-off of the SVM model. __Smaller C values produce high bias, low variance models for noisier data, while larger C values produce low bias, high variance models for cleaner data.__ Choosing the appropriate C-value is key, and methods such as cross-validation are often required to determine the optimal hyper-parameter settings that will allow the model to perform well on both training and test data.

---

## Question 3: Follow the 7-steps to model building for your selected ticker.

### 步骤一：数据收集和预处理
调用tushare的sdk抓取600073这支股票从2010-01-01到2022-12-31这六年时间内的日行情数据，包括交易日期、开盘价、最高价、最低价、收盘价、昨收价、涨跌额、涨跌幅、成交量、成交额

In [1]:
import tushare as ts

ts.set_token('7cb6ebc6b67bc4757d18b217c149110ad8f2654766fef3b0a18828ee')
pro = ts.pro_api()
# 上海梅林 600073.SH
df = pro.daily(ts_code='600073.SH', start_date='2010-01-01', end_date='2022-12-31')
print(df)

        ts_code trade_date   open   high    low  close  pre_close  change  \
0     600073.SH   20211231   7.95   8.13   7.94   8.09       7.97    0.12   
1     600073.SH   20211230   7.92   7.99   7.92   7.97       7.94    0.03   
2     600073.SH   20211229   8.01   8.05   7.93   7.94       8.03   -0.09   
3     600073.SH   20211228   8.07   8.09   8.00   8.03       8.07   -0.04   
4     600073.SH   20211227   7.99   8.07   7.96   8.07       7.99    0.08   
...         ...        ...    ...    ...    ...    ...        ...     ...   
2820  600073.SH   20100108  10.20  10.44  10.18  10.44      10.28    0.16   
2821  600073.SH   20100107  10.58  10.65  10.25  10.28      10.65   -0.37   
2822  600073.SH   20100106  10.83  10.89  10.64  10.65      10.84   -0.19   
2823  600073.SH   20100105  10.72  10.88  10.60  10.84      10.75    0.09   
2824  600073.SH   20100104  10.56  10.94  10.56  10.75      10.57    0.18   

      pct_chg        vol      amount  
0      1.5056  150337.07  121128.014

### 步骤二：特征工程
### 步骤三：数据标记
### 步骤四：数据拆分
### 步骤五：模型选择和构建
### 步骤六：超参数调优
### 步骤七：模型评估