# Grading
**Written test(individual) - 60%**
<br>
This written test of 40 multiple choice questions with 4 possible answers is about all the theory presented in the lectures. There will be no Python-specific questions. The test will contain multiple choice questions. The questions are designed to apply the knowledge obtained from the lectures to test your understanding.
<br>

**Project report(group) - 40%**
<br>
1. The actual content of the final report is 20 pages maximum. The appendix may be 20 pages maximum. The content counts from the start of your introduction to the last sentence of your last section (conclusion). 
2. Follow the guidelines in the Writing Guidelines regarding appendices, bibliography and formatting.
3. Only use screenshots of your graphs. Make proper tables for the other components of data understanding (e.g. descriptive statistics, formulas). That is to say – no screenshots of code output. 
4. If you want to show some code to provide evidence towards fulfilling the LOs, put your code in an appendix. You may either make separate entries within that appendix with the relevant code snippets, or put your code in its entirety in one appendix and refer to the relevant lines within the text. Put the code in monospace font to discern it from regular text. Use the Word plugin for code to do this. This means also not putting in screenshots of your code.
5. You take responsibility for paraphrasing correctly (or quoting if necessary) and the ethical use of sources, including accurate source referencing following APA guidelines. Use the standard Word functionality under References for this.

---

## 1. Introduction to AI
### What do we want to achieve with AI?
- **Now**: Narrow AI 狭义人工智能
  - Automate processes 自动化流程
  - Improve decision-making 改进决策过程
  - Build productivity tools 构建提升效率的工具
- **Later**: AGI & Superintelligence 通用人工智能和超人工智能
  - Perform any task a human can 能完成任何人类能做的任务
  - Solve novel problems 能解决从未遇到的新问题
  - Show creativity and common sense 展现创造力与常识推理能力

### Narrow AI (Machine Learning)
- Works with **available data** 使用**已有数据**进行工作
- Focused on **specific tasks** 专注于**特定任务**
- Core functions: 
  - Pattern recognition模式识别：Recommendation systems推荐系统, facial recognition人脸识别
  - Prediction预测：maintenance维护, demand forecasting需求预测
  - Content generation内容生成：ChatGPT, Copilot, DALL·E

### What makes Narrow AI work?
- **Main ingredients**:
  - Data 数据
  - Model of the data distribution 对数据分布的建模
  - Probability & inferential statistics 概率与推理统计方法
  - Model tuning 模型调优

### AGI (Artificial General Intelligence)
| Narrow AI | AGI |
|-----------|-----|
| Task-specific特定任务 | Cross-domain跨领域任务 |
| Based on past data基于历史数据 | Can handle novelty能处理新情况 |
| Pattern recognition模式识别 | Pattern interpretation & reasoning模式解释与推理 |
| Increases efficiency提高效率 | Matches/exceeds human flexibility匹敌或超越人类的灵活性 |

<br>

> Adding complexity or data ≠ more intelligence 增加复杂性或数据 ≠ 增强智能
> Intelligence requires reasoning beyond seen data 智能需要超越已知数据的推理能力

---

## 2. Business Understanding (CRISP-DM Phase)
### Key Questions
- **Business Objective**: What is the organization trying to achieve? 组织试图实现什么？
- **Business Success Criteria**: When is that considered successful? 什么情况下认为目标达成？（明确KPI）
- **Data Mining Goal**: How can data science help? 数据科学如何提供帮助？
- **Data Mining Success Criteria**: When is the data science effort successful? 如何判断数据科学工作是否成功？（明确准确率或RMSE等）

---

## 3. Evaluation & Baseline Models
### Why use a baseline model?
- Understand your data performance 了解数据的基本表现
- Identify data or modeling issues 发现数据或建模中存在的问题
- Faster iterations with simpler models 使用简单模型可以更快速地迭代
- Interpretable results for stakeholders 提供可解释的结果给相关方参考
- Provide a fair benchmark for advanced models 为高级模型提供一个公平的对比基准

### How to define Data Mining Success Criteria?
Use one or more of the following methods:
- **Relative improvement over baseline 相对提升**: "15% improvement over mean prediction" 比均值预测提升 15%
- **Business impact as a metric 业务影响指标**: "10% waste reduction = €500 savings/month" 减少 10% 浪费 = 每月节省 500 欧元
- **Industry/regulatory benchmarks 行业或法规标准**: "False negatives < 5% when predicting disease" 疾病预测中漏报率 < 5%
- **Statistical significance 统计显著性标准**: "Improve math scores by +6 points to be beyond ±3 point fluctuation" 数学成绩提高 6 分以超出 ±3 的自然波动范围


---

# 4. Data Understanding  
## 4.1 Preliminary Data Inspection 初步数据检查
- Data types and structure (e.g., numerical, categorical) 数据类型与结构（数值型、分类型等）  
- Variable distributions (check for normality) 变量分布（是否符合正态分布）  
- Correlations between variables 变量之间的相关性  
- Missing values 缺失值  
- Is the dataset suitable for modeling? What are the potential issues? 是否适合建模？有哪些潜在问题？  

## 4.2 Distributions  
**Purpose**: Understand the shape, skewness, and normality of variables.  
目的：了解变量的形状、偏态、是否符合正态分布  

**Visualization methods**:  
- Histogram 直方图: shows frequency of numerical variables 数值型变量频率   
- Bar plot 柱状图: compares grouped categorical values 分组（分类变量）与其数值比较   
- Boxplot 箱线图: shows central tendency, distribution, and outliers  集中趋势、分布、离群值 

Statistical tests like **t-test** and **ANOVA** can be used to test for normality. 可以使用 t 检验、ANOVA 来查看变量是否服从正态分布。 <br> 
If the distribution is not normal, apply **log transformation**: suitable for right-skewed data, reduces tail length and outlier influence. 如果不是，采用对数变换（log transform）：适用于右偏变量，拉近尾部，减少极端值影响。<br>   
Alternatively, consider using **non-linear models**. 或者使用非线性模型。  

## 4.3 Outlier Detection  
**Purpose**: Identify and handle extreme values that may bias the model. 识别和处理异常点，避免其对模型产生不当影响  

**Detection methods**:  
- **Boxplot method**:  
  - Lower bound = Q1 - 1.5 × IQR  
  - Upper bound = Q3 + 1.5 × IQR  
- **Z-score method**:  
  - Z = (x - mean) / standard deviation  
  - Z < -3 or Z > 3 is considered an outlier  

**Handling strategies**:  
- Investigate cause: data entry error or device malfunction?  调查原因：是否为录入错误或设备异常  
- Delete: only if clearly erroneous  删除：仅限于明确为错误数据  
- Retain: if it's a meaningful extreme value  保留：如果是有意义的极端值  
- Important: document all changes to avoid introducing bias  重要：记录所有修改操作，避免引入偏差  

Outlier ≠ Error — always investigate the reason.  离群值 ≠ 错误，一定要调查原因  

## 4.4 Feature Scaling  
**Purpose**: Bring variables to a similar scale for better convergence and fair influence in models. 让变量在相似尺度上，有利于建模收敛、避免某变量主导模型  

**Common scaling methods**:

| Method           | Description 描述                                 | Suitable scenarios 适用场景                           |
|------------------|--------------------------------------------------|------------------------------------------------------|
| StandardScaler   | Mean = 0, Std = 1 (Z-score standardization)      | Linear models, neural networks, gradient descent     |
|                  | 均值为 0，标准差为 1（Z-score 标准化）             | 线性模型、神经网络、梯度下降类算法                    |
| MinMaxScaler     | Rescales values to [0, 1] range                  | Image processing, interpretable scales               |
|                  | 所有值压缩到 [0,1] 区间                           | 图像处理、需要恢复业务单位的模型                      |
| RobustScaler     | Based on IQR, resistant to outliers              | Data with many outliers                              |
|                  | 基于 IQR，抗离群点                                | 数据有大量离群值时                                   |

**Note**:  
- Always scale based on **training set only** to avoid data leakage.  缩放应仅基于训练集，以防数据泄露  
- Do **not** scale binary variables or categorical variables.  不要对二元变量或分类变量缩放  
- Scaling is **not needed** for tree-based models.  不用于树模型  

## 4.5 Correlation & Multicollinearity  
**Purpose**: Understand relationships between variables and identify redundancy. 理解变量之间的关系，判断模型输入是否冗余  

**Correlation metrics**:

| Metric           | Description                                  | Notes                                  |
|------------------|----------------------------------------------|----------------------------------------|
| Pearson's r      | Measures linear correlation [-1, 1] 线性相关性，值在 [-1, 1] 之间 | Most common; for continuous variables 最常用，适用于数值型变量 |
| Spearman's ρ     | Measures monotonic rank correlation 单调关系的秩相关，适用于排序数据 | Captures non-linear monotonic patterns 可捕捉非线性单调关系 |
| Distance corr.   | Captures all types of dependencies 衡量所有类型的相关性（非线性等）| Powerful but harder to interpret 更强但不易解释 |

**Tip**: Always use scatter plots to visually confirm correlation.  总是可视化（如散点图）来验证相关性是否合理。  

### 4.5.1 Multicollinearity  
**Problems**:  
- Unstable coefficients, difficult to interpret  系数不稳定，解释困难  
- Poor generalization  模型泛化能力下降  

**Detection**:  
- Correlation heatmap  相关系数热图  
- Variance Inflation Factor (VIF) 方差膨胀因子: VIF > 5 or 10 indicates high multicollinearity

**Solutions**:  
- Drop one variable (choose less relevant or meaningful)  删除其中一个变量（选择更弱相关、含义更弱者）  
- Combine variables (e.g., average or use PCA)  合并变量（例如平均值、主成分分析等）  
- Apply regularization methods (e.g., Ridge or Lasso regression)  使用正则化方法（如 Ridge 或 Lasso 回归）  

# 5. Data Understanding & Preparation
机器学习类型流程图（Machine Learning Flow）：<br>
**Question：Is labeled data available or can a target value be generated? 是否有标注数据或可以生成目标值？**  <br><br>

Yes → **监督学习（Supervised Learning）**
- **回归（Regression）**: Used to predict continuous values (e.g., housing price, temperature)用于预测连续数值（如房价、温度）
  - Types：Linear Regression 线性回归 / Non-Linear Regression 非线性回归  
- **分类（Classification）**: Used to predict categories (e.g., spam detection) 用于预测类别（如垃圾邮件识别）  
  - Types：Linear Classification 线性分类 / Non-Linear Classification 非线性分类  

<br>

No → **无监督学习（Unsupervised Learning）**
- **聚类（Clustering）**：Grouping data without target values (e.g., customer segmentation) 将数据分组，无需目标值（如客户分群）  

## 5.1 建模假设与特征选择（Modeling Assumptions & Feature Selection）
建模假设（Modeling Assumptions）
- **代表性（Representative）**：Samples should represent the overall population 样本应能代表总体  
- **独立同分布（IID: Independent and Identically Distributed）**：Each row should be independent and from the same distribution 每行数据应独立，且来自相同分布

<br>

行层面（Row-Level）
- **测量水平（Measurement Level）**：Types of variables (e.g., numerical, categorical) 变量类型（数值型、类别型等）  
- **分布（Distribution）**：Variable distribution (e.g., skewness, outliers) 变量的分布形态（是否偏态、是否有异常值）  

<br>

列层面（Column-Level）
- **变量关系（Relationships）**：Correlations between features and with target 特征之间、特征与目标之间的相关性  


## 5.2 数据类型识别（Data Type Recognition）
- **独立数据（Independent Data）**：Observations are unrelated (e.g., random sampling) 观测值之间无关联（如随机抽样）  
- **自相关（Autocorrelation）**：Nearby data in time or space are more similar (e.g., temperature) 时间或空间上相近的数据更相似（如温度）  
- **类内相关（Intraclass Correlation）**：Correlation within groups (e.g., experimental groups) 组内数据相关（如实验组）  

<br>
判断方法：Combine dataset purpose + visualization (e.g., scatterplot, autocorrelation plot) 结合数据集目的 + 可视化（如散点图、自相关图）


## 5.3 分箱（Binning）
- **定义（Definition）**：Convert continuous variables into categories 将连续变量离散化为类别   
- **示例（Example）**：Wind speed is divided into 风速分为：  
  - "low" < 10 m/s  
  - "medium" 10–15 m/s  
  - "high" > 15 m/s
- **用途（Purpose）**：Simplify model input for classification models 简化模型输入，适应分类模型  

## 5.4 滞后特征工程（Lag Feature Engineering）
- **定义（Definition）**：Use past observations as new features 将过去的观测值作为新特征  
  - Lag 1 = 昨天的值 → **Yesterday's value**  
  - Lag 7 = 上周同一天的值 → **Same day last week**
-  **适用场景（When to Use）**：Time series modeling (e.g., predicting today’s temperature) 时间序列建模（如预测今日温度）  

## 5.5 自相关与趋势可视化（Autocorrelation & Trend Visualization）

### 5.5.1 自相关（Autocorrelation）
- Measures similarity between time series and its lagged version 衡量时间序列与其滞后版本的相似性  
- Bars outside confidence interval → Significant autocorrelation exists 条形图落在置信区间外 → 存在显著自相关  
  
### 5.5.2 趋势可视化（Trend Visualization）
- **Histogram 直方图**: Shows frequency of numerical variables 展示数值型变量的频率分布
- **Bar plot 柱状图**: Compares grouped categorical values 比较分类变量组的数值
- **Box plot 箱线图**: Visualizes distribution and outliers 可视化分布形态及异常值
- **Scatterplot 散点图**: Shows relationship between two variables 展示两个变量之间的关系
- **Line plot 折线图**: Suitable for time series 适用于时间序列分析
- **Heatmap 热力图**: Displays correlation or matrix values 展示相关性或矩阵数值

## 5.6 分类变量编码（Categorical Encoding）
### 5.6.1 编码方法（Encoding Methods）
| 方法（Method）       | 说明（Explanation）     | 适用模型（Suitable Models） |
|----------------------|--------------------------|------------------------------|
| One-hot encoding     | 每个类别一列             | 非线性模型（Non-linear）    |
| Dummy encoding       | n 类别 → n-1 列          | 线性模型，避免共线性        |

**Python Tools**：
```python
pd.get_dummies(drop_first=True)  # Dummy 编码
sklearn.preprocessing.OneHotEncoder()  # One-hot 编码
``` 

## 5.7 数据合并（Merging）

| 合并方式（Join Type） | 中文解释（Explanation）       | 特点（Feature）                     |
|------------------------|-------------------------------|--------------------------------------|
| Inner Join             | 内连接                        | 仅保留两表共有键（only common keys） |
| Left Join              | 左连接                        | 保留左表全部 + 匹配右表              |
| Right Join             | 右连接                        | 保留右表全部 + 匹配左表              |
| Full Join              | 全连接                        | 保留所有行，缺失填 NaN               |

## 5.8 缺失值处理（Missing Values）

### 5.8.1 缺失类型（Types of Missingness）
| 类型（Type） | 中文解释（Explanation）          | 示例（Example）                    |
|--------------|----------------------------------|------------------------------------|
| MCAR         | 完全随机缺失（Missing Completely at Random） | 硬件故障导致数据丢失              |
| MAR          | 随机缺失，与其他变量相关（Missing At Random）         | 年轻人不愿报告屏幕时间            |
| MNAR         | 非随机缺失，与缺失值本身相关（Missing Not At Random）   | 重度吸毒者隐瞒使用频率            |

### 5.8.2 为什么重要（Why It Matters）
- 会引入偏差 → May introduce bias  
- 降低统计效能 → Reduces statistical power  
- 影响模型训练 → Affects model training

### 5.8.3 缺失值填补方法（Imputation Methods）
| 方法（Method）        | 适用情况（When to Use） | 优点（Advantages） | 缺点（Disadvantages）           |
|------------------------|--------------------------|--------------------|----------------------------------|
| 删除（dropna）         | MCAR                     | 简单               | 丢失信息（Information loss）    |
| 均值/中位数填充       | MCAR/MAR                 | 快速               | 扭曲分布（Distorts distribution）|
| 前向/后向填充         | 时间序列                 | 保持趋势           | 不适用于突变                     |
| 热/冷卡片填充         | 有相似值可借用           | 保持分布           | 可能引入偏差                    |
| 插值（interpolation） | 有趋势可推断             | 多种方法           | 不适用于跳跃数据                 |

# 6. 回归模型 Regression Models

CRISP-DM 是一个数据科学项目的框架，包含六个阶段：业务理解、数据理解、数据准备、建模、评估和部署。 (CRISP-DM is a framework for data science projects, including six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.)
- 数据理解：探索和总结数据（例如检查模式、异常值或相关性），为建模做准备。 (Data Understanding: Explore and summarize data (e.g., check patterns, outliers, or correlations) to prepare for modeling.) 数据理解通过揭示模式（例如线性或非线性关系）帮助选择合适的模型。 (Data understanding helps select appropriate models by revealing patterns, such as linear or non-linear relationships.)
- 建模：在这个阶段，你需要： (Modeling: In this phase, you need:)
  - 拆分数据：将数据分为训练集和测试集，建立测试设计。 (Split data: Divide data into training and test sets to establish a test design.) 拆分数据确保模型在未见过的数据上测试，模拟真实场景。 (Splitting data ensures the model is tested on unseen data, simulating real-world scenarios.)
  - 构建和训练模型：用训练集训练模型，并在测试集上进行预测。 (Build and train models: Train the model with the training set and make predictions on the test set.)

监督学习基于输入（特征）和输出（目标）数据开发预测模型，分为分类和回归两大任务。 (Supervised learning develops predictive models based on input (features) and output (target) data, divided into two main tasks: classification and regression.)
- 监督学习：模型从已知正确答案（目标）的数据中学习。 (Supervised learning: The model learns from data with known correct answers (target).) 例如，有房屋面积（输入）和价格（输出）的数据，模型学习如何根据面积预测价格。 (For example, with data on house size (input) and price (output), the model learns to predict price based on size.)
- 分类：预测类别（例如“会下雨吗？”→ 是/否）。 (Classification: Predict categories (e.g., “Will it rain?” → Yes/No).)
- 回归：预测数值（例如“房价是多少？”→ 30万）。 (Regression: Predict numerical values (e.g., “What is the house price?” → 300,000).) 回归用于预测数值结果（如销售额或温度）。 (Regression is used to predict numerical outcomes, such as sales or temperature.)

模型训练的核心流程： (Core process of model training:)
- X（特征/自变量）：模型输入，是我们已知的。 (X (features/independent variables): Model inputs, which are known to us.) 例如数据框中的列，如房屋面积、年龄。 (For example, columns in a dataset, such as house size or age.)
- y（目标变量/因变量）：模型输出，是我们要预测的，如房价。 (y (target variable/dependent variable): Model output, what we want to predict, such as house price.)

训练过程： (Training process:)
- 初始化模型（如随机权重）。 (Initialize the model (e.g., with random weights).)
- 预测y值。 (Predict y values.)
- 计算误差（预测值与真实值的差）。 (Calculate error (difference between predicted and actual values).)
- 更新模型参数（例如调整权重）。 (Update model parameters (e.g., adjust weights).)
- 重复直到满足停止条件（如误差足够小）。 (Repeat until a stopping condition is met (e.g., error is small enough).)

建模前的假设：<br> (Assumptions before modeling:)<br>
在进行建模前，我们需要检查数据是否满足一些条件（假设），例如，线性回归假设特征和目标呈直线关系。 (Before modeling, we need to check if the data meets certain conditions (assumptions), for example, linear regression assumes a linear relationship between features and target.) 这些条件的存在是为了确保我们得到的结果是可信的、能被正确解释的。 (These conditions ensure the results are reliable and interpretable.)<br>
如果这些假设被违背，我们就可能需要：改用其他模型，或者在报告中说明结果可能会失真。 (If these assumptions are violated, we may need to switch to other models or explain in the report that results may be distorted.)

## 6.1 线性回归 Linear Regression
线性回归建模的目的是找到与所有数据点最接近的直线，尽量减少误差（预测值与实际值的差距）。 (The purpose of linear regression is to find the line closest to all data points, minimizing errors (difference between predicted and actual values).) 线性回归简单且易于解释，是数值预测的良好起点。 (Linear regression is simple and interpretable, a good starting point for numerical predictions.)<br>
标准的一元线性回归公式是：y = β0 + β1 * X + ϵ (The standard simple linear regression formula is: y = β0 + β1 * X + ϵ)
- 𝑦：预测值。 (Predicted value.)
- β0：截距，即当X=0时的y值（基线）。 (Intercept, the value of y when X=0 (baseline).)
- β1：斜率，表示X每增加1，y会增加多少。 (Slope, indicating how much y increases per unit of X.)
- X：自变量。 (Independent variable.)
- ϵ：误差项（真实值与预测值之间的差异）。 (Error term (difference between actual and predicted values).)

多元线性回归与系数解释：面对多个特征时的情况，如果有多个自变量（特征），我们会得到一个回归平面而不是回归线。 (Multiple linear regression and coefficient interpretation: In cases with multiple features, if there are multiple independent variables (features), we get a regression plane instead of a line.) 系数表示该特征对目标变量的影响程度。 (Coefficients indicate the impact of each feature on the target variable.) 如果我们将每个特征的系数乘以它的标准差，可以衡量它对y的“实际影响力”。 (If we multiply each feature’s coefficient by its standard deviation, we can measure its “actual impact” on y.)

线性模型的类型（Lasso 与 Ridge）： (Types of linear models (Lasso and Ridge):)
1. Lasso 回归（L1正则化）：会将不重要的特征系数压缩为0 → 实现特征选择，适用于少量特征有用的情况。 (Lasso regression (L1 regularization): Compresses coefficients of unimportant features to 0 → achieves feature selection, suitable when only a few features are useful.)
2. Ridge 回归（L2正则化）：不会将系数压为0，但会缩小它们 → 降低过拟合风险，适用于所有特征可能都有用的情况。 (Ridge regression (L2 regularization): Does not compress coefficients to 0 but shrinks them → reduces overfitting risk, suitable when all features may be useful.)

### 6.1.1 线性回归假设 Linear Regression Assumptions
线性回归依赖四个关键假设： (Linear regression relies on four key assumptions:)

| 假设 Assumption | 含义 Meaning | 检查方式 Check Method | 修复方式 Fix Method | 举例 Example |
|----------------|-------------|---------------|---------------|---------------|
| **线性关系 Linear relationship** | 每个特征与目标的关系必须是直线。 (Each feature’s relationship with the target must be linear.) | 散点图 / Pairplot (Scatter plot / Pairplot) | 使用非线性模型 (Use non-linear models) | 如果销售额随广告支出稳定增加，是线性关系；如果达到某点后趋平，则是非线性。 (If sales increase steadily with ad spend, it’s linear; if it plateaus after a point, it’s non-linear.) |
| **观察值独立 Independence of observations** | 每个数据点（行）不应影响其他点。 (Each data point (row) should not affect others.) 时间序列数据（例如每日销售额）常违反此假设，因为今天的数据可能依赖昨天。 (Time series data (e.g., daily sales) often violates this because today’s data may depend on yesterday’s.) | Durbin-Watson检验 (Durbin-Watson test) | 加入滞后变量 / 改用时序模型 (Add lag variables / Use time series models) |  |
| **无多重共线性 No multicollinearity** | 特征之间不强相关（例如房屋面积和房间数高度相关，提供类似信息）。 (Features should not be strongly correlated (e.g., house size and number of rooms are highly correlated, providing similar information).) | 热力图、VIF > 5 (Heatmap, VIF > 5) | 删除或合并变量 (Remove or combine variables) | 预测汽车价格时，如果引擎大小和马力高度相关，那就要移除一个或合并。 (When predicting car prices, if engine size and horsepower are highly correlated, remove one or combine them.) |
| **残差正态分布/同方差性 Normality of residuals/Homoscedasticity** | 误差应平均分布且方差一致。 (Errors should be evenly distributed with constant variance.) 残差指预测值与实际值的差；正态分布指残差应呈钟形（正态）分布；同方差性指残差的方差在所有预测值中应恒定（残差图无漏斗形）。 (Residuals are the difference between predicted and actual values; normality means residuals should follow a bell-shaped (normal) distribution; homoscedasticity means residuals’ variance should be constant across all predicted values (no funnel shape in residual plots).) | 残差图、Q-Q图 (Residual plots, Q-Q plots) | 变量转换（例如对y取对数）或换模型 (Variable transformation (e.g., log-transform y) or change models) | 比如，如果残差图呈漏斗形（预测值高时误差更大），说明方差不一致，需换模型。 (For example, if residual plots show a funnel shape (larger errors at higher predictions), it indicates non-constant variance, requiring a different model.) |

## 6.2 非线性模型 Non-linear Models
如果数据呈非线性关系（当X和y不呈直线关系，例如销售额最初快速增加之后又减缓），使用以下模型： (If the data shows a non-linear relationship (when X and y are not linearly related, e.g., sales increase rapidly at first then slow down), use the following models:)
- 决策树回归：简单但容易过拟合。 (Decision tree regression: Simple but prone to overfitting.)
- 随机森林回归：用多个树平均预测，更稳健。 (Random forest regression: Averages predictions from multiple trees, more robust.) 平均预测，减少过拟合。 (Average predictions, reducing overfitting.)
- 梯度提升回归：逐棵树构建，纠正前一棵树的错误，通常更准确。 (Gradient boosting regression: Builds trees sequentially, correcting errors of previous trees, usually more accurate.)

## 6.3 时间序列模型 Time Series Models
线性回归等标准模型不适合时间序列，因为数据点有时间依赖性（自相关）。 (Standard models like linear regression are unsuitable for time series due to temporal dependencies (autocorrelation).) 时间序列（数据按时间顺序排列）数据（例如每日股价、天气、销售额）需要特殊模型： (Time series (data ordered by time) data (e.g., daily stock prices, weather, sales) requires special models:)
- 移动平均：取前几期值的平均值预测下一期（例如用前7天销售额平均值预测明天），可以平滑噪声。 (Moving average: Takes the average of previous periods to predict the next (e.g., use the average of the last 7 days’ sales to predict tomorrow), smoothing noise.)
- (S)ARIMA(X)：高级模型，预测趋势和季节性。 (Advanced model for forecasting trends and seasonality.)

## 6.4 回归指标 Regression Metrics

| 指标 Metric | 说明 Description | 适用场景 Applicable Scenarios |
|--------------------------|---------------------|---------------------------|
| **MAE 平均绝对误差 Mean Absolute Error** | 预测值与实际值的平均绝对差，单位与目标相同。 (The average absolute difference between predicted and actual values, in the same units as the target.) 所有错误平等对待。 (All errors are treated equally.) 如果房价预测的MAE=1万，说明平均误差为1万元。 (If MAE=10,000 for house prices, the average error is 10,000.) | 日常使用 (Everyday use) |
| **MSE / RMSE 均方误差 / 根均方误差 Mean Squared Error / Root Mean Squared Error** | 先平方误差再平均然后开根，放大大误差的影响，适合大误差代价高的情况（例如药物剂量预测）。 (Squares errors, averages them, then takes the square root, emphasizing larger errors, suitable for cases where large errors are costly (e.g., drug dosage prediction).) | 高风险领域（如医疗） (High-risk areas (e.g., healthcare)) |
| **R² 决定系数 Coefficient of Determination** | 模型解释的方差比例，仅适用于**线性模型**。 (The proportion of variance explained by the model, only reliable for **linear models**.) | 越接近1越好（注意：非线性模型中R²可能无意义或为负） (Closer to 1 is better (note: R² may be meaningless or negative in non-linear models)) |

## 6.5 数据切分与验证 Data Splitting and Validation（Train-Test Split & Cross-Validation）

**训练/测试集划分 Train-Test Split**：
- 训练集：用于拟合模型（例如80%）。 (Training set: Used to fit the model (e.g., 80%).)
- 测试集：用于评估模型在未见过数据上的表现（例如20%）。 (Test set: Used to evaluate the model on unseen data (e.g., 20%).)

比如预测考试成绩，用80%学生数据训练，20%测试，检查模型是否能预测新学生的成绩。 (For example, predict exam scores using 80% of student data for training and 20% for testing to check if the model predicts new students’ scores.)

**交叉验证（K-Fold） Cross-Validation (K-Fold)**：
K折交叉验证将数据分成K份，训练用K-1份，测试用1份，重复K次。 (K-fold cross-validation divides data into K parts, trains on K-1 parts, tests on 1 part, and repeats K times.) 如果K=5，将数据分成5份，每次用4份训练，1份测试，重复5次，每次用不同份测试，平均结果以获得可靠性能估计。 (If K=5, divide data into 5 parts, each time train on 4 parts, test on 1, repeat 5 times with different test parts, and average results for a reliable performance estimate.) 特别适合小数据集，最大化数据使用，同时避免过拟合。 (Especially suitable for small datasets, maximizing data use while avoiding overfitting.) 比如100个房价数据，5折交叉验证分成5组，每组20个。 (For example, 100 house prices, 5-fold CV divides into 5 groups of 20.) 每次用80个训练，20个测试，重复5次，平均误差以评估模型。 (Each time train on 80, test on 20, repeat 5 times, and average errors to evaluate the model.)

**时间序列交叉验证 Time Series Cross-Validation**：
对时间序列或非独立同分布（non-i.i.d.）数据，使用TimeSeriesSplit按时间顺序拆分数据，避免随机拆分。 (For time series or non-independent and identically distributed (non-i.i.d.) data, use TimeSeriesSplit to split data chronologically, avoiding random splitting.) 时间序列数据（例如股价）有时间依赖性，随机拆分会导致数据泄露（用未来数据预测过去）。 (Time series data (e.g., stock prices) has temporal dependencies; random splitting causes data leakage (using future data to predict the past).) TimeSeriesSplit按固定时间间隔拆分（例如用第1–80天训练，第81–100天测试，然后用第1–100天训练，第101–120天测试），确保时间依赖数据的真实预测，防止误导性好结果。 (TimeSeriesSplit splits by fixed time intervals (e.g., train on days 1–80, test on days 81–100, then train on days 1–100, test on days 101–120), ensuring realistic predictions for time-dependent data, preventing misleadingly good results.)

## 6.6 常见回归模型对比（总结） Comparison of Common Regression Models (Summary)

| 模型类型 Model Type | 优势 Advantages | 适用场景 Applicable Scenarios |
|----------------|-----------|------------|
| **线性回归 Linear Regression** | 简单易解释 (Simple and interpretable) | 线性关系明显，数据质量高 (Clear linear relationships, high-quality data) |
| **Ridge / Lasso** | 处理特征相关性，Ridge缩小相关特征影响，Lasso将不重要特征置零。 (Handle feature correlation, Ridge shrinks correlated feature impacts, Lasso sets unimportant features to zero.) 正则化，防止过拟合 (Regularization prevents overfitting) | 多特征、可能相关 (Multiple features, possible correlations) |
| **随机森林 Random Forest** | 适合非线性数据，结合多个决策树。 (Suitable for non-linear data, combines multiple decision trees.) 抗过拟合 (Robust against overfitting) | 特征之间复杂关系 (Complex relationships between features) |
| **SVM回归 SVM Regression** | 适合高维、非线性数据 (Suitable for high-dimensional, non-linear data) | 特征维度远大于样本数 (Feature dimensions far exceed sample size) |
| **Logistic回归 Logistic Regression** | 常用于分类，但也可用于回归，预测事件概率 (Commonly used for classification but can be used for regression, predicting event probabilities) | 二分类/概率输出问题 (Binary classification/probability output problems) |

# 7. Classification（分类）

分类是监督学习中的一种任务，目标是根据输入数据预测离散的类别标签。  
Classification is a type of supervised learning where the goal is to predict discrete class labels based on input data.

分类模型广泛应用于多个实际场景，例如疾病诊断、垃圾邮件过滤、信用评分等。  
Classification models are widely used in real-world scenarios such as disease diagnosis, spam detection, and credit scoring.

## 7.1 分类的类型（Types of Classification）

- **二元分类（Binary Classification）**：只有两个类别，例如“是/否”、“垃圾邮件/非垃圾邮件”。  
  Binary Classification: Two possible classes, e.g., "yes/no", "spam/not spam".
- **多类分类（Multiclass Classification）**：三个或以上的互斥类别，例如“狗”、“猫”、“马”、“老鼠”。  
  Multiclass Classification: Three or more mutually exclusive classes, e.g., "dog", "cat", "horse", "mouse".
- **多标签分类（Multilabel Classification）**：每个样本可能对应多个标签，例如一张图像中既有狗又有猫。  
  Multilabel Classification: Each sample may belong to multiple classes simultaneously, e.g., "dog & cat".

---

## 7.2 Classification Models（分类模型）

分类任务可使用多种线性或非线性模型。  
A variety of linear and non-linear models can be used for classification tasks.

### 7.2.1 Logistic Regression（逻辑回归）

逻辑回归用于估算概率并进行二分类。  
Logistic regression estimates probabilities and performs binary classification.

- 使用 sigmoid（逻辑）函数将输入映射为0到1之间的概率值  
  Uses a sigmoid function to map input to probabilities between 0 and 1
- 使用阈值来确定分类结果  
  Applies a threshold to decide the classification
- 假设特征与结果之间线性关系  
  Assumes a linear relationship between features and outcomes
- 对异常值敏感  
  Sensitive to outliers

### 7.2.2 Support Vector Machines（支持向量机）

支持向量机通过寻找最大间隔的超平面来划分类别。  
Support Vector Machines (SVM) find a hyperplane that maximizes the margin between classes.

- 高维特征表现良好  
  Works well in high-dimensional spaces
- 关键超参数包括：
  - **C** 控制间隔大小（正则化）  
    C controls the margin size (regularization)
  - **gamma** 控制单个样本的影响力  
    gamma controls the influence of individual samples
- 可使用非线性核函数处理非线性数据  
  Non-linear kernels can handle non-linear data
- 使用库：`sklearn.svm.SVC`  
  Library used: `sklearn.svm.SVC`

### 7.2.3 Decision Tree（决策树）

决策树通过基于特征的值划分数据，构建具有决策路径的树结构。  
Decision trees split data based on feature values to form a tree with decision paths.

- 每个节点代表一个特征，每条分支是决策规则，叶节点是分类结果  
  Each node represents a feature, each branch a decision rule, and each leaf a result
- 直观、易解释  
  Interpretable and intuitive
- 易过拟合，可通过预剪枝或后剪枝控制复杂度  
  Easily overfits; controlled via pre-pruning (e.g., max_depth) or post-pruning
- 使用库：`sklearn.tree.DecisionTreeClassifier`  
  Library used: `sklearn.tree.DecisionTreeClassifier`

### 7.2.4 Random Forest（随机森林）

随机森林通过构建多个决策树并投票决定最终分类结果。  
Random Forest builds multiple decision trees and combines their results through voting.

- 是一种集成方法（Ensemble Method）  
  An ensemble method
- 使用 Bagging（对样本随机抽样） 和随机特征选择  
  Uses bagging and random feature selection
- 更鲁棒、抗过拟合能力强  
  More robust and less prone to overfitting
- 可计算特征重要性  
  Can compute feature importance
- 使用库：`sklearn.ensemble.RandomForestClassifier`  
  Library used: `sklearn.ensemble.RandomForestClassifier`

### 7.2.5 Gradient Boosting（梯度提升）

梯度提升结合多个弱学习器，每个模型修正前一个模型的错误。  
Gradient Boosting combines weak learners, with each new model correcting the previous one's errors.

- 表现强大，可处理缺失值与异常值  
  Powerful performance; handles missing data and outliers
- 易过拟合，计算成本较高  
  Susceptible to overfitting; computationally expensive
- 使用库：`sklearn.ensemble.GradientBoostingClassifier`  
  Library used: `sklearn.ensemble.GradientBoostingClassifier`


## 7.3 Evaluation Metrics（评估指标）

### 7.3.1 混淆矩阵（Confusion Matrix）

混淆矩阵用于评估分类模型的性能。  
The confusion matrix evaluates classification model performance.

| Term | 中文解释 | English Definition |
|------|----------|---------------------|
| TP | 真阳性：正确预测为正类 | True Positives: Correctly predicted positives |
| TN | 真阴性：正确预测为负类 | True Negatives: Correctly predicted negatives |
| FP | 假阳性：错误预测为正类 | False Positives: Incorrectly predicted positives |
| FN | 假阴性：错误预测为负类 | False Negatives: Incorrectly predicted negatives |

### 7.3.2 主要指标（Key Metrics）

- **准确率 Accuracy**：正确预测样本数占总样本数的比例  
  Accuracy: Ratio of correct predictions over all samples
- **精确率 Precision**：预测为正类中真正为正类的比例  
  Precision: TP / (TP + FP)
- **召回率 Recall**：所有实际正类中被正确预测的比例  
  Recall: TP / (TP + FN)
- **F1 分数 F1-Score**：精确率与召回率的调和平均值  
  F1-Score: Harmonic mean of precision and recall
- 推荐使用 `sklearn.metrics.classification_report` 自动生成报告  
  Recommended: `sklearn.metrics.classification_report` for auto reporting

### 7.3.3 阈值调整（Threshold Adjustment）

- 默认阈值为 0.5，可调整以改变召回率与精确率的平衡  
  Default threshold is 0.5; changing it adjusts recall vs. precision trade-off
- 例如降低阈值可提高召回率，但增加误报率  
  Lowering the threshold raises recall but also increases false positives

### 7.3.4 ROC-AUC

- ROC 曲线显示不同阈值下的 TPR（召回率）和 FPR（假阳性率）  
  ROC curve plots TPR vs. FPR at different thresholds
- AUC 表示模型区分正负样本的能力  
  AUC measures how well the model separates classes
- AUC = 1 表示完美分类，AUC = 0.5 表示随机猜测  
  AUC = 1 is perfect; AUC = 0.5 equals random guessing


## 7.4 Imbalanced Classes（类别不平衡）

### 7.4.1 问题描述（Problem Description）

在真实数据中，常常出现某一类别样本数量远远大于其他类别的情况，例如欺诈检测、垃圾邮件识别。  
In real-world datasets, some classes (e.g., fraud) are much rarer than others (non-fraud).

这会导致模型偏向多数类，尤其影响准确率等评估指标。  
This leads to bias toward the majority class, especially affecting accuracy.

### 7.4.2 处理方法（Handling Methods）

- 收集更多数据  
  Collect more data
- 重采样（Resampling）  
  Resampling: either oversample minority or undersample majority
- 使用集成方法（如 Random Forest 或 Gradient Boosting）  
  Use ensemble methods like Random Forest or Gradient Boosting
- 在极端不平衡情况下使用异常检测模型  
  Use anomaly detection for extreme imbalance

### 7.4.3 重采样技术（Resampling Techniques）

- **随机欠采样（Random Undersampling）**：随机丢弃多数类样本  
  Randomly discard majority class samples
- **SMOTE（合成少数类过采样）**：根据最近邻创建少数类合成样本  
  SMOTE: Synthetic Minority Oversampling Technique creates synthetic minority samples

注意：只应在训练集上进行重采样，测试集保持原分布。  
Note: Apply resampling only to the training set to preserve true test distribution.


## 7.5 Train-Test Split（训练-测试划分）

为了保持评估指标的稳定性，应采用分层采样。  
To ensure stable evaluation, use stratified sampling.

- `train_test_split()` 中使用 `stratify` 参数  
  Use `stratify` parameter in `train_test_split()`
- `StratifiedKFold` 用于分层交叉验证  
  Use `StratifiedKFold` for stratified cross-validation
- 保证每个类别在划分中按比例出现  
  Ensures each class is proportionally represented in each fold


## 7.6 Metric Selection Scenarios（指标选择案例）

- **任务1：识别是否中毒植物**  
  优先考虑召回率 Recall（不要漏掉毒性植物）  
  Task: Identify poisonous plants → focus on **Recall**

- **任务2：预测昂贵的治疗是否需要**  
  优先考虑召回率 Recall（宁可误判也不能漏诊）  
  Task: Predict need for expensive therapy → focus on **Recall**

- **任务3：识别狗的品种（100种均匀分布）**  
  多类平衡 → 可用准确率 Accuracy 或 F1 分数 F1-score  
  Task: Predict dog breed → suitable metrics: **Accuracy** or **F1-score**

## 7.7 Summary（总结）

分类建模是数据科学中重要的一环，涉及选择适当模型、评估指标、处理不平衡、设置阈值等多个方面。  
Classification modeling is a crucial step in data science, involving model selection, evaluation, class imbalance handling, and threshold tuning.

合理选择指标和技术，可以提升模型的实用性与解释性。  
Proper selection of metrics and techniques improves both utility and interpretability of models.

# 8. Unsupervised Learning & Optimization（无监督学习与优化）

本章主要探讨无监督学习方法（特别是聚类）以及如何通过超参数调优来优化模型性能，并解释偏差-方差权衡、过拟合与欠拟合问题。  
This chapter focuses on unsupervised learning methods (especially clustering) and how to optimize model performance through hyperparameter tuning, explaining bias-variance trade-off, overfitting, and underfitting.

## 8.1 无监督学习（Unsupervised Learning）

无监督学习用于无标签数据，目标是基于样本间的相似性发现数据结构或模式。  
Unsupervised learning deals with unlabeled data, aiming to discover structure or patterns based on similarities among samples.

常见应用包括：  
Common applications include:

- 文档或文章的分组  
  Grouping similar documents or articles
- 客户分群（依据购买行为）  
  Customer segmentation based on purchasing behavior
- 生物物种聚类（依据遗传特征）  
  Grouping species based on genetic similarities

通常基于欧几里得距离等距离度量，解释性较低。  
Usually based on distance metrics such as Euclidean distance; low interpretability.

## 8.2 聚类方法（Clustering Methods）

聚类方法根据样本间的距离或密度将数据点归为一组。  
Clustering methods group data points based on distance or density.

- **基于中心点（Centroid-based）**：如 KMeans  
  Centroid-based (e.g., KMeans)
- **基于连接性（Hierarchical/Connectivity-based）**：如凝聚层次聚类  
  Connectivity-based (e.g., Agglomerative Clustering)
- **基于密度（Density-based）**：如 DBSCAN  
  Density-based (e.g., DBSCAN)
- **基于图结构（Graph-based）**
- **基于分布（Distribution-based）**


## 8.3 KMeans 聚类（KMeans Clustering）

KMeans 是最常见的聚类算法，需预先设定聚类数 *k*。  
KMeans is one of the most commonly used clustering algorithms and requires a predefined number of clusters *k*.

流程如下：  
Steps:

1. 随机选择 *k* 个初始质心  
   Randomly pick *k* centroids
2. 将每个样本分配给最近的质心  
   Assign each point to the nearest centroid
3. 重新计算每个质心为该组点的均值  
   Recalculate centroids as the mean of assigned points
4. 重复直到质心稳定，最小化组内变异（WSS）  
   Repeat until convergence minimizing Within-Cluster Sum of Squares (WSS)

特点：对异常值和数据缩放敏感；需要预定义 *k* 值。  
Sensitive to outliers and feature scaling; requires predefined *k*.

使用库与函数：`sklearn.cluster.KMeans`  
Library: `sklearn.cluster.KMeans`


### 8.3.1 评估聚类数：肘部法（Elbow Method）

用于确定合适的 *k* 值。  
Used to determine an appropriate number of clusters *k*.

- 绘制不同 *k* 对应的 WSS（组内平方和）  
  Plot WSS vs. *k*
- 寻找“肘部”拐点，作为建议聚类数  
  Locate the "elbow" point where WSS stops significantly decreasing


### 8.3.2 聚类质量评估：轮廓系数（Silhouette Score）

- 衡量每个点与自身簇内样本的紧密度以及与其他簇的分离度  
  Measures how similar a point is to its own cluster compared to other clusters
- 值范围 [-1, 1]，越接近 1 越好，0 表示重叠，负值表示可能错误归类  
  Closer to 1 is better, 0 means overlapping, negative means likely misassigned

使用函数：`sklearn.metrics.silhouette_score`  
Function: `sklearn.metrics.silhouette_score(X, labels)`



## 8.4 分层聚类（Hierarchical Clustering）

提供自上而下或自下而上的聚类结构。  
Provides top-down or bottom-up clustering structures.

- **凝聚法（Agglomerative）**：自底向上，每个点为单独簇，逐步合并  
  Bottom-up: start from individual points, merge based on similarity
- **分裂法（Divisive）**：自顶向下，从整体逐步分裂  
  Top-down: start from all data in one cluster, split based on dissimilarity

结果可用树状图（Dendrogram）表示，切断某一高度决定簇数。  
Represented as a dendrogram; cut at a certain height to determine number of clusters.


# 9. Overfitting & Underfitting（过拟合与欠拟合）


## 9.1 现象定义（Definitions）

- **过拟合（Overfitting）**：模型在训练数据上表现过好，但泛化能力差  
  Model performs well on training data but poorly on new (test) data
- **欠拟合（Underfitting）**：模型无法捕捉数据中的模式  
  Model fails to capture underlying patterns in the data

识别方法：  
How to identify:

- 过拟合：训练误差低，测试误差高  
  Overfitting: Low training error, high test error
- 欠拟合：训练和测试误差都高  
  Underfitting: High error on both training and test sets


## 9.2 偏差-方差权衡（Bias-Variance Trade-off）

- **偏差（Bias）**：模型预测的平均值与真实值的差异  
  Bias: Difference between predicted and true values
- **方差（Variance）**：模型在不同数据集上的预测波动性  
  Variance: Variability of predictions across datasets

- 高偏差、低方差 → 欠拟合（如线性模型）  
  High bias, low variance → Underfitting (e.g., linear models)
- 低偏差、高方差 → 过拟合（如深层决策树、kNN）  
  Low bias, high variance → Overfitting (e.g., deep decision trees, kNN)


## 9.3 过拟合原因与解决方案（Causes & Solutions）

### 原因（Causes）

- 训练集太小  
  Training set too small
- 数据噪声过多  
  Too much noise in data
- 训练轮次过多  
  Trained too long (especially in deep learning)
- 模型复杂度过高  
  Model too complex

### 解决方案（Fixes）

- 增加训练数据量  
  Increase training set size
- 清洗数据（处理异常值和缺失值）  
  Clean data (handle outliers and missing values)
- 提前停止训练（early stopping）  
  Early stopping
- 正则化（Regularization，如 L1/L2）  
  Regularization (e.g., L1, L2)
- 降维（去除无关或弱相关特征）  
  Dimensionality reduction
- 超参数调优  
  Hyperparameter tuning


# 10. Hyperparameter Tuning（超参数调优）

## 10.1 概述（Overview）

- **参数（Parameters）**：模型从数据中学习的值  
  Parameters: learned from data during training
- **超参数（Hyperparameters）**：模型结构/训练过程中的设置（如学习率、树深）  
  Hyperparameters: preset values controlling model structure or training

### 为什么需要调优？  
Why tune?

- 找到模型表现最好的设置组合  
  To find the best-performing configuration
- 代价高，但能显著提升性能  
  Computationally expensive, but effective


## 10.2 调优方法（Tuning Methods）

### 网格搜索（Grid Search）

- 穷举搜索所有超参数组合  
  Exhaustive search over all combinations
- 使用函数：`sklearn.model_selection.GridSearchCV`  
  Function: `sklearn.model_selection.GridSearchCV`

### 随机搜索（Random Search）

- 随机采样超参数空间  
  Random sampling of hyperparameter space
- 效率更高但可能错过最优解  
  More efficient, but may miss optimal combination


## 10.3 示例调优空间（Example Search Space）

```python
gr_space = {
  'max_depth': [3,5,7,10],
  'n_estimators': [100, 200, 300, 400, 500],
  'max_features': [10, 20, 30, 40],
  'min_samples_leaf': [1, 2, 4]
}
```

## 10.4 K 折交叉验证（K-Fold Cross-Validation）

用于评估调优过程中每种参数组合的表现。  
Used to evaluate model performance across hyperparameter configurations.

- 将训练集划分为 K 个子集（folds）  
  Split training data into K subsets
- 每次选一个子集作为验证集，其余用于训练  
  Use one fold for validation and others for training
- 计算所有 K 次结果的平均值  
  Average the results across folds

使用库：`sklearn.model_selection.KFold` 或 `StratifiedKFold`  
Libraries: `sklearn.model_selection.KFold` or `StratifiedKFold`


# 11. Feature Importance（特征重要性）

训练完成的模型可以用于评估哪些特征对最终预测结果最为关键。  
Trained models can reveal which features are most important for prediction.

- 特征重要性可用于简化模型或解释模型行为  
  Feature importance can help simplify the model or interpret decisions
- 不同模型评估方法不同，如树模型通常基于信息增益/不纯度下降  
  Different models measure importance differently (e.g., tree models use impurity decrease)
- 在使用随机森林（Random Forest）或梯度提升树（Gradient Boosting）时，可直接查看特征重要性属性  
  For models like Random Forest or Gradient Boosting, feature importance can be directly retrieved

使用函数：`model.feature_importances_`  
Function: `model.feature_importances_`

# 总结（Summary）

- 无监督学习可用于探索数据结构（如聚类）  
  Unsupervised learning is used to explore structure (e.g., clustering)

- 聚类方法包括 KMeans、层次聚类、密度聚类等；评估指标有 WSS、轮廓系数等  
  Clustering methods include KMeans, hierarchical, and density-based; metrics include WSS and silhouette score

- 模型训练需避免过拟合与欠拟合，关键在于理解偏差-方差权衡  
  Training should avoid overfitting/underfitting by understanding the bias-variance trade-off

- 造成过拟合的原因包括训练集过小、噪声太多、训练时间过长、模型复杂度过高  
  Overfitting may be caused by small datasets, noisy data, too long training, or overly complex models

- 超参数调优（GridSearch 或 RandomSearch）与交叉验证可用于寻找最佳模型配置  
  Hyperparameter tuning (GridSearch or RandomSearch) and cross-validation help find the best model configuration

- 特征重要性分析有助于模型解释与简化  
  Feature importance helps with model interpretation and simplification

- 实践中请记得：高准确率不等于好模型，良好泛化能力才是关键  
  Remember: High training accuracy ≠ good model; generalization is key

# 12. Evaluation & Recommendations（模型评估与建议）

评估是 CRISP-DM 流程的最后阶段，重点在于解释模型结果、分析其意义并据此提出具体建议。  
Evaluation is the final phase of the CRISP-DM process, focusing on interpreting results, analyzing their impact, and providing actionable recommendations.


## 12.1 评估阶段目标（Goal of Evaluation Phase）

在评估阶段，你需要回答以下问题：  
During the evaluation phase, answer:

- 你的模型结果是什么？  
  What are your results?
- 这些结果意味着什么？对业务有何影响？  
  What do they mean and what is their impact?
- 模型是否实现了预期业务目标？  
  Did the model achieve the business objectives?
- 接下来应该做什么？  
  What are the next steps?


## 12.2 漏斗式结构（Hourglass Method）

- 在结论部分从具体模型结果逐步“放大”到整个项目目标  
  In the conclusion, zoom out from the specific results to the overall objectives
- 明确地将结果与最初的假设或业务目标联系起来  
  Explicitly connect results to original hypotheses or business goals
- 反思建模过程中做出的选择及其影响  
  Reflect on modeling choices and their impact


## 12.3 解释指标含义（Interpreting Metrics）

评估应基于可解释的指标进行，用通俗语言说明其含义：  
Evaluation should be grounded in interpretable metrics, expressed in layman's terms:

- **MAE = 0.6** → 平均误差为 0.6 个单位  
  MAE = 0.6 → On average, predictions are off by 0.6 units
- 评估其是否“好”需结合上下文（如单位是药物剂量时，0.6 可能很大）  
  Whether this is good or bad depends on the unit and context (e.g., dosage in grams)


## 12.4 回顾数据准备（Reflect on Data Prep）

在评估模型结果时，也要反思以下问题：  
When evaluating results, also reflect on:

- 数据理解/准备中的步骤如何影响建模？  
  How did your data understanding/preparation impact modeling?
- 是否有副作用或遗漏的因素？  
  Were there unintended side effects or overlooked factors?
- 你现在知道的知识，是否能帮助你做出更好的准备？  
  Knowing what you know now, what would you have done differently?


# 13. Recommendations（建议）

## 13.1 建议应具体可执行（Actionable and Specific）

好的建议应具有以下特征：  
Good recommendations should:

- 明确针对项目过程及结果  
  Be directly derived from project steps and results
- 解决利益相关者的实际问题  
  Address stakeholder needs
- 包含实施细节及效果说明  
  Include how to implement and why it helps
- 不应模糊、泛泛而谈或超出你的控制范围  
  Avoid vague generalities or issues outside your scope


## 13.2 SMART 建议原则（SMART Recommendations）

有效建议应符合 SMART 原则：  
Effective recommendations should follow SMART criteria:

- **S**pecific（具体）  
- **M**easurable（可衡量）  
- **A**chievable（可达成）  
- **R**elevant（相关性强）  
- **T**ime-bound（有时间约束）

例如：  
Example:

- ❌“我们应该尝试更多的数据技术。”  
  "We should try more data techniques." → Too vague  
- ✅“如果我们在下一阶段增加模型对 `battery_power` 的 log 转换，可能提高回归模型的 MAE 准确度。”  
  "Applying a log transformation to `battery_power` in the next phase could improve MAE." → SMART!


## 13.3 诚实面对项目缺陷（Honest Self-Reflection）

在评估和建议中，保持诚实是关键：  
Honesty is key in evaluation and recommendations:

- 承认错误，并说明如何避免  
  Acknowledge what went wrong and how to fix it
- 展示你的分析流程，而不仅仅是结果  
  Show your process, not just the output
- 做出合理、可改进的评估，而不是推卸责任或猜测  
  Offer reasonable, self-aware critique, not excuses


# 14. Knowledge Check & Final Advice（课程测试与总结）

## 14.1 模拟测试题（Mock Questions）

1. **为什么只用训练集做特征缩放？**  
   Why scale using only the training set?  
   ✅ B. 为了模拟模型处理“未见数据”的方式（simulate unseen data）

2. **如果业务目标是减少不必要的随访，哪项成功标准最相关？**  
   If the business goal is to reduce unnecessary follow-ups:  
   ✅ C. 高精度（High Precision）

3. **若特征为右偏分布，哪种变换可以提升模型性能？**  
   For right-skewed distributions:  
   ✅ D. 对数变换（Log transform）

4. **为什么在评估回归模型时考虑 MAE 而不仅是 MSE？**  
   Why consider MAE in addition to MSE?  
   ✅ C. MAE 对业务利益相关者更易解释（Easier to interpret for stakeholders）