### 坐标轴范围  

可以通过`plt.ylim`和`plt.xlim`分别对y轴和x轴的坐标范围进行配置，譬如我们可以设置y轴的起点为50；

In [None]:
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = 'SimHei'
plt.figure(figsize=(8, 6))
x = ['1月', '2月', '3月', '4月', '5月', '6月', '7月', '8月', '9月', '10月', '11月', '12月']
y1 = [123, 145, 152, 182, 147, 138, 189, 201, 203, 211, 201, 182]
y2 = [102, 121, 138, 154, 171, 178, 199, 231, 228, 202, 231, 271]

plt.title("销售趋势图", fontdict={'family':'SimHei', 'color': 'k', 'size': 15}, loc='left')
plt.plot(x, y1, linestyle='-.', marker='o', markersize=10, color='r', label='华东')  # 绘制图像
plt.plot(x, y2, linestyle='-', marker='o', markersize=10, color='y', label='华中')  # 绘制图像
plt.xlabel("月份", fontdict={'family':'SimHei', 'color': 'k', 'size': 12}, labelpad=10)
plt.ylabel("销售额（万元）", fontdict={'family':'SimHei', 'color': 'k', 'size': 12}, labelpad=10)
plt.legend(loc='best', fontsize=12) # best:matplotlib根据图表自动选择最优位置
# 设置坐标轴范围
plt.ylim(50, 300)
# 添加网格线
plt.grid(b=True, axis='y', linestyle='--', linewidth=1, color='grey')
plt.show()

### 多图  

有时候一个图表并不能说明问题，需要通过多子图进行展现；

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = 'SimHei'

x = ["深圳", "广州", "北京", "上海"]
y = [1, 3, 2, 5]

plt.subplot(2, 2, 1)
plt.bar(x, y)

plt.subplot(2, 2, 2)
plt.pie(y, labels=x)

plt.subplot(2, 2, 3)
plt.plot(x, y)

plt.subplot(2, 2, 4)
plt.barh(x, y)

plt.show()

# 日期处理

日期数据在特征工程中通常可以拆分或提取出有用的特征，比如年、月、日、季度、星期等。以下是一些处理日期数据的方法和示例代码：

1. 提取日期的基本特征
    可以从日期中提取以下常见特征：
    
    年（year）
    
    月（month）
    
    日（day）
    
    星期几（weekday）
    
    一年中的第几周（weekofyear 或 isocalendar().week）
    
    一年中的第几天（dayofyear）
    
    是否为周末（is_weekend）
    
    示例代码sss
    
    ```python
    import pandas as pd
    
    # 示例数据
    data = {'date': ['2010-01-01', '2010-05-15', '2010-12-31']}
    df = pd.DataFrame(data)
    
    # 将字符串转换为日期类型
    df['date'] = pd.to_datetime(df['date'])
    
    # 提取特征
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['weekday'] = df['date'].dt.weekday  # 0: Monday, 6: Sunday
    df['is_weekend'] = df['weekday'].isin([5, 6]).astype(int)  # 是否为周末
    df['quarter'] = df['date'].dt.quarter  # 季度
    df['dayofyear'] = df['date'].dt.dayofyear  # 一年中的第几天
    df['weekofyear'] = df['date'].dt.isocalendar().week  # 一年中的第几周
    ```

2. 计算时间差

    计算两个日期之间的时间差可以生成新的数值特征，比如某事件过去了多少天、多少月等。
    
    示例代码
    ```python
    
    # 示例日期
    df['reference_date'] = pd.to_datetime('2023-01-01')
    
    # 计算天数差
    df['days_diff'] = (df['reference_date'] - df['date']).dt.days
    
    # 计算月数差（大致）
    df['months_diff'] = (df['reference_date'].dt.year - df['date'].dt.year) * 12 + \
                        (df['reference_date'].dt.month - df['date'].dt.month)
    ```

3. 创建周期性特征
    一些特征（如月份、星期）有周期性，可以用正弦和余弦变换来捕捉这种关系。
    
    示例代码

   
    ```python
    import numpy as np
    
    # 将月份转换为周期特征
    df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
    df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
    
    # 将星期几转换为周期特征
    df['weekday_sin'] = np.sin(2 * np.pi * df['weekday'] / 7)
    df['weekday_cos'] = np.cos(2 * np.pi * df['weekday'] / 7)
    ```

4. 处理时间段
    如果日期代表时间段的开始或结束，可以提取相应的时间间隔特征。例如，计算某天距离最近的假期、季度的开始/结束等。
    
    ```python
    # 判断是否为季度的开始或结束
    df['is_quarter_start'] = df['date'].dt.is_quarter_start.astype(int)
    df['is_quarter_end'] = df['date'].dt.is_quarter_end.astype(int)
    ```
    

5. 转换为分类变量
    如果某些日期特征的分布对模型有影响，可以将它们离散化为类别变量。例如：
    
    将年份分组为"早期"、"中期"、"近期"；
    将日期划分为工作日和非工作日。
    ```python
    
    # 按年份分组
    df['year_group'] = pd.cut(df['year'], bins=[2000, 2010, 2020], labels=['2000s', '2010s'])
    
    # 将日期划分为工作日/非工作日
    df['is_workday'] = (~df['weekday'].isin([5, 6])).astype(int)
    ```
    

6. 直接使用日期差值
    如果日期值表示时间流逝，比如订单时间，可以直接将日期转换为天数或小时数。
    
    示例代码
    ```python
    # 将日期转换为时间戳
    df['timestamp'] = df['date'].view('int64') // 10**9  # 转为秒
    ```

# One-Hot Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# 示例数据
data = [['red'], ['blue'], ['green'], ['blue'], ['red']]
X = np.array(data)

# 创建 OneHotEncoder 对象
encoder = OneHotEncoder(sparse=False)

# 拟合并转换数据
encoded_data = encoder.fit_transform(X)

print(encoded_data)


# Target Encoder

基于贝叶斯思想，用先验概率和后验概率的加权平均值作为类别特征值的编码值

当特征与label的相关性不明了时，可以尝试Target Encoding来找到signal



In [1]:
import pandas as pd
import numpy as np
import category_encoders as ce

# 创建示例训练数据
train_data = {
    'id': [1, 2, 3, 4, 5],
    'category': ['A', 'B', 'A', 'B', 'C'],
    'feature1': [10, 20, 30, 40, 50],
    'feature2': [5, 15, 25, 35, 45],
    'target': [1, 0, 1, 0, 1]
}

test_data = {
    'id': [6, 7, 8],
    'category': ['B', 'C', 'A'],
    'feature1': [25, 35, 45],
    'feature2': [10, 20, 30]
}

# 转换为 DataFrame
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

# 目标列名称
label = 'target'

# 实例化 TargetEncoder
TE = ce.TargetEncoder(smoothing=10)

# 获取所有特征列
features = test_df.columns.tolist()

# 进行目标编码
for col in features:
    if col not in ['id', label]:  # 忽略 'id' 和 目标列
        TE.fit(train_df[col], train_df[label])  # 训练编码器
        train_df[col] = TE.transform(train_df[col])  # 转换训练数据
        test_df[col] = TE.transform(test_df[col])  # 转换测试数据

# 显示数据摘要
def display_summary(df, name):
    print(f"\n{name} Summary:")
    print("-" * 30)
    print("\nData Info:")
    df.info()
    print("\nFirst Rows:")
    display(df.head().T)

display_summary(train_df, "Transformed Train Dataset")
display_summary(test_df, "Transformed Test Dataset")


Transformed Train Dataset Summary:
------------------------------

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        5 non-null      int64  
 1   category  5 non-null      float64
 2   feature1  5 non-null      int64  
 3   feature2  5 non-null      int64  
 4   target    5 non-null      int64  
dtypes: float64(1), int64(4)
memory usage: 328.0 bytes

First Rows:


Unnamed: 0,0,1,2,3,4
id,1.0,2.0,3.0,4.0,5.0
category,0.65674,0.514889,0.65674,0.514889,0.652043
feature1,10.0,20.0,30.0,40.0,50.0
feature2,5.0,15.0,25.0,35.0,45.0
target,1.0,0.0,1.0,0.0,1.0



Transformed Test Dataset Summary:
------------------------------

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        3 non-null      int64  
 1   category  3 non-null      float64
 2   feature1  3 non-null      int64  
 3   feature2  3 non-null      int64  
dtypes: float64(1), int64(3)
memory usage: 224.0 bytes

First Rows:


Unnamed: 0,0,1,2
id,6.0,7.0,8.0
category,0.514889,0.652043,0.65674
feature1,25.0,35.0,45.0
feature2,10.0,20.0,30.0


# 模型 模版

## XGboost

In [None]:
import xgboost as xgb
from xgboost import plot_importance
from sklearn.metrics import mean_absolute_percentage_error
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
 
# XGBoost训练过程
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
 
model = xgb.XGBRegressor(max_depth=5, learning_rate=0.5, n_estimators=160, objective='reg:gamma')
model.fit(X_train, y_train)
 
# 对测试集进行预测
ans = model.predict(X_test)
print(mean_absolute_percentage_error(ans, y_test))
# 显示重要特征
plot_importance(model)
plt.show()

## Lightgbm

In [None]:
from catboost import CatBoostRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
# 创建示例数据
X, y = make_regression(n_samples=10000, n_features=10, noise=0.1, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 初始化和训练模型
model = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.1,
    depth=6
)

model.fit(
    X_train, y_train,
    eval_set=(X_val, y_val),
    verbose=100,
    early_stopping_rounds=50
)


## optuna 调参

### KFold

In [None]:
import warnings
import numpy as np
import optuna
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.datasets import make_regression
import pandas as pd

# 忽略所有警告
warnings.filterwarnings('ignore')

# 定义 Optuna 目标函数
def objective(trial):
    # 超参数
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000, step=100),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 10.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.0, 10.0),
        "random_state": 42
    }
    
    # K折交叉验证
    kf = KFold(n_splits=5, shuffle=True, random_state=0)
    errors = []
    
    for train_index, val_index in kf.split(X):
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        
        # 初始化并训练 XGBoost 模型
        model = XGBRegressor(**params,early_stopping_rounds=50)
        model.fit(X_train, y_train, eval_set=[(X_val, y_val)] , verbose=False)
        
        # 预测验证集
        val_preds = model.predict(X_val)
        mse = mean_absolute_percentage_error(y_val, val_preds)
        errors.append(mse)
    
    # 计算 RMSE
    mape = np.sqrt(np.mean(errors))
    print(mape)
    return mape

# 使用 Optuna 进行超参数优化
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=30)

# 输出最佳超参数
print("Best trial:", study.best_trial.params)


In [None]:
from optuna.visualization import plot_optimization_history, plot_contour, plot_param_importances, plot_parallel_coordinate, plot_edf  # 导入Optuna的可视化工具，用于绘制优化历史、参数重要性等
#优化历史图
fig1 = plot_optimization_history(study_lgbm)
fig1.show()

### Time Series

In [None]:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
tscv.split(train_df)

## 8. 用 SQL 的方式查询 DataFrame  
Pandasql 可以让我们用操作SQL的方式操作一个pandas DataFrame  

虽然 DataFrame 有许多 query 方式，但有时候，就是想用 SQL 查😋而且也比较容易展示查询逻辑

In [None]:
!pip install pandasql

In [None]:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
#Define a function called SQL
def SQL(query):
    return(pysqldf(query))

In [None]:
import pandas as pd
data = {'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2]}
data = pd.DataFrame(data)
query = '''
    SELECT *
    FROM data
    WHERE oranges > 0
'''

SQL(query)

字符串去重
```python ls = set()
for l in """
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from copy import deepcopy
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
""".splitlines():
    ls.add(l)
```
