# 数据预处理（回归）

## 1. 课题    
数据预处理与特征工程实践操作    
    
## 2. 课前准备    
2.1 学员已完成数据探索性分析实践课程，掌握pandas、Matplotlib、Seaborn、sklearn等数据分析常用库的基本使用。    
2.2 准备好用于实践的数据集（本次使用经典的泰坦尼克生存预测数据集）。
    
## 3. 教学目标    
3.1 熟练掌握数据预处理的基本流程和方法。    
3.2 熟悉常用的特征工程技术。    
3.3 学会根据实际需求选择合适的数据预处理和特征工程方法。    
3.4 能够分析和解决数据预处理过程中出现的问题。    
3.5 能够应用特征工程技术解决实际问题。    
    
## 4. 教学重点    
4.1 格式转换：例如表格中的数值型、类别型、时间型等特征。    
4.2 数据清洗：去除重复、异常值、缺失值等，使数据更加准确可靠。    
4.3 特征变换：将原始特征进行转换，如标准化、归一化、离散化等，以增强数据的可学习性。    
4.4 特征选择：通过统计、互信息法、过滤式和包装式等选取对目标分类有影响的相关性最大的特征集合。    
    
## 5. 教学难点    
5.1 根据实际需求选择合适的数据预处理方法。    
5.2 数据预处理过程中异常值、缺失值的处理方法。    
5.3 数据特征选择、特征提取与构建方法。

## 实施步骤    
### 步骤 1: 启动Jupyter Notebook    
+ 在搜索栏输入"cmd"命令，启动命令提示符窗口。    
+ 输入"jupyter notebook"命令，并按回车键启动Jupyter Notebook。   
    
### 步骤 2: 创建新的Notebook    
+ 在Jupyter的Web界面中，点击右上角的 "New" 按钮。    
+ 选择 "Python 3"内核来创建一个新的Python 3 Notebook。    
    
### 步骤 3: 导入必要的库
+ 我们通常使用numpy和pandas库来进行数据清洗，使用sklearn库来进行数据的分类、离散化等特征处理和特征选择，使用matplotlib和seaborn来进行数据的可视化。

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 定义字体，在图表中正常显示汉字
plt.rcParams['font.sans-serif'] = ['SimHei']
# 在图表中正常显示负号
plt.rcParams['axes.unicode_minus'] = False

### 步骤 4: 导入数据集

In [4]:
# 用Pandas导入CSV文件
train = pd.read_csv('used_car_train.csv')
test = pd.read_csv('used_car_testA.csv')
train

Unnamed: 0,SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0,0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 0.0 ...
1,1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 - 43...
2,2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5...
3,3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0...
4,4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 0...
...,...
149995,149995 163978 20000607 121.0 10 4.0 0.0 1.0 16...
149996,149996 184535 20091102 116.0 11 0.0 0.0 0.0 12...
149997,149997 147587 20101003 60.0 11 1.0 1.0 0.0 90 ...
149998,149998 45907 20060312 34.0 10 3.0 1.0 0.0 156 ...


可以看到数据读取出现问题，没有得到我们想要的二维表格。原因是特征名之间以空格间隔，而pandas的read_csv默认分隔符号为逗号。此时需要显式指定数据之间的分隔符为空格。

In [5]:
train = pd.read_csv('used_car_train.csv', sep=' ')
test = pd.read_csv('used_car_testA.csv', sep=' ')
train

Unnamed: 0,SaleID,name,regDate,model,brand,bodyType,fuelType,gearbox,power,kilometer,...,v_5,v_6,v_7,v_8,v_9,v_10,v_11,v_12,v_13,v_14
0,0,736,20040402,30.0,6,1.0,0.0,0.0,60,12.5,...,0.235676,0.101988,0.129549,0.022816,0.097462,-2.881803,2.804097,-2.420821,0.795292,0.914762
1,1,2262,20030301,40.0,1,2.0,0.0,0.0,0,15.0,...,0.264777,0.121004,0.135731,0.026597,0.020582,-4.900482,2.096338,-1.030483,-1.722674,0.245522
2,2,14874,20040403,115.0,15,1.0,0.0,0.0,163,12.5,...,0.251410,0.114912,0.165147,0.062173,0.027075,-4.846749,1.803559,1.565330,-0.832687,-0.229963
3,3,71865,19960908,109.0,10,0.0,0.0,1.0,193,15.0,...,0.274293,0.110300,0.121964,0.033395,0.000000,-4.509599,1.285940,-0.501868,-2.438353,-0.478699
4,4,111080,20120103,110.0,5,1.0,0.0,0.0,68,5.0,...,0.228036,0.073205,0.091880,0.078819,0.121534,-1.896240,0.910783,0.931110,2.834518,1.923482
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149995,149995,163978,20000607,121.0,10,4.0,0.0,1.0,163,15.0,...,0.280264,0.000310,0.048441,0.071158,0.019174,1.988114,-2.983973,0.589167,-1.304370,-0.302592
149996,149996,184535,20091102,116.0,11,0.0,0.0,0.0,125,10.0,...,0.253217,0.000777,0.084079,0.099681,0.079371,1.839166,-2.774615,2.553994,0.924196,-0.272160
149997,149997,147587,20101003,60.0,11,1.0,1.0,0.0,90,6.0,...,0.233353,0.000705,0.118872,0.100118,0.097914,2.439812,-1.630677,2.290197,1.891922,0.414931
149998,149998,45907,20060312,34.0,10,3.0,1.0,0.0,156,15.0,...,0.256369,0.000252,0.081479,0.083558,0.081498,2.075380,-2.633719,1.414937,0.431981,-1.659014


### 步骤 5: 数据统计信息浏览       
5.1 使用info()函数观察数据列名和缺失值

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_

In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SaleID             50000 non-null  int64  
 1   name               50000 non-null  int64  
 2   regDate            50000 non-null  int64  
 3   model              50000 non-null  float64
 4   brand              50000 non-null  int64  
 5   bodyType           48587 non-null  float64
 6   fuelType           47107 non-null  float64
 7   gearbox            48090 non-null  float64
 8   power              50000 non-null  int64  
 9   kilometer          50000 non-null  float64
 10  notRepairedDamage  50000 non-null  object 
 11  regionCode         50000 non-null  int64  
 12  seller             50000 non-null  int64  
 13  offerType          50000 non-null  int64  
 14  creatDate          50000 non-null  int64  
 15  v_0                50000 non-null  float64
 16  v_1                500

5.2 使用isnull()和sum()函数来统计缺失值