导入必须库—>读取数据—>数据探索—>数据预处理：数据清洗、数据转换（特征构造、数据标准化与归一化-、**类别型数据数值化）、数据降维**

# 一、类别型数据的处理

In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
# 一个单元格里面所有变量都会输出
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
# 读取数据
housing = pd.read_csv('dataset/housing.csv')
X_cla = housing[['ocean_proximity']]
X_cla['ocean_proximity'].value_counts()


ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

value_counts() 函数被用于 ocean_proximity 列，以统计每个唯一类别值的出现次数。

## 1、顺序型类别数据

In [4]:
X_cla

Unnamed: 0,ocean_proximity
0,NEAR BAY
1,NEAR BAY
2,NEAR BAY
3,NEAR BAY
4,NEAR BAY
...,...
20635,INLAND
20636,INLAND
20637,INLAND
20638,INLAND


 ocean_proximity 数据中，虽然这个字段的值（如 "NEAR BAY", "INLAND"）描述了房产与海洋的相对位置，但这些值是否构成顺序型数据取决于这些位置描述是否具有内在的排序逻辑。例如，如果我们定义 "NEAR BAY" 表示更接近海湾，而 "INLAND" 表示远离海湾，则这些标签可以被视为有顺序的，如果这个顺序与模型的目标（如房价）有相关性，使用顺序编码可能更有意义。

In [5]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()#初始化
X_Ordinal = encoder.fit_transform(X_cla)
X_Ordinal

array([[3.],
       [3.],
       [3.],
       ...,
       [1.],
       [1.],
       [1.]])

fit：学习 X_cla 中 ocean_proximity 列的所有唯一类别及其对应的整数编码。

transform：将这些类别转换为相应的整数编码。

转换后的数据 X_Ordinal 以 numpy 数组的形式呈现，显示了类别数据被转换成的整数编码。这里显示的是部分输出，每个数字代表 ocean_proximity 中一个类别的编码。

In [6]:
encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

**如果想要指定类别的顺序，则需要修改categories参数**

In [7]:
from sklearn.preprocessing import OrdinalEncoder
order=['NEAR BAY','<1H OCEAN','INLAND','ISLAND','NEAR OCEAN']
encoder=OrdinalEncoder(categories=[order])
X_Ordinal_Give=encoder.fit_transform(X_cla)
X_Ordinal_Give

array([[0.],
       [0.],
       [0.],
       ...,
       [2.],
       [2.],
       [2.]])

创建一个列表 order，其中包含 ocean_proximity 特征的类别，这些类别按照特定的顺序排列。这个顺序是手动定义的，可以根据具体的业务需求或数据的逻辑来设置。

## 2、非顺序型类别数据

办法1：利用toarray()

In [9]:
from sklearn.preprocessing import OneHotEncoder
#默认返回稀疏矩阵
encoder=OneHotEncoder()
#拟合数据并转换为独热编码格式
X_OneHot_sparse=encoder.fit_transform(X_cla)
#使用toarray方法将稀疏矩阵转换为密集的numpy数组
X_OneHot=X_OneHot_sparse.toarray()
#显示结果
X_OneHot

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [10]:
encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

办法2：在初始化时引入参数sparse

In [11]:
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder(sparse_output=False)
X_OneHot=encoder.fit_transform(X_cla)
colu=encoder.categories_
X_OneHot=pd.DataFrame(X_OneHot,columns=colu)
X_OneHot

Unnamed: 0,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...
20635,0.0,1.0,0.0,0.0,0.0
20636,0.0,1.0,0.0,0.0,0.0
20637,0.0,1.0,0.0,0.0,0.0
20638,0.0,1.0,0.0,0.0,0.0


查看sklearn的版本：import sklearn

print(sklearn.__version__)

In [12]:
encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

##### OrdinalEncoder，和OneHotEncoder有什么区别和联系？

# 二、数据降维

In [25]:
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OrdinalEncoder
housing = pd.read_csv('dataset/housing.csv')
X = housing.drop('median_house_value',axis = 1)
X['ocean_proximity'] = X_Ordinal
X

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,3.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,3.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,3.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,3.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,3.0
...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,1.0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,1.0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,1.0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,1.0


In [26]:
X.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
dtype: int64

In [28]:
#缺失值插补
X_miss=X[['total_bedrooms']]
impute=SimpleImputer(strategy='median')
filled_values=impute.fit_transform(X_miss)
filled_values=pd.DataFrame(filled_values,columns=['total_bedrooms'])
X['total_bedrooms']=filled_values
#标准化
scale=MinMaxScaler()
Stand_X=scale.fit_transform(X)
Stand_X=pd.DataFrame(Stand_X,columns=X.columns)#将Array结构转为Dataframe
Stand_X
X.isnull().sum()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,0.211155,0.567481,0.784314,0.022331,0.019863,0.008941,0.020556,0.539668,0.75
1,0.212151,0.565356,0.392157,0.180503,0.171477,0.067210,0.186976,0.538027,0.75
2,0.210159,0.564293,1.000000,0.037260,0.029330,0.013818,0.028943,0.466028,0.75
3,0.209163,0.564293,1.000000,0.032352,0.036313,0.015555,0.035849,0.354699,0.75
4,0.209163,0.564293,1.000000,0.041330,0.043296,0.015752,0.042427,0.230776,0.75
...,...,...,...,...,...,...,...,...,...
20635,0.324701,0.737513,0.470588,0.042296,0.057883,0.023599,0.054103,0.073130,0.25
20636,0.312749,0.738576,0.333333,0.017676,0.023122,0.009894,0.018582,0.141853,0.25
20637,0.311753,0.732200,0.313725,0.057277,0.075109,0.028140,0.071041,0.082764,0.25
20638,0.301793,0.732200,0.333333,0.047256,0.063315,0.020684,0.057227,0.094295,0.25


longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
ocean_proximity       0
dtype: int64

本代码通过 SimpleImputer 使用中位数填充 total_bedrooms 列的缺失值，然后利用 MinMaxScaler 对整个数据集进行了范围缩放，确保所有特征值均在0到1之间。这样的预处理步骤可以显著提高许多机器学习算法的性能，特别是那些对数据规模敏感的算法如线性回归、逻辑回归和神经网络。通过这些步骤，数据变得更加规范化，有助于提升模型训练效果和预测准确性。

In [29]:
from sklearn.decomposition import PCA
#指定保留的特征数为3
pca=PCA(n_components=3)
X_pca=pca.fit_transform(Stand_X)#对数据X进行降维并查看降维结果
pd.DataFrame(X_pca)

Unnamed: 0,0,1,2
0,0.595048,0.099259,0.154594
1,0.534706,0.111291,-0.273266
2,0.624337,0.088149,0.366688
3,0.626105,0.090379,0.374978
4,0.627622,0.092204,0.383064
...,...,...,...
20635,0.132178,0.395521,-0.058097
20636,0.117894,0.409467,-0.187086
20637,0.112544,0.405984,-0.215389
20638,0.118692,0.410855,-0.193428


展示了如何使用PCA对预处理后的数据进行有效的降维处理。通过选择三个最重要的主成分，PCA帮助减少了数据的复杂性和计算需求，同时保留了数据集中大部分的关键变异信息。这对于处理高维数据集尤其重要，因为它可以显著提高数据分析和机器学习模型的效率。此外，降维后的数据较少的维度可以简化模型训练过程，有助于避免过拟合，并提高模型的泛化能力。通过将降维结果转化为DataFrame，使得数据更易于理解和进一步分析，为深入研究或决策提供便利。

In [30]:
X_pca.shape

(20640, 3)

# 三、模块化数据预处理

In [33]:
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import OrdinalEncoder  
from sklearn.preprocessing import OneHotEncoder
import pandas as pd 

In [34]:
#处理缺失值
def X_missHand(X_miss):
    impute = SimpleImputer()
    filled_values = impute.fit_transform(X_miss)
    return filled_values
 # 检查way参数是否有效  
def X_trans(X,way):
    if way not in ['均值方差', '离差方差', '归一化']:  
        raise ValueError("Invalid way parameter. Must be one of: '均值方差', '离差方差', '归一化'.")
    if way == '均值方差':
        scale = StandardScaler()
    elif way == '离差方差':
        scale = MinMaxScaler()
    if way == '归一化':
        scale = Normalizer()
    stand_X = scale.fit_transform(X)
    return stand_X
#用PCA进行数据降维
def X_PCA(X):
    pca = PCA(n_components=3)
    X_pca = pca.fit_transform(X)
    return X_pca

#处理顺序型类别数据
def classHand(X_cla,way,order=None):
    if way == 'Ordinal':
        if order:
            encoder = OrdinalEncoder(categories=[order])
        else:
            encoder = OrdinalEncoder()
    if way == 'OneHot':
        encoder = OneHotEncoder(sparse = False)
    return encoder.fit_transform(X_cla)                 

In [35]:
housing = pd.read_csv('dataset/housing.csv')
X = housing.drop('median_house_value',axis = 1)
X_cla = X[['ocean_proximity']]
Y = housing[['median_house_value']]
# 1.处理缺失值
filled = X_missHand(X[['total_bedrooms']])
X['total_bedrooms'] = filled
# 2.处理类别数据
X_cla = classHand(X_cla,'Ordinal')
X['ocean_proximity'] = X_cla
# 3. 标准化
Stand_X = X_trans(X,'离差方差')
Stand_X = pd.DataFrame(Stand_X,columns=X.columns)
# 4.降维
X_pca = X_PCA(Stand_X)
pd.DataFrame(X_pca)

Unnamed: 0,0,1,2
0,0.595051,0.099264,0.154605
1,0.534706,0.111287,-0.273249
2,0.624340,0.088155,0.366701
3,0.626108,0.090385,0.374991
4,0.627625,0.092210,0.383078
...,...,...,...
20635,0.132181,0.395523,-0.058087
20636,0.117897,0.409469,-0.187079
20637,0.112547,0.405984,-0.215380
20638,0.118694,0.410856,-0.193419


# 四、数据预处理

In [36]:
from sklearn.model_selection import train_test_split

In [37]:
X_train,X_test,Y_train,Y_test = train_test_split(X_pca,Y,test_size=0.3,random_state=42) #测试集占30%

# 五、构建与训练模型

In [55]:
from sklearn.tree import DecisionTreeRegressor
tree_reg=DecisionTreeRegressor() #初始化估计器
tree_reg.fit(X_train,Y_train) #估计器
Y_test_pred=tree_reg.predict(X_test) #预测器
Y_train_pred=tree_reg.predict(X_train)
# Y_test_pred
# Y_train_pred

此代码段展示了如何使用决策树回归模型（DecisionTreeRegressor）来进行训练和预测。决策树是一种监督学习算法，常用于分类和回归任务，其通过学习输入特征与目标值之间的决策规则来建立模型。

在代码中，首先通过从 sklearn.tree 模块导入 DecisionTreeRegressor 类，接着创建了该类的一个实例。这一步涉及模型的初始化，此时可以调整模型的参数，如最大深度、最小样本分割数等，但在此示例中使用了默认设置。

随后，使用训练数据集 X_train 和对应的目标值 Y_train 来训练决策树模型。这一过程包括模型对数据特征进行学习，尝试找出最佳的分割点来构造树形结构，其目标是在每个节点上尽可能减少预测误差。

训练完成后，模型使用 predict 方法对测试集 X_test 进行预测，生成预测结果 Y_test_pred。这表明模型能够根据训练期间学到的规则来预测新数据的目标值。同样的方法也被用来对训练集 X_train 进行预测，生成 Y_train_pred，通常这一步用于评估模型在训练数据上的拟合程度。

In [56]:
from sklearn.metrics import mean_squared_error
import numpy as np
tree_mse=mean_squared_error(Y_train,Y_train_pred)
tree_mse=np.sqrt(tree_mse)
tree_mse

0.0

### 评估决策树回归模型的性能
首先，通过从 sklearn.metrics 模块导入 mean_squared_error 函数，该函数用于计算预测值与真实值之间的误差平方的平均值。在这个例子中，mean_squared_error 被用来比较 Y_train（训练集的实际目标值）和 Y_train_pred（模型基于训练集的预测值）。

计算出的 tree_mse 是一个标量值，代表模型预测错误的平均平方值。MSE 是评估回归任务中模型性能的常见指标，值越小表示模型的预测准确度越高。

接着，代码使用 numpy 的 sqrt 函数计算 tree_mse 的平方根，得到 tree_rmse。RMSE 是一个更直观的性能度量，因为它与目标变量在相同的单位上，并且也更加强调较大的误差（由于平方运算的特性）。因此，RMSE 对于实际应用中的性能解释特别有用，它直接表达了模型预测值与实际值之间的平均误差。

最后，输出 tree_rmse 可以提供一个直接的、量化的指标，用于评估模型在训练集上的表现。一个较低的 RMSE 值通常意味着模型具有更好的预测性能，而一个高 RMSE 值则可能指示模型过拟合或者对训练数据的泛化能力较弱。

In [57]:
from sklearn.metrics import mean_squared_error
tree_mse=mean_squared_error(Y_test,Y_test_pred)
tree_rmse=np.sqrt(tree_mse)
tree_rmse

102807.62260893072

此代码提供了一个关于决策树模型性能的实际评估。这种评估帮助识别模型在实际未知数据上的表现，是检验模型泛化能力的关键步骤。一个较低的 RMSE 值通常意味着较好的模型性能，而较高的 RMSE 值可能指示模型对数据的拟合不够好，或者过度拟合了训练数据而未能有效泛化到新数据。这种评估方式对于模型迭代优化和算法选择提供了重要依据，是模型开发和验证过程中的基本环节。

# 练习3：决策树回归模型在房价预测中的应用
## 1、数据预处理：
### 1.1 缺失值处理：使用适当的技术填补 total_bedrooms 特征中的缺失值，并简要说明选择该策略的原因。
### 1.2 类别数据处理：对 ocean_proximity 特征应用适当的编码技术，并解释选择该编码的理由。

In [92]:
# 在此处完成作业
from sklearn.impute import SimpleImputer
import pandas as pd 
from sklearn.preprocessing import OrdinalEncoder
#加载数据
housing = pd.read_csv('dataset/housing.csv')
X = housing.drop('median_house_value',axis = 1)
#填充total_bedrooms的缺失值
imputer=SimpleImputer(strategy='median')
#对缺失值的列进行处理
filled_values=imputer.fit_transform(X[['total_bedrooms']])
#更新数据
X['total_bedrooms'] = filled_values

encoder=OrdinalEncoder()
X['ocean_proximity']=encoder.fit_transform(X[['ocean_proximity']])
print(X.head())

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  ocean_proximity  
0       322.0       126.0         8.3252              3.0  
1      2401.0      1138.0         8.3014              3.0  
2       496.0       177.0         7.2574              3.0  
3       558.0       219.0         5.6431              3.0  
4       565.0       259.0         3.8462              3.0  


### 1.3 对处理过的特征数据进行标准化。使用 MinMaxScaler 对数据集中的所有数值型特征进行标准化，并解释为什么在此数据集上执行标准化是重要的。

In [89]:
# 在此处完成作业
from sklearn.preprocessing import MinMaxScaler
#初始化
scaler = MinMaxScaler()
#选择数值型列进行标准化
numeric_columns=['total_rooms','total_bedrooms','population','households','median_income']
X[numeric_columns] = scaler.fit_transform(X[numeric_columns])
print(X.head())

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0     0.022331        0.019863   
1    -122.22     37.86                21.0     0.180503        0.171477   
2    -122.24     37.85                52.0     0.037260        0.029330   
3    -122.25     37.85                52.0     0.032352        0.036313   
4    -122.25     37.85                52.0     0.041330        0.043296   

   population  households  median_income  ocean_proximity  
0    0.008941    0.020556       0.539668              3.0  
1    0.067210    0.186976       0.538027              3.0  
2    0.013818    0.028943       0.466028              3.0  
3    0.015555    0.035849       0.354699              3.0  
4    0.015752    0.042427       0.230776              3.0  


执行标准化可以避免数值问题：很大的数值可能会引发数值问题，如浮点数溢出或导致算法效率降低。

## 2、模型训练和预测：
### 使用划分的数据集训练一个决策树回归模型，并对其进行预测。解释选择决策树回归模型的理由，并讨论划分数据集的比例及其对模型评估的潜在影响。

In [93]:
# 在此处完成作业
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
#目标列
Y=housing['median_house_value']
#划分训练集和测试集
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state=42) #测试集占30%
#回归树模型
tree_reg=DecisionTreeRegressor(random_state=42) #初始化估计器

#训练模型
tree_reg.fit(X_train,Y_train) #估计器
#进行预测
Y_test_pred=tree_reg.predict(X_test) #预测器

tree_mse=mean_squared_error(Y_test,Y_test_pred)
#计算均方误差
tree_rmse=np.sqrt(tree_mse)
tree_rmse

69925.19065813976

选择决策树回归模型的理由：
1.可解释性：决策树模型的决策过程是透明的，可以直观地展示特征是如何影响预测结果的。
2.非线性关系：决策树能够捕捉数据中的非线性关系。

划分数据集的比例及其对模型评估的潜在影响：
1.训练集：用于训练模型的数据。
2.验证集：用于模型选择和超参数调整的数据。
3.测试集：用于最终评估模型泛化能力的数据。

## 3、预测的性能不佳，应该怎么解决

In [91]:
# 在此处完成作业
tree_reg=DecisionTreeRegressor(max_depth=10,min_samples_split=20,min_samples_leaf=10,random_state=42) #初始化估计器
tree_reg.fit(X_train,Y_train) #估计器
Y_test_pred=tree_reg.predict(X_test) #预测器

tree_mse=mean_squared_error(Y_test,Y_test_pred)
tree_rmse=np.sqrt(tree_mse)
tree_rmse

59982.46380466434