在sklearn中，数据预处理通常包括以下几个步骤：

1. 导入相关库和模块：
```python
import numpy as np
import pandas as pd
from sklearn import preprocessing
```

2. 加载数据集：
```python
data = pd.read_csv('data.csv')
```

3. 数据清洗：删除重复值、缺失值处理等。
```python
data = data.drop_duplicates()
data = data.dropna()
```

4. 特征选择：根据需要选择相关特征。
```python
selected_features = ['feature1', 'feature2', 'feature3']
data = data[selected_features]
```

5. 数据标准化：将数据转换为标准正态分布（均值为0，标准差为1）。
```python
scaler = preprocessing.StandardScaler()
data_scaled = scaler.fit_transform(data)
```

6. 数据归一化：将数据缩放到指定范围（如0到1之间）。
```python
min_max_scaler = preprocessing.MinMaxScaler()
data_normalized = min_max_scaler.fit_transform(data)
```

7. 类别标签编码：将类别标签转换为数值表示。
```python
label_encoder = preprocessing.LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)
```

8. 独热编码：将类别标签转换为独热编码表示。
```python
one_hot_encoder = preprocessing.OneHotEncoder()
encoded_labels = one_hot_encoder.fit_transform(labels.reshape(-1, 1))
```

In [13]:
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 
'''
# metadata 
print(heart_disease.metadata) 
  
# variable information 
print(heart_disease.variables) 

'''



'\n# metadata \nprint(heart_disease.metadata) \n  \n# variable information \nprint(heart_disease.variables) \n\n'

In [14]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0


In [15]:
X.shape

(303, 13)

In [16]:
X.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
dtype: int64

In [17]:
'''
from sklearn.impute import KNNImputer

# 创建KNNImputer实例
imputer = KNNImputer(n_neighbors=3)

# 使用KNNImputer填充NaN值
X = imputer.fit_transform(X)

'''
X = X.fillna(X.mean())

In [18]:
X.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
dtype: int64

In [19]:
'''
!pip install pandas-profiling
import pandas_profiling
profile = pandas_profiling.ProfileReport(X)
profile
'''

'\n!pip install pandas-profiling\nimport pandas_profiling\nprofile = pandas_profiling.ProfileReport(X)\nprofile\n'

## 未数据预处理

In [20]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0.000000,6.0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.000000,3.0
2,67,1,4,120,229,0,2,129,1,2.6,2,2.000000,7.0
3,37,1,3,130,250,0,0,187,0,3.5,3,0.000000,3.0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.000000,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,1,110,264,0,0,132,0,1.2,2,0.000000,7.0
299,68,1,4,144,193,1,0,141,0,3.4,2,2.000000,7.0
300,57,1,4,130,131,0,0,115,1,1.2,2,1.000000,7.0
301,57,0,2,130,236,0,2,174,0,0.0,2,1.000000,3.0


In [21]:
X.shape

(303, 13)

In [22]:
y=y['num']

In [23]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=888)

In [24]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(max_depth=5,n_estimators=100,random_state=5)
model.fit(X_train,y_train)

## 查看测试集

In [25]:
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
91,62,0,4,160,164,0,2,145,0,6.2,3,3.0,7.0
184,60,0,4,158,305,0,2,161,0,0.0,1,0.0,3.0
217,46,0,4,138,243,0,2,152,1,0.0,2,0.0,3.0
226,47,1,4,112,204,0,0,143,0,0.1,1,0.0,3.0
156,51,1,4,140,299,0,0,173,1,1.6,1,0.0,7.0


In [26]:
X_test.iloc[2]#单个

age          46.0
sex           0.0
cp            4.0
trestbps    138.0
chol        243.0
fbs           0.0
restecg       2.0
thalach     152.0
exang         1.0
oldpeak       0.0
slope         2.0
ca            0.0
thal          3.0
Name: 217, dtype: float64

In [27]:
model.predict(X_test)

array([2, 0, 0, 0, 0, 0, 2, 0, 0, 3, 0, 0, 0, 2, 3, 3, 0, 0, 0, 0, 0, 3,
       0, 2, 0, 2, 2, 0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 2, 0, 3, 2, 2,
       0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 3], dtype=int64)

In [28]:
model.predict_proba(X_test)

array([[1.58096272e-01, 9.57276355e-02, 2.87063130e-01, 1.87776598e-01,
        2.71336365e-01],
       [6.97505006e-01, 1.88745436e-01, 4.68916367e-02, 4.76158150e-02,
        1.92421064e-02],
       [6.44452464e-01, 1.75511114e-01, 8.27127102e-02, 6.28793787e-02,
        3.44443337e-02],
       [7.56356166e-01, 1.60596869e-01, 4.63805661e-02, 3.06856057e-02,
        5.98079250e-03],
       [4.26272843e-01, 1.44274861e-01, 2.35465535e-01, 1.51026559e-01,
        4.29602022e-02],
       [9.11518584e-01, 5.32216002e-02, 2.22695409e-02, 1.26578572e-02,
        3.32417582e-04],
       [1.06248375e-01, 1.58963212e-01, 3.44087982e-01, 2.50957429e-01,
        1.39743003e-01],
       [3.66132306e-01, 1.88206021e-01, 2.22156154e-01, 1.44585260e-01,
        7.89202587e-02],
       [4.70051370e-01, 2.17504444e-01, 2.29831450e-01, 6.42016542e-02,
        1.84110813e-02],
       [3.83334622e-02, 1.06237044e-01, 2.59333057e-01, 5.47053811e-01,
        4.90426257e-02],
       [4.32667408e-01, 1.8994

In [29]:
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

In [30]:
from sklearn.metrics import confusion_matrix

In [31]:
confusion_matrix_model = confusion_matrix(y_test,y_pred)

In [32]:
confusion_matrix_model

array([[30,  0,  0,  1,  0],
       [ 9,  0,  4,  5,  0],
       [ 2,  0,  1,  1,  0],
       [ 1,  0,  4,  1,  0],
       [ 0,  0,  1,  1,  0]], dtype=int64)