### 案例分析步骤
1.获取数据

2.基本数据处理

2.1 缺失值处理

2.2 确定特征值,目标值

2.3 分割数据

3.特征工程(标准化)

4.机器学习(逻辑回归)

5.模型评估

In [1]:
# 导入相关库
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [11]:
# 1. 获取数据.
data = pd.read_csv('breast-cancer-wisconsin.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample code number           699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bare Nuclei                  699 non-null    object
 7   Bland Chromatin              699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitoses                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB


In [12]:
# 2. 数据预处理.
# data = data.replace(to_replace='?', value=np.NAN)
data = data.replace('?', np.nan)
data = data.dropna()
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 683 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample code number           683 non-null    int64 
 1   Clump Thickness              683 non-null    int64 
 2   Uniformity of Cell Size      683 non-null    int64 
 3   Uniformity of Cell Shape     683 non-null    int64 
 4   Marginal Adhesion            683 non-null    int64 
 5   Single Epithelial Cell Size  683 non-null    int64 
 6   Bare Nuclei                  683 non-null    object
 7   Bland Chromatin              683 non-null    int64 
 8   Normal Nucleoli              683 non-null    int64 
 9   Mitoses                      683 non-null    int64 
 10  Class                        683 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 64.0+ KB


In [13]:
# 3. 确定特征值和目标值.
x = data.iloc[:, 1:-1]
y = data.Class
print(f'x.head(): {x.head()}')
print(f'y.head(): {y.head()}')

x.head():    Clump Thickness  Uniformity of Cell Size  Uniformity of Cell Shape  \
0                5                        1                         1   
1                5                        4                         4   
2                3                        1                         1   
3                6                        8                         8   
4                4                        1                         1   

   Marginal Adhesion  Single Epithelial Cell Size Bare Nuclei  \
0                  1                            2           1   
1                  5                            7          10   
2                  1                            2           2   
3                  1                            3           4   
4                  3                            2           1   

   Bland Chromatin  Normal Nucleoli  Mitoses  
0                3                1        1  
1                3                2        1  
2                3 

In [6]:
# 3. 分割数据.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=21)

In [7]:
# 4. 特征处理.
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

In [14]:
# 5. 模型训练.
estimator = LogisticRegression()
estimator.fit(x_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [15]:
# 6. 模型预测
y_predict = estimator.predict(x_test)
print(f'预测值: {y_predict}')

预测值: [4 4 2 2 2 2 4 4 2 4 2 4 2 2 2 2 2 4 2 2 2 4 4 2 4 4 2 4 4 2 4 4 4 4 2 2 2
 2 2 2 4 4 2 2 2 2 2 4 2 2 2 4 2 2 4 4 2 2 2 4 4 4 2 2 4 2 2 2 2 2 4 2 2 2
 4 2 4 2 4 2 2 2 2 4 2 4 2 2 2 4 2 2 4 2 2 2 4 2 2 4 2 2 2 4 4 4 2 2 4 2 4
 2 2 2 2 4 2 2 2 2 2 4 2 2 2 2 4 4 4 2 2 4 2 2 2 4 2]


In [17]:
# 7. 模型评估
print(f'准确率: {estimator.score(x_test, y_test)}')
print(f'准确率: {accuracy_score(y_test, y_predict)}')

准确率: 0.9708029197080292
准确率: 0.9708029197080292
