# 项目：评估和清理心血管疾病数据

## 分析目标
此数据分析的目的是，根据患者不同特征，挖掘特征对心血管疾病的影响，以便建议患者自我检测来避免或改善心血管疾病的潜在影响。
本项目的目的在于练习评估数据干净和整洁度，并基于评估结果，对数据进行清洗，从而得到可供下一步分析的数据。

## 简介
该数据集包含心血管疾病数据，由70000条患者数据和多个相关特征组成，包括基础数据如年龄、性别等，也包括体征数据如身高、体重、收缩压等。

每列变量含义：
- id：身份证号码
- age：年龄，以天为单位
- gender：性别，1-女性，2-男性
- height：身高，厘米
- weight：体重，公斤
- ap_hi：收缩压
- ap_lo：舒张压
- cholesterol：胆固醇，1:正常，2:高于正常，3:远高于正常
- gluc：胶质，1:正常，2:高于正常，3:远高于正常
- smoke：病人是否吸烟
- alco：病人是否喝酒
- active：是否身体运动
- cardio：是否存在心血管疾病

## 读取数据
导入数据分析所需要的库，并通过Pandas的read_csv函数，将原始数据文件"cardio.csv"里的数据内容，解析为DataFrame，并赋值给变量original_data。

In [1]:
import pandas as pd

In [2]:
original_data = pd.read_csv("../../[赠送] 练习数据集（持续更新）/cardio.csv")
original_data

Unnamed: 0,id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio
0,0;18393;2;168;62.0;110;80;1;1;0;0;1;0
1,1;20228;1;156;85.0;140;90;3;1;0;0;1;1
2,2;18857;1;165;64.0;130;70;3;1;0;0;0;1
3,3;17623;2;169;82.0;150;100;1;1;0;0;1;1
4,4;17474;1;156;56.0;100;60;1;1;0;0;0;0
...,...
69995,99993;19240;2;168;76.0;120;80;1;1;1;0;1;0
69996,99995;22601;1;158;126.0;140;90;2;2;0;0;1;1
69997,99996;19066;2;183;105.0;180;90;3;1;0;1;0;1
69998,99998;22431;1;163;72.0;135;80;1;2;0;0;0;1


## 评估数据
在这一部分，我将对在上一部分建立的original_data这个DataFrame所包含的数据进行评估。

评估主要从两个方面进行：结构和内容，即整齐度和干净度。数据的结构性问题指不符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”这三个标准，数据的内容性问题包括存在丢失数据、重复数据、无效数据等。

### 评估数据整齐度

In [3]:
original_data.sample(10)

Unnamed: 0,id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio
46386,66248;21946;1;154;83.0;150;90;3;1;0;0;1;1
8531,12177;23348;1;161;60.0;110;70;1;1;0;0;0;0
58645,83693;18334;1;168;69.0;120;80;1;1;0;0;1;0
38464,54919;20736;1;169;79.0;120;80;1;1;0;0;1;0
43605,62304;20655;2;165;63.0;140;100;2;1;0;0;1;1
68116,97286;23324;1;150;96.0;110;60;1;1;0;0;1;1
32194,45985;21130;1;162;58.0;120;80;1;1;0;0;1;0
23090,32983;18259;2;168;58.0;110;80;1;1;1;1;1;0
26355,37650;19723;1;155;70.0;150;100;1;1;0;0;1;0
20712,29561;19635;1;162;66.0;120;80;1;1;0;0;1;0


从抽样的10行数据数据来看，数据不符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”，具体来看每行是关于某患者体征数据的一条记录，单列包含了患者多项数据，因此存在结构性问题，需要对单列进行拆分。

在进行下一步评估前需要先清理数据结构，以便继续评估内容。

## 清理数据结构
为了区分开经过清理的数据和原始的数据，我们创建新的变量cleaned_data，让它为original_data复制出的副本。我们之后的清理步骤都将被运用在cleaned_data上。

In [4]:
cleaned_data = original_data.copy()
cleaned_data

Unnamed: 0,id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio
0,0;18393;2;168;62.0;110;80;1;1;0;0;1;0
1,1;20228;1;156;85.0;140;90;3;1;0;0;1;1
2,2;18857;1;165;64.0;130;70;3;1;0;0;0;1
3,3;17623;2;169;82.0;150;100;1;1;0;0;1;1
4,4;17474;1;156;56.0;100;60;1;1;0;0;0;0
...,...
69995,99993;19240;2;168;76.0;120;80;1;1;1;0;1;0
69996,99995;22601;1;158;126.0;140;90;2;2;0;0;1;1
69997,99996;19066;2;183;105.0;180;90;3;1;0;1;0;1
69998,99998;22431;1;163;72.0;135;80;1;2;0;0;0;1


### 对列进行拆分

In [5]:
cleaned_data[["id", "age", "gender", "height", "weight", "ap_hi", "ap_lo", "cholesterol", "gluc", "smoke", "alco", "active", "cardio"]] = cleaned_data['id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio'].str.split(";", expand = True)
cleaned_data

Unnamed: 0,id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0;18393;2;168;62.0;110;80;1;1;0;0;1;0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1;20228;1;156;85.0;140;90;3;1;0;0;1;1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2;18857;1;165;64.0;130;70;3;1;0;0;0;1,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3;17623;2;169;82.0;150;100;1;1;0;0;1;1,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4;17474;1;156;56.0;100;60;1;1;0;0;0;0,4,17474,1,156,56.0,100,60,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69995,99993;19240;2;168;76.0;120;80;1;1;1;0;1;0,99993,19240,2,168,76.0,120,80,1,1,1,0,1,0
69996,99995;22601;1;158;126.0;140;90;2;2;0;0;1;1,99995,22601,1,158,126.0,140,90,2,2,0,0,1,1
69997,99996;19066;2;183;105.0;180;90;3;1;0;1;0;1,99996,19066,2,183,105.0,180,90,3,1,0,1,0,1
69998,99998;22431;1;163;72.0;135;80;1;2;0;0;0;1,99998,22431,1,163,72.0,135,80,1,2,0,0,0,1


### 删除拆分前的列

In [6]:
cleaned_data = cleaned_data.drop("id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio", axis = 1)
cleaned_data

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69995,99993,19240,2,168,76.0,120,80,1,1,1,0,1,0
69996,99995,22601,1,158,126.0,140,90,2,2,0,0,1,1
69997,99996,19066,2,183,105.0,180,90,3,1,0,1,0,1
69998,99998,22431,1,163,72.0,135,80,1,2,0,0,0,1


### 将id设置为DataFrame的标签

In [7]:
cleaned_data.set_index("id", inplace = True)

## 继续评估数据内容

### 评估数据干净度

In [8]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 70000 entries, 0 to 99999
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          70000 non-null  object
 1   gender       70000 non-null  object
 2   height       70000 non-null  object
 3   weight       70000 non-null  object
 4   ap_hi        70000 non-null  object
 5   ap_lo        70000 non-null  object
 6   cholesterol  70000 non-null  object
 7   gluc         70000 non-null  object
 8   smoke        70000 non-null  object
 9   alco         70000 non-null  object
 10  active       70000 non-null  object
 11  cardio       70000 non-null  object
dtypes: object(12)
memory usage: 6.9+ MB


从输出结果来看，数据共有70000条观察值，不存在缺失值。
此外，所有数据类型应为整数或浮点数，而不是字符串，因此应当进行数据格式转换。

#### 评估重复数据

根据数据变量的含义来看，各变量都可能出现重复。 因此针对此数据集，我们只需要评估所有变量完全重复的数据。

In [9]:
cleaned_data.duplicated().sum()

24

无完全重复的数据，因此我们无需评估重复数据

#### 评估不一致数据
不一致数据可能存在于`gender`,`cholesterol`, `gluc`, `smoke`, `alco`, `active`, `cardio`变量中，我们要查看是否存在多个不同值指代同一结果的情况。

In [10]:
print(cleaned_data["gender"].value_counts())
print(cleaned_data["cholesterol"].value_counts())
print(cleaned_data["gluc"].value_counts())

gender
1    45530
2    24470
Name: count, dtype: int64
cholesterol
1    52385
2     9549
3     8066
Name: count, dtype: int64
gluc
1    59479
3     5331
2     5190
Name: count, dtype: int64


In [11]:
print(cleaned_data["smoke"].value_counts())
print(cleaned_data["alco"].value_counts())
print(cleaned_data["active"].value_counts())
print(cleaned_data["cardio"].value_counts())

smoke
0    63831
1     6169
Name: count, dtype: int64
alco
0    66236
1     3764
Name: count, dtype: int64
active
1    56261
0    13739
Name: count, dtype: int64
cardio
0    35021
1    34979
Name: count, dtype: int64


不存在不一致数据

#### 评估无效或错误数据
先要对数据进行格式转换，才可以通过DataFrame的describe方法，对数值统计信息进行快速了解。

In [12]:
cleaned_data[["age","gender","ap_hi", "ap_lo", "cholesterol", "gluc", "smoke", "alco", "active", "cardio"]] = cleaned_data[["age","gender","ap_hi", "ap_lo", "cholesterol", "gluc", "smoke", "alco", "active", "cardio"]].astype(int)
cleaned_data[["height","weight"]] = cleaned_data[["height","weight"]].astype(float)
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 70000 entries, 0 to 99999
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          70000 non-null  int64  
 1   gender       70000 non-null  int64  
 2   height       70000 non-null  float64
 3   weight       70000 non-null  float64
 4   ap_hi        70000 non-null  int64  
 5   ap_lo        70000 non-null  int64  
 6   cholesterol  70000 non-null  int64  
 7   gluc         70000 non-null  int64  
 8   smoke        70000 non-null  int64  
 9   alco         70000 non-null  int64  
 10  active       70000 non-null  int64  
 11  cardio       70000 non-null  int64  
dtypes: float64(2), int64(10)
memory usage: 6.9+ MB


In [13]:
cleaned_data.describe()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0
mean,19468.865814,1.349571,164.359229,74.20569,128.817286,96.630414,1.366871,1.226457,0.088129,0.053771,0.803729,0.4997
std,2467.251667,0.476838,8.210126,14.395757,154.011419,188.47253,0.68025,0.57227,0.283484,0.225568,0.397179,0.500003
min,10798.0,1.0,55.0,10.0,-150.0,-70.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,17664.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
50%,19703.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
75%,21327.0,2.0,170.0,82.0,140.0,90.0,2.0,1.0,0.0,0.0,1.0,1.0
max,23713.0,2.0,250.0,200.0,16020.0,11000.0,3.0,3.0,1.0,1.0,1.0,1.0


`ap_hi`,`ap_lo`不应该存在负数。
因此，我们先筛选出Quantity数值为负数的观察值，进一步评估其含义。
同时，通过0，1，2等数字表示容易引起误解，因此需要将内容进行替换：
- `age` 以天为单位，需要替换成以年为单位
- `gender` 中替换 "1" = Female, "2" = "Male"
- `cholesterol` 中替换 "1" = "typical", "2" = "elevated", "3" = "significantly elevated"
- `gluc` 中替换 "1" = "typical", "2" = "elevated", "3" = "significantly elevated"
- `smoke` 中替换 "0" = False, "1" = True
- `alco` 中替换 "0" = False, "1" = True
- `active` 中替换 "0" = False, "1" = True
- `cardio` 中替换 "0" = False, "1" = True

In [14]:
cleaned_data[cleaned_data["ap_hi"] < 0]

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
6525,15281,1,165.0,78.0,-100,80,2,1,0,0,1,0
22881,22108,2,161.0,90.0,-115,70,1,1,0,0,1,0
29313,15581,1,153.0,54.0,-100,70,1,1,0,0,1,0
34295,18301,1,162.0,74.0,-140,90,1,1,0,0,1,1
36025,14711,2,168.0,50.0,-120,80,2,1,0,0,0,1
50055,23325,2,168.0,59.0,-150,80,1,1,0,0,1,1
66571,23646,2,160.0,59.0,-120,80,1,1,0,0,0,0


In [15]:
cleaned_data[cleaned_data["ap_lo"] < 0]

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
85816,22571,1,167.0,74.0,15,-70,1,1,0,0,1,1


从结果来看，`ap_hi`, `ap_lo`的患者都有患心血管疾病的可能，但是这两个数值为负会影响统计结果。因此后续清理中应该去除影响结果的观察值。

## 清理数据

根据前面评估部分得到的结论，我们需要进行的数据清理包括：
- age 以天为单位，需要替换成以年为单位
- gender 中替换 "1" = Female, "2" = "Male"
- cholesterol 中替换 "1" = "typical", "2" = "elevated", "3" = "significantly elevated"
- gluc 中替换 "1" = "typical", "2" = "elevated", "3" = "significantly elevated"
- smoke 中替换 "0" = False, "1" = True
- alco 中替换 "0" = False, "1" = True
- active 中替换 "0" = False, "1" = True
- cardio 中替换 "0" = False, "1" = True
- 删除`ap_hi`,`ap_lo`存在负数的观察值

In [16]:
import numpy as np
cleaned_data["age"] = cleaned_data["age"].apply(lambda x : x / 365).round().astype(int)
cleaned_data

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50,2,168.0,62.0,110,80,1,1,0,0,1,0
1,55,1,156.0,85.0,140,90,3,1,0,0,1,1
2,52,1,165.0,64.0,130,70,3,1,0,0,0,1
3,48,2,169.0,82.0,150,100,1,1,0,0,1,1
4,48,1,156.0,56.0,100,60,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
99993,53,2,168.0,76.0,120,80,1,1,1,0,1,0
99995,62,1,158.0,126.0,140,90,2,2,0,0,1,1
99996,52,2,183.0,105.0,180,90,3,1,0,1,0,1
99998,61,1,163.0,72.0,135,80,1,2,0,0,0,1


gender 中替换 "1" = Female, "2" = "Male"

In [17]:
cleaned_data["gender"] = cleaned_data["gender"].replace({1: "Female", 2 : "Male"})
cleaned_data

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50,Male,168.0,62.0,110,80,1,1,0,0,1,0
1,55,Female,156.0,85.0,140,90,3,1,0,0,1,1
2,52,Female,165.0,64.0,130,70,3,1,0,0,0,1
3,48,Male,169.0,82.0,150,100,1,1,0,0,1,1
4,48,Female,156.0,56.0,100,60,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
99993,53,Male,168.0,76.0,120,80,1,1,1,0,1,0
99995,62,Female,158.0,126.0,140,90,2,2,0,0,1,1
99996,52,Male,183.0,105.0,180,90,3,1,0,1,0,1
99998,61,Female,163.0,72.0,135,80,1,2,0,0,0,1


cholesterol 中替换 "1" = "typical", "2" = "elevated", "3" = "significantly elevated"

In [18]:
cleaned_data["cholesterol"] = cleaned_data["cholesterol"].replace({1: "typical", 2 : "elevated", 3 : "significantly_elevated"})
cleaned_data

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50,Male,168.0,62.0,110,80,typical,1,0,0,1,0
1,55,Female,156.0,85.0,140,90,significantly_elevated,1,0,0,1,1
2,52,Female,165.0,64.0,130,70,significantly_elevated,1,0,0,0,1
3,48,Male,169.0,82.0,150,100,typical,1,0,0,1,1
4,48,Female,156.0,56.0,100,60,typical,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
99993,53,Male,168.0,76.0,120,80,typical,1,1,0,1,0
99995,62,Female,158.0,126.0,140,90,elevated,2,0,0,1,1
99996,52,Male,183.0,105.0,180,90,significantly_elevated,1,0,1,0,1
99998,61,Female,163.0,72.0,135,80,typical,2,0,0,0,1


- gluc 中替换 "1" = "typical", "2" = "elevated", "3" = "significantly elevated"
- smoke 中替换 "0" = False, "1" = True
- alco 中替换 "0" = False, "1" = True
- active 中替换 "0" = False, "1" = True
- cardio 中替换 "0" = False, "1" = True

In [19]:
cleaned_data["gluc"] = cleaned_data["gluc"].replace({1: "typical", 2 : "elevated", 3 : "significantly_elevated"})
cleaned_data["smoke"] = cleaned_data["smoke"].replace({0: "False", 1 : "True"})
cleaned_data["alco"] = cleaned_data["alco"].replace({0: "False", 1 : "True"})
cleaned_data["active"] = cleaned_data["active"].replace({0: "False", 1 : "True"})
cleaned_data["cardio"] = cleaned_data["cardio"].replace({0: "False", 1 : "True"})
cleaned_data

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50,Male,168.0,62.0,110,80,typical,typical,False,False,True,False
1,55,Female,156.0,85.0,140,90,significantly_elevated,typical,False,False,True,True
2,52,Female,165.0,64.0,130,70,significantly_elevated,typical,False,False,False,True
3,48,Male,169.0,82.0,150,100,typical,typical,False,False,True,True
4,48,Female,156.0,56.0,100,60,typical,typical,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
99993,53,Male,168.0,76.0,120,80,typical,typical,True,False,True,False
99995,62,Female,158.0,126.0,140,90,elevated,elevated,False,False,True,True
99996,52,Male,183.0,105.0,180,90,significantly_elevated,typical,False,True,False,True
99998,61,Female,163.0,72.0,135,80,typical,elevated,False,False,False,True


删除ap_hi,ap_lo存在负数的观察值

In [20]:
cleaned_data = cleaned_data[cleaned_data["ap_hi"] >= 0]
len(cleaned_data[cleaned_data["ap_hi"]<0])

0

In [21]:
cleaned_data = cleaned_data[cleaned_data["ap_lo"] >= 0]
print(len(cleaned_data[cleaned_data["ap_lo"]<0]))
cleaned_data

0


Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50,Male,168.0,62.0,110,80,typical,typical,False,False,True,False
1,55,Female,156.0,85.0,140,90,significantly_elevated,typical,False,False,True,True
2,52,Female,165.0,64.0,130,70,significantly_elevated,typical,False,False,False,True
3,48,Male,169.0,82.0,150,100,typical,typical,False,False,True,True
4,48,Female,156.0,56.0,100,60,typical,typical,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
99993,53,Male,168.0,76.0,120,80,typical,typical,True,False,True,False
99995,62,Female,158.0,126.0,140,90,elevated,elevated,False,False,True,True
99996,52,Male,183.0,105.0,180,90,significantly_elevated,typical,False,True,False,True
99998,61,Female,163.0,72.0,135,80,typical,elevated,False,False,False,True


## 保存清理后的数据

完成数据清理后，把干净整齐的数据保存到新的文件里，文件名为`cardio_cleaned.csv`。

In [22]:
cleaned_data.head()

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50,Male,168.0,62.0,110,80,typical,typical,False,False,True,False
1,55,Female,156.0,85.0,140,90,significantly_elevated,typical,False,False,True,True
2,52,Female,165.0,64.0,130,70,significantly_elevated,typical,False,False,False,True
3,48,Male,169.0,82.0,150,100,typical,typical,False,False,True,True
4,48,Female,156.0,56.0,100,60,typical,typical,False,False,False,False


In [23]:
cleaned_data.to_csv("cardio_cleaned.csv", index = False)

In [24]:
pd.read_csv("cardio_cleaned.csv")

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,50,Male,168.0,62.0,110,80,typical,typical,False,False,True,False
1,55,Female,156.0,85.0,140,90,significantly_elevated,typical,False,False,True,True
2,52,Female,165.0,64.0,130,70,significantly_elevated,typical,False,False,False,True
3,48,Male,169.0,82.0,150,100,typical,typical,False,False,True,True
4,48,Female,156.0,56.0,100,60,typical,typical,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
69987,53,Male,168.0,76.0,120,80,typical,typical,True,False,True,False
69988,62,Female,158.0,126.0,140,90,elevated,elevated,False,False,True,True
69989,52,Male,183.0,105.0,180,90,significantly_elevated,typical,False,True,False,True
69990,61,Female,163.0,72.0,135,80,typical,elevated,False,False,False,True
