简介：该数据集提供了医疗费用以及个人情况的相关数据，可以用于探索医疗费用与个人情况（包括身体情况、社会情况等）之间的关联。
变量含义：
- age：医保受益人的年龄
- sex：医保受益人性别，女性/男性
- bmi：身体质量指数，表示身高与体重的比值（kg/m^2），理想情况下为18.5-24.9
- children：医保受益人的抚养儿童人数
- smoker：是否为吸烟者
- region：地区，即受益人在美国的居住区（东北部、东南部、西南部、西北部）
- charges：医保支付的医疗费用

## 读取数据
导入数据分析所需要的库，并通过Pandas的`read_csv`函数，将原始数据文件"insurance.csv"里的数据内容，解析为DataFrame，并赋值给变量`original_data`。

In [14]:
import pandas as pd

In [16]:
original_data = pd.read_csv("../../[赠送] 练习数据集（持续更新）/insurance.csv")
original_data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


## 评估数据
在这一部分，我将对在上一部分建立的·original_data·这个DataFrame所包含的数据进行评估。
评估主要从两个方面进行：结构和内容，即整齐度和干净度。数据的结构性问题指不符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”这三个标准，数据的内容性问题包括存在丢失数据、重复数据、无效数据等。

### 评估数据整齐度

In [19]:
original_data.sample(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
699,23,female,39.27,2,no,southeast,3500.6123
894,62,male,32.11,0,no,northeast,13555.0049
987,45,female,27.645,1,no,northwest,28340.18885
42,41,male,21.78,1,no,southeast,6272.4772
474,54,male,25.1,3,yes,southwest,25382.297
391,19,female,37.43,0,no,northwest,2138.0707
1197,41,male,33.55,0,no,southeast,5699.8375
912,59,female,26.695,3,no,northwest,14382.70905
536,33,female,38.9,3,no,southwest,5972.378
790,39,female,41.8,0,no,southeast,5662.225


从抽样的10行数据数据来看，数据符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”，具体来看每行是关于某患者的一次就医诊费记录，每列是患者的各个变量，因此不存在结构性问题。

### 评估数据干净度

In [20]:
original_data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [21]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


从输出结果来看，数据共有1338条观察值，不存在缺失值。
数据类型正确，不存在需要格式转换的变量。

#### 评估重复数据

根据数据变量的含义来看，各变量都可能出现重复。
因此针对此数据集，我们无需评估重复数据。

In [33]:
original_data[original_data.duplicated()]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
581,19,male,30.59,0,no,northwest,1639.5631


各指标完全重复的数据有一条，可以进行删除。

#### 评估不一致数据

In [26]:
original_data["sex"].value_counts()

sex
male      676
female    662
Name: count, dtype: int64

In [28]:
original_data["smoker"].value_counts()

smoker
no     1064
yes     274
Name: count, dtype: int64

In [29]:
original_data["region"].value_counts()

region
southeast    364
southwest    325
northwest    325
northeast    324
Name: count, dtype: int64

根据评估，`sex`,`smoker`,`region` 均未发现不一致数据

## 清理数据

根据前面评估部分得到的结论，我们需要进行的数据清理包括：
- 把各项变量完全duplicated的观察值删除

为了区分开经过清理的数据和原始的数据，我们创建新的变量cleaned_data，让它为original_data复制出的副本。我们之后的清理步骤都将被运用在cleaned_data上。

In [35]:
cleaned_data = original_data.copy()
cleaned_data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [39]:
cleaned_data = cleaned_data.drop_duplicates()

完成数据清理后，把干净整齐的数据保存到新的文件里，文件名为insurance_cleaned.csv。
如果列名没有意义，指定index = False 

In [40]:
cleaned_data.to_csv("insurance_cleaned.csv", index = False)

In [41]:
pd.read_csv("./insurance_cleaned.csv")

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1332,50,male,30.970,3,no,northwest,10600.54830
1333,18,female,31.920,0,no,northeast,2205.98080
1334,18,female,36.850,0,no,southeast,1629.83350
1335,21,female,25.800,0,no,southwest,2007.94500
