# 项目：评估和清理K-Pop偶像数据集

## 分析目标

此数据分析的目的是，根据K-Pop偶像信息，通过人口统计学分析，揭示韩国娱乐产业在偶像选拔和公司战略中的系统性规律。例如男女偶像在身高、体重、BMI等生理指标上的选拔标准差异、头部娱乐公司（SM/JYP/YG）是否存在差异化的审美标准、出道年龄随年代推移是否有改变以及外籍偶像的分布等。

本实战项目的目的在于练习评估数据干净和整洁度，并且基于评估结果，对数据进行清洗，从而得到可供下一步分析的数据。

## 简介

原始数据集提供了1700多位K-Pop（韩流）偶像的信息，包括他们的艺名、全名、韩文名、性别、出生日期、身高、体重、出生地以及之前和其他团体的关联。

数据每列的含义如下：
- `Stage Name`：偶像被熟知的舞台上名字，即艺名。同一个团体不会有相同艺名的偶像。
- `Full Name`：偶像的全名
- `Korean Name`：偶像的韩文名字
- `K Stage Name`：偶像的韩文舞台名字，同一个团体不会有相同韩文舞台名字的偶像。
- `Date of Birth`：偶像的出生日期
- `Group`：偶像所属的韩流团体
- `Debut`：偶像出道的日期
- `Company`：偶像所属娱乐公司
- `Country`：偶像的原籍国家
- `Second Country`：偶像的第二个国家（如果有的话）
- `Height`：偶像的身高
- `Weight`：偶像的体重
- `Birthplace`：偶像的出生地
- `Other Group`：偶像所属的其他团体，这个可能是更改后的团体名或者原团体的小分队
- `Former Group`：偶像所属的前一个韩流团体
- `Gender`：偶像的性别

## 读取数据

首先导入数据分析需要的pandas库，通过pandas库的`read_csv`函数读取原始文件`"kpop_idols.csv"`，将其解析为DataFrame，并赋值给变量`original_data`。查看该DataFrame的前五行数据。

In [1]:
import pandas as pd

In [2]:
original_data=pd.read_csv("kpop_idols.csv")

In [3]:
original_data.head()

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
0,2Soul,Kim Younghoon,김영훈,이솔,10/09/1997,7 O'clock,26/08/2014,Jungle,South Korea,,172.0,55.0,,,,M
1,A.M,Seong Hyunwoo,성현우,에이엠,31/12/1996,Limitless,9/07/2019,ONO,South Korea,,181.0,62.0,,,,M
2,Ace,Jang Wooyoung,장우영,에이스,28/08/1992,VAV,31/10/2015,A team,South Korea,,177.0,63.0,,,,M
3,Aeji,Kwon Aeji,권애지,애지,25/10/1999,Hash Tag,11/10/2017,LUK,South Korea,,163.0,,Daegu,,,F
4,AhIn,Lee Ahin,이아인,아인,27/09/1999,MOMOLAND,9/11/2016,Double Kick,South Korea,,160.0,44.0,Wonju,,,F


## 评估数据

第二步是针对上一步建立的`original_data`这个DataFrame所包含的数据进行评估。  

数据评估将从两个方面进行：结构和内容，即整齐度和干净度。数据结构性问题是指不符合“1.每一列是一个变量，2.每一行是一个观测值，3.每个单元格是一个值”的标准；数据内容性的问题包括存在丢失数据、重复数据和无效数据等情况。

### 评估数据整齐度

In [4]:
original_data.sample(n=10, random_state=30)

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
1741,Yuna,Seo Yuna,서유나,유나,30/12/1992,AoA,30/07/2012,FNC,South Korea,,163.0,45.0,Busan,AOA BLACK,,F
1117,Ni-Ki,Nishimura Riki,니시무라 리키,니키,9/12/2005,ENHYPEN,30/11/2020,Be:lift,Japan,,,,,,,M
426,Hanse,Do Hanse,도한세,한세,25/09/1997,VICTON,9/11/2016,Plan A,South Korea,,,,,,,M
1080,Muzin,Kim Hyunwoo,김현우,무진,29/03/2001,BAE173,19/11/2020,PocketDol,South Korea,,,,,,,M
1486,The8,Xu Minghao,쉬밍하오,디에잇,7/11/1997,Seventeen,26/05/2015,Pledis,China,,175.0,53.0,Anshan,,,M
197,Damhee,Park Damhee,박담희,담희,21/06/2000,ARTBEAT,16/11/2022,AB Creative,South Korea,,,,,,,F
706,JinE,Shin Hyejin,신혜진,진이,22/01/1995,,,,South Korea,,160.0,48.0,Pohang,,Oh My Girl,F
314,Eunice,Heo Sooyeon,허수연,유니스,2/09/1991,DIA,14/07/2015,MBK,South Korea,,166.0,49.0,Busan,,,F
1190,Ruka,Kawai Ruka,카와이 루카,루카,20/03/2002,BABYMONSTER,0/01/1900,YG,Japan,,,,,,,F
1677,Yoon,Ji Hayoon,지하윤,윤,8/07/1997,Gate9,26/01/1999,JYP| SidusHQ,South Korea,,162.0,,Busan,,,F


抽取10条样本观察，数据不符合“1.每一列是一个变量，2.每一行是一个观测值，3.每个单元格是一个值”的标准。具体来看，每一行是一位具体的偶像，每一列是关于某位偶像的基本信息的变量。但是可以看到索引`1677`的行中，`Company`变量有2个值。这是由于偶像可能在合约到期后签约新公司（也有合约到期前解约的情况，但普遍代价较大，不算是普适情况）。  

`Company`变量数据不符合“单元格为一个值的标准”，需要把此列进行拆分。由于本次分析目的之一是公司在选拔偶像上是否有差异化标准，我们只保留偶像第一次出道的公司，从最初的选拔上进行研究，需要删除后续所属公司的列。

### 评估数据干净度

In [5]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1778 entries, 0 to 1777
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Stage Name      1778 non-null   object 
 1   Full Name       1769 non-null   object 
 2   Korean Name     1768 non-null   object 
 3   K Stage Name    1777 non-null   object 
 4   Date of Birth   1776 non-null   object 
 5   Group           1632 non-null   object 
 6   Debut           1632 non-null   object 
 7   Company         1632 non-null   object 
 8   Country         1778 non-null   object 
 9   Second Country  62 non-null     object 
 10  Height          836 non-null    float64
 11  Weight          566 non-null    float64
 12  Birthplace      834 non-null    object 
 13  Other Group     140 non-null    object 
 14  Former Group    264 non-null    object 
 15  Gender          1778 non-null   object 
dtypes: float64(2), object(14)
memory usage: 222.4+ KB


从输出结果来看，数据共有1778条观察值，其中`Stage Name`、`Country`、`Gender`不存在缺失值，其他变量均存在缺失值；`Date of Birth`、`Debut`应该为日期类型，`Gender`最好为`category`类型，需要进行格式转换。

#### 评估缺失数据

在了解`Full Name`等变量存在缺失值后，根据条件提取各个变量的缺失值。  

其中`Second Country`、`Birthplace`、`Other Group`、`Former Group`缺失值较多，但这四个变量都不会影响到分析结果，可以直接删除列。

如果`Stage name`和`Full name`、`Korean name`、`K Stage name`同时缺失，可能该记录无效，但是我们通过`describe`方法已经了解到`Stage name`不存在缺失值，同时这三个变量也不会对分析结果产生任何影响，因此这三列可以在后续步骤中删除。

下面针对本次分析使用到的关键变量，进行缺失值的提取。

对`Date of Birth`根据条件提取缺失值。

In [6]:
original_data[original_data["Date of Birth"].isnull()]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
1127,On,Kim Dongwoo,김동우,온,,ABLUE,23/10/2022,J-Star,South Korea,,177.0,65.0,,,,M
1172,Roa,,,로아,,X:IN,11/04/2023,Escrow,South Korea,,,,,,,F


`Date of Birth`是进行后续偶像特征分析的重要变量。如果缺失或者无效，那么认为该数据无法提供有效含义，因此后续可以删除`Date of Birth`的缺失值。

对`Group`根据条件提取缺失值。

In [7]:
original_data[original_data["Group"].isnull()]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
11,Ailee,Lee Yejin,이예진,에일리,30/05/1989,,,,South Korea,USA,165.0,,Denver,,,F
12,Aini,Kim Heejung,김희정,아이니,13/07/1991,,,,South Korea,,163.0,44.0,,,Pink Fantasy,F
16,AleXa,Kim Seri,김세리,알렉사,6/12/1996,,,,USA,,,,,,,F
29,Arang,Son Mnjung,손민정,아랑,8/03/2000,,,,South Korea,,,,,,NeonPunch,F
46,B.I,Kim Hanbin,김한빈,비아이,22/10/1996,,,,South Korea,,,,Goyang,,iKON,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1708,Youngmin,Lim Youngmin,임영민,영민,25/12/1995,,,,South Korea,,,,Busan,MXM,AB6IX,M
1714,Yubin,Cho Yubin,조유빈,유빈,9/10/1999,,,,South Korea,,156.0,40.0,,,Pink Fantasy,F
1725,Yujeong,Kim Yujeong,김유정,유정,14/02/1992,,,,South Korea,,161.0,42.0,Seoul,,LABOUM,F
1740,Yulhee,Kim Yulhee,김율희,율희,27/11/1997,,,,South Korea,,166.0,52.0,Bucheon,,LABOUM,F


从输出结果来看，`Group`变量缺失的数据对应的`Debut`和`Company`也是都缺失的。为了验证猜想，增加条件进行筛选。

In [8]:
original_data[(original_data["Group"].isnull())&(original_data["Debut"].notna())]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender


In [9]:
original_data[(original_data["Group"].isnull())&(original_data["Company"].notna())]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender


输出结果数量为0，可以验证猜想是正确的。说明`Group`变量缺失的数据，对应的`Debut`和`Company`也都是缺失的。  
`Debut`和`Company`是分析偶像出道年龄和公司标准差异的关键变量，如果同时缺失或者失效，则认为不能为分析提供有效的数据，所以在后续清理步骤中应该删除`Group`变量缺失数据。

对`Debut`根据条件提取缺失值。

In [10]:
original_data[original_data["Debut"].isnull()]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
11,Ailee,Lee Yejin,이예진,에일리,30/05/1989,,,,South Korea,USA,165.0,,Denver,,,F
12,Aini,Kim Heejung,김희정,아이니,13/07/1991,,,,South Korea,,163.0,44.0,,,Pink Fantasy,F
16,AleXa,Kim Seri,김세리,알렉사,6/12/1996,,,,USA,,,,,,,F
29,Arang,Son Mnjung,손민정,아랑,8/03/2000,,,,South Korea,,,,,,NeonPunch,F
46,B.I,Kim Hanbin,김한빈,비아이,22/10/1996,,,,South Korea,,,,Goyang,,iKON,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1708,Youngmin,Lim Youngmin,임영민,영민,25/12/1995,,,,South Korea,,,,Busan,MXM,AB6IX,M
1714,Yubin,Cho Yubin,조유빈,유빈,9/10/1999,,,,South Korea,,156.0,40.0,,,Pink Fantasy,F
1725,Yujeong,Kim Yujeong,김유정,유정,14/02/1992,,,,South Korea,,161.0,42.0,Seoul,,LABOUM,F
1740,Yulhee,Kim Yulhee,김율희,율희,27/11/1997,,,,South Korea,,166.0,52.0,Bucheon,,LABOUM,F


`Debut`是分析偶像出道年龄关键变量，如果缺失或者失效，则认为不能为分析提供有效的数据，所以在后续清理步骤中应该删除`Debut`变量缺失数据。

对`Company`根据条件提取缺失值。

In [11]:
original_data[original_data["Company"].isnull()]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
11,Ailee,Lee Yejin,이예진,에일리,30/05/1989,,,,South Korea,USA,165.0,,Denver,,,F
12,Aini,Kim Heejung,김희정,아이니,13/07/1991,,,,South Korea,,163.0,44.0,,,Pink Fantasy,F
16,AleXa,Kim Seri,김세리,알렉사,6/12/1996,,,,USA,,,,,,,F
29,Arang,Son Mnjung,손민정,아랑,8/03/2000,,,,South Korea,,,,,,NeonPunch,F
46,B.I,Kim Hanbin,김한빈,비아이,22/10/1996,,,,South Korea,,,,Goyang,,iKON,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1708,Youngmin,Lim Youngmin,임영민,영민,25/12/1995,,,,South Korea,,,,Busan,MXM,AB6IX,M
1714,Yubin,Cho Yubin,조유빈,유빈,9/10/1999,,,,South Korea,,156.0,40.0,,,Pink Fantasy,F
1725,Yujeong,Kim Yujeong,김유정,유정,14/02/1992,,,,South Korea,,161.0,42.0,Seoul,,LABOUM,F
1740,Yulhee,Kim Yulhee,김율희,율희,27/11/1997,,,,South Korea,,166.0,52.0,Bucheon,,LABOUM,F


`Company`是分析娱乐公司标准差异的关键变量，如果缺失或者失效，则认为不能为分析提供有效的数据，所以在后续清理步骤中应该删除`Company`变量缺失数据。

对`Height`根据条件提取缺失值。

In [12]:
original_data[original_data["Height"].isnull()]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
5,Ahra,Go Ahra,고아라,아라,21/02/2001,Favorite,5/07/2017,Astory,South Korea,,,,Yeosu,,,F
6,Ahyeon,Jung Ahyeon,정아현,아현,11/04/2007,BABYMONSTER,0/01/1900,YG,South Korea,,,,,,,F
7,Ahyoon,Choi Subin,최수빈,아윤,23/10/2004,BOTOPASS,26/08/2020,WKS ENE,South Korea,,,,,,,F
8,Ahyoon,Shin Ahyoon,신아윤,아윤,24/09/2003,Queenz Eye,24/10/2022,Big Mountain,South Korea,,,,Seoul,,,F
9,Ahyoung,Cho Jayoung,조자영,아영,26/05/1991,Dal Shabet,3/01/2011,Happy Face,South Korea,,,,Seoul,,,F
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1766,YY,Kim Moonyong,김문용,와이와이,30/08/1991,UNVS,23/02/2020,CHITWN,South Korea,,,,,,,M
1768,Zero,Nasukawa Shota,나스카와 쇼타,제로,20/01/2003,T1419,21/09/2007,CJ E&M,Japan,,,,,,,M
1771,Zin,Jin Hyunbin,진현빈,지인,31/08/2001,bugAboo,25/10/2021,A team,South Korea,,,,,,,F
1775,Zuho,Bae Juho,백주호,주호,4/07/1996,SF9,5/10/2016,FNC,South Korea,,,,,,,M


从输出结果来看，`Height`变量缺失数据较多，由于`Height`变量是分析的关键变量，所以按照性别分组中位数对缺失数据进行填充。

对`Weight`根据条件提取缺失值。

In [13]:
original_data[original_data["Weight"].isnull()]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
3,Aeji,Kwon Aeji,권애지,애지,25/10/1999,Hash Tag,11/10/2017,LUK,South Korea,,163.0,,Daegu,,,F
5,Ahra,Go Ahra,고아라,아라,21/02/2001,Favorite,5/07/2017,Astory,South Korea,,,,Yeosu,,,F
6,Ahyeon,Jung Ahyeon,정아현,아현,11/04/2007,BABYMONSTER,0/01/1900,YG,South Korea,,,,,,,F
7,Ahyoon,Choi Subin,최수빈,아윤,23/10/2004,BOTOPASS,26/08/2020,WKS ENE,South Korea,,,,,,,F
8,Ahyoon,Shin Ahyoon,신아윤,아윤,24/09/2003,Queenz Eye,24/10/2022,Big Mountain,South Korea,,,,Seoul,,,F
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1768,Zero,Nasukawa Shota,나스카와 쇼타,제로,20/01/2003,T1419,21/09/2007,CJ E&M,Japan,,,,,,,M
1771,Zin,Jin Hyunbin,진현빈,지인,31/08/2001,bugAboo,25/10/2021,A team,South Korea,,,,,,,F
1774,Zoa,Cho Hyewon,조혜원,조아,31/05/2005,Weeekly,30/07/2020,Play M,South Korea,,170.0,,,,,F
1775,Zuho,Bae Juho,백주호,주호,4/07/1996,SF9,5/10/2016,FNC,South Korea,,,,,,,M


从输出结果来看，`Weight`变量缺失数据较多，由于`Weight`变量是分析的关键变量，所以按照性别分组中位数对缺失数据进行填充。

#### 评估重复数据

根据数据变量的含义来看，该数据集的变量都不是唯一标识，但是依据行业常规，相同的团体中是不会出现相同艺名的偶像的，也就是`Group`变量和`Stage Name`不可以同时出现相同的值。

首先针对`Stage Name`进行重复值的检查。

In [14]:
original_data["Stage Name"].value_counts().sort_index()

2Soul    1
A-min    1
A-ra     1
A.M      1
Ace      1
        ..
Ziu      1
Zoa      1
Zuho     1
Zuny     1
ra.L     1
Name: Stage Name, Length: 1469, dtype: int64

输出结果数量为1469，比数据集的记录1778少，并且已知`Stage Name`不存在缺失值，所以`Stage Name`变量存在重复值。接下来提取`Stage Name`的重复值。

In [15]:
original_data[original_data["Stage Name"].duplicated(keep=False)]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
7,Ahyoon,Choi Subin,최수빈,아윤,23/10/2004,BOTOPASS,26/08/2020,WKS ENE,South Korea,,,,,,,F
8,Ahyoon,Shin Ahyoon,신아윤,아윤,24/09/2003,Queenz Eye,24/10/2022,Big Mountain,South Korea,,,,Seoul,,,F
17,Alice,Song Joohee,송주희,앨리스,21/03/1990,Hello Venus,9/05/2012,Fantagio,South Korea,,166.0,47.0,Wonju,,,F
18,Alice,Cheon Jaeyoung,천재영,앨리스,20/02/2002,TRACER,3/04/2022,Gleamedia,South Korea,,,,,,,F
22,Andy,Lui Chunyeung,루이쯔양,앤디,6/10/1994,7 O'clock,26/08/2014,Jungle,Hong Kong,,,,,,,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1744,Yuna,Shin Yuna,신유나,유나,9/12/2003,ITZY,12/02/2019,JYP,South Korea,,,,Suwon,,,F
1747,Yunji,Kim Yunji,김윤지,윤지,26/08/1996,ARIAZ,23/10/2019,Rising Star,South Korea,,,,,,,F
1748,Yunji,Lee Yunji,이윤지,윤지,4/04/1992,Playback,25/06/2015,Coridel,South Korea,,,,,,,F
1758,Yuri,Jo Yuri,조유리,유리,22/10/2001,IZ*ONE,29/10/2018,Off The Record,South Korea,,,,Busan,,,F


从输出结果来看，符合之前提到的行业常规，也就是`Stage Name`变量重复时`Group`并没有重复。为了验证这个规律，增加条件来筛选结果。

In [16]:
original_data[original_data.duplicated(subset=["Stage Name","Group"],keep=False)]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
1453,Taeha,Yoo Taeha,유태하,태하,5/10/1995,,,,South Korea,,162.0,,,,Berry Good,F
1454,Taeha,Kim Taeha,김태하,태하,3/06/1998,,,,South Korea,,,,Jeonju,,MOMOLAND,F


从输出结果来看，有两条记录存在`Stage Name`变量重复时`Group`也重复的情况。同时观察到后续分析需要的关键变量`Group`、`Debut`、`Company`也是缺失的，这不能提供有效数据，因此删除该重复值。

#### 评估不一致数据

不一致数据可能存在于`Company`和`Group`变量中，需要查看是否存在多个不同值指代同一个公司或者组合的值。

In [17]:
pd.set_option('display.max_rows', None)
original_data["Company"].value_counts().sort_index()

143                        3
A team                    12
A100                       3
A2Z                        7
AB Creative               14
ADOR                       5
ALLART                     5
ANS                       13
AO                         2
ARA-LINE                   2
ATTRAKT                    4
About                      2
All-S                      9
Alseulbit                  5
Amuse                      6
Around Us                  5
Asia Bridge                5
Astory                     6
BG                        14
Barunson                   4
Be:lift                    7
Beat                       5
Big Hit                   12
Big Mountain               6
Big Planet Made            3
Blockberry                12
Bluedot                    6
Box                        4
Brand New                 13
Brave                     13
C-JeS                      3
C9                        21
CHITWN                     5
CJ E&M                    13
CT            

从输出结果来看，没有多个不同值指代同一个公司的情况，不需要进行处理。

In [18]:
original_data["Group"].value_counts().sort_index()

(G)I-DLE             5
100%                 5
14U                 14
15&                  2
1TEAM                3
1the9                3
24K                  7
2EYES                4
2NE1                 4
2PM                  6
3YE                  3
4TEN                 4
4minute              5
7 O'clock            6
8TURN                8
9Muses               5
A.C.E                5
AB6IX                4
ABLUE                6
AIMERS               6
ALICE                7
ANS                  7
APRIL                6
AREAL                4
ARGON                6
ARIAZ                6
ARTBEAT              7
ASTRO                6
ATBO                 7
ATEEZ                8
After School         5
AoA                  4
Apink                5
Astin                7
B.A.P                6
B.I.G                5
B.O.Y                2
B1A4                 3
BABYMONSTER          7
BADKIZ               5
BAE173               9
BESTie               2
BIGBANG              4
BLACKPINK  

In [19]:
#恢复显示设置
pd.reset_option('display.max_rows')

从输出结果来看，没有多个不同值指代同一个公司的情况，不需要进行处理。

#### 评估无效数据和错误数据

可以通过DataFrame的`describe`方法对数值的统计信息进行了解。

In [20]:
original_data.describe()

Unnamed: 0,Height,Weight
count,836.0,566.0
mean,170.643541,57.224382
std,7.805094,16.831308
min,150.0,38.0
25%,164.0,48.0
50%,170.0,57.0
75%,177.0,63.0
max,190.0,170.0


直接获取的数据统计信息得出的是不区分性别的结果，可以看到`Weight`的最大值是170，这明显不符合普通偶像的身体状况，会对后续分析产生影响。

因此我们先提取出`Weight`大于100的观察值，进一步评估其含义。

In [21]:
original_data[original_data["Weight"]>=100]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
203,Dan-a,Park Seungyeon,박승연,단아,26/06/1993,Matilda,18/03/2016,Box,South Korea,,,160.0,,,,F
290,Ellyn,Bang Sunhee,방선희,엘린,19/10/2002,Girlkind,16/01/2018,Nextlevel,South Korea,,,170.0,,,,F
392,Haena,Lee Haena,이해나,해나,2/06/1991,Matilda,18/03/2016,Box,South Korea,,,165.0,,,,F
759,JK,Kim Jikang,김지강,지강,17/03/1998,Girlkind,16/01/2018,Nextlevel,South Korea,,,161.0,,,,F
992,Medic Jin,Bae Yujin,배유진,메딕진,25/08/1996,Girlkind,16/01/2018,Nextlevel,South Korea,,,167.0,,,,F
1198,Saebyeol,Han Saebyeol,한새별,새별,23/05/1996,Matilda,18/03/2016,Box,South Korea,,,168.0,,,,F
1233,Semmi,Oh Heesun,오희선,세미,22/10/1995,Matilda,18/03/2016,Box,South Korea,,,165.0,,,,F
1239,Seokcheol,Lee Seokchul,이석철,석철,11/01/2000,TheEastLight,15/11/2018,Stardium,South Korea,,,170.0,,,,M
1410,Sun J,Jeon Heesun,전희선,썬제이,13/02/2001,Girlkind,16/01/2018,Nextlevel,South Korea,,,165.0,,,,F
1572,Xeheun,Lee Seheun,이세흔,세흔,11/07/1999,Girlkind,16/01/2018,Nextlevel,South Korea,,,162.0,,,,F


从输出结果的数值来看，这些数据更像是`Height`的数据，但是作为分析的关键变量，不能直接判断这些值就是`Height`的值，所以在后续步骤中删除`Weight`变量大于100的错误值。

另外需要检查日期的逻辑性，查看出生日期`Date of Birth`是否在合理范围内。我们设定出道年龄不能小于10岁，不能大于80岁,也就是在1935年-2015年之间。

In [22]:
#截取出生日期年份
s1=pd.Series(original_data["Date of Birth"].str.slice(-4,10))
#将其转换为float类型（因为int不支持NaN的转换）
s1=s1.astype(float)
#筛选不在合理范围内的值
s1[(s1<1935)|(s1>2015)]

Series([], Name: Date of Birth, dtype: float64)

输出结果为0，说明出生日期`Date of Birth`不存在错误数据。

还需要检查出道年份`Debut`是否在出生日年份`Date of Birth`之后。

In [23]:
#截取出道日期年份
s2=pd.Series(original_data["Debut"].str.slice(-4,10))
#将其转换为float类型（因为int不支持NaN的转换）
s2=s2.astype(float)
#筛选不在合理范围内的值
s2[s2<s1]

6       1900.0
27      2002.0
38      1900.0
152     1900.0
430     1900.0
432     2002.0
1136    1900.0
1181    1900.0
1190    1900.0
Name: Debut, dtype: float64

从输出结果来看，有9条记录的出道日期早于出生日期，是错误数据，后续步骤中需要删除这些数据。

## 清理数据

第三步将根据评估的结果进行数据清理。需要进行的数据清理包括： 

- 把`Company`列拆分并且只保留初次所属公司
- 把`Date of Birth`变量的数据类型转换为日期时间类型
- 把`Debut`变量的数据类型转换为日期时间类型
- 把`Gender`变量的数据类型转换为`category`类型
- 把`Full Name`、`Korean Name`、`K Stage Name`、`Second Country`、`Birthplace`、`Other Group`、`Former Group`列删除
- 把`Group`变量缺失的观察值删除
- 把`Date ofBirth`变量缺失的观察值删除
- 把`Debut`变量缺失的观察值删除
- 把`Company`变量缺失的观察值删除
- 把`Weight`变量大于100的观察值删除
- 把`Height`变量缺失的观察值用性别分组中位数填充
- 把`Weight`变量缺失的观察值用性别分组中位数填充
- 把`Stage Name`变量和`Group`变量同时重复的观察值删除 
- 把`Debut`变量早于`Date of Birth`的观察值删除

为了区分经过清理的数据和原始数据，创建新的变量`cleaned_data`，让它成为`original_data`的副本。之后的清理步骤将被运用在`cleaned_data`上。

In [24]:
cleaned_data=original_data.copy()
cleaned_data.head()

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
0,2Soul,Kim Younghoon,김영훈,이솔,10/09/1997,7 O'clock,26/08/2014,Jungle,South Korea,,172.0,55.0,,,,M
1,A.M,Seong Hyunwoo,성현우,에이엠,31/12/1996,Limitless,9/07/2019,ONO,South Korea,,181.0,62.0,,,,M
2,Ace,Jang Wooyoung,장우영,에이스,28/08/1992,VAV,31/10/2015,A team,South Korea,,177.0,63.0,,,,M
3,Aeji,Kwon Aeji,권애지,애지,25/10/1999,Hash Tag,11/10/2017,LUK,South Korea,,163.0,,Daegu,,,F
4,AhIn,Lee Ahin,이아인,아인,27/09/1999,MOMOLAND,9/11/2016,Double Kick,South Korea,,160.0,44.0,Wonju,,,F


#### 把`Company`列拆分并且只保留初次所属公司：

In [25]:
#分列
cleaned_data["Company"].str.split("|",expand=True)

#给新产生的列命名
cleaned_data[["Company","C1","C2"]]=cleaned_data["Company"].str.split("|",expand=True)

#删除没用的列
cleaned_data=cleaned_data.drop(["C1","C2"],axis=1)

#### 把`Date of Birth`变量的数据类型转换为日期时间类型：

In [26]:
cleaned_data["Date of Birth"]=pd.to_datetime(cleaned_data["Date of Birth"])
cleaned_data["Date of Birth"]

0      1997-10-09
1      1996-12-31
2      1992-08-28
3      1999-10-25
4      1999-09-27
          ...    
1773   1994-09-06
1774   2005-05-31
1775   1996-04-07
1776   1993-01-27
1777   1994-08-12
Name: Date of Birth, Length: 1778, dtype: datetime64[ns]

#### 把`Debut`变量的数据类型转换为日期时间类型：

In [27]:
cleaned_data["Debut"]=pd.to_datetime(cleaned_data["Debut"])
cleaned_data["Debut"]

ParserError: month must be in 1..12: 0/01/1900

这里遇到了将 `"0/01/1900"` 解析为日期时的问题。`"0/01/1900"` 可能表示缺失日期或占位符。我们先提取`Debut`值为`"0/01/1900"`的观察值进行评估。

In [28]:
cleaned_data[cleaned_data["Debut"]=="0/01/1900"]

Unnamed: 0,Stage Name,Full Name,Korean Name,K Stage Name,Date of Birth,Group,Debut,Company,Country,Second Country,Height,Weight,Birthplace,Other Group,Former Group,Gender
6,Ahyeon,Jung Ahyeon,정아현,아현,2007-11-04,BABYMONSTER,0/01/1900,YG,South Korea,,,,,,,F
38,Asa,Enami Asa,에나미 아사,아사,2006-04-17,BABYMONSTER,0/01/1900,YG,Japan,,,,,,,F
152,Chiquita,Riracha Phondechaphiphat,리라차 폰데차피팟,치키타,2009-02-17,BABYMONSTER,0/01/1900,YG,Thailand,,,,,,,F
430,Haram,Shin Haram,신하람,하람,2007-10-17,BABYMONSTER,0/01/1900,YG,South Korea,,,,,,,F
1136,Pharita,Pharita Chaikong,파리따 차이콩,파리타,2005-08-26,BABYMONSTER,0/01/1900,YG,Thailand,,,,Bangkok,,,F
1181,Rora,Lee Dain,이다인,로라,2008-05-08,BABYMONSTER,0/01/1900,YG,South Korea,,,,,,,F
1190,Ruka,Kawai Ruka,카와이 루카,루카,2002-03-20,BABYMONSTER,0/01/1900,YG,Japan,,,,,,,F


从输出结果来看，出道日`Debut`为`"0/01/1900"`的数据早于出生日期`Date of Birth`，这是不符合逻辑的，所以可以先将这些数据删除，再进行转换。

删除错误值，并检查删除后错误值的个数：

In [29]:
cleaned_data=cleaned_data[cleaned_data["Debut"]!="0/01/1900"]

In [31]:
len(cleaned_data[cleaned_data["Debut"]=="0/01/1900"])

0

删除错误数据后，对`Debut`变量转换为日期时间类型：

In [32]:
cleaned_data["Debut"]=pd.to_datetime(cleaned_data["Debut"])
cleaned_data["Debut"]

0      2014-08-26
1      2019-09-07
2      2015-10-31
3      2017-11-10
4      2016-09-11
          ...    
1773   2014-08-27
1774   2020-07-30
1775   2016-05-10
1776          NaT
1777   2013-07-03
Name: Debut, Length: 1771, dtype: datetime64[ns]

#### 把`Gender`变量的数据类型转换为`category`类型：

In [35]:
cleaned_data["Gender"]=cleaned_data["Gender"].astype("category")
cleaned_data["Gender"]

0       M
1       M
2       M
3       F
4       F
       ..
1773    F
1774    F
1775    M
1776    M
1777    F
Name: Gender, Length: 1771, dtype: category
Categories (2, object): ['F', 'M']

#### 把`Full Name`、`Korean Name`、`K Stage Name`、`Second Country`、`Birthplace`、`Other Group`、`Former Group`列删除

In [36]:
cleaned_data=cleaned_data.drop(cleaned_data[["Full Name","Korean Name","K Stage Name","Second Country","Birthplace","Other Group","Former Group"]],axis=1)

In [37]:
cleaned_data.head()

Unnamed: 0,Stage Name,Date of Birth,Group,Debut,Company,Country,Height,Weight,Gender
0,2Soul,1997-10-09,7 O'clock,2014-08-26,Jungle,South Korea,172.0,55.0,M
1,A.M,1996-12-31,Limitless,2019-09-07,ONO,South Korea,181.0,62.0,M
2,Ace,1992-08-28,VAV,2015-10-31,A team,South Korea,177.0,63.0,M
3,Aeji,1999-10-25,Hash Tag,2017-11-10,LUK,South Korea,163.0,,F
4,AhIn,1999-09-27,MOMOLAND,2016-09-11,Double Kick,South Korea,160.0,44.0,F


#### 把`Group`变量缺失的观察值删除，并检查删除后缺失值的个数:

In [38]:
cleaned_data.dropna(subset=["Group"],inplace=True)

In [39]:
cleaned_data["Group"].isnull().sum()

0

#### 把`Date of Birth`变量缺失的观察值删除，并检查删除后缺失值的个数:

In [40]:
cleaned_data.dropna(subset=["Date of Birth"],inplace=True)

In [41]:
cleaned_data["Date of Birth"].isnull().sum()

0

#### 把`Company`变量缺失的观察值删除，并检查删除后缺失值的个数:

In [42]:
cleaned_data.dropna(subset=["Company"],inplace=True)

In [43]:
cleaned_data["Company"].isnull().sum()

0

#### 把`Weight`变量大于100的观察值删除，并检查删除后缺失值的个数:

In [44]:
cleaned_data=cleaned_data[cleaned_data["Weight"]<100]

In [45]:
len(cleaned_data[cleaned_data["Weight"]>=100])

0

#### 把`Height`变量缺失的观察值用性别分组中位数填充，并检查填充后是否存在缺失值：

In [46]:
cleaned_data["Height"]=cleaned_data.groupby('Gender')['Height'].transform(lambda x:x.fillna(x.median()))

In [47]:
cleaned_data["Height"].isnull().sum()

0

#### 把`Weight`变量缺失的观察值用性别分组中位数填充，并检查填充后是否存在缺失值：

In [48]:
cleaned_data['Weight']=cleaned_data.groupby('Gender')['Weight'].transform(lambda x:x.fillna(x.median()))

In [49]:
cleaned_data['Weight'].isnull().sum()

0

#### 把`Stage Name`变量和`Group`变量同时重复的观察值删除，并检查删除后同时重复的观察值：

In [50]:
cleaned_data.drop_duplicates(subset=['Stage Name','Group'],inplace=True)

In [51]:
len(cleaned_data[cleaned_data.duplicated(subset=["Stage Name","Group"],keep=False)])

0

#### 把`Debut`变量早于`Date of Birth`的观察值删除，并检查删除后是否存在不合逻辑的观察值：

In [52]:
cleaned_data=cleaned_data[cleaned_data["Debut"]>cleaned_data["Date of Birth"]]

In [53]:
len(cleaned_data[cleaned_data["Debut"]<=cleaned_data["Date of Birth"]])

0

## 保存清理后的数据

完成数据清理后，将干净整齐的数据保存到新文件里，文件名为"kpop_idols_cleaned.csv"。

In [54]:
cleaned_data.head()

Unnamed: 0,Stage Name,Date of Birth,Group,Debut,Company,Country,Height,Weight,Gender
0,2Soul,1997-10-09,7 O'clock,2014-08-26,Jungle,South Korea,172.0,55.0,M
1,A.M,1996-12-31,Limitless,2019-09-07,ONO,South Korea,181.0,62.0,M
2,Ace,1992-08-28,VAV,2015-10-31,A team,South Korea,177.0,63.0,M
4,AhIn,1999-09-27,MOMOLAND,2016-09-11,Double Kick,South Korea,160.0,44.0,F
14,Aki,2001-12-02,MAJORS,2021-09-03,ANS,South Korea,168.0,48.0,F


In [55]:
cleaned_data.to_csv("kpop_idols_cleaned.csv",index=False)

In [56]:
pd.read_csv("kpop_idols_cleaned.csv").head()

Unnamed: 0,Stage Name,Date of Birth,Group,Debut,Company,Country,Height,Weight,Gender
0,2Soul,1997-10-09,7 O'clock,2014-08-26,Jungle,South Korea,172.0,55.0,M
1,A.M,1996-12-31,Limitless,2019-09-07,ONO,South Korea,181.0,62.0,M
2,Ace,1992-08-28,VAV,2015-10-31,A team,South Korea,177.0,63.0,M
3,AhIn,1999-09-27,MOMOLAND,2016-09-11,Double Kick,South Korea,160.0,44.0,F
4,Aki,2001-12-02,MAJORS,2021-09-03,ANS,South Korea,168.0,48.0,F


In [57]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 512 entries, 0 to 1773
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Stage Name     512 non-null    object        
 1   Date of Birth  512 non-null    datetime64[ns]
 2   Group          512 non-null    object        
 3   Debut          512 non-null    datetime64[ns]
 4   Company        512 non-null    object        
 5   Country        512 non-null    object        
 6   Height         512 non-null    float64       
 7   Weight         512 non-null    float64       
 8   Gender         512 non-null    category      
dtypes: category(1), datetime64[ns](2), float64(2), object(4)
memory usage: 36.6+ KB
