# 美国婴儿名字数据分析

### 简介:

我们从 Kaggle 拿了一部分[美国婴儿名字数据](https://www.kaggle.com/kaggle/us-baby-names), 这是 2004年到 2014的名字。


### Step 1. 导入必须的Python库

In [16]:
import pandas as pd

### Step 2. 从这里 [下载地址](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv)  导入数据. 

### Step 3. 将这文件中的数据放到变量 baby_names.

In [17]:
baby_names = pd.read_csv('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv')
baby_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
Unnamed: 0    1016395 non-null int64
Id            1016395 non-null int64
Name          1016395 non-null object
Year          1016395 non-null int64
Gender        1016395 non-null object
State         1016395 non-null object
Count         1016395 non-null int64
dtypes: int64(4), object(3)
memory usage: 54.3+ MB


### Step 4. 看 baby_names 的前10行

In [18]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. 删除 'Unnamed: 0' ，'Id' 两列

In [19]:
# deletes Unnamed: 0
del baby_names['Unnamed: 0']

# deletes Unnamed: 0
del baby_names['Id']

baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


### Step 6. 统计 baby_names 的性别分布情况

In [20]:
baby_names['Gender'].value_counts()

F    558846
M    457549
Name: Gender, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [32]:
# 我们不需要年份数据，我们可以删降这一列
#del baby_names["Year"]

# 按名字('Name')进行分组, 并求和统计每个名字出现次数
names = baby_names.groupby("Name").sum()

# 打印前5个
print("---- first 5 observations -----")
print(names.head())

# 打印数据集的大小
print("---- size of the dataset -----")
print(names.shape)

# 对名字的出现顺序从高到低进行排序
names.sort_values("Count", ascending = False).head()

---- first 5 observations -----
         Count
Name          
Aaban       12
Aadan       23
Aadarsh      5
Aaden     3426
Aadhav       6
---- size of the dataset -----
(17632, 1)


Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874
Emma,214852
Michael,214405
Ethan,209277
Isabella,204798


In [28]:
names

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aaban,12
Aadan,23
Aadarsh,5
Aaden,3426
Aadhav,6
...,...
Zyra,42
Zyrah,11
Zyren,6
Zyria,59


### Step 8. 有多少个不同的名字?

In [22]:
# 我们前面已对名字进行分组，所以 names 中的名字都是不重复的
# 我们只要取 names 的长度就可以了
len(names)

17632

### Step 9. 哪个名字出现最多?

In [60]:
# 这是 idxmax 是 pd.Series 里的自带方法
#  你可以输入pd.Series.idxmax? 来看帮助
print("--- 出现最多的名字 ----")
print(names.Count.idxmax())

# 解释一下
# names 中的索引(index) 是用户的名字，列(column)是名字出现次数(Count)
# idxmax() 是打印出 Count字段值 最大的索引
print("---- names 中的索引 -------")
print(names.index)
print("---- names 中的列名 -------")
print(names.columns)
# OR

# 译者补充，还有一种大家以前熟悉的办法
#  先用 sort_values 排序，再看第一行就解决了
names.sort_values("Count", ascending = False).head(1).index[0]
# names[names.Count == names.Count.max()]


--- 出现最多的名字 ----
Jacob
---- names 中的索引 -------
Index(['Aaban', 'Aadan', 'Aadarsh', 'Aaden', 'Aadhav', 'Aadhya', 'Aadi',
       'Aadin', 'Aadit', 'Aaditya',
       ...
       'Zymire', 'Zyon', 'Zyonna', 'Zyquan', 'Zyquavious', 'Zyra', 'Zyrah',
       'Zyren', 'Zyria', 'Zyriah'],
      dtype='object', name='Name', length=17632)
---- names 中的列名 -------
Index(['Count'], dtype='object')


'Jacob'

### Step 10. 出现最少次数名字出现了多少次?

In [24]:
len(names[names.Count == names.Count.min()])

2578

### Step 11. 出现了中位数次数名字是哪些?

In [25]:
names[names.Count == names.Count.median()]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aishani,49
Alara,49
Alysse,49
Ameir,49
Anely,49
...,...
Sriram,49
Trinton,49
Vita,49
Yoni,49


### Step 12. 名字出现次数的标准差是多少?

In [57]:
names.Count.std()

count     17632.000000
mean       2008.932169
std       11006.069468
min           5.000000
25%          11.000000
50%          49.000000
75%         337.000000
max      242874.000000
Name: Count, dtype: float64

### Step 13. 对名字出现次数进统计性描述.

In [27]:
names.describe()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0
