<img src="../../img/ods_stickers.jpg" />

## 人口收入普查数据探索

---

本次挑战中，你需要运用 Pandas 探索数据，并回答有关 [<i class="fa fa-external-link-square" aria-hidden="true"> Adult 数据集</i>](https://archive.ics.uci.edu/ml/datasets/Adult) 的几个问题。Adult 数据集是一个关于人口收入普查的数据集，其包含多个特征，目标值为类别类型。

首先，我们加载并预览该数据集。

In [13]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')

In [14]:
data = pd.read_csv(
    'adult.data.csv')
data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


DataFrame 前面的列均为特征，最后的 `salary` 为目标值。接下来，你需要自行补充必要的代码来回答相应的挑战问题。

---

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中有多少男性和女性？

In [15]:
# 通过补充代码得到问题的答案，挑战最终需自行对照末尾的参考答案来评判，系统无法自动评分
male = data[data['sex'] == 'Male'].shape[0] #male
female = data[data['sex'] == 'Female'].shape[0] #female
print(male, female)

21790 10771


In [16]:
data['sex'].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中女性的平均年龄是多少？

In [17]:
female = data[data['sex'] == 'Female']
female.mean()

age                   36.858230
fnlwgt            185746.311206
education-num         10.035744
capital-gain         568.410547
capital-loss          61.187633
hours-per-week        36.410361
dtype: float64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中德国公民的比例是多少？

In [18]:
data['native-country'].value_counts(normalize=True)['Germany']

0.004207487485028101

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>年收入超过 50K 和低于 50K 人群年龄的平均值和标准差是多少？

In [19]:
over = data[data['salary'] == '>50K']
over.describe()
below = data[data['salary'] == '<=50K']
below.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,24720.0,24720.0,24720.0,24720.0,24720.0,24720.0
mean,36.783738,190340.9,9.595065,148.752468,53.142921,38.84021
std,14.020088,106482.3,2.436147,963.139307,310.755769,12.318995
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,25.0,117606.0,9.0,0.0,0.0,35.0
50%,34.0,179465.0,9.0,0.0,0.0,40.0
75%,46.0,239023.0,10.0,0.0,0.0,40.0
max,90.0,1484705.0,16.0,41310.0,4356.0,99.0


<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>年收入超过 50K 的人群是否都接受过高中以上教育？

In [20]:
over['education'].value_counts() #no

Bachelors       2221
HS-grad         1675
Some-college    1387
Masters          959
Prof-school      423
Assoc-voc        361
Doctorate        306
Assoc-acdm       265
10th              62
11th              60
7th-8th           40
12th              33
9th               27
5th-6th           16
1st-4th            6
Name: education, dtype: int64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>使用 `groupby` 和 `describe` 统计不同种族和性别人群的年龄分布数据。

In [22]:
race = data.groupby(by=['race', 'sex'])[['age']].describe()
print(race)

                               age                                          \
                             count       mean        std   min   25%   50%   
race               sex                                                       
Amer-Indian-Eskimo Female    119.0  37.117647  13.114991  17.0  27.0  36.0   
                   Male      192.0  37.208333  12.049563  17.0  28.0  35.0   
Asian-Pac-Islander Female    346.0  35.089595  12.300845  17.0  25.0  33.0   
                   Male      693.0  39.073593  12.883944  18.0  29.0  37.0   
Black              Female   1555.0  37.854019  12.637197  17.0  28.0  37.0   
                   Male     1569.0  37.682600  12.882612  17.0  27.0  36.0   
Other              Female    109.0  31.678899  11.631599  17.0  23.0  29.0   
                   Male      162.0  34.654321  11.355531  17.0  26.0  32.0   
White              Female   8642.0  36.811618  14.329093  17.0  25.0  35.0   
                   Male    19174.0  39.652498  13.436029  17.0  

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>统计男性高收入人群中已婚和未婚（包含离婚和分居）人群各自所占数量。

In [10]:
male_over = data[(data['salary'] == '>50K') & (data['sex'] == 'Male')]
male_over['marital-status'].value_counts()['Divorced']

284

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>统计数据集中最长周工作小时数及对应的人数，并计算该群体中收入超过 50K 的比例。

In [11]:
longest = data['hours-per-week'].describe()
# print(longest)
max_hours = longest['max']
longest = data[data['hours-per-week'] == max_hours]
# print(longest.head())
longest['salary'].value_counts(normalize=True)

<=50K    0.705882
>50K     0.294118
Name: salary, dtype: float64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>计算各国超过和低于 50K 人群各自的平均周工作时长。

In [12]:
over = data[data['salary'] == '>50K']
below = data[data['salary'] == '<=50K']
over_describe = over.groupby(by=['native-country'])[['hours-per-week']].mean()
below_describe = below.groupby(by=['native-country'])[['hours-per-week']].mean()
print(over_describe)
print(below_describe)

                    hours-per-week
native-country                    
?                        45.547945
Cambodia                 40.000000
Canada                   45.641026
China                    38.900000
Columbia                 50.000000
Cuba                     42.440000
Dominican-Republic       47.000000
Ecuador                  48.750000
El-Salvador              45.000000
England                  44.533333
France                   50.750000
Germany                  44.977273
Greece                   50.625000
Guatemala                36.666667
Haiti                    42.750000
Honduras                 60.000000
Hong                     45.000000
Hungary                  50.000000
India                    46.475000
Iran                     47.500000
Ireland                  48.000000
Italy                    45.400000
Jamaica                  41.100000
Japan                    47.958333
Laos                     40.000000
Mexico                   46.575758
Nicaragua           

---

<div style="background-color: #e6e6e6; margin-bottom: 10px; padding: 1%; border: 1px solid #ccc; border-radius: 6px;text-align: center;"><a href="https://nbviewer.jupyter.org/github/shiyanlou/mlcourse-answers/tree/master/" title="挑战参考答案"><i class="fa fa-file-code-o" aria-hidden="true"> 查看挑战参考答案</i></a></div>