<img src="../../img/ods_stickers.jpg" />

## 人口收入普查数据探索

---

本次挑战中，你需要运用 Pandas 探索数据，并回答有关 [<i class="fa fa-external-link-square" aria-hidden="true"> Adult 数据集</i>](https://archive.ics.uci.edu/ml/datasets/Adult) 的几个问题。Adult 数据集是一个关于人口收入普查的数据集，其包含多个特征，目标值为类别类型。

首先，我们加载并预览该数据集。

In [40]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')

In [41]:
data = pd.read_csv(
    'adult.data.csv')
#data = pd.read_csv(path,delimiter)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


DataFrame 前面的列均为特征，最后的 `salary` 为目标值。接下来，你需要自行补充必要的代码来回答相应的挑战问题。

---

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中有多少男性和女性？

In [42]:
# 通过补充代码得到问题的答案，挑战最终需自行对照末尾的参考答案来评判，系统无法自动评分
data[data['sex']=='Male'].shape[0]


21790

In [43]:
data[data['sex']=='Female'].shape[0]

10771

In [44]:
data['sex'].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中女性的平均年龄是多少？

In [45]:
female=data[data['sex'] == 'Female']
female.mean()

age                   36.858230
fnlwgt            185746.311206
education-num         10.035744
capital-gain         568.410547
capital-loss          61.187633
hours-per-week        36.410361
dtype: float64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中德国公民的比例是多少？

In [46]:
data['native-country'].value_counts(normalize=True)

United-States                 0.895857
Mexico                        0.019748
?                             0.017905
Philippines                   0.006081
Germany                       0.004207
Canada                        0.003716
Puerto-Rico                   0.003501
El-Salvador                   0.003255
India                         0.003071
Cuba                          0.002918
England                       0.002764
Jamaica                       0.002488
South                         0.002457
China                         0.002303
Italy                         0.002242
Dominican-Republic            0.002150
Vietnam                       0.002058
Guatemala                     0.001966
Japan                         0.001904
Poland                        0.001843
Columbia                      0.001812
Taiwan                        0.001566
Haiti                         0.001351
Iran                          0.001321
Portugal                      0.001136
Nicaragua                

In [47]:
data['native-country'].value_counts(normalize=True)['Germany']

0.004207487485028101

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>年收入超过 50K 和低于 50K 人群年龄的平均值和标准差是多少？

In [48]:
over = data[data['salary']=='>50K']
over.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,7841.0,7841.0,7841.0,7841.0,7841.0,7841.0
mean,44.249841,188005.0,11.611657,4006.142456,195.00153,45.473026
std,10.519028,102541.8,2.385129,14570.378951,595.487574,11.012971
min,19.0,14878.0,2.0,0.0,0.0,1.0
25%,36.0,119101.0,10.0,0.0,0.0,40.0
50%,44.0,176101.0,12.0,0.0,0.0,40.0
75%,51.0,230959.0,13.0,0.0,0.0,50.0
max,90.0,1226583.0,16.0,99999.0,3683.0,99.0


In [49]:
below = data[data['salary']=='<=50K']
below.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,24720.0,24720.0,24720.0,24720.0,24720.0,24720.0
mean,36.783738,190340.9,9.595065,148.752468,53.142921,38.84021
std,14.020088,106482.3,2.436147,963.139307,310.755769,12.318995
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,25.0,117606.0,9.0,0.0,0.0,35.0
50%,34.0,179465.0,9.0,0.0,0.0,40.0
75%,46.0,239023.0,10.0,0.0,0.0,40.0
max,90.0,1484705.0,16.0,41310.0,4356.0,99.0


<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>年收入超过 50K 的人群是否都接受过高中以上教育？

In [50]:
over['education'].value_counts()

Bachelors       2221
HS-grad         1675
Some-college    1387
Masters          959
Prof-school      423
Assoc-voc        361
Doctorate        306
Assoc-acdm       265
10th              62
11th              60
7th-8th           40
12th              33
9th               27
5th-6th           16
1st-4th            6
Name: education, dtype: int64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>使用 `groupby` 和 `describe` 统计不同种族和性别人群的年龄分布数据。

In [51]:
rate = data.groupby(by=['race','sex'])[['age']].describe()


In [52]:
print(rate)

                               age                                          \
                             count       mean        std   min   25%   50%   
race               sex                                                       
Amer-Indian-Eskimo Female    119.0  37.117647  13.114991  17.0  27.0  36.0   
                   Male      192.0  37.208333  12.049563  17.0  28.0  35.0   
Asian-Pac-Islander Female    346.0  35.089595  12.300845  17.0  25.0  33.0   
                   Male      693.0  39.073593  12.883944  18.0  29.0  37.0   
Black              Female   1555.0  37.854019  12.637197  17.0  28.0  37.0   
                   Male     1569.0  37.682600  12.882612  17.0  27.0  36.0   
Other              Female    109.0  31.678899  11.631599  17.0  23.0  29.0   
                   Male      162.0  34.654321  11.355531  17.0  26.0  32.0   
White              Female   8642.0  36.811618  14.329093  17.0  25.0  35.0   
                   Male    19174.0  39.652498  13.436029  17.0  

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>统计男性高收入人群中已婚和未婚（包含离婚和分居）人群各自所占数量。

In [53]:
male_over = data[(data['salary']=='>50K')&(data['sex']=='Male')]
male_over['marital-status'].value_counts()

Married-civ-spouse       5938
Never-married             325
Divorced                  284
Separated                  49
Widowed                    39
Married-spouse-absent      23
Married-AF-spouse           4
Name: marital-status, dtype: int64

In [54]:
male_over['marital-status'].value_counts()['Divorced']

284

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>统计数据集中最长周工作小时数及对应的人数，并计算该群体中收入超过 50K 的比例。

In [55]:
longest = data['hours-per-week'].describe()
print(longest)

count    32561.000000
mean        40.437456
std         12.347429
min          1.000000
25%         40.000000
50%         40.000000
75%         45.000000
max         99.000000
Name: hours-per-week, dtype: float64


In [56]:
max_hours = longest['max']
longest=data[data['hours-per-week']==max_hours]
longest['salary'].value_counts(normalize=True)

<=50K    0.705882
>50K     0.294118
Name: salary, dtype: float64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>计算各国超过和低于 50K 人群各自的平均周工作时长。

In [57]:
over=data[data['salary']=='>50K']
below=data[data['salary']=='<=50K']
over_describe = over.groupby(by=['native-country'])[['hours-per-week']].mean()
#below_describe = below.groupby(by=['native-country'])[['hours-per-week']].mean()
print(over_describe)
#print(below_describe)

                    hours-per-week
native-country                    
?                        45.547945
Cambodia                 40.000000
Canada                   45.641026
China                    38.900000
Columbia                 50.000000
Cuba                     42.440000
Dominican-Republic       47.000000
Ecuador                  48.750000
El-Salvador              45.000000
England                  44.533333
France                   50.750000
Germany                  44.977273
Greece                   50.625000
Guatemala                36.666667
Haiti                    42.750000
Honduras                 60.000000
Hong                     45.000000
Hungary                  50.000000
India                    46.475000
Iran                     47.500000
Ireland                  48.000000
Italy                    45.400000
Jamaica                  41.100000
Japan                    47.958333
Laos                     40.000000
Mexico                   46.575758
Nicaragua           

In [58]:
below_describe = below.groupby(by=['native-country'])[['hours-per-week']].mean()
print(below_describe)

                            hours-per-week
native-country                            
?                                40.164760
Cambodia                         41.416667
Canada                           37.914634
China                            37.381818
Columbia                         38.684211
Cuba                             37.985714
Dominican-Republic               42.338235
Ecuador                          38.041667
El-Salvador                      36.030928
England                          40.483333
France                           41.058824
Germany                          39.139785
Greece                           41.809524
Guatemala                        39.360656
Haiti                            36.325000
Holand-Netherlands               40.000000
Honduras                         34.333333
Hong                             39.142857
Hungary                          31.300000
India                            38.233333
Iran                             41.440000
Ireland    

---

<div style="background-color: #e6e6e6; margin-bottom: 10px; padding: 1%; border: 1px solid #ccc; border-radius: 6px;text-align: center;"><a href="https://nbviewer.jupyter.org/github/shiyanlou/mlcourse-answers/tree/master/" title="挑战参考答案"><i class="fa fa-file-code-o" aria-hidden="true"> 查看挑战参考答案</i></a></div>