<img src="../../img/ods_stickers.jpg" />

## 人口收入普查数据探索

---

本次挑战中，你需要运用 Pandas 探索数据，并回答有关 [<i class="fa fa-external-link-square" aria-hidden="true"> Adult 数据集</i>](https://archive.ics.uci.edu/ml/datasets/Adult) 的几个问题。Adult 数据集是一个关于人口收入普查的数据集，其包含多个特征，目标值为类别类型。

首先，我们加载并预览该数据集。

In [1]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('adult.data.csv')
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


DataFrame 前面的列均为特征，最后的 `salary` 为目标值。接下来，你需要自行补充必要的代码来回答相应的挑战问题。

---

In [3]:
data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中有多少男性和女性？

In [4]:
male_counts=data[data['sex'] == 'Male'].shape[0]
female_counts=data[data['sex'] == 'Female'].shape[0]
print(male_counts, female_counts)

21790 10771


In [5]:
data['sex'].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中女性的平均年龄是多少？

In [6]:
data[data['sex']=='Female']['age'].mean()

36.85823043357163

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中德国公民的比例是多少？

In [7]:
data['native-country'].value_counts(normalize=True)['Germany']

0.004207487485028101

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>年收入超过 50K 和低于 50K 人群年龄的平均值和标准差是多少？

In [8]:
data[data['salary']=='>50K'].describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,7841.0,7841.0,7841.0,7841.0,7841.0,7841.0
mean,44.249841,188005.0,11.611657,4006.142456,195.00153,45.473026
std,10.519028,102541.8,2.385129,14570.378951,595.487574,11.012971
min,19.0,14878.0,2.0,0.0,0.0,1.0
25%,36.0,119101.0,10.0,0.0,0.0,40.0
50%,44.0,176101.0,12.0,0.0,0.0,40.0
75%,51.0,230959.0,13.0,0.0,0.0,50.0
max,90.0,1226583.0,16.0,99999.0,3683.0,99.0


In [9]:
data[data['salary']=='<=50K'].describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,24720.0,24720.0,24720.0,24720.0,24720.0,24720.0
mean,36.783738,190340.9,9.595065,148.752468,53.142921,38.84021
std,14.020088,106482.3,2.436147,963.139307,310.755769,12.318995
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,25.0,117606.0,9.0,0.0,0.0,35.0
50%,34.0,179465.0,9.0,0.0,0.0,40.0
75%,46.0,239023.0,10.0,0.0,0.0,40.0
max,90.0,1484705.0,16.0,41310.0,4356.0,99.0


<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>年收入超过 50K 的人群是否都接受过高中以上教育？

In [10]:
data[data['salary']=='>50K']['education'].value_counts()

Bachelors       2221
HS-grad         1675
Some-college    1387
Masters          959
Prof-school      423
Assoc-voc        361
Doctorate        306
Assoc-acdm       265
10th              62
11th              60
7th-8th           40
12th              33
9th               27
5th-6th           16
1st-4th            6
Name: education, dtype: int64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>使用 `groupby` 和 `describe` 统计不同种族和性别人群的年龄分布数据。

In [11]:
data['race'].value_counts()

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

In [12]:
data.groupby(by=['race','sex'])[['age']].describe()
#参考Excel数据透视图

Unnamed: 0_level_0,Unnamed: 1_level_0,age,age,age,age,age,age,age,age
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max
race,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Amer-Indian-Eskimo,Female,119.0,37.117647,13.114991,17.0,27.0,36.0,46.0,80.0
Amer-Indian-Eskimo,Male,192.0,37.208333,12.049563,17.0,28.0,35.0,45.0,82.0
Asian-Pac-Islander,Female,346.0,35.089595,12.300845,17.0,25.0,33.0,43.75,75.0
Asian-Pac-Islander,Male,693.0,39.073593,12.883944,18.0,29.0,37.0,46.0,90.0
Black,Female,1555.0,37.854019,12.637197,17.0,28.0,37.0,46.0,90.0
Black,Male,1569.0,37.6826,12.882612,17.0,27.0,36.0,46.0,90.0
Other,Female,109.0,31.678899,11.631599,17.0,23.0,29.0,39.0,74.0
Other,Male,162.0,34.654321,11.355531,17.0,26.0,32.0,42.0,77.0
White,Female,8642.0,36.811618,14.329093,17.0,25.0,35.0,46.0,90.0
White,Male,19174.0,39.652498,13.436029,17.0,29.0,38.0,49.0,90.0


<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>统计男性高收入人群中已婚和未婚（包含离婚和分居）人群各自所占数量。

In [13]:
#data[(data['salary'] == '>50K') & (data['sex']=='Male')]['marital-status'].value_counts()['Divorced','Separated']
data[(data['salary'] == '>50K') & (data['sex']=='Male')]['marital-status'].value_counts()

Married-civ-spouse       5938
Never-married             325
Divorced                  284
Separated                  49
Widowed                    39
Married-spouse-absent      23
Married-AF-spouse           4
Name: marital-status, dtype: int64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>统计数据集中最长周工作小时数及对应的人数，并计算该群体中收入超过 50K 的比例。

In [14]:
data['hours-per-week'].max()

99

In [15]:
hours_per_week_max = data['hours-per-week'].max()
data[data['hours-per-week'] == hours_per_week_max].shape[0]

85

In [22]:
data[data['hours-per-week'] == hours_per_week_max]['salary'].value_counts(normalize=True)
#注意要选中['salary']列，才能继续使用.value_counts方法

<=50K    0.705882
>50K     0.294118
Name: salary, dtype: float64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>计算各国超过和低于 50K 人群各自的平均周工作时长。

In [None]:
over = data[data['salary'] == '>50K']
over.groupby(by=['native-country'])[['hours-per-week']].mean()

---

<div style="background-color: #e6e6e6; margin-bottom: 10px; padding: 1%; border: 1px solid #ccc; border-radius: 6px;text-align: center;"><a href="https://nbviewer.jupyter.org/github/shiyanlou/mlcourse-answers/tree/master/" title="挑战参考答案"><i class="fa fa-file-code-o" aria-hidden="true"> 查看挑战参考答案</i></a></div>