<img src="../../img/ods_stickers.jpg" />

## 人口收入普查数据探索

---

本次挑战中，你需要运用 Pandas 探索数据，并回答有关 [<i class="fa fa-external-link-square" aria-hidden="true"> Adult 数据集</i>](https://archive.ics.uci.edu/ml/datasets/Adult) 的几个问题。Adult 数据集是一个关于人口收入普查的数据集，其包含多个特征，目标值为类别类型。

首先，我们加载并预览该数据集。

In [49]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')

In [50]:
data = pd.read_csv(
    'adult.data.csv')
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


DataFrame 前面的列均为特征，最后的 `salary` 为目标值。接下来，你需要自行补充必要的代码来回答相应的挑战问题。

---

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中有多少男性和女性？

In [51]:
# 通过补充代码得到问题的答案，挑战最终需自行对照末尾的参考答案来评判，系统无法自动评


In [52]:
data['sex'].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中女性的平均年龄是多少？

In [53]:
data[data['sex'] == 'Female'].mean()

age                   36.858230
fnlwgt            185746.311206
education-num         10.035744
capital-gain         568.410547
capital-loss          61.187633
hours-per-week        36.410361
dtype: float64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>数据集中德国公民的比例是多少？

In [54]:
data['native-country'].value_counts(normalize=True)['Germany']

0.004207487485028101

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>年收入超过 50K 和低于 50K 人群年龄的平均值和标准差是多少？

In [55]:
data[data['salary'] == '<=50K'].describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,24720.0,24720.0,24720.0,24720.0,24720.0,24720.0
mean,36.783738,190340.9,9.595065,148.752468,53.142921,38.84021
std,14.020088,106482.3,2.436147,963.139307,310.755769,12.318995
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,25.0,117606.0,9.0,0.0,0.0,35.0
50%,34.0,179465.0,9.0,0.0,0.0,40.0
75%,46.0,239023.0,10.0,0.0,0.0,40.0
max,90.0,1484705.0,16.0,41310.0,4356.0,99.0


In [56]:
data[data['salary'] == '>50K'].describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,7841.0,7841.0,7841.0,7841.0,7841.0,7841.0
mean,44.249841,188005.0,11.611657,4006.142456,195.00153,45.473026
std,10.519028,102541.8,2.385129,14570.378951,595.487574,11.012971
min,19.0,14878.0,2.0,0.0,0.0,1.0
25%,36.0,119101.0,10.0,0.0,0.0,40.0
50%,44.0,176101.0,12.0,0.0,0.0,40.0
75%,51.0,230959.0,13.0,0.0,0.0,50.0
max,90.0,1226583.0,16.0,99999.0,3683.0,99.0


<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>年收入超过 50K 的人群是否都接受过高中以上教育？

In [63]:
temp = data[data['salary'] == '>50K']['education']
temp.value_counts(normalize=True)

Bachelors       0.283255
HS-grad         0.213621
Some-college    0.176891
Masters         0.122306
Prof-school     0.053947
Assoc-voc       0.046040
Doctorate       0.039026
Assoc-acdm      0.033797
10th            0.007907
11th            0.007652
7th-8th         0.005101
12th            0.004209
9th             0.003443
5th-6th         0.002041
1st-4th         0.000765
Name: education, dtype: float64

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>使用 `groupby` 和 `describe` 统计不同种族和性别人群的年龄分布数据。

In [None]:
data.groupby(by=['race','sex'])[['age']].describe()

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>统计男性高收入人群中已婚和未婚（包含离婚和分居）人群各自所占数量。

In [None]:
male_over = data[(data['salary'] == '>50K')& (data['sex'] == 'Male')]
male_over['marital-status'].value_counts()

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>统计数据集中最长周工作小时数及对应的人数，并计算该群体中收入超过 50K 的比例。

In [None]:
longest = data['hours-per-week'].max()
longest_people = data[data['hours-per-week'] == longest]
longest_people['salary'].value_counts(normalize = True)

<i class="fa fa-question-circle" aria-hidden="true"> 问题：</i>计算各国超过和低于 50K 人群各自的平均周工作时长。

In [None]:
over = data[data['salary'] == '>50K']
below = data[data['salary'] == '<=50K']
over_people = over.groupby(by=['native-country'])[['hours-per-week']]
below_people = below.groupby(by=['native-country'])[['hours-per-week']]
over_people.mean()

In [None]:
below_people.mean()

---

<div style="background-color: #e6e6e6; margin-bottom: 10px; padding: 1%; border: 1px solid #ccc; border-radius: 6px;text-align: center;"><a href="https://nbviewer.jupyter.org/github/shiyanlou/mlcourse-answers/tree/master/" title="挑战参考答案"><i class="fa fa-file-code-o" aria-hidden="true"> 查看挑战参考答案</i></a></div>