**<center>[mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course** </center><br>

Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/). Translated and edited by [Sergey Isaev](https://www.linkedin.com/in/isvforall/), [Artem Trunov](https://www.linkedin.com/in/datamove/), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.


**Same assignment as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/a1-demo-pandas-and-uci-adult-dataset) + [solution](https://www.kaggle.com/kashnitsky/a1-demo-pandas-and-uci-adult-dataset-solution).**

**In this task you should use Pandas to answer a few questions about the [Adult](https://archive.ics.uci.edu/ml/datasets/Adult) dataset. (You don't have to download the data – it's already  in the repository). Choose the answers in the [web-form](https://docs.google.com/forms/d/1uY7MpI2trKx6FLWZte0uVh3ULV4Cm_tDud0VDFGCOKg).**

Unique values of features (for more information please see the link above):
- `age`: continuous;
- `workclass`: `Private`, `Self-emp-not-inc`, `Self-emp-inc`, `Federal-gov`, `Local-gov`, `State-gov`, `Without-pay`, `Never-worked`;
- `fnlwgt`: continuous;
- `education`: `Bachelors`, `Some-college`, `11th`, `HS-grad`, `Prof-school`, `Assoc-acdm`, `Assoc-voc`, `9th`, `7th-8th`, `12th`, `Masters`, `1st-4th`, `10th`, `Doctorate`, `5th-6th`, `Preschool`;
- `education-num`: continuous;
- `marital-status`: `Married-civ-spouse`, `Divorced`, `Never-married`, `Separated`, `Widowed`, `Married-spouse-absent`, `Married-AF-spouse`,
- `occupation`: `Tech-support`, `Craft-repair`, `Other-service`, `Sales`, `Exec-managerial`, `Prof-specialty`, `Handlers-cleaners`, `Machine-op-inspct`, `Adm-clerical`, `Farming-fishing`, `Transport-moving`, `Priv-house-serv`, `Protective-serv`, `Armed-Forces`;
- `relationship`: `Wife`, `Own-child`, `Husband`, `Not-in-family`, `Other-relative`, `Unmarried`;
- `race`: `White`, `Asian-Pac-Islander`, `Amer-Indian-Eskimo`, `Other`, `Black`;
- `sex`: `Female`, `Male`;
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: `United-States`, `Cambodia`, `England`, `Puerto-Rico`, `Canada`, `Germany`, `Outlying-US(Guam-USVI-etc)`, `India`, `Japan`, `Greece`, `South`, `China`, `Cuba`, `Iran`, `Honduras`, `Philippines`, `Italy`, `Poland`, `Jamaica`, `Vietnam`, `Mexico`, `Portugal`, `Ireland`, `France`, `Dominican-Republic`, `Laos`, `Ecuador`, `Taiwan`, `Haiti`, `Columbia`, `Hungary`, `Guatemala`, `Nicaragua`, `Scotland`, `Thailand`, `Yugoslavia`, `El-Salvador`, `Trinadad&Tobago`, `Peru`, `Hong`, `Holand-Netherlands`;
- `salary`: `>50K`, `<=50K`.

In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.max.columns", 100)
# to draw pictures in jupyter notebook
%matplotlib inline
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings

import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings("ignore")

# for display 2 decimal places:
pd.options.display.float_format = '{:,.2f}'.format

In [2]:
# for Jupyter-book, we copy data from GitHub, locally, to save Internet traffic,
# you can specify the data/ folder from the root of your cloned
# https://github.com/Yorko/mlcourse.ai repo, to save Internet traffic
DATA_URL = "https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/data/"

In [3]:
data = pd.read_csv(DATA_URL + "adult.data.csv")
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**1. How many men and women (*sex* feature) are represented in this dataset?**

In [4]:
data.sex.value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

**2. What is the average age (*age* feature) of women?**

In [5]:
data[data.sex == 'Female'].age.mean()

36.85823043357163

**3. What is the percentage of German citizens (*native-country* feature)?**

In [21]:
(data['native-country'].value_counts(normalize=True)['Germany'] * 100).round(2)

0.42

**4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (*salary* feature) and those who earn less than 50K per year?**

In [7]:
data.salary.unique()

array(['<=50K', '>50K'], dtype=object)

In [8]:
data[data.salary == '>50K'].age.describe()[['mean', 'std']]

mean   44.25
std    10.52
Name: age, dtype: float64

In [9]:
data[data.salary == '<=50K'].age.describe()[['mean', 'std']]

mean   36.78
std    14.02
Name: age, dtype: float64

**6. Is it true that people who earn more than 50K have at least high school education? (*education* – `Bachelors`, `Prof-school`, `Assoc-acdm`, `Assoc-voc`, `Masters` or `Doctorate` feature)**

In [10]:
data[data.salary == '>50K'].education.unique() # obviosly false

array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
       'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
       '10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)

**7. Display age statistics for each race (*race* feature) and each gender (*sex* feature). Use *groupby()* and *describe()*. Find the maximum age of men of `Amer-Indian-Eskimo` race.**

In [11]:
data.groupby(['race', 'sex']).describe() # 82

Unnamed: 0_level_0,Unnamed: 1_level_0,age,age,age,age,age,age,age,age,fnlwgt,fnlwgt,fnlwgt,fnlwgt,fnlwgt,fnlwgt,fnlwgt,fnlwgt,education-num,education-num,education-num,education-num,education-num,education-num,education-num,education-num,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
race,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2
Amer-Indian-Eskimo,Female,119.0,37.12,13.11,17.0,27.0,36.0,46.0,80.0,119.0,112950.73,93207.97,12285.0,31387.0,87950.0,163027.5,445168.0,119.0,9.7,2.33,2.0,9.0,10.0,11.0,16.0,119.0,544.61,2451.59,0.0,0.0,0.0,0.0,15024.0,119.0,14.46,157.76,0.0,0.0,0.0,0.0,1721.0,119.0,36.58,11.05,4.0,35.0,40.0,40.0,84.0
Amer-Indian-Eskimo,Male,192.0,37.21,12.05,17.0,28.0,35.0,45.0,82.0,192.0,125715.36,85063.25,13769.0,48197.75,113091.0,182656.0,356015.0,192.0,9.07,2.27,2.0,9.0,9.0,10.0,16.0,192.0,675.26,2929.75,0.0,0.0,0.0,0.0,27828.0,192.0,46.4,286.56,0.0,0.0,0.0,0.0,1980.0,192.0,42.2,11.6,3.0,40.0,40.0,45.0,84.0
Asian-Pac-Islander,Female,346.0,35.09,12.3,17.0,25.0,33.0,43.75,75.0,346.0,147452.08,76401.63,19914.0,86879.25,131986.0,175705.75,379046.0,346.0,10.39,2.8,1.0,9.0,10.0,13.0,15.0,346.0,778.44,7675.23,0.0,0.0,0.0,0.0,99999.0,346.0,50.85,296.53,0.0,0.0,0.0,0.0,2258.0,346.0,37.44,12.48,1.0,35.0,40.0,40.0,99.0
Asian-Pac-Islander,Male,693.0,39.07,12.88,18.0,29.0,37.0,46.0,90.0,693.0,166175.87,88552.95,14878.0,98350.0,147719.0,200117.0,506329.0,693.0,11.25,2.78,1.0,9.0,11.0,13.0,16.0,693.0,1827.81,10947.53,0.0,0.0,0.0,0.0,99999.0,693.0,120.37,472.92,0.0,0.0,0.0,0.0,2457.0,693.0,41.47,12.39,1.0,40.0,40.0,45.0,99.0
Black,Female,1555.0,37.85,12.64,17.0,28.0,37.0,46.0,90.0,1555.0,212971.39,109971.26,19752.0,142666.5,193553.0,253759.0,930948.0,1555.0,9.55,2.21,1.0,9.0,9.0,10.0,16.0,1555.0,516.59,5312.75,0.0,0.0,0.0,0.0,99999.0,1555.0,45.45,299.1,0.0,0.0,0.0,0.0,4356.0,1555.0,36.83,9.42,2.0,35.0,40.0,40.0,99.0
Black,Male,1569.0,37.68,12.88,17.0,27.0,36.0,46.0,90.0,1569.0,242920.64,134145.97,21856.0,156410.0,221196.0,298601.0,1268339.0,1569.0,9.42,2.38,1.0,9.0,9.0,10.0,16.0,1569.0,702.45,4962.11,0.0,0.0,0.0,0.0,99999.0,1569.0,75.19,370.98,0.0,0.0,0.0,0.0,2824.0,1569.0,40.0,10.91,1.0,40.0,40.0,40.0,99.0
Other,Female,109.0,31.68,11.63,17.0,23.0,29.0,39.0,74.0,109.0,172519.64,77766.67,24562.0,119890.0,171199.0,219441.0,388741.0,109.0,8.9,3.03,2.0,7.0,9.0,10.0,14.0,109.0,254.67,1317.33,0.0,0.0,0.0,0.0,7688.0,109.0,36.28,231.8,0.0,0.0,0.0,0.0,1740.0,109.0,35.93,10.3,6.0,30.0,40.0,40.0,65.0
Other,Male,162.0,34.65,11.36,17.0,26.0,32.0,42.0,77.0,162.0,213679.1,92187.36,25610.0,150726.75,208516.5,253334.75,481175.0,162.0,8.8,3.36,1.0,8.0,9.0,10.0,16.0,162.0,1392.19,11093.71,0.0,0.0,0.0,0.0,99999.0,162.0,77.75,370.99,0.0,0.0,0.0,0.0,2179.0,162.0,41.85,11.08,5.0,40.0,40.0,40.0,98.0
White,Female,8642.0,36.81,14.33,17.0,25.0,35.0,46.0,90.0,8642.0,183549.97,101710.29,19395.0,115914.75,175810.5,224836.5,1484705.0,8642.0,10.13,2.37,1.0,9.0,10.0,12.0,16.0,8642.0,573.61,4763.13,0.0,0.0,0.0,0.0,99999.0,8642.0,65.39,352.33,0.0,0.0,0.0,0.0,4356.0,8642.0,36.3,12.19,1.0,30.0,40.0,40.0,99.0
White,Male,19174.0,39.65,13.44,17.0,29.0,38.0,49.0,90.0,19174.0,188987.39,103714.6,18827.0,117381.0,178662.5,236858.75,1455435.0,19174.0,10.14,2.66,1.0,9.0,10.0,13.0,16.0,19174.0,1368.67,8442.83,0.0,0.0,0.0,0.0,99999.0,19174.0,102.26,434.16,0.0,0.0,0.0,0.0,3770.0,19174.0,42.67,12.19,1.0,40.0,40.0,50.0,99.0


**8. Among whom is the proportion of those who earn a lot (`>50K`) greater: married or single men (*marital-status* feature)? Consider as married those who have a *marital-status* starting with *Married* (`Married-civ-spouse`, `Married-spouse-absent` or `Married-AF-spouse`), the rest are considered bachelors.**

In [12]:
data['married'] = data['marital-status'].apply(lambda x: x.startswith('Married'))

In [13]:
data[data.married == 1].salary.value_counts(normalize=True) # for married

<=50K   0.56
>50K    0.44
Name: salary, dtype: float64

In [14]:
data[data.married == 0].salary.value_counts(normalize=True) # for not married

<=50K   0.94
>50K    0.06
Name: salary, dtype: float64

**9. What is the maximum number of hours a person works per week (*hours-per-week* feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (`>50K`) among them?**

In [15]:
maximum = data['hours-per-week'].max()
maximum

99

In [16]:
count = len(data[data['hours-per-week'] == maximum])
count

85

In [17]:
data[data['hours-per-week'] == maximum].salary.value_counts(normalize=True)

<=50K   0.71
>50K    0.29
Name: salary, dtype: float64

**10. Count the average time of work (*hours-per-week*) for those who earn a little and a lot (*salary*) for each country (*native-country*). What will these be for Japan?**

In [18]:
data.groupby(['native-country', 'salary'], as_index=False) \
    .agg({'hours-per-week': np.mean}).query('`native-country` == "Japan"')

Unnamed: 0,native-country,salary,hours-per-week
47,Japan,<=50K,41.0
48,Japan,>50K,47.96


# 10/10