# 02807 - Week 3 Exercises: Exploratory data analysis with Pandas


## Learning objectives:

* Get hands-on experience performing exploratory data analysis with Pandas


## Readings:

Main reading:

* [Chapter 3: Data Manipulation with Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html). Python Data Science Handbook.

Recommended readings and tutorials:

* [Kaggle Pandas tutorials](https://www.kaggle.com/learn/pandas). Great hands-on, step-by-step introduction.
* [Python for Data Analysis Book](https://wesmckinney.com/pages/book.html). Excellent in-depth presentation of Pandas, by the creator of Pandas. Chapters 5 and onwards.


## Exercises:

* This week, you'll work on an assignment from the Open Machine Learning Course. 

* It comes with solutions and will score your work (check the info below).

* Of course, you should try to find your own solution before looking at the given one. Otherwise, you won't learn much.

* Enjoy and good luck!



---




<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](mlcourse.ai) â€“ Open Machine Learning Course 

<center>Author: [Yury Kashnitskiy](http://yorko.github.io) <br>
Translated and edited by [Sergey Isaev](https://www.linkedin.com/in/isvforall/), [Artem Trunov](https://www.linkedin.com/in/datamove/), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/) <br>All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.


# <center>  Exploratory data analysis with Pandas


**In this task you should use Pandas to answer a few questions about the [Adult](https://archive.ics.uci.edu/ml/datasets/Adult) dataset.** 

Choose the answers in the [web-form](https://docs.google.com/forms/d/1uY7MpI2trKx6FLWZte0uVh3ULV4Cm_tDud0VDFGCOKg). 

This is a demo version of an assignment, so by submitting the form, you'll see a link to the solution .ipynb file.

Unique values of all features (for more information, please see the links above):
- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `fnlwgt`: continuous.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male.
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.   
- `salary`: >50K,<=50K

In [1]:
import pandas as pd

In [2]:
url = 'https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/data/adult.data.csv'
data = pd.read_csv(url)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**1. How many men and women (*sex* feature) are represented in this dataset?** 

In [13]:
# You code here
sex = data['sex']
nms = sex.value_counts()
print(nms)

Male      21790
Female    10771
Name: sex, dtype: int64


**2. What is the average age (*age* feature) of women?**

In [16]:
# You code here
print(data.age[data.sex=='Female'].mean())

36.85823043357163


**3. What is the percentage of German citizens (*native-country* feature)?**

In [46]:
# You code here
import numpy as np
nms = data['native-country'].value_counts()
print(nms.Germany/np.sum(nms))

0.004207487485028101


**4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (*salary* feature) and those who earn less than 50K per year?**

In [49]:
# You code here
print(data.salary)
age_over_50 = data[data.salary=='>50K'].age
ave = age_over_50.mean()
std = age_over_50.std()
print('mean: {}, std: {}'.format(ave, std))
age_under_50 = data[data.salary=='<=50K'].age
ave = age_under_50.mean()
std = age_under_50.std()
print('mean: {}, std: {}'.format(ave, std))

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4        <=50K
         ...  
32556    <=50K
32557     >50K
32558    <=50K
32559    <=50K
32560     >50K
Name: salary, Length: 32561, dtype: object
mean: 44.24984058155847, std: 10.519027719851826
mean: 36.78373786407767, std: 14.02008849082488


**6. Is it true that people who earn more than 50K have at least high school education? (*education = HS-grad, Some-College, Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or *Doctorate*)**

In [62]:
# You code here
print(np.unique(data.education.values))
edu_over_50 = age_over_50 = data[data.salary=='>50K'].education
edu_under_hs = edu_over_50[(edu_over_50=='10th') | (edu_over_50=='11th') | (edu_over_50=='12th') | (edu_over_50=='1st-4th')\
                           | (edu_over_50=='5th-6th') | (edu_over_50=='7th-8th') | (edu_over_50=='9th')].value_counts()
print(edu_under_hs)

['10th' '11th' '12th' '1st-4th' '5th-6th' '7th-8th' '9th' 'Assoc-acdm'
 'Assoc-voc' 'Bachelors' 'Doctorate' 'HS-grad' 'Masters' 'Preschool'
 'Prof-school' 'Some-college']
10th       62
11th       60
7th-8th    40
12th       33
9th        27
5th-6th    16
1st-4th     6
Name: education, dtype: int64


**7. Display age statistics for each race (*race* feature) and each gender (*sex* feature). Use *groupby()* and *describe()*. Find the maximum age of men of *Amer-Indian-Eskimo* race.**

In [87]:
# You code here
print(data.groupby(['race','sex']).age.describe())

                             count       mean        std   min   25%   50%  \
race               sex                                                       
Amer-Indian-Eskimo Female    119.0  37.117647  13.114991  17.0  27.0  36.0   
                   Male      192.0  37.208333  12.049563  17.0  28.0  35.0   
Asian-Pac-Islander Female    346.0  35.089595  12.300845  17.0  25.0  33.0   
                   Male      693.0  39.073593  12.883944  18.0  29.0  37.0   
Black              Female   1555.0  37.854019  12.637197  17.0  28.0  37.0   
                   Male     1569.0  37.682600  12.882612  17.0  27.0  36.0   
Other              Female    109.0  31.678899  11.631599  17.0  23.0  29.0   
                   Male      162.0  34.654321  11.355531  17.0  26.0  32.0   
White              Female   8642.0  36.811618  14.329093  17.0  25.0  35.0   
                   Male    19174.0  39.652498  13.436029  17.0  29.0  38.0   

                             75%   max  
race               sex

**8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (*marital-status* feature)? Consider as married those who have a *marital-status* starting with *Married* (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.**

In [98]:
# You code here
earn_over_50_men = data[(data.salary=='>50K') & (data.sex=='Male')]
married = earn_over_50_men[(earn_over_50_men['marital-status']=='Married-civ-spouse') \
                           | (earn_over_50_men['marital-status']=='Married-spouse-absent') | \
            (earn_over_50_men['marital-status']=='Married-AF-spouse')]
ratio_m = married.__len__() / earn_over_50_men.__len__()
print('Married Proportion: ', ratio_m)
print('Bachelor Proportion: ', 1-ratio_m)

Married Ratio:  0.8953767637346143
Bachelor Ratio:  0.10462323626538572


**9. What is the maximum number of hours a person works per week (*hours-per-week* feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?**

In [102]:
# You code here
print('The maximum number of working hours par week: ', data['hours-per-week'].max())
working_hard = data[data['hours-per-week']==99]
print('People number: ', working_hard.__len__())
print('Proportion: ', working_hard[working_hard.salary=='>50K'].__len__()/working_hard.__len__())

The maximum number of working hours par week:  99
People number:  85
Proportion:  0.29411764705882354


**10. Count the average time of work (*hours-per-week*) for those who earn a little and a lot (*salary*) for each country (*native-country*). What will these be for Japan?**

In [106]:
# You code here
country_list = np.unique(data['native-country'])
d = {}
for nation in country_list:
    df = data[data['native-country']==nation]
    ave_low = df[df.salary=='<=50K']
    ave_low = ave_low['hours-per-week'].mean()
    ave_high = df[df.salary=='>50K']
    ave_high = ave_high['hours-per-week'].mean()
    d[nation+'_low'] = ave_low
    d[nation+'_high'] = ave_high
    
print('Japan low: ', d['Japan_low'])
print('Japan high: ', d['Japan_high'])

Japan low:  41.0
Japan high:  47.958333333333336
