# CMPSC 448: Homewrok #1
# Exploratory Data Analysis with `pandas`

## Objectives

In this assignment, you are asked to analyze the UCI Adult data set containing demographic information about the US residents. This data was extracted from the census bureau database found at

http://www.census.gov/ftp/pub/DES/www/welcome.html

The features of data with possible values of each feature are listed below:

| Feature Name| Possible Values  |
|------|------|
| age | continuous|
| workclass| Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked|
| fnlwgt| continuous|
| education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool|
|education_num | continuous|
|marital_status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse|
|occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces|
|relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried |
|race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black|
|sex | Female, Male|
|capital_gain| continuous|
|capital_loss | continuous|
|hours-per-week | continuous |
|native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands |
|salary | >50K,<=50K |


Please  complete the tasks in the Jupyter notebook by answering following 8 questions.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)
# to draw pictures in jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')


In [2]:
data = pd.read_csv('adult.data.csv')
print("\n".join(data.columns))

age
 workclass
 fnlwgt
 education
 education-num
 marital-status
 occupation
 relationship
 race
 sex
 capital-gain
 capital-loss
 hours-per-week
 native-country
 salary


In [3]:
data.shape

(32561, 15)

In [4]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### 1. How many men and women (sex feature) are represented in this dataset?

In [5]:
print(data[' sex'].value_counts())

 sex
Male      21790
Female    10771
Name: count, dtype: int64


### 2. What is the average age (age feature) of women?

In [6]:
print(data[data[' sex'] == ' Female']['age'].mean())

36.85823043357163


### 3. What is the percentage of German citizens (native-country feature)?


In [14]:
print((data[data[' native-country'] == ' Germany'].shape[0] / data.shape[0]) * 100)

0.42074874850281013


###  4. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?

In [8]:
moreThan = data[data[' salary'] == ' >50K']['age']
lessThan = data[data[' salary'] == ' <=50K']['age']
print(moreThan.mean())
print(moreThan.std())
print(lessThan.mean())
print(lessThan.std())

44.24984058155847
10.519027719851772
36.78373786407767
14.020088490824813


### 5. Is it true that people who earn more than 50K have at least high school education? (education â€“ Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

In [13]:
print(data[data[' salary'] == ' >50K'][' education'].value_counts())
## Not true. Some have less than a high school education

 education
Bachelors       2221
HS-grad         1675
Some-college    1387
Masters          959
Prof-school      423
Assoc-voc        361
Doctorate        306
Assoc-acdm       265
10th              62
11th              60
7th-8th           40
12th              33
9th               27
5th-6th           16
1st-4th            6
Name: count, dtype: int64


### 6.  Display age statistics for each race (race feature) and each gender (sex feature). 

Hint: Use `groupby()` and `describe()` functions of DataFrame. Find the maximum age of men of Amer-Indian-Eskimo race.

In [10]:
print(data.groupby(' race')['age'].describe())
print(data.groupby(' sex')['age'].describe())
### max Amer-Indian-Eskimo race = 82.0

                      count       mean        std   min   25%   50%   75%  \
 race                                                                       
Amer-Indian-Eskimo    311.0  37.173633  12.447130  17.0  28.0  35.0  45.5   
Asian-Pac-Islander   1039.0  37.746872  12.825133  17.0  28.0  36.0  45.0   
Black                3124.0  37.767926  12.759290  17.0  28.0  36.0  46.0   
Other                 271.0  33.457565  11.538865  17.0  25.0  31.0  41.0   
White               27816.0  38.769881  13.782306  17.0  28.0  37.0  48.0   

                     max  
 race                     
Amer-Indian-Eskimo  82.0  
Asian-Pac-Islander  90.0  
Black               90.0  
Other               77.0  
White               90.0  
          count       mean        std   min   25%   50%   75%   max
 sex                                                               
Female  10771.0  36.858230  14.013697  17.0  25.0  35.0  46.0  90.0
Male    21790.0  39.433547  13.370630  17.0  29.0  38.0  48.0  90.0

### 7. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?


In [11]:
maximum = data[' hours-per-week'].max()
numPpl = data[data[' hours-per-week'] == maximum].shape[0]
fiftyOrGreater = (data[(data[' hours-per-week'] == maximum) & (data[' salary'] == ' >50K')].shape[0] / maximum) * 100
print(maximum)
print(numPpl)
print(fiftyOrGreater)

99
85
25.252525252525253


### 8. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?

In [12]:
timeLittle = data[data[' salary'] == ' <=50K'][' hours-per-week'].mean()
timeLot = data[data[' salary'] == ' >50K'][' hours-per-week'].mean()
timeLittleJapan = data[(data[' salary'] == ' <=50K') & (data[' native-country'] == ' Japan')][' hours-per-week'].mean()
timeLotJapan = data[(data[' salary'] == ' >50K') & (data[' native-country'] == ' Japan')][' hours-per-week'].mean()
print(timeLittle)
print(timeLot)
print(timeLittleJapan)
print(timeLotJapan)

38.840210355987054
45.473026399693914
41.0
47.958333333333336
