# CMPSC 448: Homewrok #1
# Exploratory Data Analysis with `pandas`

## Objectives

In this assignment, you are asked to analyze the UCI Adult data set containing demographic information about the US residents. This data was extracted from the census bureau database found at

http://www.census.gov/ftp/pub/DES/www/welcome.html

The features of data with possible values of each feature are listed below:

| Feature Name| Possible Values  |
|------|------|
| age | continuous|
| workclass| Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked|
| fnlwgt| continuous|
| education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool|
|education_num | continuous|
|marital_status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse|
|occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces|
|relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried |
|race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black|
|sex | Female, Male|
|capital_gain| continuous|
|capital_loss | continuous|
|hours-per-week | continuous |
|native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands |
|salary | >50K,<=50K |


Please  complete the tasks in the Jupyter notebook by answering following 8 questions.

In [10]:
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)
# to draw pictures in jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')


In [41]:
data = pd.read_csv('adult.data.csv')
#Remove the whitespace from some of the variables
cleaned_cols = data.columns.str.strip()
data.columns = cleaned_cols

#Change the types of some of the variable
data[["age","fnlwgt","education-num","capital-gain","capital-loss","hours-per-week"]] = data[["age","fnlwgt","education-num","capital-gain","capital-loss","hours-per-week"]].astype(int)

print("\n".join(data.columns))

age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
salary


In [42]:
data.shape

(32561, 15)

In [43]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### 1. How many men and women (sex feature) are represented in this dataset?

In [44]:
# You answer (code + results)
gender_counts = data['sex'].value_counts()
print(gender_counts)

sex
Male      21790
Female    10771
Name: count, dtype: int64


### 2. What is the average age (age feature) of women?

In [54]:
# You answer (code + results)
avg_woman_age = data[data["sex"].str.strip() == "Female"]["age"].mean()
print("average age: ",avg_woman_age)

average age:  36.85823043357163


### 3. What is the percentage of German citizens (native-country feature)?


In [55]:
# You answer (code + results)
count_german = data[data["native-country"].str.strip() == "Germany"].shape[0]
total_count = data.shape[0]
print((count_german/total_count)*100, "% are Germans")

0.42074874850281013 % are Germans


###  4. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?

In [61]:
# You answer (code + results)
greater_than = data[data["salary"].str.strip() == ">50K"]["age"]
less_than = data[data["salary"].str.strip() == "<=50K"]["age"]
printable = f"Earn more than 50K] Mean: {greater_than.mean()} Standard Deviation: {greater_than.std()}\nEarn less than 50K] Mean: {less_than.mean()} Standard Deviation: {less_than.std()}"

print(printable)

Earn more than 50K] Mean: 44.24984058155847 Standard Deviation: 10.519027719851826
Earn less than 50K] Mean: 36.78373786407767 Standard Deviation: 14.02008849082488


### 5. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

In [75]:
# You answer (code + results)
# at least a highschool education implies that person is HS-Grad
education = ['Bachelors','HS-grad','Masters','Some-college','Assoc-acdm','Assoc-voc','Doctorate' 'Prof-school']
at_least_hs = data[(data["salary"].str.strip() == ">50K")]["education"].str.strip().isin(education)

print(at_least_hs.value_counts())
print("\nThis dataframe above shows that there are 973 people that make over 50K that have not completed Highschool, meaning that the statement is false")


education
True     6868
False     973
Name: count, dtype: int64

This dataframe above shows that there are 973 people that make over 50K that have not completed Highschool, meaning that the statement is false


### 6.  Display age statistics for each race (race feature) and each gender (sex feature). 

Hint: Use `groupby()` and `describe()` functions of DataFrame. Find the maximum age of men of Amer-Indian-Eskimo race.

In [78]:
# You answer (code + results)
age_stats = data.groupby(['race', 'sex'])['age'].describe()
print(age_stats)

print("\nFrom the table above, we can see that the max age for Amer-Indian-Eskimo men is 82 years old")

                             count       mean        std   min   25%   50%  \
race               sex                                                       
Amer-Indian-Eskimo Female    119.0  37.117647  13.114991  17.0  27.0  36.0   
                   Male      192.0  37.208333  12.049563  17.0  28.0  35.0   
Asian-Pac-Islander Female    346.0  35.089595  12.300845  17.0  25.0  33.0   
                   Male      693.0  39.073593  12.883944  18.0  29.0  37.0   
Black              Female   1555.0  37.854019  12.637197  17.0  28.0  37.0   
                   Male     1569.0  37.682600  12.882612  17.0  27.0  36.0   
Other              Female    109.0  31.678899  11.631599  17.0  23.0  29.0   
                   Male      162.0  34.654321  11.355531  17.0  26.0  32.0   
White              Female   8642.0  36.811618  14.329093  17.0  25.0  35.0   
                   Male    19174.0  39.652498  13.436029  17.0  29.0  38.0   

                             75%   max  
race               sex

### 7. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?


In [88]:
# You answer (code + results)
max_hours = data['hours-per-week'].max()
print(f"The maximum number of hours a person works a week is {max_hours}")

max_workers = data[data['hours-per-week'] == max_hours]
num_max_workers = max_workers.shape[0]
print(f"There are {num_max_workers} workers that work {max_hours} hours a week")

num_greater = max_workers[max_workers["salary"].str.strip() == ">50K"].shape[0]
print(f"{(num_greater/num_max_workers)*100}% of workers that work {max_hours} hours a week make more than $50K a year")

The maximum number of hours a person works a week is 99
There are 85 workers that work 99 hours a week
29.411764705882355% of workers that work 99 hours a week make more than $50K a year


### 8. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?

In [95]:
# You answer (code + results)
data["native-country"] = data['native-country'].str.strip() #remove spaces around country names
average_hours = data.groupby(["native-country","salary"])["hours-per-week"].mean()

print(average_hours.loc['Japan'])

salary
<=50K    41.000000
>50K     47.958333
Name: hours-per-week, dtype: float64
