# CMPSC 448: Homewrok #1
# Spring 2022
# Exploratory Data Analysis with `pandas`

## Objectives

In this assignment, you are asked to analyze the UCI Adult data set containing demographic information about the US residents. This data was extracted from the census bureau database found at

http://www.census.gov/ftp/pub/DES/www/welcome.html

The features of data with possible values of each feature are listed below:

| Feature Name| Possible Values  |
|------|------|
| age | continuous|
| workclass| Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked|
| fnlwgt| continuous|
| education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool|
|education_num | continuous|
|marital_status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse|
|occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces|
|relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried |
|race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black|
|sex | Female, Male|
|capital_gain| continuous|
|capital_loss | continuous|
|hours-per-week | continuous |
|native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands |
|salary | >50K,<=50K |


Please  complete the tasks in the Jupyter notebook by answering following 8 questions.

In [3]:
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)
# to draw pictures in jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')


In [4]:
data = pd.read_csv('adult.data.csv')
print("\n".join(data.columns))

age
 workclass
 fnlwgt
 education
 education-num
 marital-status
 occupation
 relationship
 race
 sex
 capital-gain
 capital-loss
 hours-per-week
 native-country
 salary


In [5]:
data.shape

(32561, 15)

In [6]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### 1. How many men and women (sex feature) are represented in this dataset?

In [7]:
data[' sex'].value_counts()
# 21790 Men and 10771 Women

 Male      21790
 Female    10771
Name:  sex, dtype: int64

### 2. What is the average age (age feature) of women?

In [9]:
data[[" sex", "age"]].groupby(" sex").mean()
# 36.858230

Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
Female,36.85823
Male,39.433547


### 3. What is the percentage of German citizens (native-country feature)?


In [10]:
data[" native-country"].value_counts()
# 137/32561 = .00420749

 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 

###  4. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?

In [12]:
data.groupby(" salary")["age"].mean()
# rich and poor means (44.249841, 36.783738)
data.groupby(" salary")["age"].std()
# rich and poor std (10.519028, 14.020088)

 salary
 <=50K    14.020088
 >50K     10.519028
Name: age, dtype: float64

### 5. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

In [28]:
educated = 0
total = 0
atLeastHS = [' Bachelors', ' Prof-school', ' Assoc-acdm', ' Assoc-voc', ' Masters', ' Doctorate', ' HS-grad', ' Some-college']

for i in range (0, len(data)):
    if (data.iloc[i][14] == ' >50K'):
        if (data.iloc[i][3] in atLeastHS):
            educated += 1
        total += 1
print("People making >50K: ", total)
print("People with at least hs: ", educated)

# People making >50K:  7841
# People with at least hs:  7597
# Most people making >50K have at least HS

People making >50K:  7841
People with at least hs:  7597


# 6.  Display age statistics for each race (race feature) and each gender (sex feature). 

Hint: Use `groupby()` and `describe()` functions of DataFrame. Find the maximum age of men of Amer-Indian-Eskimo race.

In [39]:
data[[" race", "age"]].groupby(" race").describe()
data[[" sex", "age"]].groupby(" sex").describe()

#count, mean, std, min, 25%, 50%, 75%, max
# Amer-Indian-Eskimo	311.0	37.173633	12.447130	17.0	28.0	35.0	45.5	82.0
# Asian-Pac-Islander	1039.0	37.746872	12.825133	17.0	28.0	36.0	45.0	90.0
# Black	3124.0	37.767926	12.759290	17.0	28.0	36.0	46.0	90.0
# Other	271.0	33.457565	11.538865	17.0	25.0	31.0	41.0	77.0
# White	27816.0	38.769881	13.782306	17.0	28.0	37.0	48.0	90.0

# Female	10771.0	36.858230	14.013697	17.0	25.0	35.0	46.0	90.0
# Male	21790.0	39.433547	13.370630	17.0	29.0	38.0	48.0	90.0

# max age of Amer-Indian-Eskimo is 82 years old

Unnamed: 0_level_0,age,age,age,age,age,age,age,age
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Female,10771.0,36.85823,14.013697,17.0,25.0,35.0,46.0,90.0
Male,21790.0,39.433547,13.37063,17.0,29.0,38.0,48.0,90.0


### 7. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?


In [47]:
# data[[" hours-per-week", "age"]].groupby(" hours-per-week").describe()
# data[[" salary", " hours-per-week"]].groupby(" salary").describe()

busy = 0
richAndBusy = 0

for i in range (0, len(data)):
    if (data.iloc[i][12] == 99):
        if (data.iloc[i][14] == " >50K"):
            richAndBusy += 1
        busy += 1
print("People working 99: ", busy)
print("and are making >50K: ", richAndBusy)
print("Percent of 99 hour workers that make >50K: ", 100*richAndBusy/busy)

# max number of hours is 99
# 85 people work 99 hours
# 29.41 percent of 99-hour workers are making >50K

People working 99:  85
and are making >50K:  25
Percent of 99 hour workers that make >50K:  0.29411764705882354


### 8. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?

In [1]:
data.groupby([' native-country',' salary']).mean()
dataSet.groupby([' native-country',' salary']).get_group((' Japan', ' <=50K')).mean()
# 41 hours
dataSet.groupby([' native-country',' salary']).get_group((' Japan', ' >50K')).mean()
# 47.958333 hours

NameError: name 'data' is not defined