# UCI Adult Sociodemographic

In [1]:
import pandas as pd
import numpy as np

In [2]:
adults = pd.read_csv(
    '00.01_adult.csv', 
    names=[
        'age', 
        'workclass', 
        'fnlwgt', 
        'education', 
        'education_num', 
        'marital_status', 
        'occupation', 
        'relationship', 
        'race', 
        'sex', 
        'capital_gain', 
        'capital_loss', 
        'hours_per_week', 
        'native_country', 
        'salary'
    ],
    skipinitialspace=True
)

- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- salary: >50K,<=50K

In [3]:
adults.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## How many men and women are represented in this dataset?

We can split our dataset by sex, then use the `size()` method from DataFrameGroupBy to count the number of rows in each group.

In [4]:
adults.groupby('sex').size().reset_index(name='count')

Unnamed: 0,sex,count
0,Female,10771
1,Male,21790


Another, easier way, is to use the `value_counts()` method on the sex column.

In [5]:
adults.sex.value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

Let's store these values for future use

In [6]:
male_adults_count = adults.sex.value_counts()[0]
female_adults_count = adults.sex.value_counts()[1]

## What is the average age of women?

We better manipulate the dataset by sex, let's first create masks based on sex

In [7]:
female_mask = adults.sex == 'Female'
male_mask = adults.sex == 'Male'

We can then pass it inside the bracket operator to filter out only the subset of the population we want. In this case, female adults

In [8]:
adults[female_mask].age.apply('mean')

36.85823043357163

The average age of women in this dataset is 36.86 y.o.

## What is the percentage of German citizens?

Similarly, let's subset the population to only select German adults

In [9]:
german_mask = adults.native_country == 'Germany'
german_adults = adults[german_mask]

To get the percentage of German citizens, we can divide the count for the subset german_adults with the count for all adults in the dataset multiplied by 100

In [10]:
len(german_adults) / len(adults) * 100

0.42074874850281013

There are 0.42% German adults in the dataset

## What are the mean and standard deviation of age for those who earn more than 50k per year and those who earn less than 50k per year?

In [11]:
more_50K_mask = adults.salary == '>50K'
less_50K_mask = adults.salary == '<=50K'

We can use the `apply()` method passed with mean and std to get the mean and standard deviation.

In [12]:
adults[more_50K_mask].age.apply(['mean', 'std'])

mean    44.249841
std     10.519028
Name: age, dtype: float64

In [13]:
adults[less_50K_mask].age.apply(['mean', 'std'])

mean    36.783738
std     14.020088
Name: age, dtype: float64

The average age for adults with more than 50K salary is 44.25 y.o $\pm$ 10.52 years while 36.78 y.o. $\pm$ for those below 50K. 

## Is it true that people who earn more than 50k have at least high school education? 

In [14]:
adults[more_50K_mask].education.unique()

array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
       'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
       '10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)

It is not true that people with more than 50K salary have at least a high school education.

In [15]:
set(adults.education.unique()) - set(adults[more_50K_mask].education.unique())

{'Preschool'}

With the exception of preschool, there are adults who earn more than 50K from all educational category.

## Display the age statistics for each race and each gender.

In [16]:
adults.groupby(['race', 'sex'])['age'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
race,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Amer-Indian-Eskimo,Female,119.0,37.117647,13.114991,17.0,27.0,36.0,46.0,80.0
Amer-Indian-Eskimo,Male,192.0,37.208333,12.049563,17.0,28.0,35.0,45.0,82.0
Asian-Pac-Islander,Female,346.0,35.089595,12.300845,17.0,25.0,33.0,43.75,75.0
Asian-Pac-Islander,Male,693.0,39.073593,12.883944,18.0,29.0,37.0,46.0,90.0
Black,Female,1555.0,37.854019,12.637197,17.0,28.0,37.0,46.0,90.0
Black,Male,1569.0,37.6826,12.882612,17.0,27.0,36.0,46.0,90.0
Other,Female,109.0,31.678899,11.631599,17.0,23.0,29.0,39.0,74.0
Other,Male,162.0,34.654321,11.355531,17.0,26.0,32.0,42.0,77.0
White,Female,8642.0,36.811618,14.329093,17.0,25.0,35.0,46.0,90.0
White,Male,19174.0,39.652498,13.436029,17.0,29.0,38.0,49.0,90.0


In [17]:
adults.groupby('race')['age'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Amer-Indian-Eskimo,311.0,37.173633,12.44713,17.0,28.0,35.0,45.5,82.0
Asian-Pac-Islander,1039.0,37.746872,12.825133,17.0,28.0,36.0,45.0,90.0
Black,3124.0,37.767926,12.75929,17.0,28.0,36.0,46.0,90.0
Other,271.0,33.457565,11.538865,17.0,25.0,31.0,41.0,77.0
White,27816.0,38.769881,13.782306,17.0,28.0,37.0,48.0,90.0


In [18]:
adults.groupby('sex')['age'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Female,10771.0,36.85823,14.013697,17.0,25.0,35.0,46.0,90.0
Male,21790.0,39.433547,13.37063,17.0,29.0,38.0,48.0,90.0


## What is the maximum age of men of Amer-Indian-Eskimo race?

In [19]:
adults[adults.race == 'Amer-Indian-Eskimo'].age.apply('max')

82

## Among whom is the proportion of those who earn a lot (>50k) greater: married or single men? Consider as married those who have a marital-status with Married, the rest are considered bachelors

In [20]:
adults.marital_status.unique()

array(['Never-married', 'Married-civ-spouse', 'Divorced',
       'Married-spouse-absent', 'Separated', 'Married-AF-spouse',
       'Widowed'], dtype=object)

Inspecting our data set, we can see that there are 7 different values for marital status. The subset of the population we want are those with 'Married-spouse-absent', 'Married-civ-spouse', 'Married-AF-spouse'. Note that there is a pattern for the values - all of the values we want are prefixed with 'Married'. We can use this to create a mask.

In [21]:
married_mask = adults.marital_status.str.split('-').str[0] == 'Married'

In [22]:
married_men = adults[(married_mask) & (male_mask)]

Then applying `value_counts()` to the column that we want, we see that

In [23]:
married_men.salary.value_counts()

<=50K    7576
>50K     5965
Name: salary, dtype: int64

Similarly for bachelor men, we the count per salary is

In [24]:
bachelor_mask = adults.marital_status.str.split('-').str[0] != 'Married'

In [25]:
bachelor_men = adults[bachelor_mask & male_mask]
bachelor_men.salary.value_counts()

<=50K    7552
>50K      697
Name: salary, dtype: int64

In [26]:
len(bachelor_men) / len(adults[male_mask]), len(married_men) / len(adults[male_mask])

(0.37856815052776505, 0.621431849472235)

About 37.86% of the population are bachelor men while 62.14% are married.

In [27]:
adults.marital_status.value_counts()

Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: marital_status, dtype: int64

## What is the maximum number of hours a person works per week? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50k) among them?

In [28]:
adults.hours_per_week.max()

99

The maximum time a person works per week is 99 hours!

In [29]:
max_hours_per_week_mask = adults.hours_per_week == adults.hours_per_week.max()
len(adults[max_hours_per_week_mask])

85

And there are 85 people who work that long.

In [30]:
(len(adults[max_hours_per_week_mask & more_50K_mask]) 
 / len(adults[max_hours_per_week_mask]) 
 * 100)

29.411764705882355

Lastly, the percentage of rich people who work 99 hours per week is 29.41%.

## Count the average time of work for those who earn a little and a lot salary for each country. What will these be for Japan?

In [31]:
(adults.groupby(['native_country', 'salary'])
    .hours_per_week
    .agg('mean'))

native_country  salary
?               <=50K     40.164760
                >50K      45.547945
Cambodia        <=50K     41.416667
                >50K      40.000000
Canada          <=50K     37.914634
                            ...    
United-States   >50K      45.505369
Vietnam         <=50K     37.193548
                >50K      39.200000
Yugoslavia      <=50K     41.600000
                >50K      49.500000
Name: hours_per_week, Length: 82, dtype: float64

Alternatively, we can use crosstab to get the relationship between country and salary

In [32]:
pd.crosstab(
    adults.native_country, 
    adults.salary, 
    values=adults.hours_per_week, 
    aggfunc=np.mean
).head()

salary,<=50K,>50K
native_country,Unnamed: 1_level_1,Unnamed: 2_level_1
?,40.16476,45.547945
Cambodia,41.416667,40.0
Canada,37.914634,45.641026
China,37.381818,38.9
Columbia,38.684211,50.0


In [33]:
japan_mask = adults.native_country == 'Japan'

In [34]:
(adults[japan_mask]
    .groupby(['native_country', 'salary'])
    .hours_per_week.agg('mean')
    .reset_index())

Unnamed: 0,native_country,salary,hours_per_week
0,Japan,<=50K,41.0
1,Japan,>50K,47.958333
