In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database. Here is a sample of what the data looks like:

|    |   age | workclass        |   fnlwgt | education   |   education-num | marital-status     | occupation        | relationship   | race   | sex    |   capital-gain |   capital-loss |   hours-per-week | native-country   | salary   |
|---:|------:|:-----------------|---------:|:------------|----------------:|:-------------------|:------------------|:---------------|:-------|:-------|---------------:|---------------:|-----------------:|:-----------------|:---------|
|  0 |    39 | State-gov        |    77516 | Bachelors   |              13 | Never-married      | Adm-clerical      | Not-in-family  | White  | Male   |           2174 |              0 |               40 | United-States    | <=50K    |
|  1 |    50 | Self-emp-not-inc |    83311 | Bachelors   |              13 | Married-civ-spouse | Exec-managerial   | Husband        | White  | Male   |              0 |              0 |               13 | United-States    | <=50K    |
|  2 |    38 | Private          |   215646 | HS-grad     |               9 | Divorced           | Handlers-cleaners | Not-in-family  | White  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
|  3 |    53 | Private          |   234721 | 11th        |               7 | Married-civ-spouse | Handlers-cleaners | Husband        | Black  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
|  4 |    28 | Private          |   338409 | Bachelors   |              13 | Married-civ-spouse | Prof-specialty    | Wife           | Black  | Female |              0 |              0 |               40 | Cuba             | <=50K    |




## Answer Questions below

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("adult.data.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


You must use Pandas to answer the following questions:

1. How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)

In [9]:
race_df = df.groupby('race')['race'].value_counts().reset_index().sort_values(by=['count'], ascending = False)
race_df

Unnamed: 0,race,count
4,White,27816
2,Black,3124
1,Asian-Pac-Islander,1039
0,Amer-Indian-Eskimo,311
3,Other,271


2. What is the average age of men?


In [12]:
men_df = df[df['sex']=='Male']
print("The average age of men is: ", round(men_df['age'].mean(), 2))

The average age of men is:  39.43


3. What is the percentage of people who have a Bachelor's degree?


In [16]:
edu_df = df.education.value_counts(normalize=True).reset_index()
edu_df['proportion'] = edu_df['proportion'].apply(lambda x: f'{round(x*100, 2)}%')
edu_df.head()

Unnamed: 0,education,proportion
0,HS-grad,32.25%
1,Some-college,22.39%
2,Bachelors,16.45%
3,Masters,5.29%
4,Assoc-voc,4.24%


4. What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?


In [19]:
high_edu_df = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
per_high_ed = high_edu_df['salary'].value_counts(normalize=True).reset_index()
per_high_ed['proportion'] = per_high_ed['proportion'].apply(lambda x: f'{round(x*100, 2)}%')
per_high_ed.head()

Unnamed: 0,salary,proportion
0,<=50K,53.46%
1,>50K,46.54%


5. What percentage of people without advanced education make more than 50K?


In [21]:
not_high_edu_df = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
per_not_high_ed = not_high_edu_df['salary'].value_counts(normalize=True).reset_index()
per_not_high_ed['proportion'] = per_not_high_ed['proportion'].apply(lambda x: f'{round(x*100, 2)}%')
per_not_high_ed.head()

Unnamed: 0,salary,proportion
0,<=50K,82.63%
1,>50K,17.37%


6. What is the minimum number of hours a person works per week?


In [22]:
min_hours = df['hours-per-week'].min()
print(f'The minimum numbers of hour a person works per week is : {min_hours}')

The minimum numbers of hour a person works per week is : 1


7. What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?


In [23]:
min_df = df[df['hours-per-week']==min_hours]
min_salary = min_df['salary'].value_counts(normalize=True).reset_index()
min_salary['proportion'] = min_salary['proportion'].apply(lambda x: f'{round(x*100, 2)}%')
min_salary.head()

Unnamed: 0,salary,proportion
0,<=50K,90.0%
1,>50K,10.0%


8. What country has the highest percentage of people that earn >50K and what is that percentage?


In [43]:
df_country = df[df['salary'] == '>50K'][['native-country']].value_counts(normalize=True).reset_index()
df_country['proportion'] = df_country['proportion'].apply(lambda x: f'{round(x*100, 2)}%')
df_country.head()

Unnamed: 0,native-country,proportion
0,United-States,91.46%
1,?,1.86%
2,Philippines,0.78%
3,Germany,0.56%
4,India,0.51%


9. Identify the most popular occupation for those who earn >50K in India.

In [48]:
df_india = df[(df['native-country']=='India') & (df['salary']=='>50K')]
df_popular_occ = df_india.groupby('occupation')['occupation'].value_counts().reset_index().sort_values(by = ['count'], ascending=False)
df_popular_occ.head()

Unnamed: 0,occupation,count
3,Prof-specialty,25
1,Exec-managerial,8
2,Other-service,2
5,Tech-support,2
0,Adm-clerical,1
