# Demographic Data Analyzer
This notebook is to analyze census data that was provided by FreeCodeCamp. The goal here is to create a function that answers the following questions using Pandas:

- How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)
- What is the average age of men?
- What is the percentage of people who have a Bachelor's degree?
- What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
- What percentage of people without advanced education make more than 50K?
- What is the minimum number of hours a person works per week?
- What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
- What country has the highest percentage of people that earn >50K and what is that percentage?
- Identify the most popular occupation for those who earn >50K in India.

**Here you will find the link to the assignment and csv file:** https://repl.it/@freeCodeCamp/fcc-demographic-data-analyzer#README.md


In [None]:
# Importing nedded libraries
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('adult.data.txt') # Importing the data

In [None]:
df.head() # Previewing the data 

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
df.shape # Checking how large the data is via rows and columns 

(32561, 15)

In [None]:
df.describe() # Checking the stats of the data

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [None]:
df.isna().sum() # Checking for any missing data 

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
salary            0
dtype: int64

In [None]:
df.info() # Checking the data types of each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
salary            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


# The Function
The skeleton of this function was provided by Free Code Camp, my job is to fill in the portions that will answer the questions listed in the notebook description. Before I work on the function, I will find the answers to each part of the function. Then I will bring everthing all together in the end. 

# Lets Get Started: Answering The Questions

# Race Count: This Must Be Reported In A Series With The Race As The Index

In [None]:
# Locate the race count for the df and checking the class type
racecount = df['race'].value_counts()
print(racecount)
print( f'\n Data Type: {type(racecount)}') 

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

 Data Type: <class 'pandas.core.series.Series'>


# What Is The Average Age Of Men?

In [None]:
# Locate the males in df and then calculate the mean of their ages
males = df[df['sex'] == 'Male']
males['age'].mean()

39.43354749885268

In [None]:
males.head() # Quick view of the males 

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K


# What Is The Percentage Of People Who Have A Bachelor's Degree?

In [None]:
percentagebachelors = df.loc[df['education'] == 'Bachelors']
len(percentagebachelors) # Length of the people who have bachelors degree

5355

In [None]:
len(df) # Length of full data frame

32561

In [None]:
(len(percentagebachelors)/len(df))*100 # Percentage of people with bachelors degree

16.44605509658794

# What Percentage Of People With Advanced Education (`Bachelors`, `Masters`, Or `Doctorate`) Make More Than 50K?

In [None]:

Bachelors = df.loc[(df['education'] == 'Bachelors') & (df['salary'] == '>50K')]
len(Bachelors)/len(df)*100

6.821043579742637

In [None]:
Masters = df.loc[(df['education'] == 'Masters') & (df['salary'] == '>50K')]
(len(Masters)/len(df))*100

2.945241239519671

In [None]:
Doctorate = df.loc[(df['education'] == 'Doctorate') & (df['salary'] == '>50K')]
(len(Doctorate)/len(df))*100

0.9397745769478825

In [None]:
df['education'].value_counts() # The amount of people for each level of education

HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64

# What Percentage Of People Without Advanced Education Make More Than 50K?

In [None]:
Some_college = df.loc[(df['education'] == 'Some-college') & (df['salary'] == '>50K')]
print(f'Some College: {(len(Some_college)/len(df))*100}%')

HS_grad = df.loc[(df['education'] == 'HS-grad') & (df['salary'] == '>50K')]
print(f'High School Grad: {(len(HS_grad)/len(df))*100}%') 

Some College: 4.259697183747428%
High School Grad: 5.144190903227788%


# With And Without `Bachelors`, `Masters`, Or `Doctorate`

In [None]:
higher_education = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
lower_education = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]

# Percentage With Salary >50K

In [None]:
higher_education_rich = higher_education[higher_education['salary'] == '>50K'
                                        ]['salary'].value_counts() / higher_education.shape[0] * 100
print(f'Higher Education %: {higher_education_rich}%')
lower_education_rich = lower_education[lower_education['salary'
                                                      ] == '<=50K']['salary'
                                                                   ].value_counts() / lower_education.shape[0] * 100

print(f'\n Lower Education %: {lower_education_rich}%')

Higher Education %: >50K    46.535843
Name: salary, dtype: float64%

 Lower Education %: <=50K    82.62864
Name: salary, dtype: float64%


# What Is The Minimum Number Of Hours A Person Works Per Week (Hours-Per-Week Feature)?

In [None]:
minworkhours = df['hours-per-week'].min()
minworkhours

1

# What Percentage Of The People Who Work The Minimum Number Of Hours Per Week Have A Salary Of >50K?

In [None]:
numminworkers = df.loc[(df['hours-per-week'] == 1) & (df['salary'] == '>50K')]
richpercentage = (len(numminworkers)/len(df))*100
print(f'Rich Percentage: {richpercentage}%')

Rich Percentage: 0.006142317496391388%


In [None]:
numminworkers.head() # Quick view of who made the most working the lest amount of hours

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
189,58,State-gov,109567,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,1,United-States,>50K
20072,65,?,76043,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,1,United-States,>50K


# What Country Has The Highest Percentage Of People That Earn >50K?

In [None]:
highestearningcountry = df.loc[df['salary'] =='>50K']
highestearningcountry['native-country'].value_counts().head(1)

United-States    7171
Name: native-country, dtype: int64

In [None]:
print(f' Highest Percentage of Earnings: {(7171/len(df))*100}%')

 Highest Percentage of Earnings: 22.023279383311323%


# Identify The Most Popular Occupation For Those Who Earn >50K In India.

In [None]:
topINoccupation = df.loc[(df['native-country'] == 'India') & (df['salary']== '>50K')]
topINoccupation['occupation'].value_counts().head(1)

Prof-specialty    25
Name: occupation, dtype: int64

# Pulling It All Together: The Function

In [None]:

def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.txt')

    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df['race'].value_counts()

    # What is the average age of men?
    average_age_men = df[df['sex'] == 'Male']['age'].mean()

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = df[df['education'] == 'Bachelors'].shape[0] / df.shape[0] * 100

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    higher_education = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
    lower_education = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]

    # percentage with salary >50K
    higher_education_rich = higher_education[higher_education['salary'] == '>50K']['salary'].value_counts() / higher_education.shape[0] * 100

    lower_education_rich = lower_education[lower_education['salary'] == '<=50K']['salary'].value_counts() / lower_education.shape[0] * 100

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = df[df['hours-per-week'] == 1]['hours-per-week'].value_counts()

    rich_percentage = df[(df['hours-per-week'] == 1) & (df['salary'] == '>50K')].shape[0] / num_min_workers

    # What country has the highest percentage of people that earn >50K?
    highest_earning_country = df[df['salary'] == '>50K'].groupby('native-country')['native-country'].value_counts().max()

    highest_earning_country_percentage = (highest_earning_country / df.groupby('native-country')['native-country'].value_counts()).max()

    # Identify the most popular occupation for those who earn >50K in India.
    top_IN_occupation = df[(df['native-country'] == 'India') & (df['salary'] == '>50K')].groupby('occupation')['occupation'].value_counts().max()


    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }


# Testing Out The Function

In [None]:
calculate_demographic_data()

Number of each race:
 White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64
Average age of men: 39.43354749885268
Percentage with Bachelors degrees: 16.44605509658794%
Percentage with higher education that earn >50K: >50K    46.535843
Name: salary, dtype: float64%
Percentage without higher education that earn >50K: <=50K    82.62864
Name: salary, dtype: float64%
Min work time: 1 hours/week
Percentage of rich among those who work fewest hours: 1    0.1
Name: hours-per-week, dtype: float64%
Country with highest percentage of rich: 7171
Highest percentage of rich people in country: 7171.0%
Top occupations in India: 25


{'race_count': White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
 Name: race, dtype: int64,
 'average_age_men': 39.43354749885268,
 'percentage_bachelors': 16.44605509658794,
 'higher_education_rich': >50K    46.535843
 Name: salary, dtype: float64,
 'lower_education_rich': <=50K    82.62864
 Name: salary, dtype: float64,
 'min_work_hours': 1,
 'rich_percentage': 1    0.1
 Name: hours-per-week, dtype: float64,
 'highest_earning_country': 7171,
 'highest_earning_country_percentage': 7171.0,
 'top_IN_occupation': 25}