# Demographic Data Analizer

Este ejercicio forma parte del curso *Data Analysis with Python* de [freeCodeCamp](https://www.freecodecamp.org/learn/data-analysis-with-python/).

**Consigna.** Dado un conjunto de datos demográficos extraídos de un censo realizado en 1994, se debe responder las siguientes preguntas:

1. ¿Cuántas personas de cada grupo étnico están representadas en este conjunto de datos? 
2. ¿Cuál es la edad promedio de los hombres?
3. ¿Cuál es el porcentaje de personas que tienen un `Bachelors`?
4. ¿Qué porcentaje de personas con educación avanzada (`Bachelors`, `Masters`, o `Doctorate`) gana más de 50K? ¿Qué porcentaje de personas sin educación avanzada gana más de 50K?
5. ¿Cuál es el número mínimo de horas que trabaja una persona por semana?
6. ¿Qué porcentaje de las personas que trabajan la cantidad mínima de horas por semana tienen un salario de más de 50K?
7. ¿Qué país tiene el porcentaje más alto de personas que ganan más de 50K y cuál es ese porcentaje?
8. ¿Cuál es la ocupación más popular para aquellos que ganan más de 50K en India?

In [1]:
# Import libraries
import pandas as pd

In [2]:
# Read data
df = pd.read_csv("adult.data.csv")
df.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


In [5]:
df.shape

(32561, 15)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB


El dataset tiene 32561 observaciones y 15 columnas. Algunas de las variables son edad, género, estado civil, nivel educativo alcanzado, ocupación, salario, entre otras. No tiene valores perdidos o nulos. Y el tipo de dato parece ser el adecuado para cada columna.

**1. How many of each race are represented in this dataset?**

In [14]:
race_count = df["race"].value_counts()
print("Number of each race:\n", race_count)

Number of each race:
 White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64


**2. What is the average age of men?**

In [10]:
average_age_men = round(df[df["sex"] == "Male"]["age"].mean(), 1)
print("Average age of men:", average_age_men)

Average age of men: 39.4


**3. What is the percentage of people who have a Bachelor's degree?**

In [8]:
percentage_bachelors = round(df[df["education"] == "Bachelors"].shape[0] / df.shape[0] * 100, 1)
print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")

Percentage with Bachelors degrees: 16.4%


**4. What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K? What percentage of people without advanced education make more than 50K?**

In [9]:
# with and without `Bachelors`, `Masters`, or `Doctorate`
higher_education = df[df.education.isin(["Bachelors","Masters","Doctorate"])]
lower_education = df[~df.education.isin(["Bachelors","Masters","Doctorate"])]

# percentage with salary >50K
higher_education_rich = round(higher_education[higher_education["salary"] == ">50K"].shape[0] / higher_education.shape[0] * 100, 1)
lower_education_rich = round(lower_education[lower_education["salary"] == ">50K"].shape[0] / lower_education.shape[0] * 100, 1)
   
print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")

Percentage with higher education that earn >50K: 46.5%
Percentage without higher education that earn >50K: 17.4%


**5. What is the minimum number of hours a person works per week?**

In [11]:
min_work_hours = df["hours-per-week"].min()
print(f"Min work time: {min_work_hours} hours/week")

Min work time: 1 hours/week


**6. What percentage of the people who work the minimum number of hours per week have a salary of >50K?**

In [12]:
num_min_workers = df[df["hours-per-week"] == min_work_hours]
rich_percentage = round(num_min_workers[num_min_workers["salary"] == ">50K"].shape[0] / num_min_workers.shape[0] * 100, 1)
print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")

Percentage of rich among those who work fewest hours: 10.0%


**7. What country has the highest percentage of people that earn >50K and what is that percentage?**

In [21]:
highest_country = (df[df["salary"] == ">50K"]["native-country"].value_counts() / df["native-country"].value_counts() * 100).sort_values(ascending = False)
highest_earning_country = highest_country.idxmax()
highest_earning_country_percentage = round(highest_country.max(), 1)
print("Country with highest percentage of rich:", highest_earning_country)
print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")

Country with highest percentage of rich: Iran
Highest percentage of rich people in country: 41.9%


**8. Identify the most popular occupation for those who earn >50K in India.**

In [13]:
top_IN_occupation = df.loc[(df["native-country"] == "India") & (df["salary"] == ">50K")]["occupation"].value_counts().idxmax()
print("Top occupations in India:", top_IN_occupation)

Top occupations in India: Prof-specialty
