## Análisis de datos demográficos

El presente análisis se realizó con el "Census Income Data Set" de UCI Machine Learning Repository.

El dataset cuenta con información del censo de 1994 de Estados Unidos.

Mas información en:
https://archive.ics.uci.edu/ml/datasets/census+income

### Importando las librerías necesarias:

In [1]:
import pandas as pd

### Leyendo el dataset y analizando su información:

In [2]:
df = pd.read_csv("data/adult.data.csv")

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
df.shape

(32561, 15)

In [5]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB


### Cantidad de personas de cada raza en el dataset
Responde a la pregunta: *How many people of each race are represented in this dataset?*

In [7]:
race_df = df["race"].value_counts()

In [8]:
race_df 

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

### Edad promedio de los hombres
Responde a la pregunta: *What is the average age of men?*

In [9]:
filt_avage_men = (df["sex"] == "Male")

In [10]:
average_age_men = df.loc[filt_avage_men, "age"].mean()

In [11]:
print(f"La edad promedio de los hombres son {round(average_age_men, 1)} años")

La edad promedio de los hombres son 39.4 años


### Porcentaje de personas que tienen un Bachelor's degree
Responde a la pregunta *What is the percentage of people who have a Bachelor's degree?*

In [12]:
percentage_bachelors = ((df["education"] == "Bachelors").sum() / len(df) * 100)

In [13]:
print(f"El porcentaje de personas que tienen un Bachelor's degree son: {round(percentage_bachelors, 1)}%")

El porcentaje de personas que tienen un Bachelor's degree son: 16.4%


### Porcentaje de personas con y sin títulos de grado y posgrado (`Bachelors`, `Masters`, or `Doctorate`)
Responde a la pregunta: Percentage of people with and without `Bachelors`, `Masters`, or `Doctorate`.

In [14]:
filt_education = (df["education"] == "Bachelors") | (df["education"] == "Masters") | (df["education"] == "Doctorate") 

In [15]:
higher_education = df[filt_education]
lower_education = df[-filt_education]

In [16]:
higher_education.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K


In [17]:
lower_education.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K


### Porcentaje de personas con y sin educación de grado y posgrado cuyos salarios son superiores a USD 50K

Los posibles ingresos en la columna "Salary" (salarios) son los siguientes:

In [18]:
df["salary"].unique()

array(['<=50K', '>50K'], dtype=object)

#### % de personas _CON_ educación de grado y/o posgrado cuyos ingresos son superiores a USD 50K

In [19]:
higher_education_rich = (higher_education["salary"] == ">50K").sum() / len(higher_education) * 100

In [20]:
print(f"La cantidad de personas CON educación de grado y/o posgrado cuyos ingresos son superiores a USD 50K: {higher_education_rich}")

La cantidad de personas CON educación de grado y/o posgrado cuyos ingresos son superiores a USD 50K: 46.535843011613935


#### % de personas _SIN_ educación de grado y/o posgrado cuyos ingresos son superiores a USD 50K

In [21]:
lower_education_rich = (lower_education["salary"] == ">50K").sum() / len(lower_education) * 100

In [22]:
print(f"La cantidad de personas SIN educación de grado y/o posgrado cuyos ingresos son superiores a USD 50K: {lower_education_rich}")

La cantidad de personas SIN educación de grado y/o posgrado cuyos ingresos son superiores a USD 50K: 17.3713601914639


### Mínimo de horas que trabaja una persona
Responde a la pregunta: *What is the minimum number of hours a person works per week (hours-per-week feature)?*

In [23]:
min_work_hours = df["hours-per-week"].min()

In [24]:
print(f"El mínimo de horas que una persona trabaja es: {min_work_hours} hs por semana")

El mínimo de horas que una persona trabaja es: 1 hs por semana


### Porcentaje de personas que trabajan por el mínimo de horas por semana y tienen un salario mayor a USD 50K
Responde a la pregunta: *What percentage of the people who work the minimum number of hours per week have a salary of >50K?*

#### Cantidad de personas que trabajan el mínimo de horas por semana:

In [25]:
filt_min_work_hours = (df["hours-per-week"] == df["hours-per-week"].min())

In [26]:
num_min_workers = len(df.loc[filt_min_work_hours].index)

In [27]:
print(f"La cantidad de personas que trabajan el mínimo de horas son: {num_min_workers} personas")

La cantidad de personas que trabajan el mínimo de horas son: 20 personas


#### % de personas que trabajan el mínimo de horas y ganan más de USD 50K

In [28]:
filt_rich_percentage = (df["hours-per-week"] == df["hours-per-week"].min()) & (df["salary"] == ">50K")

In [29]:
rich_percentage = len(df[filt_rich_percentage]) / num_min_workers * 100

In [30]:
print(f"El porcentaje de personas que trabajan el mínimo de horas y ganan más de USD 50K es: {round(rich_percentage, 1)}%")

El porcentaje de personas que trabajan el mínimo de horas y ganan más de USD 50K es: 10.0%


### País con el porcentaje más alto de personas que ganan más de USD 50K
Responde a la pregunta *What country has the highest percentage of people that earn >50K?*

#### País con la mayor cantidad de personas con ingresos mayores a USD 50K

In [31]:
df.loc[(df["salary"] == ">50K"), "native-country"].value_counts().head()

United-States    7171
?                 146
Philippines        61
Germany            44
India              40
Name: native-country, dtype: int64

In [32]:
df["native-country"].value_counts().head()

United-States    29170
Mexico             643
?                  583
Philippines        198
Germany            137
Name: native-country, dtype: int64

In [33]:
(df.loc[(df["salary"] == ">50K"), "native-country"].value_counts() / df["native-country"].value_counts()).head(15)

?                     0.250429
Cambodia              0.368421
Canada                0.322314
China                 0.266667
Columbia              0.033898
Cuba                  0.263158
Dominican-Republic    0.028571
Ecuador               0.142857
El-Salvador           0.084906
England               0.333333
France                0.413793
Germany               0.321168
Greece                0.275862
Guatemala             0.046875
Haiti                 0.090909
Name: native-country, dtype: float64

In [34]:
# División y ordenando de menor a mayor los resultados, 
# accediendo así al país con mayor cantidad de personas con ingresos mayores a USD 50K

highest_earning_country = (df.loc[(df["salary"] == ">50K"), "native-country"].value_counts() / df["native-country"].value_counts()).sort_values(ascending=False).index[0]

In [35]:
print(f"El país con más cantidad de personas que ganan más de USD 50K es: {highest_earning_country}")

El país con más cantidad de personas que ganan más de USD 50K es: Iran


#### Porcentaje del país con la mayor cantidad de personas con ingresos mayores a USD 50K

In [36]:
highest_earning_country_percentage_df = (df.loc[(df["salary"] == ">50K"), "native-country"].value_counts() / df["native-country"].value_counts() * 100).sort_values(ascending=False)

In [37]:
highest_earning_country_percentage_df.head()

Iran      41.860465
France    41.379310
India     40.000000
Taiwan    39.215686
Japan     38.709677
Name: native-country, dtype: float64

In [38]:
highest_earning_country_percentage = (df.loc[(df["salary"] == ">50K"), "native-country"].value_counts() / df["native-country"].value_counts() * 100).sort_values(ascending=False).values[0]

In [39]:
print(f"El porcentaje del país con más cantidad de personas que ganan más de USD 50K es: {round(highest_earning_country_percentage, 1)}%")

El porcentaje del país con más cantidad de personas que ganan más de USD 50K es: 41.9%


### Ocupación más popular de aquellos que ganan más de USD 50K en India
Responde a la pregunta: *Identify the most popular occupation for those who earn >50K in India*

In [40]:
df.groupby(["native-country", "salary", "occupation"])["age"].count()

native-country  salary  occupation       
?               <=50K   ?                    23
                        Adm-clerical         40
                        Craft-repair         48
                        Exec-managerial      43
                        Farming-fishing       5
                                             ..
Yugoslavia      <=50K   Transport-moving      1
                >50K    Exec-managerial       2
                        Farming-fishing       1
                        Machine-op-inspct     1
                        Other-service         2
Name: age, Length: 642, dtype: int64

In [41]:
df.groupby(["native-country", "salary", "occupation"])["age"].count().loc["India", ">50K"]

occupation
Adm-clerical         1
Exec-managerial      8
Other-service        2
Prof-specialty      25
Sales                1
Tech-support         2
Transport-moving     1
Name: age, dtype: int64

In [42]:
top_IN_occupation = df.groupby(["native-country", "salary", "occupation"])["age"].count().loc["India", ">50K"].sort_values(ascending=False).index[0]

In [43]:
print(f"La ocupación más popular en India de quienes ganan más de USD 50K es: {top_IN_occupation}")

La ocupación más popular en India de quienes ganan más de USD 50K es: Prof-specialty
