## *01. Visualización de los datos*
* La visualización de datos es el proceso de representar datos y la información contenida en ellos de una manera visual y gráfica, para que puedan ser comprendidos de manera más fácil y rápida. La visualización de datos se utiliza en muchos campos, incluyendo la ciencia de datos, la estadística, la investigación de mercado, la inteligencia empresarial, entre otros.

* Las ventajas de la visualización de datos incluyen:

    * Comprender y comunicar patrones y relaciones complejas en los datos de manera más rápida y efectiva.
    * Descubrir información y conocimientos nuevos y valiosos a partir de los datos.
    * Identificar problemas y oportunidades de manera más fácil y rápida.
    * Identificar y solucionar errores en los datos.
    * Mejorar la comunicación y la colaboración entre los usuarios que trabajan con los datos.

In [1]:
# Librerías
import pandas as pd
pd.set_option('display.max_columns', None)

In [2]:
# Rutas de los archivos
file_path = '../datasets/adult.data'
col_names = pd.read_csv('../datasets/col_names.txt').T.iloc[0].tolist()

# Lectura de los datos
data = pd.read_csv(filepath_or_buffer=file_path,
                   names=col_names)

In [11]:
# Visualización de las primeras 5 filas del dataset
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [12]:
# Visualización de las últimas 5 filas del dataset
data.tail()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [13]:
# Visualización aleatoria de 5 filas del dataset
data.sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
8210,31,Private,35378,Some-college,10,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,45,United-States,<=50K
1932,28,Private,162551,7th-8th,4,Married-civ-spouse,Machine-op-inspct,Other-relative,Asian-Pac-Islander,Female,0,0,48,China,<=50K
18889,26,Private,162312,Some-college,10,Never-married,Adm-clerical,Not-in-family,Asian-Pac-Islander,Male,0,0,20,Philippines,<=50K
15822,24,Private,169532,Assoc-acdm,12,Never-married,Adm-clerical,Own-child,White,Female,0,0,15,United-States,<=50K
20807,55,Self-emp-not-inc,184702,HS-grad,9,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,<=50K


In [52]:
# Forma
data.shape

(32561, 15)

In [14]:
# Índice
data.index

RangeIndex(start=0, stop=32561, step=1)

In [15]:
# Columnas
data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

### *Indexación*

In [17]:
# Indexación de filas de la 10 a la 20
data[10:21]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
12,23,Private,122272,Bachelors,13,Never-married,Adm-clerical,Own-child,White,Female,0,0,30,United-States,<=50K
13,32,Private,205019,Assoc-acdm,12,Never-married,Sales,Not-in-family,Black,Male,0,0,50,United-States,<=50K
14,40,Private,121772,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,?,>50K
15,34,Private,245487,7th-8th,4,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,0,0,45,Mexico,<=50K
16,25,Self-emp-not-inc,176756,HS-grad,9,Never-married,Farming-fishing,Own-child,White,Male,0,0,35,United-States,<=50K
17,32,Private,186824,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,White,Male,0,0,40,United-States,<=50K
18,38,Private,28887,11th,7,Married-civ-spouse,Sales,Husband,White,Male,0,0,50,United-States,<=50K
19,43,Self-emp-not-inc,292175,Masters,14,Divorced,Exec-managerial,Unmarried,White,Female,0,0,45,United-States,>50K


In [18]:
# Indexar una columna
data['education']

0          Bachelors
1          Bachelors
2            HS-grad
3               11th
4          Bachelors
            ...     
32556     Assoc-acdm
32557        HS-grad
32558        HS-grad
32559        HS-grad
32560        HS-grad
Name: education, Length: 32561, dtype: object

In [19]:
# Indexación por posición
data.iloc[[0]]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K


In [20]:
# Indexación por posición
data.iloc[[1337]]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
1337,40,Private,240124,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K


In [21]:
# Indexación por etiqueta
# Todas las filas de las columnas "education", "marital-status" y "relationship"
data.loc[:, ['education', 'marital-status', 'relationship']]

Unnamed: 0,education,marital-status,relationship
0,Bachelors,Never-married,Not-in-family
1,Bachelors,Married-civ-spouse,Husband
2,HS-grad,Divorced,Not-in-family
3,11th,Married-civ-spouse,Husband
4,Bachelors,Married-civ-spouse,Wife
...,...,...,...
32556,Assoc-acdm,Married-civ-spouse,Wife
32557,HS-grad,Married-civ-spouse,Husband
32558,HS-grad,Widowed,Unmarried
32559,HS-grad,Never-married,Own-child


In [22]:
# Primer valor de la columna "education"
data.loc[0, 'education']

' Bachelors'

In [23]:
# Las primeras 10 filas de las columnas "education", "marital-status" y "relationship"
data.loc[0:10, ['education', 'marital-status', 'relationship']]

Unnamed: 0,education,marital-status,relationship
0,Bachelors,Never-married,Not-in-family
1,Bachelors,Married-civ-spouse,Husband
2,HS-grad,Divorced,Not-in-family
3,11th,Married-civ-spouse,Husband
4,Bachelors,Married-civ-spouse,Wife
5,Masters,Married-civ-spouse,Wife
6,9th,Married-spouse-absent,Not-in-family
7,HS-grad,Married-civ-spouse,Husband
8,Masters,Never-married,Not-in-family
9,Bachelors,Married-civ-spouse,Husband


### *Indexación (filtrado) por condiciones*

In [28]:
data['marital-status'].value_counts()

 Married-civ-spouse       14976
 Never-married            10683
 Divorced                  4443
 Separated                 1025
 Widowed                    993
 Married-spouse-absent      418
 Married-AF-spouse           23
Name: marital-status, dtype: int64

In [38]:
data[data.loc[:, 'marital-status'] == 'Separated']

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income


In [33]:
# La celda anterior devuelve un Dataframe vacío, ¿Por qué?
data.loc[:, 'marital-status'].unique()

array([' Never-married', ' Married-civ-spouse', ' Divorced',
       ' Married-spouse-absent', ' Separated', ' Married-AF-spouse',
       ' Widowed'], dtype=object)

In [36]:
data[data.loc[:, 'marital-status'] == ' Separated']

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
21,54,Private,302146,HS-grad,9,Separated,Other-service,Unmarried,Black,Female,0,0,20,United-States,<=50K
43,49,Private,94638,HS-grad,9,Separated,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
71,31,Private,309974,Bachelors,13,Separated,Sales,Own-child,Black,Female,0,0,40,United-States,<=50K
157,71,Self-emp-not-inc,494223,Some-college,10,Separated,Sales,Unmarried,Black,Male,0,1816,2,United-States,<=50K
159,42,Private,228456,Bachelors,13,Separated,Other-service,Other-relative,Black,Male,0,0,50,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32481,52,Private,301229,Assoc-voc,11,Separated,Sales,Unmarried,White,Female,0,0,40,United-States,<=50K
32509,32,Private,192965,HS-grad,9,Separated,Sales,Not-in-family,White,Female,0,0,45,United-States,<=50K
32529,29,Private,125976,HS-grad,9,Separated,Sales,Unmarried,White,Female,0,0,35,United-States,<=50K
32540,45,State-gov,252208,HS-grad,9,Separated,Adm-clerical,Own-child,White,Female,0,0,40,United-States,<=50K


In [31]:
data[data.loc[:, 'marital-status'] == ' Widowed']

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
147,36,Private,108293,HS-grad,9,Widowed,Other-service,Unmarried,White,Female,0,0,24,United-States,<=50K
167,46,State-gov,102628,Masters,14,Widowed,Protective-serv,Unmarried,White,Male,0,0,40,United-States,<=50K
169,66,Local-gov,54826,Assoc-voc,11,Widowed,Prof-specialty,Not-in-family,White,Female,0,0,20,United-States,<=50K
228,75,Private,314209,Assoc-voc,11,Widowed,Adm-clerical,Not-in-family,White,Female,0,0,20,Columbia,<=50K
252,59,Local-gov,171328,10th,6,Widowed,Other-service,Unmarried,Black,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32445,61,Private,190682,HS-grad,9,Widowed,Craft-repair,Not-in-family,Black,Female,0,1669,50,United-States,<=50K
32446,53,Private,158993,HS-grad,9,Widowed,Machine-op-inspct,Unmarried,Black,Female,0,0,38,United-States,<=50K
32460,62,Private,128092,HS-grad,9,Widowed,Adm-clerical,Not-in-family,White,Female,0,0,32,United-States,<=50K
32465,66,Private,269665,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,0,25,United-States,<=50K


In [41]:
# Mujeres mayores de edad que trabajan 40 horas o más
data.loc[(data.loc[:, 'age'] >= 18) &
         (data.loc[:, 'sex'] == ' Female') &
         (data.loc[:, 'hours-per-week'] >= 40)]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
19,43,Self-emp-not-inc,292175,Masters,14,Divorced,Exec-managerial,Unmarried,White,Female,0,0,45,United-States,>50K
24,59,Private,109015,HS-grad,9,Divorced,Tech-support,Unmarried,White,Female,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32543,45,Local-gov,119199,Assoc-acdm,12,Divorced,Prof-specialty,Unmarried,White,Female,0,0,48,United-States,<=50K
32546,37,Private,198216,Assoc-acdm,12,Divorced,Tech-support,Not-in-family,White,Female,0,0,40,United-States,<=50K
32549,43,State-gov,255835,Some-college,10,Divorced,Adm-clerical,Other-relative,White,Female,0,0,40,United-States,<=50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K


In [51]:
# Insight 1
print(f"""De la muestra total que se encuentran en nuestro dataset podemos afirmar que el {(6647 / data.shape[0] * 100):0.2f}% son mujeres mayores de edad que trabajan 40 horas o más""")

De la muestra total que se encuentran en nuestro dataset podemos afirmar que el 20.41% son mujeres mayores de edad que trabajan 40 horas o más


In [72]:
# Hombres que viven en México y que trabajen como Tech-support
data.loc[(data.loc[:, 'sex'] == ' Male') &
         (data.loc[:, 'native-country'] == ' Mexico') &
         (data.loc[:, 'occupation'] == ' Tech-support')]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
6040,25,Private,306352,Assoc-voc,11,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,40,Mexico,<=50K
23723,25,Private,245628,Some-college,10,Never-married,Tech-support,Own-child,White,Male,0,0,15,Mexico,<=50K
32031,35,Private,195516,7th-8th,4,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,40,Mexico,<=50K


In [73]:
# Insight 2
print(f"""De la muestra total que se encuentran en nuestro dataset podemos afirmar que el {(3 / data.shape[0] * 100):0.2f}% son hombres que viven en México y que trabajen como Tech-support""")

De la muestra total que se encuentran en nuestro dataset podemos afirmar que el 0.01% son hombres que viven en México y que trabajen como Tech-support


---
---