<a href="https://colab.research.google.com/github/jhonda18/Python3/blob/main/Clase_3_20210424.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Carga de módulos (Librerías)


### Inspeccionar un DataFrame

Cuando tiene un DataFrame para trabajar, lo primero que debemos hacer es explorarlo y ver que contiene. Hay varios métodos y atributos para esto:

* **.head()** devuelve las primeras filas del DataFrame
* **.info()** muestra información sobre cada una de las columnas, como el tipo de datos y el número de valores missing (nulos).
* **.shape** devuelve una tupla con el número de filas y columnas del DataFrame
* **.describe()** calcula algunas estadísticas de resumen para cada columna.

In [None]:
import numpy as np
import pandas as pd

In [None]:
path = "https://raw.githubusercontent.com/stivenlopezg/Modulo-Python-3/master/data/diabetes.csv"

schema = {'PatientID': 'category',
          'Diabetic': 'category'}

diabetes = pd.read_csv(path, dtype=schema)
print(f"El set de datos tiene {diabetes.shape[0]} observaciones, y {diabetes.shape[1]} variables.")
diabetes.head()

El set de datos tiene 15000 observaciones, y 10 variables.


Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0


In [None]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   PatientID               15000 non-null  category
 1   Pregnancies             15000 non-null  int64   
 2   PlasmaGlucose           15000 non-null  int64   
 3   DiastolicBloodPressure  15000 non-null  int64   
 4   TricepsThickness        15000 non-null  int64   
 5   SerumInsulin            15000 non-null  int64   
 6   BMI                     15000 non-null  float64 
 7   DiabetesPedigree        15000 non-null  float64 
 8   Age                     15000 non-null  int64   
 9   Diabetic                15000 non-null  category
dtypes: category(2), float64(2), int64(6)
memory usage: 1.7 MB


In [None]:
diabetes.describe()

Unnamed: 0,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age
count,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0
mean,3.224533,107.856867,71.220667,28.814,137.852133,31.509646,0.398968,30.137733
std,3.39102,31.981975,16.758716,14.555716,133.068252,9.759,0.377944,12.089703
min,0.0,44.0,24.0,7.0,14.0,18.200512,0.078044,21.0
25%,0.0,84.0,58.0,15.0,39.0,21.259887,0.137743,22.0
50%,2.0,104.0,72.0,31.0,83.0,31.76794,0.200297,24.0
75%,6.0,129.0,85.0,41.0,195.0,39.259692,0.616285,35.0
max,14.0,192.0,117.0,93.0,799.0,56.034628,2.301594,77.0


In [None]:
diabetes.describe(exclude='number')

Unnamed: 0,PatientID,Diabetic
count,15000,15000
unique,14895,2
top,1127499,0
freq,2,10000


El conjunto de datos tiene 11 variables y 15000 registros, donde estas variables son:

* *PatientID*: El ID del paciente.
* *Pregnancies*: El número de embarazos.
* *PlasmaGlucose*: Concentración de glucosa a 2 horas, en una prueba de tolerancia oral a la glucosa.
* *DiastolicBloodPressure*: Presión arterial diastólica.
* *TricepsThickness*: Espesor del pliegue cutaneo del triceps.
* *SerumInsulin*: Insulina Sérica de dos horas.
* *BMI*: Indice de masa corporal.
* *DiabetesPedigree*: Función pedigree de la diabetes.
* *Age*: Edad expresada en años.
* *Diabetic*: Si la persona tiene diabetes o no (Si=1, No=0).

### Partes de un DataFrame

Es útil saber que los DataFrames constan de tres componentes, almacenados como atributos:

* **.values**: Una matriz de valores *Numpy* bidimensional.
* **.columns**: Un índice de columnas: los nombres de las columnas.
* **.index**: Un índice para las filas: números de fila o nombres de fila.

In [None]:
diabetes.index

RangeIndex(start=0, stop=15000, step=1)

In [None]:
diabetes.columns

Index(['PatientID', 'Pregnancies', 'PlasmaGlucose', 'DiastolicBloodPressure',
       'TricepsThickness', 'SerumInsulin', 'BMI', 'DiabetesPedigree', 'Age',
       'Diabetic'],
      dtype='object')

In [None]:
diabetes.values

array([['1354778', 0, 171, ..., 1.213191354, 21, '0'],
       ['1147438', 8, 92, ..., 0.158364981, 23, '0'],
       ['1640031', 7, 115, ..., 0.079018568, 23, '0'],
       ...,
       ['1742742', 0, 93, ..., 0.427048955, 24, '0'],
       ['1099353', 0, 132, ..., 0.302257208, 23, '0'],
       ['1386396', 3, 114, ..., 0.14736285, 34, '1']], dtype=object)

### Ordenar filas

Para entender un poco más los datos, podemos ordenar las filas usando el método **.sort_values()**.


In [None]:
diabetes.sort_values(by=['PatientID', 'Age'], ascending=[True, True], inplace=True)

# diabetes = diabetes.sort_values(by=['PatientID', 'Age'], ascending=[True, True])

In [None]:
diabetes.head(10)

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
2294,1000038,0,117,88,32,607,43.906257,0.810828,23,0
12189,1000069,3,76,81,19,47,36.775178,0.084195,38,0
12633,1000118,0,144,95,7,34,37.317559,0.166225,21,0
1515,1000183,1,164,102,22,40,49.822233,0.294454,24,0
5928,1000326,3,116,43,45,207,29.83378,1.344658,46,1
3986,1000340,0,92,24,11,37,31.619982,0.124932,23,0
3532,1000471,1,53,59,37,51,35.250901,0.618805,22,0
11262,1000482,7,90,52,33,190,19.080371,0.250355,35,0
9611,1000510,1,88,60,8,20,21.141686,0.268369,23,0
6255,1000652,2,131,74,42,206,28.126824,0.196287,44,1


### Seleccionar columnas

En muchas ocasiones no necesitará todas las variables en su set de datos, o incluso experimentará con unas pocas. Los corchetes (**[]**) se pueden utilizar para seleccionar solo las columnas de interés. Por ejemplo, para seleccionar la columna `salary` del DataFrame `data`, puede hacer lo siguiente: `data["salary"]`


In [None]:
diabetes.Diabetic
diabetes["Diabetic"]

2294     0
12189    0
12633    0
1515     0
5928     1
        ..
11580    1
3473     1
1326     0
11026    0
4012     0
Name: Diabetic, Length: 15000, dtype: category
Categories (2, object): ['0', '1']

In [None]:
diabetes[["Age", "Diabetic"]].sample(n=10, random_state=42)

Unnamed: 0,Age,Diabetic
6327,22,0
2484,25,0
14206,57,1
2796,46,1
10299,23,1
3946,46,1
808,35,0
14709,30,1
7915,39,1
12416,21,0


In [None]:
import time

start = time.time()

diabetes.iloc[:, [5, 9]]

end = time.time() - start
print(end)

0.001051187515258789


In [None]:
start = time.time()

diabetes.loc[:, ["SerumInsulin", "Diabetic"]]

end = time.time() - start
print(end)

0.0009334087371826172



### Seleccionar filas

Para encontrar un subconjunto de filas que coincida con algún criterio, quizás la forma más común es usar operadores lógicos que devuelvan **True** o **False** para cada fila, y luego pasarlo entre corchetes.


In [None]:
diabetes.reset_index(drop=True, inplace=True)

In [None]:
indices = diabetes[(diabetes["Age"] >= 21) & (diabetes['Age'] < 28) & (diabetes["BMI"] < 30)].index

diabetes.iloc[indices, :]

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
8,1000510,1,88,60,8,20,21.141686,0.268369,23,0
10,1000677,10,114,79,35,161,21.295850,0.181434,23,0
12,1000963,0,101,53,8,28,20.415181,0.502161,24,0
19,1001867,6,166,55,42,38,21.899396,0.116333,22,0
25,1002229,1,73,96,35,498,20.846682,0.738083,22,0
...,...,...,...,...,...,...,...,...,...,...
14980,1998920,3,144,90,35,162,19.147830,0.290215,23,0
14993,1999551,0,129,49,10,169,20.197111,0.175101,24,0
14995,1999765,14,146,75,44,57,22.873571,0.098536,22,1
14997,1999864,7,84,48,21,152,29.801928,0.098014,26,0


In [None]:
diabetes.loc[(diabetes["Age"] >= 21) & (diabetes['Age'] < 28) & (diabetes["BMI"] < 30), ["Age", "BMI", "Diabetic"]]

Unnamed: 0,Age,BMI,Diabetic
8,23,21.141686,0
10,23,21.295850,0
12,24,20.415181,0
19,22,21.899396,0
25,22,20.846682,0
...,...,...,...
14980,23,19.147830,0
14993,24,20.197111,0
14995,22,22.873571,1
14997,26,29.801928,0


In [None]:
diabetes.query("21 < Age <= 28 & BMI < 30")

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
8,1000510,1,88,60,8,20,21.141686,0.268369,23,0
10,1000677,10,114,79,35,161,21.295850,0.181434,23,0
12,1000963,0,101,53,8,28,20.415181,0.502161,24,0
19,1001867,6,166,55,42,38,21.899396,0.116333,22,0
25,1002229,1,73,96,35,498,20.846682,0.738083,22,0
...,...,...,...,...,...,...,...,...,...,...
14980,1998920,3,144,90,35,162,19.147830,0.290215,23,0
14993,1999551,0,129,49,10,169,20.197111,0.175101,24,0
14995,1999765,14,146,75,44,57,22.873571,0.098536,22,1
14997,1999864,7,84,48,21,152,29.801928,0.098014,26,0


### Agregar nuevas columnas

Una de las prácticas más comunes es agregar nuevas columnas a un DataFrame. Esta etapa en el proceso de construir un modelo usando algoritmos de Machine Learning o estadísticos se llama ingeniería de características.

Podemos crear columnas desde cero, o derivarla de otra columna, por ejemplo, agregando columnas o cambiando sus unidades.


In [None]:
diabetes.drop(labels="A", axis=1, inplace=True)

In [None]:
diabetes.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1000038,0,117,88,32,607,43.906257,0.810828,23,0
1,1000069,3,76,81,19,47,36.775178,0.084195,38,0
2,1000118,0,144,95,7,34,37.317559,0.166225,21,0
3,1000183,1,164,102,22,40,49.822233,0.294454,24,0
4,1000326,3,116,43,45,207,29.83378,1.344658,46,1


In [None]:
diabetes['A'] = np.nan

diabetes.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,A
0,1000038,0,117,88,32,607,43.906257,0.810828,23,0,
1,1000069,3,76,81,19,47,36.775178,0.084195,38,0,
2,1000118,0,144,95,7,34,37.317559,0.166225,21,0,
3,1000183,1,164,102,22,40,49.822233,0.294454,24,0,
4,1000326,3,116,43,45,207,29.83378,1.344658,46,1,


In [None]:
diabetes.drop(labels='A', axis=1, inplace=True) # Opcional

In [None]:
# diabetes['A'] = np.nan

a = diabetes.pop("A")

type(a)

pandas.core.series.Series

In [None]:
diabetes.insert(loc=0, column="A", value=a)

diabetes.head()

Unnamed: 0,A,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,,1000038,0,117,88,32,607,43.906257,0.810828,23,0
1,,1000069,3,76,81,19,47,36.775178,0.084195,38,0
2,,1000118,0,144,95,7,34,37.317559,0.166225,21,0
3,,1000183,1,164,102,22,40,49.822233,0.294454,24,0
4,,1000326,3,116,43,45,207,29.83378,1.344658,46,1


### Funciones lambda

Son funciones anonimas.

In [None]:
calculate_square = lambda x: x ** 2

calculate_square(2)

4

In [None]:
suma = lambda x, y: x + y

suma(5, 7)

12

In [None]:
diabetes["diabetes_apply"] = diabetes["Diabetic"].apply(lambda x: "Si" if x == "1" else "No")
diabetes.head()

Unnamed: 0,A,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,diabetes_apply
0,,1000038,0,117,88,32,607,43.906257,0.810828,23,0,No
1,,1000069,3,76,81,19,47,36.775178,0.084195,38,0,No
2,,1000118,0,144,95,7,34,37.317559,0.166225,21,0,No
3,,1000183,1,164,102,22,40,49.822233,0.294454,24,0,No
4,,1000326,3,116,43,45,207,29.83378,1.344658,46,1,Si


In [None]:
diabetes["diabetes_map"] = diabetes["Diabetic"].map({'1': "Si", "0": "No"})
diabetes.head()

Unnamed: 0,A,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,diabetes_apply,diabetes_map
0,,1000038,0,117,88,32,607,43.906257,0.810828,23,0,No,No
1,,1000069,3,76,81,19,47,36.775178,0.084195,38,0,No,No
2,,1000118,0,144,95,7,34,37.317559,0.166225,21,0,No,No
3,,1000183,1,164,102,22,40,49.822233,0.294454,24,0,No,No
4,,1000326,3,116,43,45,207,29.83378,1.344658,46,1,Si,Si



### Estadísticas descriptivas

Puede utilizar diferentes métodos como:

* **.mean()**
* **.median()**
* **.min()**
* **.max()**
* **.quantile()**

Y muchos otros más para calcular estadísticas sobre los datos, sin embargo, en algunos casos necesitará crear sus propias funciones personalizadas. Para esto el método **.agg()** le permite aplicar sus propias funciones personalizadas a un DataFrame, así como también aplicar funciones a más de una columna, lo que hace que sus agregaciones sean más eficientes.


In [None]:
diabetes["sum_age_serum_insulin"] = diabetes[["Age", "SerumInsulin"]].sum(axis=1)
diabetes.head()

Unnamed: 0,A,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,diabetes_apply,diabetes_map,sum_age_serum_insulin
0,,1000038,0,117,88,32,607,43.906257,0.810828,23,0,No,No,630
1,,1000069,3,76,81,19,47,36.775178,0.084195,38,0,No,No,85
2,,1000118,0,144,95,7,34,37.317559,0.166225,21,0,No,No,55
3,,1000183,1,164,102,22,40,49.822233,0.294454,24,0,No,No,64
4,,1000326,3,116,43,45,207,29.83378,1.344658,46,1,Si,Si,253


In [None]:
diabetes[["Age", "SerumInsulin"]].mean(axis=1)

0        315.0
1         42.5
2         27.5
3         32.0
4        126.5
         ...  
14995     39.5
14996     72.5
14997     89.0
14998     95.5
14999    117.5
Length: 15000, dtype: float64

### Eliminar datos duplicados

Cuando queremos ver si hay datos duplicados, dos métodos interesantes son:

* **.duplicated()**
* **.drop_duplicates()**

Ambos métodos comparten dos argumentos muy importantes, el primero es *subset* que considera solo las columnas que se le pasen para buscar datos duplicados; el segundo es *keep* que determina que duplicado dejar o eliminar.


In [None]:
diabetes[diabetes.duplicated(subset=["PatientID"], keep=False)].shape[0]/diabetes.shape[0]

0.014

In [None]:
diabetes.drop_duplicates(subset=["PatientID"], keep=False, inplace=True)

### Agrupaciones

Podemos hacer agrupaciones usando el método **.groupby()**, donde el primer argumento es *by* que recibe un str o una lista de las columnas por las cuales se desea agrupar.


In [None]:
table = diabetes.groupby(by="Diabetic", as_index=False)["Pregnancies", "BMI"].mean()
table

  """Entry point for launching an IPython kernel.


Unnamed: 0,Diabetic,Pregnancies,BMI
0,0,2.251927,30.063947
1,1,5.175862,34.441364


In [None]:
diabetes.groupby(by="Diabetic")["Pregnancies", "BMI"].mean().reset_index()

  """Entry point for launching an IPython kernel.


Unnamed: 0,Diabetic,Pregnancies,BMI
0,0,2.251927,30.063947
1,1,5.175862,34.441364


In [None]:
table = diabetes.groupby(by=["Diabetic", "diabetes_map"])["Pregnancies", "BMI"].median()

  """Entry point for launching an IPython kernel.


In [None]:
table.index

MultiIndex([('0', 'No'),
            ('0', 'Si'),
            ('1', 'No'),
            ('1', 'Si')],
           names=['Diabetic', 'diabetes_map'])

In [None]:
table

Unnamed: 0_level_0,Unnamed: 1_level_0,Pregnancies,BMI
Diabetic,diabetes_map,Unnamed: 2_level_1,Unnamed: 3_level_1
0,No,1.0,28.453579
0,Si,,
1,No,,
1,Si,5.0,33.727606


In [None]:
table.loc[[('0', 'No')], "Pregnancies"]

Diabetic  diabetes_map
0         No              1.0
Name: Pregnancies, dtype: float64

In [None]:
table = diabetes.groupby(by=["Diabetic"])["Pregnancies", "BMI"].agg(['min', 'max', 'mean'])
table

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,Pregnancies,Pregnancies,Pregnancies,BMI,BMI,BMI
Unnamed: 0_level_1,min,max,mean,min,max,mean
Diabetic,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
0,0,11,2.251927,18.200512,51.418626,30.063947
1,0,14,5.175862,18.218614,56.034628,34.441364


In [None]:
table.columns

MultiIndex([('Pregnancies',  'min'),
            ('Pregnancies',  'max'),
            ('Pregnancies', 'mean'),
            (        'BMI',  'min'),
            (        'BMI',  'max'),
            (        'BMI', 'mean')],
           )

In [None]:
table.index

CategoricalIndex(['0', '1'], categories=['0', '1'], ordered=False, name='Diabetic', dtype='category')

In [None]:
table.loc[["0"], "Pregnancies"].agg("min")[0]

0.0

In [None]:
table.loc["0", [("Pregnancies", "min"), ("BMI", 'max')]]

Pregnancies  min     0.000000
BMI          max    51.418626
Name: 0, dtype: float64