# Funciones de muestreo

In [60]:
import pandas as pd
import numpy as np
import random

In [61]:
econdata = pd.read_csv("../data/econdata.csv")
econdata.head()

Unnamed: 0,id,geo_point_2d,geo_shape,clave_cat,delegacion,perimetro,tipo,nom_id
0,0,"19.424781053,-99.1327537959","{""type"": ""Polygon"", ""coordinates"": [[[-99.1332...",307_130_11,Cuauhtémoc,B,Mercado,Pino Suárez
1,1,"19.4346139576,-99.1413808393","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",002_008_01,Cuautémoc,A,Museo,Museo Nacional de Arquitectura Palacio de Bell...
2,2,"19.4340695945,-99.1306348409","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",006_002_12,Cuautémoc,A,Museo,Santa Teresa
3,3,"19.42489472,-99.12073393","{""type"": ""MultiPoint"", ""coordinates"": [[-99.12...",323_102_06,Venustiano Carranza,B,Hotel,Balbuena
4,4,"19.42358238,-99.12451093","{""type"": ""MultiPoint"", ""coordinates"": [[-99.12...",323_115_12,Venustiano Carranza,B,Hotel,real


In [62]:
econdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230 entries, 0 to 229
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            230 non-null    int64 
 1   geo_point_2d  229 non-null    object
 2   geo_shape     229 non-null    object
 3   clave_cat     230 non-null    object
 4   delegacion    230 non-null    object
 5   perimetro     230 non-null    object
 6   tipo          230 non-null    object
 7   nom_id        229 non-null    object
dtypes: int64(1), object(7)
memory usage: 14.5+ KB


## Muestreo aleatorio simple

Cualquier elemento de una población tiene la misma probabilidad de ser elegido para una muestra.
<br>

la función `sample(n)` de pandas nos retorna `n` elementos (rows) aleatorios del DataFrame,. 

In [63]:
# Muestra aleatoria de 8 elementos
aleat_8 = econdata.sample(n=8)
aleat_8

Unnamed: 0,id,geo_point_2d,geo_shape,clave_cat,delegacion,perimetro,tipo,nom_id
129,129,,,005_125_01,Cuauhtémoc,A,Mercado,Abelardo
175,175,"19.4291934671,-99.1323328561","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",006_073_11,Cuautémoc,A,Museo,La Ciudad de México
43,43,"19.4343713052,-99.1291939496","{""type"": ""MultiPoint"", ""coordinates"": [[-99.12...",006_003_19,Cuautémoc,A,Hotel,Valencia
132,132,"19.4248733452,-99.1202942813","{""type"": ""MultiPoint"", ""coordinates"": [[-99.12...",323_102_02,Venustiano Carranza,B,Hotel,De Casa
92,92,"19.4383554723,-99.1327563513","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",004_082_19,Cuautémoc,A,Hotel,Tuxpan
98,98,"19.4409812032,-99.1401537957","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",003_069_08,Cuautémoc,B,Hotel,"Riva Palacio, S.A. DE C.V."
158,158,"19.4362349051,-99.1302332694","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",005_129_16,Cuautémoc,A,Museo,De La Luz
210,210,"19.43082385,-99.12366058","{""type"": ""MultiPoint"", ""coordinates"": [[-99.12...",323_029_08,Venustiano Carranza,B,Hotel,Hispano


Con la función `sample(frac)` obtenemos una fracción aleatoria del DataFrame.

In [64]:
# Fracción 25% de los datos.
frac_25 = econdata.sample(frac=.25)
frac_25.head()

Unnamed: 0,id,geo_point_2d,geo_shape,clave_cat,delegacion,perimetro,tipo,nom_id
114,114,"19.43836668,-99.14752899","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",003_103_30,Cuautémoc,A,Hotel,Marconi
64,64,"19.44281242,-99.13974599","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",003_053_01,Cuautémoc,B,Hotel,San Martin
144,144,"19.43846278,-99.14185407","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",003_097_17,Cuautémoc,A,Hotel,Covadonga
181,181,"19.4451852396,-99.1478597989","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",012_103_01,Cuautémoc,B,Hotel,Yale
54,54,"19.4263645964,-99.1399088724","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",001_076_12,Cuautémoc,B,Hotel,"Cadillac, S.A. DE C.V."


In [65]:
frac_25.shape

(58, 8)

## Muestreo sistemático

Técnica de muestreo al que se le indica una regla/norma a seguir para la selección de los elementos de una población.
<br>

Para este ejemplo construimos una función para obtener un muestreo sistematico.

In [66]:
def systematic_sampling(data, step):
  indexes = np.arange(0, len(data), step=step)
  sample = data.iloc[indexes]
  return sample

Llamamos a la función para crear la muestra.
Obtendremos cada 5 elementos del DataFrame.

In [67]:
sample = systematic_sampling(econdata, 5)
sample.head()

Unnamed: 0,id,geo_point_2d,geo_shape,clave_cat,delegacion,perimetro,tipo,nom_id
0,0,"19.424781053,-99.1327537959","{""type"": ""Polygon"", ""coordinates"": [[[-99.1332...",307_130_11,Cuauhtémoc,B,Mercado,Pino Suárez
5,5,"19.4263287068,-99.1207277209","{""type"": ""MultiPoint"", ""coordinates"": [[-99.12...",323_161_11,Venustiano Carranza,B,Hotel,Baño San Tiago
10,10,"19.4441424478,-99.14600807","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",003_048_10,Cuautémoc,B,Hotel,Moctezuma
15,15,"19.42413788,-99.1324515","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",307_153_11,Cuautémoc,B,Hotel,San Lucas
20,20,"19.4357307042,-99.1326583218","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",004_098_26,Cuautémoc,A,Museo,La Caricatura


## Muestreo estratificado

Consiste en crear sub grupos o segmentos exlusivos, homegeneos (estratos) de una población, y apartir de ellas extraer muestras aleatorias.
<br>

**Pasos para crear muestras estratificadas:**
1. Crear variable de estratificación.
2. Verificar proporción de aparición del estrato en la población.
3. Establecer el muestreo.
4. Crear nueva tabla con las proporciones correspondinetes (ajustadas) de los estratos.


In [68]:
# 1. Crear variable de estratificación
econdata["estratos"] = econdata["delegacion"] + "," + econdata["tipo"]

# 2. Proporcion de aparición
(econdata["estratos"].value_counts()/len(econdata)).sort_values(ascending=False)


Cuautémoc,Hotel                0.643478
Cuautémoc,Museo                0.156522
Venustiano Carranza,Hotel      0.078261
Cuauhtémoc,Mercado             0.073913
Venustiano Carranza,Mercado    0.047826
Name: estratos, dtype: float64

Proportciones de los datos son los siguientes:
1. Hoteles en Cuautémoc: $50\%$
2. Museos en Cuautémoc: $20\%$
3. Hoteles en Venustiano Carranza: $10\%$
4. Mercados en Cuauhtémoc: $10\%$
5. Mercados en Venustiano Carranza: $10\%$

📌 Ajustamos las proporciones para que la suma de todas ellas nos de $100\%$

In [69]:
def stratified_data(data, stratified_colums, values, proportions, random_state=None):
  stratified_df = pd.DataFrame(columns = data.columns)
  
  position = -1
  for i in range(len(values)):
    position += 1
    if position == len(values) - 1:
      ratio = len(data) - len(stratified_df)
    else:
      ratio = int(len(data) * proportions[i])

    filtered_df = data[data[stratified_colums] == values[i]] # Filtra los datosde origden según los valores estratificados
    # 3. Muestreo dato por la proporcion del estrato
    temp_df = filtered_df.sample(replace=True, n=ratio, random_state=random_state) # Crea un una muestra con la proporcion correspondiente
    # 4. Tabla con las proporciones correspondientes de cada estrato
    stratified_df = pd.concat([stratified_df, temp_df])

  return stratified_df

📌 **NOTA**: En la función `stratified_data` al obtener la muestra aleatoria, el parámetro `replace=True` nos permite repetir rows del DataFrame `filtered_df` cuando el total de elementos a tomar (`n=ratio`) es mayor a la longitud del DataFrame. En consecuencia obtenemos un nuevo DF con rows repetidos, por ende al final podemos hacer una limpieza de la muestra eliminando aquellos elementos que se duplicaron.

In [70]:
values = ["Cuautémoc,Hotel","Cuautémoc,Museo","Venustiano Carranza,Hotel","Cuauhtémoc,Mercado","Venustiano Carranza,Mercado"]
proportions = [0.5, 0.2, 0.1, 0.1, 0.1]

stratified_df = stratified_data(econdata, "estratos",values, proportions, random_state=42)
stratified_df

Unnamed: 0,id,geo_point_2d,geo_shape,clave_cat,delegacion,perimetro,tipo,nom_id,estratos
164,164,"19.4388741511,-99.1413308257","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",003_113_03,Cuautémoc,B,Hotel,Dos Naciones,"Cuautémoc,Hotel"
142,142,"19.4263681354,-99.1327278126","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",006_127_14,Cuautémoc,A,Hotel,Ambar,"Cuautémoc,Hotel"
27,27,"19.4348360773,-99.1463945583","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",002_016_01,Cuautémoc,B,Hotel,Hilton Centro Histórico,"Cuautémoc,Hotel"
168,168,"19.4349726565,-99.147766133","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",002_014_23,Cuautémoc,B,Hotel,One Alameda,"Cuautémoc,Hotel"
113,113,"19.43374405,-99.13550135","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",001_012_13,Cuautémoc,A,Hotel,San Antonio,"Cuautémoc,Hotel"
...,...,...,...,...,...,...,...,...,...
128,128,"19.4270781084,-99.1210175514","{""type"": ""Polygon"", ""coordinates"": [[[-99.1214...",323_061_04(123),Venustiano Carranza,B,Mercado,San Ciprian,"Venustiano Carranza,Mercado"
37,37,"19.4271233834,-99.125111772","{""type"": ""Polygon"", ""coordinates"": [[[-99.1251...",323_065_01,Venustiano Carranza,B,Mercado,Dulceria,"Venustiano Carranza,Mercado"
163,163,"19.4265454033,-99.1224859032","{""type"": ""Polygon"", ""coordinates"": [[[-99.1231...",323_063_05,Venustiano Carranza,B,Mercado,,"Venustiano Carranza,Mercado"
156,156,"19.4255480371,-99.1249308096","{""type"": ""Polygon"", ""coordinates"": [[[-99.1253...",323_138_04 (3),Venustiano Carranza,B,Mercado,Mariscos,"Venustiano Carranza,Mercado"


Eliminamos los elementos duplicados:

In [73]:
cleaned_df = stratified_df.drop_duplicates(keep="first")
print(f"shape={cleaned_df.shape}")
cleaned_df

shape=(140, 9)


Unnamed: 0,id,geo_point_2d,geo_shape,clave_cat,delegacion,perimetro,tipo,nom_id,estratos
164,164,"19.4388741511,-99.1413308257","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",003_113_03,Cuautémoc,B,Hotel,Dos Naciones,"Cuautémoc,Hotel"
142,142,"19.4263681354,-99.1327278126","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",006_127_14,Cuautémoc,A,Hotel,Ambar,"Cuautémoc,Hotel"
27,27,"19.4348360773,-99.1463945583","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",002_016_01,Cuautémoc,B,Hotel,Hilton Centro Histórico,"Cuautémoc,Hotel"
168,168,"19.4349726565,-99.147766133","{""type"": ""MultiPoint"", ""coordinates"": [[-99.14...",002_014_23,Cuautémoc,B,Hotel,One Alameda,"Cuautémoc,Hotel"
113,113,"19.43374405,-99.13550135","{""type"": ""MultiPoint"", ""coordinates"": [[-99.13...",001_012_13,Cuautémoc,A,Hotel,San Antonio,"Cuautémoc,Hotel"
...,...,...,...,...,...,...,...,...,...
128,128,"19.4270781084,-99.1210175514","{""type"": ""Polygon"", ""coordinates"": [[[-99.1214...",323_061_04(123),Venustiano Carranza,B,Mercado,San Ciprian,"Venustiano Carranza,Mercado"
204,204,"19.4260286762,-99.1249971994","{""type"": ""Polygon"", ""coordinates"": [[[-99.1253...",323_138_01,Venustiano Carranza,B,Mercado,Florería,"Venustiano Carranza,Mercado"
49,49,"19.4264953358,-99.1248854383","{""type"": ""Polygon"", ""coordinates"": [[[-99.1252...",323_139_01,Venustiano Carranza,B,Mercado,Dulceria Don Goloso,"Venustiano Carranza,Mercado"
156,156,"19.4255480371,-99.1249308096","{""type"": ""Polygon"", ""coordinates"": [[[-99.1253...",323_138_04 (3),Venustiano Carranza,B,Mercado,Mariscos,"Venustiano Carranza,Mercado"


In [92]:
for i in range(len(values)):
  value = values[i]
  total_by = cleaned_df[cleaned_df["estratos"] == value]
  print(total_by["estratos"].value_counts())

Cuautémoc,Hotel    77
Name: estratos, dtype: int64
Cuautémoc,Museo    28
Name: estratos, dtype: int64
Venustiano Carranza,Hotel    13
Name: estratos, dtype: int64
Cuauhtémoc,Mercado    13
Name: estratos, dtype: int64
Venustiano Carranza,Mercado    9
Name: estratos, dtype: int64


Al final nuestra muestra estratificada real es de 140 elementos correspondientes al 60% de la población total, que se agrupa en:
- 77 elementos de Hoteles en Cuautémoc
- 28 elementos de Museos en Cuautémoc
- 13 elementos de Hoteles en Venustiano Carranza
- 13 elementos de Mercados en Cuauhtémoc
-  9 elementos de Mercados en Venustiano Carranza