# PreViaje
## Humai Data Analytics Exam

This notebook has a few exercises of the exam from the Data Analytics with Python course from [Humai](https://ihum.ai/). The data comes from PreViaje, a tourism pre-sale program in Argentina that reimburses you 50% of the value of your trip in credit.

### Imports and getting the data
I'm going to work with three Pandas DataFrames: _df_provincias_, _df_vuelos_ and _df_pagos_.

In [2]:
import requests
import io
import pandas as pd
import urllib3
urllib3.disable_warnings()

def generate_df(url):
    response = requests.get(url, verify=False)
    csv_data = response.content.decode('utf-8')
    return pd.read_csv(io.StringIO(csv_data))

df_provincias = generate_df(url='http://datos.yvera.gob.ar/dataset/09679fe3-7379-481d-a36a-6b1e3d7374b1/resource/42a60de1-a6d5-473b-a2c0-415e25ad7f31/download/personas_beneficiarias.csv')
df_provincias = df_provincias[df_provincias['genero'] != 'X']
df_vuelos = generate_df(url='http://datos.yvera.gob.ar/dataset/09679fe3-7379-481d-a36a-6b1e3d7374b1/resource/9d4db872-0a51-4042-9daa-e55bc7a3044c/download/viajes_origen_destino_mes.csv')
df_pagos = generate_df(url='http://datos.yvera.gob.ar/dataset/09679fe3-7379-481d-a36a-6b1e3d7374b1/resource/2eaac913-0273-41cb-b1b2-c810e59d2334/download/comprobantes_fecha.csv')

### What's on each DataFrame?

df_provincias contains five columns with information of the people who used the program:

0. _provincia_: stands for the province of the beneficiaries.
1. _tramo_edad_: an age group for the beneficiaries.
2. _genero_: gender of the beneficiaries.
3. _personas_beneficiarias_: the amount of beneficiaries.
4. _edicion_: the edition of the program.

In [3]:
df_provincias.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 808
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   provincia               768 non-null    object
 1   tramo_edad              768 non-null    object
 2   genero                  768 non-null    object
 3   personas_beneficiarias  768 non-null    int64 
 4   edicion                 768 non-null    object
dtypes: int64(1), object(4)
memory usage: 36.0+ KB


df_vuelos contains six columns with information of the flights taken by beneficiaries:

0. _mes_inicio_: the month where the trip started.
1. _provincia_origen_: the province of origin.
2. _provincia_destino_: the destination province.
3. _viajes_: how many travels where registered.
4. _viajeros_: how many travelers where registered.
5. _edicion_: the edition of the program.

In [4]:
df_vuelos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11033 entries, 0 to 11032
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   mes_inicio         11033 non-null  object
 1   provincia_origen   11033 non-null  object
 2   provincia_destino  11033 non-null  object
 3   viajes             11033 non-null  int64 
 4   viajeros           11033 non-null  int64 
 5   edicion            11033 non-null  object
dtypes: int64(2), object(4)
memory usage: 517.3+ KB


df_vuelos contains twelve columns with information of the payments made by beneficiaries.

In [5]:
df_pagos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4603 entries, 0 to 4602
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   fecha_carga   4603 non-null   object 
 1   comprobantes  4603 non-null   int64  
 2   monto         4603 non-null   float64
 3   clae          4603 non-null   int64  
 4   clae6_desc    4603 non-null   object 
 5   edicion       4603 non-null   object 
 6   clae2         4603 non-null   int64  
 7   clae2_desc    4603 non-null   object 
 8   clae3         4603 non-null   int64  
 9   clae3_desc    4603 non-null   object 
 10  letra         4603 non-null   object 
 11  letra_desc    4603 non-null   object 
dtypes: float64(1), int64(4), object(7)
memory usage: 431.7+ KB


### Analyze the Dataframes according to the following instructions

- df_provincias:
   - "genero" column: review the unique values that the column has.
   - Column "personas_beneficiarias": are there null or invalid values?

- df_vuelos:
   - "viajes" column: are there null or invalid values?

- df_pagos:
   - "fecha_carga" column: convert the column to datetime format.

In [7]:
# start by looking the head of the df
df_provincias.head()

Unnamed: 0,provincia,tramo_edad,genero,personas_beneficiarias,edicion
0,Chaco,18 a 29,Femenino,215,previaje 1
1,Chaco,30 a 44 años,Femenino,478,previaje 1
2,Chaco,45 a 59 años,Femenino,238,previaje 1
3,Chaco,60 años y más,Femenino,103,previaje 1
4,Chaco,18 a 29,Masculino,179,previaje 1


Using <code>value_counts()</code> with the default dropna=True will count unique values withouth NaNs. To check if there are any null values, we can use isnull() on the column.

In [8]:
df_provincias.genero.value_counts()

Femenino     384
Masculino    384
Name: genero, dtype: int64

In [12]:
df_provincias.personas_beneficiarias.isnull().value_counts() # there are no nulls

False    768
Name: personas_beneficiarias, dtype: int64

In [15]:
df_provincias.personas_beneficiarias.isna().value_counts() #there are no NaNs

False    768
Name: personas_beneficiarias, dtype: int64

The amount of beneficiaries should be a positive integer (and the type of the column is int64). By using .describe() we can see the min, max, quartiles, mean and std. If the min is negative, then that would be an invalid value (I could have used .min, but it doesn't take much to describe and see a few other things).

In [16]:
df_provincias.personas_beneficiarias.describe()

count       768.000000
mean       3001.373698
std        8632.113857
min           4.000000
25%         204.500000
50%         539.500000
75%        1605.250000
max      103922.000000
Name: personas_beneficiarias, dtype: float64

In [17]:
# now with the flights
df_vuelos.head()

Unnamed: 0,mes_inicio,provincia_origen,provincia_destino,viajes,viajeros,edicion
0,2021-01,Buenos Aires,Buenos Aires,6920,22641,previaje 1
1,2021-01,Buenos Aires,Catamarca,24,56,previaje 1
2,2021-01,Buenos Aires,Chaco,15,27,previaje 1
3,2021-01,Buenos Aires,Chubut,446,1203,previaje 1
4,2021-01,Buenos Aires,Ciudad Autónoma de Buenos Aires,112,272,previaje 1


In [21]:
nulls = df_vuelos.viajes.isnull().value_counts()
nans = df_vuelos.viajes.isna().value_counts()
print(nulls)
print(nans)

False    11033
Name: viajes, dtype: int64
False    11033
Name: viajes, dtype: int64


In [22]:
# payments
df_pagos.head()

Unnamed: 0,fecha_carga,comprobantes,monto,clae,clae6_desc,edicion,clae2,clae2_desc,clae3,clae3_desc,letra,letra_desc
0,2020-10-08,130,3652753.17,511000,Servicio de transporte aÃ©reo de pasajeros,previaje 1,51,Transporte aÃ©reo,511,Servicio de transporte aÃ©reo de pasajeros,H,SERVICIO DE TRANSPORTE Y ALMACENAMIENTO
1,2020-10-08,35,1627464.98,791100,Servicios minoristas de agencias de viajes,previaje 1,79,"Agencias de viajes, servicios de reservas y ac...",791,Servicios de agencias de viaje y otras activid...,N,ACTIVIDADES ADMINISTRATIVAS Y SERVICIOS DE APOYO
2,2020-10-09,222,5757986.58,511000,Servicio de transporte aÃ©reo de pasajeros,previaje 1,51,Transporte aÃ©reo,511,Servicio de transporte aÃ©reo de pasajeros,H,SERVICIO DE TRANSPORTE Y ALMACENAMIENTO
3,2020-10-09,3,121289.0,551022,"Servicios de alojamiento en hoteles, hosterÃ­a...",previaje 1,55,Servicios de alojamiento,551,"Servicios de alojamiento, excepto en camping",I,SERVICIOS DE ALOJAMIENTO Y SERVICIOS DE COMIDA
4,2020-10-09,4,186166.5,551023,"Servicios de alojamiento en hoteles, hosterÃ­a...",previaje 1,55,Servicios de alojamiento,551,"Servicios de alojamiento, excepto en camping",I,SERVICIOS DE ALOJAMIENTO Y SERVICIOS DE COMIDA


In [23]:
df_pagos.fecha_carga = pd.to_datetime(df_pagos.fecha_carga)
df_pagos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4603 entries, 0 to 4602
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   fecha_carga   4603 non-null   datetime64[ns]
 1   comprobantes  4603 non-null   int64         
 2   monto         4603 non-null   float64       
 3   clae          4603 non-null   int64         
 4   clae6_desc    4603 non-null   object        
 5   edicion       4603 non-null   object        
 6   clae2         4603 non-null   int64         
 7   clae2_desc    4603 non-null   object        
 8   clae3         4603 non-null   int64         
 9   clae3_desc    4603 non-null   object        
 10  letra         4603 non-null   object        
 11  letra_desc    4603 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(4), object(6)
memory usage: 431.7+ KB


### From the DataFrame `df_provincias`, operate on the column `tramo_edad` to separate its values into new columns `min_edad` and `max_edad`

Note:
- The columns must be of type integer (int)
- If in any case the upper limit of an age range is not clarified, take 100 years as the maximum age.

In [24]:
df_provincias[["min_edad", "max_edad"]] = df_provincias.tramo_edad.str.split(" a", expand=True)[[0,1]]
df_provincias.min_edad = pd.to_numeric(df_provincias.min_edad, downcast='integer')
df_provincias.max_edad = pd.to_numeric(df_provincias.max_edad, downcast='integer', errors='coerce').fillna(100).astype('Int8')
df_provincias.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 808
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   provincia               768 non-null    object
 1   tramo_edad              768 non-null    object
 2   genero                  768 non-null    object
 3   personas_beneficiarias  768 non-null    int64 
 4   edicion                 768 non-null    object
 5   min_edad                768 non-null    int8  
 6   max_edad                768 non-null    Int8  
dtypes: Int8(1), int64(1), int8(1), object(4)
memory usage: 38.2+ KB


### How many people aged 44 or younger were beneficiaries in edition 1 of the PreViaje?

In [32]:
# sum the number of beneficiaries whose min_edad < 45
df_provincias.personas_beneficiarias[df_provincias.min_edad < 45].sum()

1222699

### Analyzing the DataFrames `df_provincias` and `df_vuelos`, an interest arises in obtaining a table with the data from the **2nd `edition`** of PreViaje.

We want to obtain a table with the totals by province of:
- Beneficiaries
- Number of trips
- Number of passengers on trips

Prepare the tables to join them, obtaining the necessary data from each one.

Note: In the case of trips, we are interested in the province **of origin**.

In [33]:
df_prov_pre2 = df_provincias.query('edicion == "previaje 2"')
df_vuelos_pre2 = df_vuelos.query('edicion == "previaje 2"')

In [34]:
provs_pre2 = df_prov_pre2.groupby(by="provincia", axis=0).sum().drop(["min_edad", "max_edad"], axis=1)

  provs_pre2 = df_prov_pre2.groupby(by="provincia", axis=0).sum().drop(["min_edad", "max_edad"], axis=1)


In [35]:
vuelos_pre2 = df_vuelos_pre2.groupby(by="provincia_origen", axis=0).sum()

  vuelos_pre2 = df_vuelos_pre2.groupby(by="provincia_origen", axis=0).sum()


### Join both DataFrames on the `provincia` and `provincia_origen` columns.

#### Then, from the new table, put together a dictionary with the values of the province of Chaco with the following structure:
```python
{
   'personas_beneficiarias': [int],
   'viajes': [int],
   'viajeros': [int]
}
```

Note:
- Use the DataFrame method `.to_dict('list')`

In [36]:
totales_pre2 = pd.merge(provs_pre2, vuelos_pre2, left_index=True, right_index=True)
totales_pre2

Unnamed: 0_level_0,personas_beneficiarias,viajes,viajeros
provincia,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Buenos Aires,520431,550513,1706730
Catamarca,2741,2473,7534
Chaco,12985,13951,47315
Chubut,12298,14076,38075
Ciudad Autónoma de Buenos Aires,280764,420311,1227153
Corrientes,7480,8073,26576
Córdoba,105977,130407,439270
Entre Ríos,24066,25686,80129
Formosa,1968,1717,5426
Jujuy,5175,5687,16562


In [37]:
totales_pre2[totales_pre2.index == "Chaco"].to_dict('list')

{'personas_beneficiarias': [12985], 'viajes': [13951], 'viajeros': [47315]}

### Working with `df_vuelos`, create a table where the total number of travelers is observed according to `provincia_origen` and `province_destino`.

- Filter by **edition 3** of pre-travel.
- `province_origen` must remain as an index.
- The values must be of type **integer**.
- Fill in missing values with 0.

#### Create a dictionary with the following values calculated from passengers with `provincia_destino` Santa Fe
```python
{
   'total': int,
   'promedio': float,
   'mediana': int
}
```

In [38]:
vuelos_pre3 = df_vuelos.query("edicion == 'previaje 3'")

In [39]:
pv3_sum = pd.pivot_table(vuelos_pre3,
               index = "provincia_origen",
               columns= "provincia_destino",
               values = "viajeros",
               aggfunc = "sum").fillna(0).apply(lambda x: x.astype(int) if x.dtype == 'float' else x)

pv3_mean = pd.pivot_table(vuelos_pre3,
               index = "provincia_origen",
               columns= "provincia_destino",
               values = "viajeros",
               aggfunc = "mean").fillna(0).apply(lambda x: x.astype(int) if x.dtype == 'float' else x)

pv3_median = pd.pivot_table(vuelos_pre3,
               index = "provincia_origen",
               columns= "provincia_destino",
               values = "viajeros",
               aggfunc = "median").fillna(0).apply(lambda x: x.astype(int) if x.dtype == 'float' else x)

In [42]:
# the pv3_sum table has the province of origin as the index and the destination province as columns
pv3_sum.head()

provincia_destino,Buenos Aires,Catamarca,Chaco,Chubut,Ciudad Autónoma de Buenos Aires,Corrientes,Córdoba,Entre Ríos,Formosa,Jujuy,...,Neuquén,Río Negro,Salta,San Juan,San Luis,Santa Cruz,Santa Fe,Santiago del Estero,"Tierra del Fuego, Antártida e Islas del Atlántico Sur",Tucumán
provincia_origen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Buenos Aires,45644,2686,217,15586,7276,2029,18106,12949,182,10266,...,10855,29027,22466,2655,5986,20050,1945,3189,16383,4480
Catamarca,161,188,1,28,435,3,231,8,0,38,...,10,52,69,46,9,26,20,64,32,47
Chaco,596,22,58,344,1590,124,732,72,8,139,...,182,549,538,15,171,368,57,75,665,47
Chubut,1067,27,18,1324,2547,16,658,219,3,146,...,292,790,445,113,68,530,54,61,358,56
Ciudad Autónoma de Buenos Aires,29258,1699,270,10637,2372,1644,11099,6438,178,7691,...,6821,17376,14050,2167,2163,13184,1467,1348,10451,2835


In [43]:
# the other two tables have the mean and median travelers by origin province for each destination province
pv3_mean.head()

provincia_destino,Buenos Aires,Catamarca,Chaco,Chubut,Ciudad Autónoma de Buenos Aires,Corrientes,Córdoba,Entre Ríos,Formosa,Jujuy,...,Neuquén,Río Negro,Salta,San Juan,San Luis,Santa Cruz,Santa Fe,Santiago del Estero,"Tierra del Fuego, Antártida e Islas del Atlántico Sur",Tucumán
provincia_origen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Buenos Aires,15214,895,72,5195,2425,676,6035,4316,60,3422,...,3618,9675,7488,885,1995,6683,648,1063,5461,1493
Catamarca,53,94,1,14,145,1,77,8,0,19,...,5,26,34,23,4,8,10,32,16,23
Chaco,198,7,29,114,530,62,244,24,4,69,...,91,183,269,7,57,122,28,37,221,23
Chubut,355,13,9,441,849,5,219,109,1,48,...,97,263,148,37,22,176,27,20,119,28
Ciudad Autónoma de Buenos Aires,9752,566,90,3545,790,548,3699,2146,59,2563,...,2273,5792,4683,722,721,4394,489,449,3483,945


In [44]:
pv3_median.head()

provincia_destino,Buenos Aires,Catamarca,Chaco,Chubut,Ciudad Autónoma de Buenos Aires,Corrientes,Córdoba,Entre Ríos,Formosa,Jujuy,...,Neuquén,Río Negro,Salta,San Juan,San Luis,Santa Cruz,Santa Fe,Santiago del Estero,"Tierra del Fuego, Antártida e Islas del Atlántico Sur",Tucumán
provincia_origen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Buenos Aires,17850,1321,73,6105,3264,978,7933,5021,54,4700,...,4348,13104,10788,1217,2701,8518,864,1086,7126,2039
Catamarca,61,94,1,14,197,1,102,8,0,19,...,5,26,34,23,4,2,10,32,16,23
Chaco,219,10,29,93,615,62,287,32,4,69,...,91,220,269,7,44,132,28,37,108,23
Chubut,412,13,9,471,1169,5,275,109,1,56,...,135,276,170,11,15,204,27,28,169,28
Ciudad Autónoma de Buenos Aires,10586,812,110,4504,1142,787,4366,2386,81,3681,...,2628,7226,6779,884,1047,5782,700,577,4480,1359


In [71]:
#now for the province of Santa Fe
pv3_sum_SF = pv3_sum.query("provincia_destino == 'Santa Fe'").transpose().rename(columns={"Santa Fe": "total"})
pv3_mean_SF = pv3_mean.query("provincia_destino == 'Santa Fe'").transpose().rename(columns={"Santa Fe": "promedio"})
pv3_median_SF = pv3_median.query("provincia_destino == 'Santa Fe'").transpose().rename(columns={"Santa Fe": "mediana"})

In [72]:
santa_fe = pd.concat([pv3_sum_SF, pv3_mean_SF, pv3_median_SF], axis=1, )

In [73]:
santa_fe

provincia_origen,total,promedio,mediana
provincia_destino,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Buenos Aires,6652,2217,1570
Catamarca,1562,520,564
Chaco,31,15,15
Chubut,5271,1757,1994
Ciudad Autónoma de Buenos Aires,5439,1813,2588
Corrientes,537,268,268
Córdoba,8486,2828,4021
Entre Ríos,1964,654,877
Formosa,23,11,11
Jujuy,2106,702,836


In [74]:
santa_fe.to_dict('dict')

{'total': {'Buenos Aires': 6652,
  'Catamarca': 1562,
  'Chaco': 31,
  'Chubut': 5271,
  'Ciudad Autónoma de Buenos Aires': 5439,
  'Corrientes': 537,
  'Córdoba': 8486,
  'Entre Ríos': 1964,
  'Formosa': 23,
  'Jujuy': 2106,
  'La Pampa': 129,
  'La Rioja': 1305,
  'Mendoza': 6792,
  'Misiones': 4866,
  'Neuquén': 2296,
  'Río Negro': 7070,
  'Salta': 6144,
  'San Juan': 562,
  'San Luis': 2256,
  'Santa Cruz': 6029,
  'Santa Fe': 1740,
  'Santiago del Estero': 627,
  'Tierra del Fuego, Antártida e Islas del Atlántico Sur': 4554,
  'Tucumán': 1067},
 'promedio': {'Buenos Aires': 2217,
  'Catamarca': 520,
  'Chaco': 15,
  'Chubut': 1757,
  'Ciudad Autónoma de Buenos Aires': 1813,
  'Corrientes': 268,
  'Córdoba': 2828,
  'Entre Ríos': 654,
  'Formosa': 11,
  'Jujuy': 702,
  'La Pampa': 64,
  'La Rioja': 435,
  'Mendoza': 2264,
  'Misiones': 1622,
  'Neuquén': 765,
  'Río Negro': 2356,
  'Salta': 2048,
  'San Juan': 187,
  'San Luis': 752,
  'Santa Cruz': 2009,
  'Santa Fe': 580,
  'San