## Load libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
sys.path.append("../libs")
sys.path.append("../")
from definitions import ROOT_DIR
import utils as utils

# Set pandas options
# Not use scientific notation and use two decimal places and use comma as thousand separator
pd.options.display.float_format = '{:,.2f}'.format

# ETL

## 1. Load dataset

In [3]:
# Load dataset from data_files/internet.xlsx - sheet: 'Ingresos '
df = utils.get_xls_sheet_data('/data_files/internet.xlsx', 'Ingresos ')
df.head()

Unnamed: 0,Año,Trimestre,Ingresos (miles de pesos),Periodo
0,2024,2,442032166.69,Abr-Jun 2024
1,2024,1,346198986.13,Ene-Mar 2024
2,2023,4,167376014.8,Oct-Dic 2023
3,2023,3,133106593.41,Jul-Sept 2023
4,2023,2,118060280.28,Jun-Mar 2023


## 2. Handling missing values

### Look for missing values in all cells

In [4]:
#Find missing values
df.isnull().sum()

Año                          0
Trimestre                    0
Ingresos (miles de pesos)    0
Periodo                      0
dtype: int64

#### There's no missing values

## 3. Look for Duplicates

### Find duplicates for complete rows

In [5]:
#Find duplicates by complete row
df.duplicated().sum()

0

#### There's no complete duplicated rows

### Fin duplicated rows for year and quarter ('Trimestre')

In [6]:
#Find duplicated rows by row, for year and quarter
df.duplicated(subset=['Año', 'Trimestre']).sum()

0

#### There's no duplicated rows for year and quarter

## 4. Finding outliers

In [7]:
#Finding outliers
df.describe()

Unnamed: 0,Año,Trimestre,Ingresos (miles de pesos)
count,42.0,42.0,42.0
mean,2019.0,2.45,50016480.78
std,3.73,1.13,87102080.46
min,2014.0,1.0,2984054.21
25%,2016.0,1.25,7055326.25
50%,2019.0,2.0,20475265.73
75%,2021.0,3.0,44850901.45
max,2033.0,4.0,442032166.69


#### In the statistics, we can see that there are outliers values for the column year ("Año"), because the maximum value is 2033 and the data is historical. Now we will review the year column in detail.

In [8]:
df.groupby(['Año','Trimestre']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Ingresos (miles de pesos),Periodo
Año,Trimestre,Unnamed: 2_level_1,Unnamed: 3_level_1
2014,1,2984054.21,Ene-Mar 2014
2014,2,3270816.2,Abr-Jun 2014
2014,3,3478637.74,Jul-Sept 2014
2014,4,3950440.78,Oct-Dic 2014
2015,1,4876385.32,Ene-Mar 2015
2015,2,4701790.72,Abr-Jun 2015
2015,3,5153738.88,Jul-Sept 2015
2015,4,5376899.21,Oct-Dic 2015
2016,1,5936844.89,Ene-Mar 2016
2016,2,6534240.6,Abr-Jun 2016


#### We can see there's no 1st quarter data for 20**2**3 and we have 1st quarter data for 20**3**3. I think there is a typing error and the correct year should be 2023. And in the column "Periodo" we have "End-Mar 2023", this confirms the hypothesis.  We will correct this error.

In [9]:
#Change year 2033 to 2023
df['Año'] = df['Año'].replace(2033, 2023)
df.groupby(['Año','Trimestre']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Ingresos (miles de pesos),Periodo
Año,Trimestre,Unnamed: 2_level_1,Unnamed: 3_level_1
2014,1,2984054.21,Ene-Mar 2014
2014,2,3270816.2,Abr-Jun 2014
2014,3,3478637.74,Jul-Sept 2014
2014,4,3950440.78,Oct-Dic 2014
2015,1,4876385.32,Ene-Mar 2015
2015,2,4701790.72,Abr-Jun 2015
2015,3,5153738.88,Jul-Sept 2015
2015,4,5376899.21,Oct-Dic 2015
2016,1,5936844.89,Ene-Mar 2016
2016,2,6534240.6,Abr-Jun 2016


#### The behaviour of the income ("Ingresos (miles de pesos)"), have a consistent growth. Although the first quarter of 2024 almost doubles the previous quarter, it may be due to the volatility of the currency in Argentina, it will be explored in detail in the EDA section.

## 5. Data Types

### We will review the data types for each column.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 4 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Año                        42 non-null     int64  
 1   Trimestre                  42 non-null     int64  
 2   Ingresos (miles de pesos)  42 non-null     float64
 3   Periodo                    42 non-null     object 
dtypes: float64(1), int64(2), object(1)
memory usage: 1.4+ KB


#### The data types are consistent with the data provided. Just the column "Periodo" have an object data type, and have strings values. We should change the data type to string, but since the data is redundant with the quarter ("trimestre") column, we will drop it.

In [11]:
#drop Periodo column
df = df.drop('Periodo', axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 3 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Año                        42 non-null     int64  
 1   Trimestre                  42 non-null     int64  
 2   Ingresos (miles de pesos)  42 non-null     float64
dtypes: float64(1), int64(2)
memory usage: 1.1 KB


## New columns

### Create a new column with the year and quarter

In [12]:
#Create a new column with the quarter and year
df['Periodo'] = (df['Año'].astype(str) + 'T' + df['Trimestre'].astype(str)).astype('string')
df.head()

Unnamed: 0,Año,Trimestre,Ingresos (miles de pesos),Periodo
0,2024,2,442032166.69,2024T2
1,2024,1,346198986.13,2024T1
2,2023,4,167376014.8,2023T4
3,2023,3,133106593.41,2023T3
4,2023,2,118060280.28,2023T2


# Save dataset

In [13]:
df.to_parquet(ROOT_DIR + '/data_files/ingresos_clean.parquet')