# Datasets extraction
First, we will extract the datasets from the ZIP files and create CSV/Parquet files for easier loading later.

We have two ZIP files: one containing daily consumption data and another containing hourly consumption data. These datasets are not related to each other.

In [25]:
# Necessary imports to read the datasets
import pandas as pd
import zipfile
import os
import numpy as np

We have two zip files:
1) **Daily Consumption**
2) **Hourly Consumption**

# DAILY DATA
Inside the file containing the daily consumption data, we have two sub-datasets: one aggregated by census section and another aggregated by economic activity.

## First dataset: Census Section
For this dataset, we will extract all the information from the municipalities of Badalona, Barcelona, and Hospitalet, as they are the three main sources of data.


In [29]:
# Route to the ZIP file
zip_route1 = '../data/daily_consumption_dataset.zip'

# We open the first ZIP file
with zipfile.ZipFile(zip_route1, 'r') as zip_file:
    # Then we list the different files inside
    file_list = zip_file.namelist()
    print("Files in the ZIP:", file_list)
    
    # Load the first CSV (data aggregated by census section)
    with zip_file.open(file_list[0]) as csv_file1:
        # We use low_memory false option in order to avoid errors when pandas infers the datatype of a columns
        df_census_section = pd.read_csv(csv_file1, low_memory=False)
    
    # Load the second CSV (data for industrial/commercial use)
    with zip_file.open(file_list[1]) as csv_file2:
        # We use low_memory false option in order to avoid errors when pandas infers the datatype of a columns
        df_industrial_commercial_use = pd.read_csv(csv_file2, low_memory=False)

# We display the first few rows of each DataFrame for verification
print("##### DATA AGGREGATED BY CENSUS SECTION:")
print(df_census_section.head())

print("\n\n\n##### DATA FOR INDUSTRIAL/COMMERCIAL USE:")
print(df_industrial_commercial_use.head())

Files in the ZIP: ['daily_dataset.csv', 'daily_dataset_economic_activity.csv']
##### DATA AGGREGATED BY CENSUS SECTION:
  Secció censal/Sección censal/Census section Districte/Distrito/District  \
0                                  0801501001                          01   
1                                  0801501001                          01   
2                                  0801501001                          01   
3                                  0801501001                          01   
4                                  0801501001                          01   

  Municipi/Municipio/Municipality Data/Fecha/Date  \
0                        BADALONA      2021-05-26   
1                        BADALONA      2021-05-26   
2                        BADALONA      2021-05-27   
3                        BADALONA      2021-05-27   
4                        BADALONA      2021-05-28   

                       Ús/Uso/Use  \
0  Comercial/Comercial/Commercial   
1     Domèstic/Doméstico

In [30]:
# Municipalities we are interested in
selected_municipalities = ["BARCELONA", "BADALONA", "L'HOSPITALET LLOBR."]

# Create df1 with only the selected municipalities
df1 = df_census_section[df_census_section["Municipi/Municipio/Municipality"].isin(selected_municipalities)]

# Create df2 with municipalities that are NOT in the selected list
df2 = df_census_section[~df_census_section["Municipi/Municipio/Municipality"].isin(selected_municipalities)]


In [31]:
# Count the number of rows in each dataframe
num_rows_df1 = df1.shape[0]
num_rows_df2 = df2.shape[0]

# Get unique municipalities in each dataframe
unique_municipalities_df1 = df1["Municipi/Municipio/Municipality"].unique()
unique_municipalities_df2 = df2["Municipi/Municipio/Municipality"].unique()

# Print the information in a clear and structured way
print("### Information about df1 (Barcelona, Badalona, L'Hospitalet de Llobregat) ###")
print(f"Total rows: {num_rows_df1}")
print("Included municipalities:", ", ".join(unique_municipalities_df1))
print("\n")

print("### Information about df2 (Other municipalities) ###")
print(f"Total rows: {num_rows_df2}")
print("Included municipalities:", ", ".join(unique_municipalities_df2))


### Information about df1 (Barcelona, Badalona, L'Hospitalet de Llobregat) ###
Total rows: 3547935
Included municipalities: BADALONA, BARCELONA, L'HOSPITALET LLOBR.


### Information about df2 (Other municipalities) ###
Total rows: 27027
Included municipalities: PALLEJA, EL PAPIOL


### Observation
We observe that there are **27.000** rows of data solely from Pallejà and El Papiol. Given this, we will focus only on Badalona, Barcelona, and L'Hospitalet de Llobregat, which together account for **3.547.935** rows of information.

In [33]:
# Now we are going to rename the columns for easier access
df1.columns = [
    "Census section", 
    "District", 
    "Municipality", 
    "Date", 
    "Use", 
    "Number of meters", 
    "Accumulated consumption (L/day)"
]

In [34]:
# We save the file as a csv
df1 = df1.reset_index(drop=True)
df1.to_csv("../data/datasets/daily_consumption_census.csv", index=False)


In [35]:
# Let's check if the file is correctly stored

df_loaded = pd.read_csv("../data/datasets/daily_consumption_census.csv", low_memory=False)
if df1.equals(df_loaded):
    print("The CSV File is correctly stored.")
else:
    print("The CSV file differs from the original.")

The CSV File is correctly stored.


In [36]:
df_loaded

Unnamed: 0,Census section,District,Municipality,Date,Use,Number of meters,Accumulated consumption (L/day)
0,0801501001,01,BADALONA,2021-05-26,Comercial/Comercial/Commercial,12,843
1,0801501001,01,BADALONA,2021-05-26,Domèstic/Doméstico/Domestic,161,4891
2,0801501001,01,BADALONA,2021-05-27,Comercial/Comercial/Commercial,12,2173
3,0801501001,01,BADALONA,2021-05-27,Domèstic/Doméstico/Domestic,173,15458
4,0801501001,01,BADALONA,2021-05-28,Comercial/Comercial/Commercial,12,1836
...,...,...,...,...,...,...,...
3547930,,,L'HOSPITALET LLOBR.,2023-12-30,Domèstic/Doméstico/Domestic,318,4199
3547931,,,L'HOSPITALET LLOBR.,2023-12-30,Industrial/Industrial/Industrial,12,25802
3547932,,,L'HOSPITALET LLOBR.,2023-12-31,Comercial/Comercial/Commercial,8,1277
3547933,,,L'HOSPITALET LLOBR.,2023-12-31,Domèstic/Doméstico/Domestic,318,5046


## Second dataset: Economic Activity
We will do the same as the previous dataset. We have got another CSV file.

In [38]:
# Municipalities we are interested in
selected_municipalities = ["BARCELONA", "BADALONA", "L'HOSPITALET LLOBR."]

# Create df1 with only the selected municipalities
df1 = df_industrial_commercial_use[df_industrial_commercial_use["Municipi/Municipio/Municipality"].isin(selected_municipalities)]

# Create df2 with municipalities that are NOT in the selected list
df2 = df_industrial_commercial_use[~df_industrial_commercial_use["Municipi/Municipio/Municipality"].isin(selected_municipalities)]

In [39]:
# Count the number of rows in each dataframe
num_rows_df1 = df1.shape[0]
num_rows_df2 = df2.shape[0]

# Get unique municipalities in each dataframe
unique_municipalities_df1 = df1["Municipi/Municipio/Municipality"].unique()
unique_municipalities_df2 = df2["Municipi/Municipio/Municipality"].unique()

# Print the information in a clear and structured way
print("### Information about df1 (Barcelona, Badalona, L'Hospitalet de Llobregat) ###")
print(f"Total rows: {num_rows_df1}")
print("Included municipalities:", ", ".join(unique_municipalities_df1))
print("\n")

print("### Information about df2 (Other municipalities) ###")
print(f"Total rows: {num_rows_df2}")
print("Included municipalities:", ", ".join(unique_municipalities_df2))


### Information about df1 (Barcelona, Badalona, L'Hospitalet de Llobregat) ###
Total rows: 7833423
Included municipalities: BARCELONA, L'HOSPITALET LLOBR.


### Information about df2 (Other municipalities) ###
Total rows: 0
Included municipalities: 


### Observation

In this datset, we do not even have data from municipalities other than Barcelona or Hospitalet de Llobregat. Let's rename the columns and save the dataset as a CSV file.

In [41]:
df1

Unnamed: 0,Districte/Distrito/District,Municipi/Municipio/Municipality,Data/Fecha/Date,Ús/Uso/Use,Activitat econòmica/Actividad económica/Economic activity,Descripció activitat econòmica/Descripción actividad económica/Economic activity description,Nombre de comptadors/Número de contadores/Number of meters,Consum acumulat (L/dia)/Consumo acumulado (L/día)/Accumulated consumption (L/day)
0,1,BARCELONA,2021-01-01,Comercial/Comercial/Commercial,A/011,Directors de cinema i teatre,1,0
1,1,BARCELONA,2021-01-01,Comercial/Comercial/Commercial,A/012,Ajudants de direcció,1,547
2,1,BARCELONA,2021-01-01,Comercial/Comercial/Commercial,A/015,"Operadors cameres cinema, tv i vídeo",1,0
3,1,BARCELONA,2021-01-01,Comercial/Comercial/Commercial,A/019,"Altres activ. cinema, teatre, circ n.c.a.a",2,13
4,1,BARCELONA,2021-01-01,Comercial/Comercial/Commercial,A/021,Directors coreogràfics,1,0
...,...,...,...,...,...,...,...,...
7833418,>,L'HOSPITALET LLOBR.,2023-12-31,Industrial/Industrial/Industrial,I/6121,"* Productes alimentaris, begudes i tabac",1,1
7833419,>,L'HOSPITALET LLOBR.,2023-12-31,Industrial/Industrial/Industrial,I/672,Serveis en cafeteries,1,1090
7833420,>,L'HOSPITALET LLOBR.,2023-12-31,Industrial/Industrial/Industrial,I/EMA,COMUNICACIó EMA TARIFA C1A,1,541
7833421,>,L'HOSPITALET LLOBR.,2023-12-31,Industrial/Industrial/Industrial,I/TAN,LOCALS TANCATS,1,0


In [42]:
df1.columns = [
    "District", 
    "Municipality", 
    "Date", 
    "Use", 
    "Economic activity", 
    "Economic activity description",
    "Number of meters",
    "Accumulated consumption (L/day)"
]

In [54]:
# We save the file as a csv
df1 = df1.reset_index(drop=True)
df1.to_csv("../data/datasets/daily_consumption_economic.csv", index=False)


In [55]:
# Let's check if the file is correctly stored

df_loaded = pd.read_csv("../data/datasets/daily_consumption_economic.csv", low_memory=False)
if df1.equals(df_loaded):
    print("The CSV File is correctly stored.")
else:
    print("The CSV file differs from the original.")

The CSV File is correctly stored.


# HOURLY DATA
We have a ZIP file containing a Parquet file with all the data. Let's read it and save it as a Parquet file in order to save space.

In [56]:
# Now we will read te hourly consumption of a subset of meters
# Route to the ZIP file containing the Parquet file
zip_route2 = '../data/hourly_consumption_dataset.zip' 

# Open the ZIP file and read the Parquet file inside
with zipfile.ZipFile(zip_route2, 'r') as zip_file:
    # List the files in the ZIP (should contain only one Parquet file)
    file_list = zip_file.namelist()
    print("Files in the ZIP:", file_list)
    
    # Read the Parquet file
    with zip_file.open(file_list[0]) as parquet_file:
        df_consumption = pd.read_parquet(parquet_file)

# Optional: display the first few rows of the DataFrame for verification
print("HOURLY CONSUMPTION DATA:")
print(df_consumption.head())


Files in the ZIP: ['hourly_dataset.parquet']
HOURLY CONSUMPTION DATA:
                               Pòlissa/Póliza/Policy  \
0  9a588b6ba55d2c7baed4c039328f3bfb3fbab2c78dbe30...   
1  9a588b6ba55d2c7baed4c039328f3bfb3fbab2c78dbe30...   
2  9a588b6ba55d2c7baed4c039328f3bfb3fbab2c78dbe30...   
3  9a588b6ba55d2c7baed4c039328f3bfb3fbab2c78dbe30...   
4  9a588b6ba55d2c7baed4c039328f3bfb3fbab2c78dbe30...   

  Tecnologia/Tecnología/Technology  \
0                                R   
1                                R   
2                                R   
3                                R   
4                                R   

   Diàmetre comptador (cm)/Diámetro contador (cm)/Counter diameter (cm)  \
0                                               15.0                      
1                                               15.0                      
2                                               15.0                      
3                                               15.0            

In [57]:
df_consumption.columns = [
    "Policy", 
    "Technology", 
    "Counter diameter (cm)", 
    "Use", 
    "Type of housing", 
    "Date", 
    "Reading index (L/h)"
]

In [58]:
# Reset index
df_consumption = df_consumption.reset_index(drop=True)

# Save as a Parquet file
df_consumption.to_parquet("../data/datasets/hourly_consumption.parquet", index=False)

In [59]:
# Let's check if the file is correctly stored
df_loaded = pd.read_parquet("../data/datasets/hourly_consumption.parquet", engine="pyarrow")

if df_consumption.equals(df_loaded):
    print("The CSV File is correctly stored.")
else:
    print("The CSV file differs from the original.")

The CSV File is correctly stored.
