# Primer Analisis exploratório de los Datos
En este primer cuaderno se realizará un primer analisis de los datos de forma individual, con el fin de comprender su estructura y poder determinar el curso de acción durante el procesamiento de los datos.

## Archivos
Los datos se componen de 7 diferentes archivos CSV, los cuales proporcionan datos tanto del clima como de los diferentes usuarios y sus hábitos de generación y consumo de energía.

Los diferentes archivos se encuentran en una única carpeta llamada "data" los cuales serán leidos secuencialmente y se obtendrán datos esenciales sobre los tipos de datos que contienen y sus estadísticas. 

# Imports
Se listan los paquetes necesários para el análisis de los datos.

In [9]:
import polars as pl
import matplotlib.pyplot as plt
from pathlib import Path
import numpy as np
from IPython.display import Markdown, display

# Funciones
A continuación se escriben las funciones necesárias para el análisis de los datos.

In [10]:
def display_column(col: pl.Series):
    dtype = col.dtype
    display(Markdown(f"**{col.name}:**\t{dtype}"))

    numeric_types = [pl.Int64, pl.Float64]
    date_types = [pl.Date, pl.Datetime]

    print(f"\tcount: {col.count()}")
    print(f"\tnº nulls: {col.null_count()}")
    if dtype in numeric_types or dtype in date_types:
        print(f"\tmin: {col.min()}")
        print(f"\tmax: {col.max()}")

        if dtype in numeric_types:
            print(f"\tmean: {col.mean()}")
            print(f"\tstd: {col.std()}")
    
    elif dtype == pl.Categorical:
        print(f"\tvalues: {col.unique()}")
            
    

def describe_data(df: pl.DataFrame):
    display(df.head(3))
    for column in df.columns:
        display_column(df.get_column(column))

def describe_full_data(data_folder: Path):
    files = data_folder.glob("*.csv")

    for file in files:
        display(Markdown(f"### {file.name}"))
        data = pl.read_csv(file, try_parse_dates=True)
        describe_data(data)


In [11]:
data_path = Path("../data/").resolve()
data_path

WindowsPath('D:/Clases/4o Curso/2o Cuatrimestre/TFG/TFG-ProduccionElectrica/codigo/data')

In [12]:
describe_full_data(data_path)

### client.csv

product_type,county,eic_count,installed_capacity,is_business,date,data_block_id
i64,i64,i64,f64,i64,date,i64
1,0,108,952.89,0,2021-09-01,2
2,0,17,166.4,0,2021-09-01,2
3,0,688,7207.88,0,2021-09-01,2


**product_type:**	Int64

	count: 41919
	nº nulls: 0
	min: 0
	max: 3
	mean: 1.8989956821489062
	std: 1.0817126643589838


**county:**	Int64

	count: 41919
	nº nulls: 0
	min: 0
	max: 15
	mean: 7.297096781888881
	std: 4.780750373851959


**eic_count:**	Int64

	count: 41919
	nº nulls: 0
	min: 5
	max: 1517
	mean: 73.34511796560032
	std: 144.0643893026519


**installed_capacity:**	Float64

	count: 41919
	nº nulls: 0
	min: 5.5
	max: 19314.31
	mean: 1450.771451370499
	std: 2422.233120188548


**is_business:**	Int64

	count: 41919
	nº nulls: 0
	min: 0
	max: 1
	mean: 0.5367733008898113
	std: 0.49865183856739687


**date:**	Date

	count: 41919
	nº nulls: 0
	min: 2021-09-01
	max: 2023-05-29


**data_block_id:**	Int64

	count: 41919
	nº nulls: 0
	min: 2
	max: 637
	mean: 322.8988764044944
	std: 182.07572371646717


### electricity_prices.csv

forecast_date,euros_per_mwh,origin_date,data_block_id
datetime[μs],f64,datetime[μs],i64
2021-09-01 00:00:00,92.51,2021-08-31 00:00:00,1
2021-09-01 01:00:00,88.9,2021-08-31 01:00:00,1
2021-09-01 02:00:00,87.35,2021-08-31 02:00:00,1


**forecast_date:**	Datetime(time_unit='us', time_zone=None)

	count: 15286
	nº nulls: 0
	min: 2021-09-01 00:00:00
	max: 2023-05-30 23:00:00


**euros_per_mwh:**	Float64

	count: 15286
	nº nulls: 0
	min: -10.06
	max: 4000.0
	mean: 157.06417571634174
	std: 121.14862497158866


**origin_date:**	Datetime(time_unit='us', time_zone=None)

	count: 15286
	nº nulls: 0
	min: 2021-08-31 00:00:00
	max: 2023-05-29 23:00:00


**data_block_id:**	Int64

	count: 15286
	nº nulls: 0
	min: 1
	max: 637
	mean: 318.9907104540102
	std: 183.89030108081295


### forecast_weather.csv

latitude,longitude,origin_datetime,hours_ahead,temperature,dewpoint,cloudcover_high,cloudcover_low,cloudcover_mid,cloudcover_total,10_metre_u_wind_component,10_metre_v_wind_component,data_block_id,forecast_datetime,direct_solar_radiation,surface_solar_radiation_downwards,snowfall,total_precipitation
f64,f64,datetime[μs],i64,f64,f64,f64,f64,f64,f64,f64,f64,i64,datetime[μs],f64,f64,f64,f64
57.6,21.7,2021-09-01 02:00:00,1,15.655786,11.553613,0.904816,0.019714,0.0,0.905899,-0.411328,-9.106137,1,2021-09-01 03:00:00,0.0,0.0,0.0,0.0
57.6,22.2,2021-09-01 02:00:00,1,13.003931,10.689844,0.886322,0.004456,0.0,0.886658,0.206347,-5.355405,1,2021-09-01 03:00:00,0.0,0.0,0.0,0.0
57.6,22.7,2021-09-01 02:00:00,1,14.206567,11.671777,0.729034,0.005615,0.0,0.730499,1.451587,-7.417905,1,2021-09-01 03:00:00,0.0,0.0,0.0,0.0


**latitude:**	Float64

	count: 3424512
	nº nulls: 0
	min: 57.6
	max: 59.7
	mean: 58.650000000000006
	std: 0.6873864546060713


**longitude:**	Float64

	count: 3424512
	nº nulls: 0
	min: 21.7
	max: 28.2
	mean: 24.94999999999998
	std: 2.015564731359616


**origin_datetime:**	Datetime(time_unit='us', time_zone=None)

	count: 3424512
	nº nulls: 0
	min: 2021-09-01 02:00:00
	max: 2023-05-30 02:00:00


**hours_ahead:**	Int64

	count: 3424512
	nº nulls: 0
	min: 1
	max: 48
	mean: 24.5
	std: 13.853401124226906


**temperature:**	Float64

	count: 3424512
	nº nulls: 0
	min: -27.499395751953102
	max: 31.810693359375023
	mean: 5.743912920130573
	std: 7.844205564571225


**dewpoint:**	Float64

	count: 3424512
	nº nulls: 0
	min: -29.683569335937477
	max: 23.680566406250023
	mean: 2.411946331585833
	std: 7.121431885390269


**cloudcover_high:**	Float64

	count: 3424512
	nº nulls: 0
	min: 0.0
	max: 1.0000076293945312
	mean: 0.39466538707738674
	std: 0.44404250997993655


**cloudcover_low:**	Float64

	count: 3424512
	nº nulls: 0
	min: 0.0
	max: 1.0000076293945312
	mean: 0.43464533793686067
	std: 0.4386346186601451


**cloudcover_mid:**	Float64

	count: 3424512
	nº nulls: 0
	min: 0.0
	max: 1.0000076293945312
	mean: 0.35906934760703196
	std: 0.42015560381711575


**cloudcover_total:**	Float64

	count: 3424512
	nº nulls: 0
	min: 0.0
	max: 1.0000076293945312
	mean: 0.6819927145740771
	std: 0.40096288460112794


**10_metre_u_wind_component:**	Float64

	count: 3424512
	nº nulls: 0
	min: -17.577178955078125
	max: 22.573198318481445
	mean: 1.255446288060258
	std: 3.9953004119000584


**10_metre_v_wind_component:**	Float64

	count: 3424512
	nº nulls: 0
	min: -22.116119384765625
	max: 19.314369201660156
	mean: 0.7250110057142549
	std: 4.223751603346017


**data_block_id:**	Int64

	count: 3424512
	nº nulls: 0
	min: 1
	max: 637
	mean: 319.0
	std: 183.88586099564722


**forecast_datetime:**	Datetime(time_unit='us', time_zone=None)

	count: 3424512
	nº nulls: 0
	min: 2021-09-01 03:00:00
	max: 2023-06-01 02:00:00


**direct_solar_radiation:**	Float64

	count: 3424512
	nº nulls: 0
	min: -0.773333333334449
	max: 954.4222222222215
	mean: 151.1882116442292
	std: 256.5069027992335


**surface_solar_radiation_downwards:**	Float64

	count: 3424510
	nº nulls: 2
	min: -0.3258333333333212
	max: 848.7144444444439
	mean: 110.76420663802111
	std: 187.44437281644247


**snowfall:**	Float64

	count: 3424512
	nº nulls: 0
	min: -3.814697265625e-06
	max: 0.0048329830169677734
	mean: 2.533922635490404e-05
	std: 0.0001222840090302696


**total_precipitation:**	Float64

	count: 3424512
	nº nulls: 0
	min: -1.5290977898985147e-05
	max: 0.01651620864868164
	mean: 7.863859249567455e-05
	std: 0.000278087973486791


### gas_prices.csv

forecast_date,lowest_price_per_mwh,highest_price_per_mwh,origin_date,data_block_id
date,f64,f64,date,i64
2021-09-01,45.23,46.32,2021-08-31,1
2021-09-02,45.62,46.29,2021-09-01,2
2021-09-03,45.85,46.4,2021-09-02,3


**forecast_date:**	Date

	count: 637
	nº nulls: 0
	min: 2021-09-01
	max: 2023-05-30


**lowest_price_per_mwh:**	Float64

	count: 637
	nº nulls: 0
	min: 28.1
	max: 250.0
	mean: 95.03675039246471
	std: 47.55229451330453


**highest_price_per_mwh:**	Float64

	count: 637
	nº nulls: 0
	min: 34.0
	max: 305.0
	mean: 107.75463108320251
	std: 54.743666079706344


**origin_date:**	Date

	count: 637
	nº nulls: 0
	min: 2021-08-31
	max: 2023-05-29


**data_block_id:**	Int64

	count: 637
	nº nulls: 0
	min: 1
	max: 637
	mean: 319.0
	std: 184.03034170121694


### historical_weather.csv

datetime,temperature,dewpoint,rain,snowfall,surface_pressure,cloudcover_total,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,winddirection_10m,shortwave_radiation,direct_solar_radiation,diffuse_radiation,latitude,longitude,data_block_id
datetime[μs],f64,f64,f64,f64,f64,i64,i64,i64,i64,f64,i64,f64,f64,f64,f64,f64,f64
2021-09-01 00:00:00,14.2,11.6,0.0,0.0,1015.9,31,31,0,11,7.083333,8,0.0,0.0,0.0,57.6,21.7,1.0
2021-09-01 00:00:00,13.9,11.5,0.0,0.0,1010.7,33,37,0,0,5.111111,359,0.0,0.0,0.0,57.6,22.2,1.0
2021-09-01 00:00:00,14.0,12.5,0.0,0.0,1015.0,31,34,0,0,6.333333,355,0.0,0.0,0.0,57.6,22.7,1.0


**datetime:**	Datetime(time_unit='us', time_zone=None)

	count: 1710802
	nº nulls: 0
	min: 2021-09-01 00:00:00
	max: 2023-05-30 10:00:00


**temperature:**	Float64

	count: 1710802
	nº nulls: 0
	min: -23.7
	max: 32.6
	mean: 5.7409677449523615
	std: 8.025647396635456


**dewpoint:**	Float64

	count: 1710802
	nº nulls: 0
	min: -25.9
	max: 23.8
	mean: 2.2403116783824197
	std: 7.224356820215283


**rain:**	Float64

	count: 1710802
	nº nulls: 0
	min: 0.0
	max: 16.8
	mean: 0.04962011968655634
	std: 0.20791129631705607


**snowfall:**	Float64

	count: 1710802
	nº nulls: 0
	min: 0.0
	max: 2.66
	mean: 0.016048958324809067
	std: 0.07462935561057461


**surface_pressure:**	Float64

	count: 1710802
	nº nulls: 0
	min: 942.9
	max: 1049.3
	mean: 1009.2815146346568
	std: 13.08891494992898


**cloudcover_total:**	Int64

	count: 1710802
	nº nulls: 0
	min: 0
	max: 100
	mean: 60.91269591688577
	std: 37.76904750088249


**cloudcover_low:**	Int64

	count: 1710802
	nº nulls: 0
	min: 0
	max: 100
	mean: 46.685927418836314
	std: 40.74759834341767


**cloudcover_mid:**	Int64

	count: 1710802
	nº nulls: 0
	min: 0
	max: 100
	mean: 34.406980468809365
	std: 38.327692910054715


**cloudcover_high:**	Int64

	count: 1710802
	nº nulls: 0
	min: 0
	max: 100
	mean: 36.051408053065174
	std: 41.35852074534182


**windspeed_10m:**	Float64

	count: 1710802
	nº nulls: 0
	min: 0.0
	max: 21.75
	mean: 4.849871385856851
	std: 2.475450310599722


**winddirection_10m:**	Int64

	count: 1710802
	nº nulls: 0
	min: 0
	max: 360
	mean: 197.86941913792478
	std: 89.9379784466276


**shortwave_radiation:**	Float64

	count: 1710802
	nº nulls: 0
	min: 0.0
	max: 849.0
	mean: 106.49050445346685
	std: 179.94491172141247


**direct_solar_radiation:**	Float64

	count: 1710802
	nº nulls: 0
	min: 0.0
	max: 754.0
	mean: 64.45291740365045
	std: 133.40995132346063


**diffuse_radiation:**	Float64

	count: 1710802
	nº nulls: 0
	min: 0.0
	max: 388.0
	mean: 42.0375870498164
	std: 61.95225075947365


**latitude:**	Float64

	count: 1710802
	nº nulls: 0
	min: 57.6
	max: 59.7
	mean: 58.649998772505526
	std: 0.6873870908607653


**longitude:**	Float64

	count: 1710802
	nº nulls: 0
	min: 21.7
	max: 28.2
	mean: 24.94999853869706
	std: 2.015564373635757


**data_block_id:**	Float64

	count: 1710802
	nº nulls: 0
	min: 1.0
	max: 637.0
	mean: 319.2707782665674
	std: 183.72979800619706


### train.csv

county,is_business,product_type,target,is_consumption,datetime,data_block_id,row_id,prediction_unit_id
i64,i64,i64,f64,i64,datetime[μs],i64,i64,i64
0,0,1,0.713,0,2021-09-01 00:00:00,0,0,0
0,0,1,96.59,1,2021-09-01 00:00:00,0,1,0
0,0,2,0.0,0,2021-09-01 00:00:00,0,2,1


**county:**	Int64

	count: 2018352
	nº nulls: 0
	min: 0
	max: 15
	mean: 7.297034412233347
	std: 4.780990335833066


**is_business:**	Int64

	count: 2018352
	nº nulls: 0
	min: 0
	max: 1
	mean: 0.5368260838545507
	std: 0.49864211889842713


**product_type:**	Int64

	count: 2018352
	nº nulls: 0
	min: 0
	max: 3
	mean: 1.8989274417940973
	std: 1.0817658113000004


**target:**	Float64

	count: 2017824
	nº nulls: 528
	min: 0.0
	max: 15480.274
	mean: 274.85556009889854
	std: 909.5023780198597


**is_consumption:**	Int64

	count: 2018352
	nº nulls: 0
	min: 0
	max: 1
	mean: 0.5
	std: 0.5000001238634751


**datetime:**	Datetime(time_unit='us', time_zone=None)

	count: 2018352
	nº nulls: 0
	min: 2021-09-01 00:00:00
	max: 2023-05-31 23:00:00


**data_block_id:**	Int64

	count: 2018352
	nº nulls: 0
	min: 0
	max: 637
	mean: 321.8745986824895
	std: 182.6343140006769


**row_id:**	Int64

	count: 2018352
	nº nulls: 0
	min: 0
	max: 2018351
	mean: 1009175.5
	std: 582648.179597259


**prediction_unit_id:**	Int64

	count: 2018352
	nº nulls: 0
	min: 0
	max: 68
	mean: 33.04537563318985
	std: 19.590593978268064


### weather_station_to_county_mapping.csv

county_name,longitude,latitude,county
str,f64,f64,i64
,21.7,57.6,
,21.7,57.9,
,21.7,58.2,


**county_name:**	String

	count: 49
	nº nulls: 63


**longitude:**	Float64

	count: 112
	nº nulls: 0
	min: 21.7
	max: 28.2
	mean: 24.949999999999978
	std: 2.024623199288969


**latitude:**	Float64

	count: 112
	nº nulls: 0
	min: 57.6
	max: 59.69999999999998
	mean: 58.649999999999984
	std: 0.6904757466824936


**county:**	Int64

	count: 49
	nº nulls: 63
	min: 0
	max: 15
	mean: 7.061224489795919
	std: 4.870866466662212
