# AWS Glue Studio Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


In [2]:
#### Optional: Run this cell to see available notebook commands ("magics").
%help

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 1.0.5 



# Available Magic Commands

## Sessions Magic

----
    %help                             Return a list of descriptions and input types for all magic commands. 
    %profile            String        Specify a profile in your aws configuration to use as the credentials provider.
    %region             String        Specify the AWS region in which to initialize a session. 
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\ USERNAME \.aws\config" on Windows.
    %idle_timeout       Int           The number of minutes of inactivity after which a session will timeout. 
                                      Default: 2880 minutes (48 hours).
    %timeout            Int           The number of minutes after which a session will timeout. 
                                      Default: 2880 minutes (48 hours).
    %session_id_prefix  String        Define a String that will precede all session IDs in the format 
                                      [session_id_prefix]-[session_id]. If a session ID is not provided,
                                      a random UUID will be generated.
    %status                           Returns the status of the current Glue session including its duration, 
                                      configuration and executing user / role.
    %session_id                       Returns the session ID for the running session.
    %list_sessions                    Lists all currently running sessions by ID.
    %stop_session                     Stops the current session.
    %glue_version       String        The version of Glue to be used by this session. 
                                      Currently, the only valid options are 2.0, 3.0 and 4.0. 
                                      Default: 2.0.
    %reconnect          String        Specify a live session ID to switch/reconnect to the sessions.
----

## Selecting Session Types

----
    %streaming          String        Sets the session type to Glue Streaming.
    %etl                String        Sets the session type to Glue ETL.
    %glue_ray           String        Sets the session type to Glue Ray.
    %session_type       String        Specify a session_type to be used. Supported values: streaming, etl and glue_ray. 
----

## Glue Config Magic 
*(common across all session types)*

----

    %%configure         Dictionary    A json-formatted dictionary consisting of all configuration parameters for 
                                      a session. Each parameter can be specified here or through individual magics.
    %iam_role           String        Specify an IAM role ARN to execute your session with.
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\%USERNAME%\.aws\config` on Windows.
    %number_of_workers  int           The number of workers of a defined worker_type that are allocated 
                                      when a session runs.
                                      Default: 5.
    %additional_python_modules  List  Comma separated list of additional Python modules to include in your cluster 
                                      (can be from Pypi or S3).
    %%tags        Dictionary          Specify a json-formatted dictionary consisting of tags to use in the session.
    
    %%assume_role Dictionary, String  Specify a json-formatted dictionary or an IAM role ARN string to create a session 
                                      for cross account access.
                                      E.g. {valid arn}
                                      %%assume_role 
                                      'arn:aws:iam::XXXXXXXXXXXX:role/AWSGlueServiceRole' 
                                      E.g. {credentials}
                                      %%assume_role
                                      {
                                            "aws_access_key_id" : "XXXXXXXXXXXX",
                                            "aws_secret_access_key" : "XXXXXXXXXXXX",
                                            "aws_session_token" : "XXXXXXXXXXXX"
                                       }
----

                                      
## Magic for Spark Sessions (ETL & Streaming)

----
    %worker_type        String        Set the type of instances the session will use as workers. 
    %connections        List          Specify a comma separated list of connections to use in the session.
    %extra_py_files     List          Comma separated list of additional Python files From S3.
    %extra_jars         List          Comma separated list of additional Jars to include in the cluster.
    %spark_conf         String        Specify custom spark configurations for your session. 
                                      E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer
----
                                      
## Magic for Ray Session

----
    %min_workers        Int           The minimum number of workers that are allocated to a Ray session. 
                                      Default: 1.
    %object_memory_head Int           The percentage of free memory on the instance head node after a warm start. 
                                      Minimum: 0. Maximum: 100.
    %object_memory_worker Int         The percentage of free memory on the instance worker nodes after a warm start. 
                                      Minimum: 0. Maximum: 100.
----

## Action Magic

----

    %%sql               String        Run SQL code. All lines after the initial %%sql magic will be passed
                                      as part of the SQL code.  
    %matplot      Matplotlib figure   Visualize your data using the matplotlib library.
                                      E.g. 
                                      import matplotlib.pyplot as plt
                                      # Set X-axis and Y-axis values
                                      x = [5, 2, 8, 4, 9]
                                      y = [10, 4, 8, 5, 2]
                                      # Create a bar chart 
                                      plt.bar(x, y) 
                                      # Show the plot
                                      %matplot plt    
    %plotly            Plotly figure  Visualize your data using the plotly library.
                                      E.g.
                                      import plotly.express as px
                                      #Create a graphical figure
                                      fig = px.line(x=["a","b","c"], y=[1,3,2], title="sample figure")
                                      #Show the figure
                                      %plotly fig

  
                
----



In [2]:
%%configure
{
    "--datalake-formats":"delta",
    "--conf":"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore"
}

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 1.0.5 
The following configurations have been updated: {'--datalake-formats': 'delta', '--conf': 'spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore'}


####  Run this cell to set up and start your interactive session.


In [1]:
%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 5

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import functions as F
from pyspark.sql import types as T
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Current idle_timeout is None minutes.
idle_timeout has been set to 2880 minutes.
Setting Glue version to: 4.0
Previous worker type: None
Setting new worker type to: G.1X
Previous number of workers: None
Setting new number of workers to: 5
Trying to create a Glue session for the kernel.
Session Type: glueetl
Worker Type: G.1X
Number of Workers: 5
Idle Timeout: 2880
Session ID: 98fc663c-1c75-4e5b-b976-afdee97f8946
Applying the following default arguments:
--glue_kernel_version 1.0.5
--enable-glue-datacatalog true
--datalake-formats delta
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
Waiting for session 98fc663c-1c75-4e5b-b976-afdee97f8946 to get into ready status...
Session 98fc663c-1c75-4e5b-b976-afdee97f8946 has been created.



In [2]:
# Definimos variables a usar
database="delta_lake_db"
table_name="table_silver_folder"
additional_options={"path":"s3://bucket-for-requests/data/api_database/silver_folder/"}




# En el notebook gold vamos a realizar el estudio de outliers y añadir algunas variables propias que aporten información al estudio.

## Empezamos con la carga del archivo

In [3]:
df = glueContext.create_data_frame.from_catalog(
    database=database,
    table_name=table_name,
    additional_options=additional_options
)



In [4]:
df.printSchema() # Observamos que ha cargado correctamente.

root
 |-- address: string (nullable = true)
 |-- bathrooms_count: long (nullable = true)
 |-- country: string (nullable = true)
 |-- typology: string (nullable = true)
 |-- district: string (nullable = true)
 |-- exterior: boolean (nullable = true)
 |-- floor: string (nullable = true)
 |-- hasLift: boolean (nullable = true)
 |-- municipality: string (nullable = true)
 |-- neighborhood: string (nullable = true)
 |-- newDevelopment: boolean (nullable = true)
 |-- price: double (nullable = true)
 |-- price_per_m2: double (nullable = true)
 |-- propertyCode: string (nullable = true)
 |-- propertyType: string (nullable = true)
 |-- province: string (nullable = true)
 |-- rooms_count: long (nullable = true)
 |-- size_m2: double (nullable = true)
 |-- status: string (nullable = true)


In [5]:
data_type = dict(df.dtypes)['rooms_count']
print(f'Tipo de dato de la columna: {data_type}') # Observamos el data type de la columna pues printSchema no lo infiere correctamente.

Tipo de dato de la columna: bigint


In [6]:
df.show(2)

+--------------------+---------------+-------+--------+-----------+--------+-----+-------+------------------+------------+--------------+---------+------------+------------+------------+--------+-----------+-------+--------------+
|             address|bathrooms_count|country|typology|   district|exterior|floor|hasLift|      municipality|neighborhood|newDevelopment|    price|price_per_m2|propertyCode|propertyType|province|rooms_count|size_m2|        status|
+--------------------+---------------+-------+--------+-----------+--------+-----+-------+------------------+------------+--------------+---------+------------+------------+------------+--------+-----------+-------+--------------+
|      calle Talavera|              5|     es|  chalet| Somosaguas|    null| null|   null|Pozuelo de Alarcón|  Somosaguas|         false|2190000.0|      3763.0|   100017563|      chalet|  Madrid|          6|  582.0|          good|
|Lanzarote del lag...|              3|     es|  chalet|Los Molinos|    null|

In [7]:
summaries=df.summary().cache()
summaries.show() # Mostramos los estadísticos principales.

+-------+---------------+------------------+-------+--------+----------+------------------+--------------------+--------------------+------------------+------------------+--------------------+------------+--------+------------------+------------------+------+
|summary|        address|   bathrooms_count|country|typology|  district|             floor|        municipality|        neighborhood|             price|      price_per_m2|        propertyCode|propertyType|province|       rooms_count|           size_m2|status|
+-------+---------------+------------------+-------+--------+----------+------------------+--------------------+--------------------+------------------+------------------+--------------------+------------+--------+------------------+------------------+------+
|  count|           6483|              6483|   6483|    6483|      6418|              5259|                6483|                5256|              6483|              6483|                6483|        6483|    6483|      

Me llama la atención lo siguiente:

* Aún existan valores perdidos de floor en el dataset pero esto es debido a que hay diferentes tipos de viviendas, como los chalets, que se entiende que la vivienda es el propio conjunto de plantas.
* En la variable price encontramos un precio máximo de 1.850.000.000. Esa cifra parece indicar la presencia de outliers. Esto parece ocurrir en rooms_count, size_m2 y price_per_m2. Entendemos que son viviendas de súper lujo pero la presencia de las mismas dificultan el estudio.

Con esto dicho, el siguiente paso será el estudio de outliers.


# Estudio de outliers.
Para completar este paso vamos a crear una función que:
* Calcule los percentiles 25 y 75.
* Calcule el rango intercuartílico.
* Defina los límites de los outliers.
* Imprima el número de outliers.
* Cape los outliers y los colapse a las colas en caso de elegir hacerlo.

In [8]:
# Realizamos la función siguiendo el criterio de umbrales del boxplot.

def calculate_outliers(df, column, mode="check"):
    # Calcular cuartiles y IQR
    quantiles = df.approxQuantile(column, [0.25, 0.75], 0.05)
    iqr = quantiles[1] - quantiles[0]
    lower_bound = quantiles[0] - 3 * iqr
    upper_bound = quantiles[1] + 3 * iqr
    
    # Filtrar los outliers
    outliers = df.filter((F.col(column) < lower_bound) | (F.col(column) > upper_bound))
    
    # Si el modo es "check", solo imprimimos el número de outliers
    if mode == "check":
        print(f"Outliers en {column}: {outliers.count()}")
    
    # Si el modo es "cap", imprimimos los outliers y los capamos por el límite.
    elif mode == "cap":
        print(f"Vamos a sustituir {outliers.count()} valores en {column}")
        df = df.withColumn(
            column,
            F.when(F.col(column) < lower_bound, lower_bound)
             .when(F.col(column) > upper_bound, upper_bound)
             .otherwise(F.col(column))
        )
        outliers_after_capping = df.filter((F.col(column) < lower_bound) | (F.col(column) > upper_bound))
        print(f"Outliers restantes después de capar: {outliers_after_capping.count()}")
    
    return df

# Generalmente se calcula usando 1,5 veces el iqr pero vamos a utilizar el doble de su valor para encontrar verdaderos atípicos.




### Creamos una función que aplique la función anterior a todas las columnas numéricas.

In [9]:
def apply_outliers_to_numeric(df, mode="check"):
    # Identificamos las columnas numéricas
    numerical_cols = [c for c, t in df.dtypes if t in ('int', 'double', 'float', 'long', 'bigint')]
    
    # Aplicamos la función calculate_outliers a cada columna en la lista
    for col in numerical_cols:
        df=calculate_outliers(df, col, mode)
    
    return df




In [10]:
df_without_outliers=apply_outliers_to_numeric(df, mode="cap")

Vamos a sustituir 31 valores en bathrooms_count
Outliers restantes después de capar: 0
Vamos a sustituir 196 valores en price
Outliers restantes después de capar: 0
Vamos a sustituir 10 valores en price_per_m2
Outliers restantes después de capar: 0
Vamos a sustituir 15 valores en rooms_count
Outliers restantes después de capar: 0
Vamos a sustituir 318 valores en size_m2
Outliers restantes después de capar: 0


In [11]:
df_without_outliers.summary().show() # Chequeamos que se han colapsado los datos en las colas.

+-------+---------------+------------------+-------+--------+----------+------------------+--------------------+--------------------+-----------------+------------------+--------------------+------------+--------+------------------+------------------+------+
|summary|        address|   bathrooms_count|country|typology|  district|             floor|        municipality|        neighborhood|            price|      price_per_m2|        propertyCode|propertyType|province|       rooms_count|           size_m2|status|
+-------+---------------+------------------+-------+--------+----------+------------------+--------------------+--------------------+-----------------+------------------+--------------------+------------+--------+------------------+------------------+------+
|  count|           6483|              6483|   6483|    6483|      6418|              5259|                6483|                5256|             6483|              6483|                6483|        6483|    6483|          

Los datos arrojados reflejan que efectivamente los datos se han colapsado en las colas. Por ejemplo, el valor máximo del precio de una vivienda es ahora 3.653.000€. 

# Procedemos a realizar feature engineering para aportar información al estudio y al futuro modelo de predicción.

In [12]:
# Feature engineering
df_price_per_district=df_without_outliers.groupBy(F.col("district"))\
                                        .agg(F.mean("price")\
                                        .alias("avg_price_per_district"))




In [13]:
df_price_per_district.show()

+--------------------+----------------------+
|            district|avg_price_per_district|
+--------------------+----------------------+
|           Chamartín|           1658541.416|
|       Zona Estación|            1193608.04|
|           Las Lomas|     2252090.909090909|
|       Casco Antiguo|     368897.8333333333|
|            El Burgo|    515714.28571428574|
|Pol. Industrial n...|              195300.0|
|         Laguna Park|     296466.3333333333|
|            Chamberí|    1240040.1756756757|
|              Tetuán|      568637.182572614|
|Yucatán- Las Corn...|    1695666.6666666667|
|          Miramadrid|     707592.3076923077|
|             Bonanza|    1183777.7777777778|
|Centro - Ayuntami...|              203262.5|
|               Reyes|    194743.22222222222|
|El Arroyo - La Fu...|              197360.0|
|         Universidad|              222000.0|
|            Noroeste|              213462.5|
|Encinar de los Reyes|     1413482.142857143|
|    Derechos Humanos|            

In [14]:
df_joined=df.join(df_price_per_district, on="district", how="left")
df_joined.show()

+--------------------+--------------------+---------------+-------+--------+--------+-----+-------+-------------------+--------------------+--------------+---------+------------+------------+------------+--------+-----------+-------+--------------+----------------------+
|            district|             address|bathrooms_count|country|typology|exterior|floor|hasLift|       municipality|        neighborhood|newDevelopment|    price|price_per_m2|propertyCode|propertyType|province|rooms_count|size_m2|        status|avg_price_per_district|
+--------------------+--------------------+---------------+-------+--------+--------+-----+-------+-------------------+--------------------+--------------+---------+------------+------------+------------+--------+-----------+-------+--------------+----------------------+
|          Somosaguas|      calle Talavera|              5|     es|  chalet|    null| null|   null| Pozuelo de Alarcón|          Somosaguas|         false|2190000.0|      3763.0|   100

In [15]:
# Definimos un márgen de tolerancia
tolerance=0.1
# Creamos una nueva columna que clasifique las viviendas según la media de precios de la zona. 
df_joined=df_joined.withColumn("price_category",
                               F.when(F.col("price")>F.col("avg_price_per_district")*(1+tolerance), "high")\
                               .when(F.col("price")<F.col("avg_price_per_district")*(1-tolerance), "low")\
                               .otherwise("medium")
                              )
df_joined.show()

+--------------------+--------------------+---------------+-------+--------+--------+-----+-------+-------------------+--------------------+--------------+---------+------------+------------+------------+--------+-----------+-------+--------------+----------------------+--------------+
|            district|             address|bathrooms_count|country|typology|exterior|floor|hasLift|       municipality|        neighborhood|newDevelopment|    price|price_per_m2|propertyCode|propertyType|province|rooms_count|size_m2|        status|avg_price_per_district|price_category|
+--------------------+--------------------+---------------+-------+--------+--------+-----+-------+-------------------+--------------------+--------------+---------+------------+------------+------------+--------+-----------+-------+--------------+----------------------+--------------+
|          Somosaguas|      calle Talavera|              5|     es|  chalet|    null| null|   null| Pozuelo de Alarcón|          Somosaguas

### Por último, se torna necesario añadir una variable de tipo fecha precisamente para realizar análisis más detallados y sobre todo conocer el evolutivo del precio en las futuras consultas.

Para conseguir esto, vamos a añadir la fecha actual en la que hemos limpiado y trabajado los datos.

In [16]:
df_joined=df_joined.withColumn("date", F.current_date()).cache()
df_joined.show(3, vertical=True)

-RECORD 0--------------------------------------
 district               | Somosaguas           
 address                | calle Talavera       
 bathrooms_count        | 5                    
 country                | es                   
 typology               | chalet               
 exterior               | null                 
 floor                  | null                 
 hasLift                | null                 
 municipality           | Pozuelo de Alarcón   
 neighborhood           | Somosaguas           
 newDevelopment         | false                
 price                  | 2190000.0            
 price_per_m2           | 3763.0               
 propertyCode           | 100017563            
 propertyType           | chalet               
 province               | Madrid               
 rooms_count            | 6                    
 size_m2                | 582.0                
 status                 | good                 
 avg_price_per_district | 1852000.0     

In [17]:
df_joined.printSchema() # En este punto me fijo en la columna floor y veo que es de tipo string. Quiero castearla a tipo entero para que sea coherente.

root
 |-- district: string (nullable = true)
 |-- address: string (nullable = true)
 |-- bathrooms_count: long (nullable = true)
 |-- country: string (nullable = true)
 |-- typology: string (nullable = true)
 |-- exterior: boolean (nullable = true)
 |-- floor: string (nullable = true)
 |-- hasLift: boolean (nullable = true)
 |-- municipality: string (nullable = true)
 |-- neighborhood: string (nullable = true)
 |-- newDevelopment: boolean (nullable = true)
 |-- price: double (nullable = true)
 |-- price_per_m2: double (nullable = true)
 |-- propertyCode: string (nullable = true)
 |-- propertyType: string (nullable = true)
 |-- province: string (nullable = true)
 |-- rooms_count: long (nullable = true)
 |-- size_m2: double (nullable = true)
 |-- status: string (nullable = true)
 |-- avg_price_per_district: double (nullable = true)
 |-- price_category: string (nullable = false)
 |-- date: date (nullable = false)


In [28]:
distinct_floors = df_joined.select("floor").distinct().orderBy("floor") # Exploro la variable y vemos que hay valores de tipo string.
distinct_floors.show(30)

+-----+
|floor|
+-----+
| null|
|   -1|
|    1|
|   10|
|   11|
|   12|
|   13|
|   14|
|   15|
|   17|
|   18|
|    2|
|   21|
|   22|
|   24|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
|   bj|
|   en|
|   ss|
|   st|
+-----+


In [29]:
df_floor=df_joined.filter((F.col("floor") =="en") | (F.col("floor")=="ss"))
df_floor.show(30) # Exploramos la variable cuando toma alguno de los valores seleccionados de tipo string.

+-------------------+--------------------+---------------+-------+--------+--------+-----+-------+------------+--------------------+--------------+---------+------------+------------+------------+--------+-----------+-------+------+----------------------+--------------+----------+
|           district|             address|bathrooms_count|country|typology|exterior|floor|hasLift|municipality|        neighborhood|newDevelopment|    price|price_per_m2|propertyCode|propertyType|province|rooms_count|size_m2|status|avg_price_per_district|price_category|      date|
+-------------------+--------------------+---------------+-------+--------+--------+-----+-------+------------+--------------------+--------------+---------+------------+------------+------------+--------+-----------+-------+------+----------------------+--------------+----------+
|Barrio de Salamanca|avenida de Felipe II|              1|     es|    flat|    true|   ss|  false|      Madrid|                Goya|         false| 385000

In [30]:
df_filtered_count=df_joined.filter(F.col("floor").isin("en", "ss", "st", "bj")) \
                           .groupBy("floor", "propertyType") \
                           .count() \
                           .orderBy("floor")

df_filtered_count.show() # Miramos la distribución del tipo de viviendas.

+-----+------------+-----+
|floor|propertyType|count|
+-----+------------+-----+
|   bj|        flat|  628|
|   bj|      studio|   34|
|   bj|      duplex|   71|
|   en|      duplex|    3|
|   en|      studio|    1|
|   en|        flat|   60|
|   ss|        flat|   23|
|   ss|      studio|    1|
|   st|        flat|   13|
+-----+------------+-----+


Viendo estos resultados, podemos asumir que estamos ante tipos de viviendas que o bien son pisos para una persona (estudios), pisos como tal o pisos con dos plantas.

## Sería lógico asumir que "bj" puede significar bajo, pero el resto de siglas no están del todo claras. "En" puede significar entreplanta y "ss" o "st" sótano. 
## Sin embargo, como no lo podemos asegurar, me decanto por sustituir bj por 0 (pues es el que mayor incidencia tiene) y el resto de valores por la moda

In [31]:
df_joined=df_joined.withColumn(
    "floor",
    F.when(F.col("floor")=="bj", 0) # Sustituimos los valores "bj" por 0.
     .when(F.col("floor").isin("en", "ss", "st"), 3) # Sustituimos los valores extraños por la moda.
     .otherwise(F.col("floor")) 
)




In [35]:
check=df_joined.filter(F.col("floor").isin("en", "ss", "st", "bj")) \
                           .groupBy("floor", "propertyType") \
                           .count() \
                           .orderBy("floor")
check.show()# Chequeamos que las transformaciones han funcionado.

+-----+------------+-----+
|floor|propertyType|count|
+-----+------------+-----+
+-----+------------+-----+


In [36]:
df_casted=df_joined.withColumn("floor", F.col("floor").cast(T.IntegerType())) # Casteamos la columna floor pasándola a tipo integer.




In [37]:
df_casted.printSchema()

root
 |-- district: string (nullable = true)
 |-- address: string (nullable = true)
 |-- bathrooms_count: long (nullable = true)
 |-- country: string (nullable = true)
 |-- typology: string (nullable = true)
 |-- exterior: boolean (nullable = true)
 |-- floor: integer (nullable = true)
 |-- hasLift: boolean (nullable = true)
 |-- municipality: string (nullable = true)
 |-- neighborhood: string (nullable = true)
 |-- newDevelopment: boolean (nullable = true)
 |-- price: double (nullable = true)
 |-- price_per_m2: double (nullable = true)
 |-- propertyCode: string (nullable = true)
 |-- propertyType: string (nullable = true)
 |-- province: string (nullable = true)
 |-- rooms_count: long (nullable = true)
 |-- size_m2: double (nullable = true)
 |-- status: string (nullable = true)
 |-- avg_price_per_district: double (nullable = true)
 |-- price_category: string (nullable = false)
 |-- date: date (nullable = false)


In [None]:
df_casted.describe().show()

# Una vez realizadas estas transformaciones, guardamos los resultados en: s3 y en las tablas listas para el servicio. 
### También guardamos los resultados en una carpeta csv para descargarnos el archivo, le aplicamos el modelo de predicción y realizamos el servicio de los datos mediante Tableau.

In [None]:
gold_path="s3://bucket-for-requests/data/api_database/gold_folder/"
df_casted.write\
        .format("delta")\
        .mode("overwrite")\
        .save(gold_path)

In [None]:
gold_csv_path="s3://bucket-for-requests/data/api_database/gold_folder_csv/"
df_casted.write.csv(gold_csv_path, mode="overwrite", header=True)

In [None]:
job.commit()