# Use case with Spark

## Prerequisite

Start your HDFS cluster :

Start your Spark cluster :

Check all the following containers are up :

smaster, sworker1, sworker2 


### Data download

This practice is based on a dataset from data.gouv.fr which you uploaded into hdfs while you studied HDFS.

If you don't, please run the following cell :


In [None]:
!$HADOOP_HOME/bin/hdfs dfs -mkdir /data/permanent/rawdata/trade/
!$HADOOP_HOME/bin/hdfs dfs -put /home/jovyan/data/2021-mutations-immobilieres.csv /data/permanent/rawdata/trade/

The dataset represents all France real estate transfer in 2021.

Here is the description of each column : 

```
id_mutation : Identifiant de mutation (non stable, sert à grouper les lignes)
date_mutation : Date de la mutation au format ISO-8601 (YYYY-MM-DD)
numero_disposition : Numéro de disposition
nature_mutation : Nature de la mutation
valeur_fonciere : Valeur foncière (séparateur décimal = point)
adresse_numero : Numéro de l'adresse
adresse_suffixe : Suffixe du numéro de l'adresse (B, T, Q)
adresse_code_voie : Code FANTOIR de la voie (4 caractères)
adresse_nom_voie : Nom de la voie de l'adresse
code_postal : Code postal (5 caractères)
code_commune : Code commune INSEE (5 caractères)
nom_commune : Nom de la commune (accentué)
ancien_code_commune : Ancien code commune INSEE (si différent lors de la mutation)
ancien_nom_commune : Ancien nom de la commune (si différent lors de la mutation)
code_departement : Code département INSEE (2 ou 3 caractères)
id_parcelle : Identifiant de parcelle (14 caractères)
ancien_id_parcelle : Ancien identifiant de parcelle (si différent lors de la mutation)
numero_volume : Numéro de volume
lot_1_numero : Numéro du lot 1
lot_1_surface_carrez : Surface Carrez du lot 1
lot_2_numero : Numéro du lot 2
lot_2_surface_carrez : Surface Carrez du lot 2
lot_3_numero : Numéro du lot 3
lot_3_surface_carrez : Surface Carrez du lot 3
lot_4_numero : Numéro du lot 4
lot_4_surface_carrez : Surface Carrez du lot 4
lot_5_numero : Numéro du lot 5
lot_5_surface_carrez : Surface Carrez du lot 5
nombre_lots : Nombre de lots
code_type_local : Code de type de local
type_local : Libellé du type de local
surface_reelle_bati : Surface réelle du bâti
nombre_pieces_principales : Nombre de pièces principales
code_nature_culture : Code de nature de culture
nature_culture : Libellé de nature de culture
code_nature_culture_speciale : Code de nature de culture spéciale
nature_culture_speciale : Libellé de nature de culture spéciale
surface_terrain : Surface du terrain
longitude : Longitude du centre de la parcelle concernée (WGS-84)
latitude : Latitude du centre de la parcelle concernée (WGS-84)    
```

You are now ready to analyze data.

Note that the following documentation may be a great help to assist you in this homework.

https://sparkbyexamples.com/pyspark-tutorial/


## 1 - Dataframe

Open a connection to your Spark cluster :

In [None]:
# N'oubliez pas de fermer la connexion à la fin du TP
# spark.stop()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://smaster:7077").appName("TPDF02").getOrCreate()

Check your connection :

In [None]:
spark

<font color='red'>Q1 - Open and infer the dataset into the mutationsDF dataframe ?
    
Note that the dataset schema should be automatically included into the dataframe.  
The dataset is compressed under the format .gz, don't worry about that Spark is able to read like a simple a csv file.
</font>

In [None]:
#Insert your code here

<font color='red'>Q2 - Display the data schema of mutationsDF dataframe ?  
</font>

In [None]:
#Insert your code here

<font color='red'>Q3 - Display one row of your mutationsDF dataframe :
    </font>

In [None]:
#Insert your code here

<font color='red'>Q7 - Create from mutationsDF dataframe a new Dataframe CleanMutationDF which matches the following constraints : 
* type_local is not null
* type_local is 'Maison' or 'Appartement'
* only the following attributes should be selected : id_mutation , nature_mutation, type_local , date_mutation , valeur_fonciere


</font>

In [None]:
#Insert your code here

<font color='red'>Q4 - Save your CleanMutationDF dataframe into a parquet file ( /dataspark/mutations-immobilieres.parquet ) ?
</font>

In [None]:
#Insert your code here

If you need to remove the parquet file you can run the following cell :

In [None]:
!rm -rf /dataspark/mutations-immobilieres.parquet

<font color='red'>Q5 - Load  the parquet file ('/dataspark/mutations-immobilieres.parquet') into a mutationsPDF dataframe  ?</font>

In [None]:
#Insert your code here

<font color='red'>Q6 - How many rows do you have in mutationsPDF ?</font>

In [None]:
#Insert your code here



<div class="alert alert-block alert-info">
 Please note that there may be several rows for the same transaction. All the rows part of a single transaction have the same identifier (i.e. the same value) in the id_mutation column. For instance, there are two rows with the value 2021-887 in the id_mutation column.
</div>

<font color='red'>Q7 - Select all rows concerning the id_mutation 2021-15481 into the singleTrDF dataframe :
</font>


In [None]:
#Insert your code here




<font color='red'>Q8 - The singleTrDF contains 3 lines for the same transaction. How could you filtered out duplicated rows ? 
</font>

In [None]:
#Insert your code here

<font color='red'>Q9 -  From mutationsPDF dataframe, create a new dataframe mutationDistinctPDF matching the following constraints :
* no duplicated rows
* selecting only nature_mutation = 'Vente'
</font>

In [None]:
#Insert your code here

<font color='red'>Q9 -  From mutationDistinctPDF dataframe, compare the sales amount (valeur_fonciere) between 'Maison' (House) and 'Appartment' by month
</font>


<div class="alert alert-block alert-info">
Note that you can use the month function included into the pyspark.sql.functions to extract the month value from a date.
</div>


In [None]:
from pyspark.sql.functions import month

#Insert your code here


<font color='red'>Q10 - Determine the month where sales amount is highest for 'Maison' and 'Appartement' ?
</font>

In [None]:
#Insert your code here

# 3 - SQL

In this exercise, you will handle the dataset using the SQL language.


<font color='red'>Q11 - From the mutationDistinctPDF dataframe, create a view mutationSalesV ?</font>

In [None]:
#Insert your code here

<font color='red'>Q12 -  From the mutationSalesV view, compare the sales amount (valeur_fonciere) between 'Maison' (House) and 'Appartment' by month using SQL ?

</font>

In [None]:
#Insert your code here


Close your Spark connection.

In [None]:
spark.stop()