# Machine Learning in PySpark (Spark MLlib)
In this notebook we use the same data set used in the data processing notebook; this time our goal is to build a prediction model for rental price in Lausanne. 

## Import libraries

Start a Spark session:

## 1. Import data
`path = '../../data/lausanne_rental.csv'`

## 2. Preprocessing

### 2.1 Insight
Look at the first 5 rows and get a summary statistics. Do you observe anything strange in the summary statistics?

### 2.2 Transformations, modifications

**Converting data types** (A)

Convert surface area, number of rooms and rent columns types appropriately.

In [None]:
df = 

**Summary Statistics again**

Just out of curosity, can you have a look at the apartment(s) with the highest Rent?

**Dealing with the outliers**

Can you detect the outliers in the columns `SurfaceArea`, `NumRooms`, `Rent`? How about using [Tukey method](https://en.wikipedia.org/wiki/Outlier#Tukey's_fences)?

Can you describe what the following function is doing cosidering the Tukey method? 

In [42]:
def get_whiskers(df, colname, WHIS):
    q1 = df.approxQuantile(colname, [0.25], 0.01)[0]
    q3 = df.approxQuantile(colname, [0.75], 0.01)[0]
    iqn = q3 - q1
    return [q1 - WHIS*iqn, q3 + WHIS*iqn]

`SurfaceArea`

`NumRooms`

`Rent`

Can you think of any other variable that you can define and use it to detect the outliers? 

**Missing data**

Print out how many missing data for each column you have. 

**Dealing with the missing data**

Question what whould you do with the missing data? What would you do with `NumRooms`? with `Bookmark`? `Rent`? 

`Bookmark` (A)

`Rent` (A)

`NumRooms` (A)

`SurfaceArea` (A)

**Check missing data again**

## 3. Building prediction model

### 3.1 Feature engineering

Can you get `ZipCode` from the `Address`?

Can you get average rental price per zip code? 

Can you get average rental price per zip code per number of rooms? For instance, I would like to know what is the average price for a 2-room apartment in the zip code of 1004. 

**Onehot encoding?**

Question: when do we need onehot encoding?

In [None]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoderEstimator

In [76]:
indexer = StringIndexer... # complete
df = indexer.transform(df)
encoder = OneHotEncoderEstimator.. # complete
df = encoder.transform(df)

### 3.2 Prediction

In [None]:
from pyspark.ml.regression import LinearRegression, RandomForestRegressor, GBTRegressor
from pyspark.ml import Pipeline
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler

**Train test split**

### 3.3 Evaluation

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

**Relative error:**

Calculate the relative error of prediction on the test set. 

### Cross validation and GridsearchCV

Some hints on how to do Cross validation and Grid search with PySpark.

In [299]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [439]:
paramGrid = ParamGridBuilder().addGrid(rf.maxDepth, [5, 6]).build()

In [440]:
crossval = CrossValidator(estimator = pipeline, estimatorParamMaps = paramGrid, evaluator = evaluator, numFolds = 4)

In [441]:
cvModel = crossval.fit(train)

In [442]:
pred = cvModel.transform(test)

In [None]:
evaluator.evaluate(pred)