# Project

### Deadline: 26th of January

---

## Scope & Ground Rules


### Part 1 - 50% | Scope

It's composed for `5` small assignments with guidelines. The assignment scores **evenly distributed across all questions** (each question accounts for 10% of the final score). 

### Part 2 - 50% | Scope

Part 2 entails a project using the same dataset. The goal is to prove your data preprocessing skills. As output from this project, you should delivery the **notebook with the code you have done**.

Please apply, at least, **6 transformations** to the feature set you have in hands (or to the features which makes sense to apply the transformation). Each transformation should be accompanied by an explanation . Last but not least, compare the benefits of such transformation with the baseline score or the last best score. 

Regarding the variables you have to use throughout the Part 2, there are 6 in the total, 3 of them you are free to choose while the remaining 3 I have picked for you:
* YearBuilt
* LotFrontage
* MasVnrType

Make your baseline progressive, i.e. please consider the score from the previous transformation as the new baseline if it shows improvements. Example:

    Baseline - subset of transformations [None]  = 60% accuracy
    Iteration #1 - subset of transformations [A]     = 64% accuracy -> new baseline
    Iteration #2 - subset of transformations [A,B]   = 63% accuracy    (future scaling, ou imputação, ou, pelo menos 6 transformações)
    Iteration #3 - subset of transformations [A,B,C] = 68% accuracy -> new baseline
    ...
    Iteration #N - subset of transformations [A,B,C,..., N]
    (Being A, B, C a transformation that uses 1 or N features.)

Transformation example: encoding `color` & `country` with `One-Hot-Encoding`.

The `target` variable should be used to compute the accuracy (please use the `Target` you have created on the exercise 2.1, part1).


**Any question please contact to me via Slack or Email.**


---

**IMPORTANT NOTES to have in mind** 

a) Code Readability is taken into account for the evaluation, so please make it simple, readable and explain your operations when necessary.

b) Make sure that the evaluater can re-run the notebook from the begining, i.e. before you delivery the assignment please go to the bar on top of your notebook -> `Kernel` -> `Restart & Run all`. Validate that all outputs are as you expect.

----


## How can I deliver the project?



**Email contact**

Please email me via `antony_costa_1995@hotmail.com` with the following subject:

`[MPPD Project] <Your Name>`


**Deliverable**

1) Notebook with the code used for both parts, 1 and 2.

2) The notebook **NAME** should follow the notation:

```
 <Your Name>_MPPD_project.ipynb
```

E.g. `AntonyCosta_MPPD_project.ipynb`

---

---

## Setup

Feel free to add any Python package as you please

---

# Part 1

## 1- Load Data

1.1- Load **house_prices_final_project.csv** to a Pandas DataFrame. You can see in `data_description.txt` file the description of each column

In [1]:
import pandas as pd 

In [2]:

df = pd.read_csv('data/house_prices_final_project.csv')

In [3]:
df.head(5)  

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


1.2- Print dataset total number of `observations` and `variables`

In [4]:
observations, variables = df.shape

observations, variables

(1460, 81)

---

### Please find below the subset of columns we are going to consider for the rest of the assignment

In [1]:
columns_list = ['FullBath',
                'TotRmsAbvGrd',
                'Fireplaces',
                'GarageYrBlt',
                'GarageCars',
                'GarageArea',
                'LotFrontage',
                'WoodDeckSF',
                'OpenPorchSF',
                'SaleType',
                'SaleCondition',
                'SalePrice']

1.3- Create a new dataframe which is a subset of the origin dataframe based on the columns listed above.

In [5]:
# Definir o subconjunto de colunas
columns_list = [
    'FullBath', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars',
    'GarageArea', 'LotFrontage', 'WoodDeckSF', 'OpenPorchSF',
    'SaleType', 'SaleCondition', 'SalePrice'
]

# Criar um DataFrame apenas com as colunas selecionadas
df_subset = df[columns_list]

# Exibir as primeiras linhas para confirmar
df_subset.head()

Unnamed: 0,FullBath,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,LotFrontage,WoodDeckSF,OpenPorchSF,SaleType,SaleCondition,SalePrice
0,2,8,0,2003.0,2,548,65.0,0,61,WD,Normal,208500
1,2,6,1,1976.0,2,460,80.0,298,0,WD,Normal,181500
2,2,6,1,2001.0,2,608,68.0,0,42,WD,Normal,223500
3,1,7,1,1998.0,3,642,60.0,0,35,WD,Abnorml,140000
4,2,9,1,2000.0,3,836,84.0,192,84,WD,Normal,250000


## 2- Creating Labels

2.1- Create the `target` column based on `SalePrice`. The split should be done using the median value to create 2 new buckets. `Min->Median` bucket should have assigned the value `0` while the other bucket (`Median->Max`) value should be `1`.



Note: you are free to decide the buckets boundaries

use  bins


In [7]:
bins = [df_subset['SalePrice'].min() - 1, median_price, df_subset['SalePrice'].max()]
labels = [0, 1]

df_subset['target'] = pd.cut(df_subset['SalePrice'], bins=bins, labels=labels).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset['target'] = pd.cut(df_subset['SalePrice'], bins=bins, labels=labels).astype(int)


## 3- Handling Missing Values

3.1- List the amount of missing values per column

3.2- Take care of the missing values in the column `LotFrontage`

## 4- Handling Categorical Data

4.1- Split categorical feature into a `df_categorical` dataframe

4.2- Apply OHE to `SaleType`

## 5- Feature Scaling

5.1- Apply feature scaling to the variable `GarageArea`. Make sure that the new range fall between `-1/3` and `3`.

---

## End of Part 1

---

---

# Part 2

Remark:
* you shall use 6 variable for the assessment
* 3 out of 6 features are designated in the section on the top of the notebook
* the 3 remaining variables are up to you to choose
* you can consider any variable from the original dataset during this assessment

Above all, take this opportunity to practice :)

**Good luck!**