## DAB200 -- Graded Lab 1

In this lab, you will gain some experience in **denoising** a dataset in the context of a specific objective. 

**Overall Objective**: Create a model that predicts rent prices as well as possible for typical New York City apartments.

**Data set**: make sure you use the data with the same number as your group number!

| Group | Data set |
| :-: | :-: |
| 1 | rent_1.csv |
| 2 | rent_2.csv |
| etc. | etc. |

**Important Notes:**
 - This lab is more open-ended so be prepared to think on your own, in a logical way, in order to solve the problem at hand
     - You should be able to support any decision you make with logical evidence
 - The data looks like the data we have been using in class but it has other **surprises**
     - Be sure to investigate the data in a way that allows you to discover all these surprises
 - Use [Chapter 5](https://mlbook.explained.ai/prep.html) of the textbook as a **guide**, except:
     - you only need to use **random forest** models;
     - exclude Section 5.5; 
 - Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
 - Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
 - Don't make assumptions!

I have broken the lab down into 4 main parts. 

### Part 0

Please provide the following information:
 - Group Number: 12
 - Group Members
     - Jonathan Alberto Calle Zuniga (0825959)
     - Jonathan Chukwuma OTEH (0775057)
     - Name (Student ID)

     

### Part 1 - Create and evaluate an initial model

#### Code (15 marks)

In [3]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

In [14]:
rent_12 = pd.read_csv("https://raw.githubusercontent.com/joncalle/ML1/main/rent_12.csv")
rent_12 = rent_12[(rent_12.building_id != '0')]
print(rent_12.shape) # print rows, columns
rent_12.head(2)      # dump first 2 rows

(16672, 15)


Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,longitude,manager_id,photos,price,street_address,interest_level,num_desc_words
0,1.0,1,6257ec70258e72c2f9f32cb92f1d3449,2016-06-16 07:49:37,"Amazing modern rental, Central air and central...",North 12th Street,"['Elevator', 'Laundry in Building', 'New Const...",40.7199,-73.9538,5ba989232d0489da1b5f2c45f6688adc,['https://photos.renthop.com/2/7172218_c8961ee...,2850,210 North 12th Street,-3,51
1,1.0,2,e3c4f2223d1deb777fc7941dbb41047c,2016-04-26 02:38:18,"Spacious and bright full-floor two bedroom, 1....",East 55th Street,"['Laundry in Unit', 'Dishwasher', 'Hardwood Fl...",40.7593,-73.9689,76a7b8c8e01b7192330128f82a3445fb,['https://photos.renthop.com/2/6924750_ea06008...,4000,157 East 55th Street,1,182


In [15]:
# Select the relevant features on our dataset
rent = rent_12[['bathrooms', 'bedrooms', 'longitude', 'latitude', 'price']]
rent.head(2)

Unnamed: 0,bathrooms,bedrooms,longitude,latitude,price
0,1.0,1,-73.9538,40.7199,2850
1,1.0,2,-73.9689,40.7593,4000


In [16]:
# Separate the features and target columns.
X_train, y_train = rent.drop('price', axis=1), rent['price']

#Create an initial model
rf = RandomForestRegressor(n_estimators = 100, # number of trees in the forest 
                           n_jobs = -1,        # using all processors
                           oob_score = True)   # estimate the generalization score
rf.fit(X_train, y_train)                       # Build a forest of trees from the training set (X, y).
print(f"OOB score {rf.oob_score_:.4f}")        # Score of the training dataset obtained using an out-of-bag estimate.

OOB score 0.4829


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 1** in the context of the overall objective. 

We imported the libraries to be used like pandas, sklearn and numpy. Then we loaded the dataset as a Dataframe from the csv file provided for Group Number 12. 
After getting a preview of the data we selected as we considered were the most important features ('bathrooms', 'bedrooms', 'longitude', 'latitude', 'price') where the price feature will be target. 
We trained our model using a Random Forest Regressor and  then selected the Out-of-Bag score. 
After evaluating our model with OOB score we noticed (good?poor?xxxxx) performance(-x.xxxx).

### Part 2 - Denoise the data

This section should only include the code necessary to **denoise** the data, NOT the code necessary to identify inconsistencies, problems, errors, etc. in the data. 

#### Code (25 marks)

In [17]:
# filter all asuming a rent price higher that 1k but lower thatn 10k
rent_clean = rent[(rent.price>1000) & (rent.price<10000)]

# filter all apartment where bath and rooms > 0
rent_clean = rent_clean[(rent_clean.bathrooms>0) & (rent_clean.bedrooms>0)]

# Including apartments with valid number of bathrooms.
rent_clean = rent_clean[((rent_clean.bedrooms==1) & (rent_clean.bathrooms<2)) |
                  (rent_clean.bedrooms==2) & (rent_clean.bathrooms<3) |
                  (rent_clean.bedrooms==3) & (rent_clean.bathrooms<4) |
                  (rent_clean.bedrooms==4) & (rent_clean.bathrooms<5) |
                  (rent_clean.bedrooms==5) & (rent_clean.bathrooms<6) |
                  (rent_clean.bedrooms==6) & (rent_clean.bathrooms<7)]

# Latitude Range: Approximately 40.4774° N to 40.9176° N
# Longitude Range: Approximately -74.2591° W to -73.7004° W
rent_clean = rent_clean[(rent_clean['latitude']>40.4774) &
                    (rent_clean['latitude']<40.9176) &
                    (rent_clean['longitude']>=-74.2591) &
                    (rent_clean['longitude']<=-73.7004)]


print(rent_clean.shape) # print rows, columns
rent_clean.head()

(12047, 5)


Unnamed: 0,bathrooms,bedrooms,longitude,latitude,price
0,1.0,1,-73.9538,40.7199,2850
1,1.0,2,-73.9689,40.7593,4000
2,2.0,2,-73.9857,40.7691,5725
4,1.0,1,-73.9935,40.7301,3500
5,2.0,2,-73.9903,40.7468,7485


### Part 3 - Create and evaluate a final model

#### Code (15 marks)

In [18]:
# The oob score provided should be the average of 10 runs
X, y = rent_clean.drop('price', axis=1), rent_clean['price']
oob_scores = []
numRuns = 10    # numbers of runs

for i in range(numRuns):
    rf = RandomForestRegressor(n_estimators = 100,
                               n_jobs = -1,        
                               oob_score = True)
    X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.20)
    rf.fit(X, y)
    oob_scores.append(rf.oob_score_)

avg_oob = sum(oob_scores) / len(oob_scores)
print(f"Average OOB score over 10 runs: {avg_oob:.4f}")

Average OOB score over 10 runs: 0.8075


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 3** in the context of the overall objective. 

### Part 4 - Document the problems (35 marks)

In this part, please use the table below to document your understanding of all the data issues you discovered. Note that **no code** should be included, as that should be covered in **Part 2**. Also, note that even if one line of code fixed a few problems, you should list each problem separately in the table below, so be sure you have investigated the data properly. For example, if the list `[-6, 5, 0, 50]` represents heights of adults, the -6, 0, and 50 would represent three data issues to be included in the table below, even though one line of code may be able to address all of them. 

| Data issue discovered | Why is this a problem? | How did you fix it? | Why is this fix appropriate? |
| :- | :- | :- | :- | 
|  example problem 1  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
|  example problem 2  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
