---
# Welcome to CE880
### This is your week-10 : Problem notebook

For this problem set, we'll be using the Jupyter notebook and please upload this notebook to [Google Colab](https://colab.research.google.com/). 

Paris Housing is a set of data created from imaginary data of house prices in an urban environment. 
https://github.com/sagihaider/CE880_2021/blob/main/Data/ParisHousingClass.csv 

All attributes are numeric variables and they are listed bellow:
* squareMeters
* numberOfRooms
* hasYard
* hasPool
* floors - number of floors
* cityCode - zip code
* cityPartRange - the higher the range, the more exclusive the neighbourhood is
* numPrevOwners - number of prevoious owners
* made - year
* isNewBuilt
* hasStormProtector
* basement - basement square meters
* attic - attic square meteres
* garage - garage size
* hasStorageRoom
* hasGuestRoom - number of guest rooms
* price - price of a house
* category - Luxury or Basic

In [2]:
import numpy as np 
import pandas as pd 
url = 'https://raw.githubusercontent.com/sagihaider/CE880_2021/main/Data/ParisHousingClass.csv'
trainData = pd.read_csv(url,index_col=0)
print(trainData.head())

              numberOfRooms  hasYard  hasPool  floors  cityCode  \
squareMeters                                                      
75523                     3        0        1      63      9373   
80771                    39        1        1      98     39381   
55712                    58        0        1      19     34457   
32316                    47        0        0       6     27939   
70429                    19        1        1      90     38045   

              cityPartRange  numPrevOwners  made  isNewBuilt  \
squareMeters                                                   
75523                     3              8  2005           0   
80771                     8              6  2015           1   
55712                     6              8  2021           0   
32316                    10              4  2012           0   
70429                     3              7  1990           1   

              hasStormProtector  basement  attic  garage  hasStorageRoom  \
squar

In [3]:
from sklearn.preprocessing import LabelEncoder

def label_encoded(feat):
    le=LabelEncoder()
    le.fit(feat)
    print(feat.name,le.classes_)
    return le.transform(feat)
trainData['category']=label_encoded(trainData['category'])

category ['Basic' 'Luxury']


In [4]:
from sklearn.model_selection import train_test_split,GridSearchCV

y=trainData['category']
x=trainData.drop('category',axis=1)
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=42)

---
## Question 1: 

#### Classification
Use the dataset, which I have spilted into training and test sets (xtrain, xtest, ytrain, ytest), please train the machine learning model to get the test accuracy of 100%. You are free to use any classication model such as Decision Tree, Random Forest, KNN, SVM..etc. 

Hint: Please use grid search to find a best model using different parameters. 

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

param_range = [3, 5, 7, 10]
param_range_1 = [1.0, 0.5, 0.1]
n_estimators = [50,100,150]
learning_rates = [.1,.2,.3]

pipe_dt = Pipeline([('DT',DecisionTreeClassifier(random_state=42))])
pipe_rf = Pipeline([('RF',RandomForestClassifier(random_state=42))])
pipe_knn = Pipeline([('KNN', KNeighborsClassifier())])
pipe_xgb = Pipeline([('XGB', XGBClassifier(random_state=42))])

dt_param_grid = [{'DT__criterion': ['gini', 'entropy'],
                   'DT__min_samples_leaf': param_range,
                   'DT__max_depth': param_range,
                   'DT__min_samples_split': param_range[1:]}]
rf_param_grid = [{'RF__min_samples_leaf': param_range,
                   'RF__max_depth': param_range,
                   'RF__min_samples_split': param_range[1:]}]
knn_param_grid = [{'KNN__n_neighbors': param_range,
                   'KNN__metric': ['euclidean', 'manhattan']}]
xgb_param_grid = [{'XGB__max_depth': param_range,
                    'XGB__min_child_weight': param_range[:2],
                    'XGB__n_estimators': n_estimators}]

dt_grid_search = GridSearchCV(estimator=pipe_dt,
        param_grid=dt_param_grid,
        scoring='accuracy',
        cv=3)
rf_grid_search = GridSearchCV(estimator=pipe_rf,
        param_grid=rf_param_grid,
        scoring='accuracy',
        cv=3)
knn_grid_search = GridSearchCV(estimator=pipe_knn,
        param_grid=knn_param_grid,
        scoring='accuracy',
        cv=3)

xgb_grid_search = GridSearchCV(estimator=pipe_xgb,
        param_grid=xgb_param_grid,
        scoring='accuracy',
        cv=3)

grids = [dt_grid_search, rf_grid_search, knn_grid_search, xgb_grid_search]
for pipe in grids:
    pipe.fit(xtrain, ytrain)

In [7]:
grid_dict = {0: 'Decision Trees', 
             1: 'Random Forest', 2: 'K-Nearest Neighbors', 3: 'XGBoost'}
for i, model in enumerate(grids):
    print('{} Test Accuracy: {}'.format(grid_dict[i],
    model.score(xtest,ytest)))
    print('{} Best Params: {}'.format(grid_dict[i],
                                      model.best_params_))

Decision Trees Test Accuracy: 1.0
Decision Trees Best Params: {'DT__criterion': 'gini', 'DT__max_depth': 3, 'DT__min_samples_leaf': 3, 'DT__min_samples_split': 5}
Random Forest Test Accuracy: 1.0
Random Forest Best Params: {'RF__max_depth': 7, 'RF__min_samples_leaf': 3, 'RF__min_samples_split': 5}
K-Nearest Neighbors Test Accuracy: 0.871
K-Nearest Neighbors Best Params: {'KNN__metric': 'euclidean', 'KNN__n_neighbors': 10}
XGBoost Test Accuracy: 1.0
XGBoost Best Params: {'XGB__max_depth': 3, 'XGB__min_child_weight': 3, 'XGB__n_estimators': 50}


In [8]:
from sklearn import metrics
def my_model(xtrain,xtest,ytrain,ytest):
    """Write a function to train a machine learning model to get the test accuracy of 100%. 
    You are free to use any classication model such as Decision Tree, 
    Random Forest, KNN, SVM..etc. """
    # YOUR CODE HERE
    xgb_clf = XGBClassifier(random_state=42, max_depth = 3, min_child_weight = 3, n_estimators = 50)
    # Fit the classifier to the training data
    xgb_clf.fit(xtrain, ytrain)
    ypred = xgb_clf.predict(xtest)
    score = metrics.accuracy_score(ytest, ypred)*100
    return score
    raise NotImplementedError()



In [10]:
# Check you solution by running this cell
import math
assert math.isclose(my_model(xtrain,xtest,ytrain,ytest), 100.0, rel_tol = 0.05)