## Choosing the appropriate machine learning algorithm

Choosing the right ML algorithm is half systematic and half case-sensitive. What does that mean? From one prospective, there are few general steps you can save them as a crutch when you select the right ML model! However, from another prospective, your problem/data might be very unique and need to be carefully analyzed in order to select the right ML model!

So, for any machine learning problem, we need to throw these questions to select the right model:

1- Is our data labeled or not?

Yes ---> Supervised Learning

Is the output "Label" discrete or continuous?

Discrete ---> Classification

Continuous ---> Regression

Anomaly Detection ---> The goal here is to identify data points that are simply unusual (e.x., Fraud Detection problem when we have highly unusual credit card spending patterns)

No ---> Unsupervised Learning (Clustering)

2- Do we need to interact with the environment to optimize the objective function?

Yes ---> Reinforcement Learning


Now that we have a clear picture about our data, we can answer the previous questions directly: Yes, we have labeled, continuous data ---> Supervised Learning "Regression"

That lead us to a smaller set of ML algorithms where we deal with continuous labeled data: Linear Regression, Bayesian Linear Regression, Decision Forest Regression, Boosted Decision Tree Regression, Fast Forest Quantile Regression, Poisson Regression, Ordinal Regression, and Neural Network Regression.

Here, we reach the second option, mentioned above, for the model selection: a case sensitive task! This requires awareness of many factors that characterizing our problem, mainly: Accuracy, linearity, training time, the number of parameters (e.g., number of iterations, error tolerance, etc), and the number of features.

Finally, we need to highlight that "No free lunch" with machine learning models. We need to figure out what requirement(s) we want to pay more for it!

So, we have a large number of observations for the three cities. 

In our particular problem, we have the renting price as the target variable to predict. For the three cities, the price shows a continuous increase over time. In other words, the late you book a listing on Airbnb, the cheaper price you can get!! 

At the previous step, our exploratory analysis has shown that we have non-linear connections between our features and the average price. However, this does not say that there is no correlation between them. Some of our features have shown good correlation with the price such as the people accommodation represented mainly by the sharing and the quantity of the available bedrooms, beds, and bathrooms.  Even, with some features that show no correlation with the price, they can help us in modeling our target variable. For example, the number of review stars show no direct correlation with the average price; however, it can help to explain other connections such as the variation of the price among different neighborhoods. 


# Linear Regression

Linear models either the simple or Bayesian regression models are very popular to use in regression problems due to their simplicity and easy to interpret. However, in our case, I think they are not suitable to use because:

1-	The relationship between the price and the predictors are non-linear as shown in the exploratory analysis. So, those linear models will fail to capture that non-linearity. 

2-	Also, with the large number of features (>10) we have, linear regression models are very prone to overfitting.


#  Decision Forest Regression 

Decision trees are non-parametric models that splits the data in the form of a tree structure! It performs a sequence of simple tests for each data point, forming a binary tree data structure until a leaf node (decision) is reached.

It can be one of our candidates for modeling our data. As we have some nonlinearity in our data, Decision Forest Regression can represent non-linear decision boundaries in Airbnb data. Also, since we have a quite large number of samples, DFR is efficient in both computation and memory usage during training and prediction.


# Boosted Decision Tree Regression

It simply a mix of two classical ML algorithms where boosting is employed to create an ensemble of regression trees!

"Boosting" means that each tree is dependent on prior trees. The algorithm learns by fitting the residual of the trees that preceded it. Thus, boosting in a decision tree ensemble tends to improve accuracy with some small risk of less coverage.

Boosting in regression trees tends to improve accuracy with some small risk of less coverage.

### XGBoost 


XGBoost is the implementation of the gradient boosted tree algorithms that's commonly used for classification and regression problems. Gradient boosting is a supervised learning algorithm consisting of an ensemble (set) of weaker models (trees), which sums up their estimates to predict a target variable with more accuracy.

XGBoost is also a strong candidate to model our Airbnb data as it performs very well with large data sets and large number of features. In addition, it is very strong to capture the non-linearity in the connection between the target variable and the predictors. However, since it is a result of combining several models (trees), it can be a bit difficult to interpret. 


# Fast Forest Quantile Regression

Quantile regression is useful if we want to understand more about the distribution of the predicted value, rather than get a single mean prediction value. This method has many applications, including:

    - Predicting prices

    - Estimating student performance or applying growth charts to assess child development

    - Discovering predictive relationships in cases where there is only a weak relationship between variables


As Regression Trees and Random Forest are generally great at learning complex, highly non-linear relationships, Fast Forest Quantile Regression can performs very well with our data. Also, it is very easy to interpret and understand. On the other hand, it might slow down the training time and also requires more memory with a larger random forest ensembles.

# K-nearest neighbors (KNN)

KNN is a non-parametric method used for both classification and regression problems. The algorithm uses "feature similarity" to predict values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set.

KNN is a suitable algorithm to predict sales or price since it depends on averaging the output of similar examples. However, in our case of large data sets, it can slow down the training stage very badly.


# Neural Network Regression

Any class of statistical models can be termed a neural network if they use adaptive weights and can approximate non-linear functions of their inputs. Thus, neural network regression is suited to problems where a more traditional regression model cannot fit a solution.

Although NN acts as a black box in terms of the model complexity, so it is very difficult to interpret. However, it is one of the most powerful ML models that can capture different types of feature variable relationships. In addition, NN performs very well as we have more data. So, definitely, NN is a suitable model to select for our goal of Airbnb price prediction. 


# Poisson Regression

Poisson regression is intended for use in regression models that are used to predict numeric values, typically counts. Therefore, we can use this module to create a regression model only if the values you are trying to predict fit the following conditions:

    - The response variable has a Poisson distribution.

    - Counts cannot be negative. The method will fail outright if you attempt to use it with negative labels.

    - A Poisson distribution is a discrete distribution; therefore, it is not meaningful to use this method with non-whole numbers.



# Ordinal Regression

Ordinal regression is used when the label or target column contains numbers, but the numbers represent a ranking or order rather than a numeric measurement.

Both Poisson and Ordinal regression requirements do not agree with the nature of our data and features. So, no point of even trying them!

# Conclusion

 In conclusion, based on the following factors:
 
1- The nature of our problem and the target variable (rental price),

2- From the previous analysis of the usage of different ML techniques for a regression problem,

3- And from the nature of our data set,

Those techniques have been selected to be applied on our data in order to predict the rental price: Boosted Decision Tree Regression (XGBoost), Decision Forest Regression, and Artificial Neural Networks.



Let's start!

In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')
from pathlib import Path

In [3]:
from bnb.data_processing.read import read_csvfile

In [4]:
Data_dir = Path('C:/Users/KBrig/AirBnBRepo/AirBnB/data')
Data_dir

WindowsPath('C:/Users/KBrig/AirBnBRepo/AirBnB/data')

In [5]:
NewYork_CSV = Data_dir/'NewYorkData_clean.csv'
Bangkok_CSV = Data_dir/'BangkokData_clean.csv'
Cairo_CSV = Data_dir/'CairoData_clean.csv'

In [6]:
NewYorkData = read_csvfile(NewYork_CSV)
NewYorkData.head(3)

Unnamed: 0,Accuracy_stars,CheckIn,CheckIn_stars,CheckOut,Cleanliness_stars,Communication_stars,Latitude,Location_stars,Longitude,Price,...,partial_address,Group_ID,AvgPrice_neighborhood,AvgNormPrice_neighborhood,avg_rating_stars,Amenities_length,day_of_month,Year_week,weeks_before_booking,ListingsNo_in_weekth
0,5.0,2019-03-21,5.0,2019-03-26,5.0,5.0,40.73321,5.0,-74.00242,244,...,"West Village, Manhattan, Manhattan Community ...",193,242.59375,12,4.916667,28,21,12,2,142
1,0.0,2019-03-07,0.0,2019-03-12,0.0,0.0,40.73181,0.0,-74.00606,188,...,"West Village, Manhattan, Manhattan Community ...",193,242.59375,12,0.0,10,7,10,0,168
2,5.0,2019-03-07,5.0,2019-03-12,5.0,5.0,40.82484,5.0,-73.9526,59,...,"Hamilton Heights, Manhattan, Manhattan Commun...",74,90.1875,4,5.0,12,7,10,0,168


In [7]:
BangkokData = read_csvfile(Bangkok_CSV)
BangkokData.head(3)

Unnamed: 0,level_0,index,CheckIn,CheckOut,Price,Super_host,Latitude,Longitude,Accuracy_stars,Communication_stars,...,Group_ID,home_renting_trend,Super_host_string,avg_rating_stars,price_stars_corr_per_nhood,Amenities_length,day_of_month,Year_week,weeks_before_booking,ListingsNo_in_weekth
0,3742,3742,2019-08-27,2019-09-01,52.0,1,13.74306,100.55585,5.0,5.0,...,39,up-down,Super,5.0,0.078627,25,27,35,24,301
1,2626,2626,2019-07-02,2019-07-07,67.0,0,13.74169,100.55655,5.0,5.0,...,58,no change,not,5.0,0.133509,22,2,27,16,304
2,2859,2859,2019-07-16,2019-07-21,67.0,0,13.74169,100.55655,5.0,5.0,...,8,no change,not,5.0,-0.227533,26,16,29,18,304


In [8]:
CairoData = read_csvfile(Cairo_CSV)
CairoData.head(3)

Unnamed: 0,index,CheckIn,CheckOut,Price,Super_host,Latitude,Longitude,Accuracy_stars,Communication_stars,Cleanliness_stars,...,Group_ID,avg_rating_stars,price_stars_corr_per_nhood,price_stars_corr_Zvalue,avg_price_stars_corr,Amenities_length,day_of_month,Year_week,weeks_before_booking,ListingsNo_in_weekth
0,3201,2019-08-01,2019-08-06,39.0,1,30.05867,31.44884,0.0,0.0,0.0,...,81,0.0,-0.015865,-0.015867,-0.015865,23,1,31,20,304
1,908,2019-04-11,2019-04-16,27.0,0,29.96925,31.26652,0.0,0.0,0.0,...,21,0.0,0.04517,0.0452,0.04517,24,11,15,4,305
2,4605,2019-10-10,2019-10-15,51.0,1,29.95475,31.26336,4.0,4.5,4.5,...,81,4.416667,-0.015865,-0.015867,-0.015865,26,10,41,30,302


### Choose only the features of interest as selected in the previous step of feature selection!

Number of weeks till booking

The week of the year

The number of listings available in every week of the year

Latitude

Longitude

Neighborhood_ID

No. of Bathrooms

No. of Bedrooms

No. of Beds

Shared_bathroom

Shared_bedroom

No. of Amenities

No. of Reviewers

Avg. number of review stars

Super host


In [9]:
Selected_list = ['Latitude','Longitude','ListingsNo_in_weekth','weeks_before_booking','Year_week','Bathrooms_num_int','Bedrooms_int','Beds_int','shared_bathroom','shared_room','Reviewers_no','Group_ID','avg_rating_stars','Amenities_length','Super_host']

## New York

In [10]:
NY_X = NewYorkData[Selected_list]
NY_Y = NewYorkData['Price']

## Data Splitting

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(NY_X,NY_Y,test_size=0.3, random_state=5)

In [13]:
print(X_train1.shape)
print(X_test1.shape)
print(y_train1.shape)
print(y_test1.shape)

(2769, 15)
(1188, 15)
(2769,)
(1188,)


### Random Forest Regression (RF)

In [14]:
from sklearn.ensemble.forest import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=400, criterion='mse', random_state=5)
rf_regressor.fit(X_train1, y_train1)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=None,
           oob_score=False, random_state=5, verbose=0, warm_start=False)

In [15]:
y_pred1_RF = rf_regressor.predict(X_test1)

In [16]:
df = pd.DataFrame({'Actual':y_test1, 'Predicted': y_pred1_RF})  
df.head(10)

Unnamed: 0,Actual,Predicted
3198,264,273.59
3897,263,331.2075
2069,119,131.6
2871,112,113.5425
3575,187,245.72
1714,132,131.8075
342,65,54.4875
3928,317,365.8425
3362,198,247.9675
1015,330,299.2875


In [18]:
from sklearn import metrics
MAE_RF = metrics.mean_absolute_error(y_test1, y_pred1_RF)
MSE_RF = metrics.mean_squared_error(y_test1, y_pred1_RF)
RMSE_RF = np.sqrt(metrics.mean_squared_error(y_test1, y_pred1_RF))

In [19]:
print('Mean Absolute Error:',MAE_RF)  
print('Mean Squared Error:',MSE_RF)  
print('Root Mean Squared Error:',RMSE_RF)  

Mean Absolute Error: 20.212573653198653
Mean Squared Error: 1010.7090101694023
Root Mean Squared Error: 31.79165000702861


### Decision Tree Regression (DT)

In [20]:
from sklearn.tree import DecisionTreeRegressor
DT_regressor = DecisionTreeRegressor(max_depth=2)

In [21]:
DT_regressor.fit(X_train1, y_train1)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [22]:
y_pred_DT = DT_regressor.predict(X_test1)

In [23]:
MAE_DT = metrics.mean_absolute_error(y_test1, y_pred_DT)
MSE_DT = metrics.mean_squared_error(y_test1, y_pred_DT)
RMSE_DT = np.sqrt(metrics.mean_squared_error(y_test1, y_pred_DT))

In [24]:
print('Mean Absolute Error:', MAE_DT)  
print('Mean Squared Error:',MSE_DT)  
print('Root Mean Squared Error:',RMSE_DT) 

Mean Absolute Error: 45.680676587357716
Mean Squared Error: 3888.132986264667
Root Mean Squared Error: 62.35489544746801


### Boosted Decision Tree Regression (XGBoost)

In [25]:
import xgboost as xgb
xgb_model = xgb.XGBRegressor(objective="reg:linear", random_state=42)

xgb_model.fit(X_train1, y_train1)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0,
       importance_type='gain', learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=42,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=1, verbosity=1)

In [26]:
y_pred_XGB = xgb_model.predict(X_test1)

In [27]:
MAE_XGB = metrics.mean_absolute_error(y_test1, y_pred_XGB)
MSE_XGB = metrics.mean_squared_error(y_test1, y_pred_XGB)
RMSE_XGB = np.sqrt(metrics.mean_squared_error(y_test1, y_pred_XGB))

In [28]:
print('Mean Absolute Error:', MAE_XGB)
print('Mean Squared Error:', MSE_XGB)  
print('Root Mean Squared Error:', RMSE_XGB)

Mean Absolute Error: 26.19879113142739
Mean Squared Error: 1402.5676271332873
Root Mean Squared Error: 37.450869511044566


### Neural Network Regression (NN)

In [29]:
from sklearn.neural_network import MLPRegressor
NN_regressor = MLPRegressor(activation='relu')

In [30]:
NN_regressor.fit(X_train1, y_train1)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [31]:
y_pred1_NN = NN_regressor.predict(X_test1)

In [32]:
MAE_NN = metrics.mean_absolute_error(y_test1, y_pred1_NN)
MSE_NN = metrics.mean_squared_error(y_test1, y_pred1_NN)
RMSE_NN = np.sqrt(metrics.mean_squared_error(y_test1, y_pred1_NN))

In [33]:
print('Mean Absolute Error:',MAE_NN)
print('Mean Squared Error:', MSE_NN)  
print('Root Mean Squared Error:', RMSE_NN)

Mean Absolute Error: 48.87504074164213
Mean Squared Error: 4167.743371960598
Root Mean Squared Error: 64.55806202141292


In [34]:
models= ["RF", "DT","XGboost","NN"]
models

['RF', 'DT', 'XGboost', 'NN']

In [35]:
Mean_Abs_Error =[MAE_RF ,MAE_DT ,MAE_XGB , MAE_NN]
Mean_Abs_Error

[20.212573653198653, 45.680676587357716, 26.19879113142739, 48.87504074164213]

In [36]:
Mean_Sqrt_Error = [MSE_RF, MSE_DT, MSE_XGB, MSE_NN]
Mean_Sqrt_Error

[1010.7090101694023, 3888.132986264667, 1402.5676271332873, 4167.743371960598]

In [37]:
Root_Mean_Sqrd_Error = [RMSE_RF, RMSE_DT, RMSE_XGB, RMSE_NN]
Root_Mean_Sqrd_Error

[31.79165000702861, 62.35489544746801, 37.450869511044566, 64.55806202141292]

In [38]:
d = {'model': models,'Mean_Abs_Error': Mean_Abs_Error,'Mean_Sqrt_Error': Mean_Sqrt_Error,'Root_Mean_Sqrd_Error':Root_Mean_Sqrd_Error}

In [40]:
pd.DataFrame(data=d)

Unnamed: 0,model,Mean_Abs_Error,Mean_Sqrt_Error,Root_Mean_Sqrd_Error
0,RF,20.212574,1010.70901,31.79165
1,DT,45.680677,3888.132986,62.354895
2,XGboost,26.198791,1402.567627,37.45087
3,NN,48.875041,4167.743372,64.558062


## Parameters Tuning with Random Hyperparameter Grid Search

### Random Forest Regression

In [41]:
from pprint import pprint

In [42]:
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf_classifier.get_params())

Parameters currently in use:

{'bootstrap': True,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 400,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 5,
 'verbose': 0,
 'warm_start': False}


In [43]:
# Randomized Search CV
from sklearn.model_selection import RandomizedSearchCV

In [44]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
# max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]
# Method of selecting samples for training each tree
# bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

In [45]:
pprint(random_grid)

{'max_depth': [5, 10, 15, 20, 25, 30],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 5, 10],
 'min_samples_split': [2, 5, 10, 15, 100],
 'n_estimators': [100,
                  200,
                  300,
                  400,
                  500,
                  600,
                  700,
                  800,
                  900,
                  1000,
                  1100,
                  1200]}


In [46]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

In [47]:
#Fit the random search model
rf_random.fit(X_train1,y_train1)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   19.9s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  2.2min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200], 'max_features': ['auto', 'sqrt'], 'max_depth': [5, 10, 15, 20, 25, 30], 'min_samples_split': [2, 5, 10, 15, 100], 'min_samples_leaf': [1, 2, 5, 10]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring=None, verbose=2)

In [48]:
rf_random.best_params_

{'n_estimators': 700,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 30}

In [49]:
y_pred1_RF = rf_random.predict(X_test1)

In [50]:
from sklearn import metrics
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test1, y_pred1_RF))  
print('Mean Squared Error:',metrics.mean_squared_error(y_test1, y_pred1_RF))  
print('Root Mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test1, y_pred1_RF)))  

Mean Absolute Error: 20.50350600616722
Mean Squared Error: 1026.937584433325
Root Mean Squared Error: 32.04586688534615


Let's call the same metrics before tuning:

In [52]:
print('Mean Absolute Error:',MAE_RF)  
print('Mean Squared Error:',MSE_RF)  
print('Root Mean Squared Error:',RMSE_RF)

Mean Absolute Error: 20.212573653198653
Mean Squared Error: 1010.7090101694023
Root Mean Squared Error: 31.79165000702861


### Bangkok

In [141]:
Bngkok_X = BangkokData[Selected_list]
Bngkok_Y = BangkokData['Price']

In [142]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(Bngkok_X,Bngkok_Y,test_size=0.3, random_state=5)
print(X_train2.shape)
print(X_test2.shape)
print(y_train2.shape)
print(y_test2.shape)

(5425, 15)
(2326, 15)
(5425,)
(2326,)


### Random Forest Regression

In [143]:
rf_classifier = RandomForestRegressor(n_estimators=400, criterion='mse', random_state=5)
rf_classifier.fit(X_train2, y_train2)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=None,
           oob_score=False, random_state=5, verbose=0, warm_start=False)

In [166]:
y_pred2_RFR = rf_classifier.predict(X_test2)

In [167]:
df = pd.DataFrame({'Actual':y_test2, 'Predicted': y_pred2_RFR})  
df.head(10)

Unnamed: 0,Actual,Predicted
3213,84.0,83.635
1445,54.0,76.7825
3913,43.0,42.095
6396,58.0,57.7375
7175,60.0,66.155
3067,33.0,34.4425
4280,48.0,47.86
1290,117.0,113.3475
1075,34.0,37.59
7559,66.0,66.6975


In [168]:
MAE_RF =  metrics.mean_absolute_error(y_test2, y_pred2_RFR)
MSE_RF = metrics.mean_squared_error(y_test2, y_pred2_RFR)  
RMSE_RF = np.sqrt(metrics.mean_squared_error(y_test2, y_pred2_RFR))

In [169]:
from sklearn import metrics  
print('Mean Absolute Error:', MAE_RF)  
print('Mean Squared Error:', MSE_RF)  
print('Root Mean Squared Error:', RMSE_RF)  

Mean Absolute Error: 2.967570937231298
Mean Squared Error: 37.246637419389515
Root Mean Squared Error: 6.103002328312641


### Decision Tree Regression

In [148]:
from sklearn.tree import DecisionTreeRegressor
DT_regressor = DecisionTreeRegressor(max_depth=2)

In [149]:
DT_regressor.fit(X_train2, y_train2)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [150]:
y_pred2_DT = DT_regressor.predict(X_test2)

In [151]:
MAE_DT = metrics.mean_absolute_error(y_test2, y_pred2_DT)
MSE_DT = metrics.mean_squared_error(y_test2, y_pred2_DT)
RMSE_DT = np.sqrt(metrics.mean_squared_error(y_test2, y_pred2_DT))

In [152]:
print('Mean Absolute Error:', MAE_DT)  
print('Mean Squared Error:', MSE_DT)  
print('Root Mean Squared Error:', RMSE_DT)

Mean Absolute Error: 16.077043088844782
Mean Squared Error: 480.7835438074452
Root Mean Squared Error: 21.92677686773515


### Boosted Decision Tree Regression (XGBoost)

In [153]:
import xgboost as xgb
xgb_model = xgb.XGBRegressor(objective="reg:linear", random_state=42)

In [154]:
xgb_model.fit(X_train2, y_train2)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0,
       importance_type='gain', learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=42,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=1, verbosity=1)

In [155]:
y_pred2_XGB = xgb_model.predict(X_test2)

In [None]:
MAE_XGB = metrics.mean_absolute_error(y_test2, y_pred2_XGB)
MSE_XGB = metrics.mean_squared_error(y_test2, y_pred2_XGB)
RMSE_XGB = np.sqrt(metrics.mean_squared_error(y_test2, y_pred2_XGB))

In [156]:
print('Mean Absolute Error:', MAE_XGB)  
print('Mean Squared Error:', MSE_XGB )  
print('Root Mean Squared Error:',RMSE_XGB) 

Mean Absolute Error: 26.19879113142739
Mean Squared Error: 1402.5676271332873
Root Mean Squared Error: 37.450869511044566


### Neural Network Regression

In [157]:
from sklearn.neural_network import MLPRegressor
NN_regressor = MLPRegressor(activation='relu')

In [158]:
NN_regressor.fit(X_train2, y_train2)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [159]:
y_pred2_NN = NN_regressor.predict(X_test2)

In [161]:
MAE_NN = metrics.mean_absolute_error(y_test2, y_pred2_NN)
MSE_NN = metrics.mean_squared_error(y_test2, y_pred2_NN)
RMSE_NN = np.sqrt(metrics.mean_squared_error(y_test2, y_pred2_NN))

In [162]:
print('Mean Absolute Error:', MAE_NN)  
print('Mean Squared Error:', MSE_NN)  
print('Root Mean Squared Error:', RMSE_NN) 

Mean Absolute Error: 15.662627306881554
Mean Squared Error: 425.64247871790775
Root Mean Squared Error: 20.63110464124274


In [163]:
Mean_Abs_Error =[MAE_RF ,MAE_DT ,MAE_XGB , MAE_NN]
Mean_Sqrt_Error = [MSE_RF, MSE_DT, MSE_XGB, MSE_NN]
Root_Mean_Sqrd_Error = [RMSE_RF, RMSE_DT, RMSE_XGB, RMSE_NN]

In [164]:
d = {'model': models,'Mean_Abs_Error': Mean_Abs_Error,'Mean_Sqrt_Error': Mean_Sqrt_Error,'Root_Mean_Sqrd_Error':Root_Mean_Sqrd_Error}

In [165]:
pd.DataFrame(data=d)

Unnamed: 0,model,Mean_Abs_Error,Mean_Sqrt_Error,Root_Mean_Sqrd_Error
0,RF,2.967571,37.246637,6.103002
1,DT,16.077043,480.783544,21.926777
2,XGboost,26.198791,1402.567627,37.45087
3,NN,15.662627,425.642479,20.631105


### Cairo

In [170]:
Cairo_X = CairoData[Selected_list]
Cairo_Y = CairoData['Price']

In [171]:
X_train3, X_test3, y_train3, y_test3 = train_test_split(Cairo_X,Cairo_Y,test_size=0.3, random_state=5)
print(X_train3.shape)
print(X_test3.shape)
print(y_train3.shape)
print(y_test3.shape)

(5531, 15)
(2371, 15)
(5531,)
(2371,)


### Random Forest Regression

In [172]:
rf_classifier = RandomForestRegressor(n_estimators=400, criterion='mse', random_state=5)
rf_classifier.fit(X_train3, y_train3)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=None,
           oob_score=False, random_state=5, verbose=0, warm_start=False)

In [173]:
y_pred3_RFR = rf_classifier.predict(X_test3)

In [174]:
df = pd.DataFrame({'Actual':y_test3, 'Predicted': y_pred3_RFR})  
df.head(10)

Unnamed: 0,Actual,Predicted
3564,68.0,65.8325
6015,40.0,38.4775
7518,20.0,27.3175
4707,67.0,65.8375
4272,59.0,62.79
7715,25.0,22.115
6862,120.0,66.6525
2617,13.0,13.5925
2005,78.0,77.91
855,40.0,42.715


In [175]:
from sklearn import metrics 
MAE_RF =  metrics.mean_absolute_error(y_test3, y_pred3_RFR)
MSE_RF =  metrics.mean_squared_error(y_test3, y_pred3_RFR)
RMSE_RF = np.sqrt(metrics.mean_squared_error(y_test3, y_pred3_RFR))

In [176]:
 
print('Mean Absolute Error:', MAE_RF)  
print('Mean Squared Error:',MSE_RF)  
print('Root Mean Squared Error:',RMSE_RF) 

Mean Absolute Error: 4.018936102910165
Mean Squared Error: 141.08917577235343
Root Mean Squared Error: 11.878096470914581


### Decision Tree Regression

In [177]:
from sklearn.tree import DecisionTreeRegressor
DT_regressor = DecisionTreeRegressor(max_depth=2)

In [178]:
DT_regressor.fit(X_train3, y_train3)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [179]:
y_pred3_DT = DT_regressor.predict(X_test3)

In [180]:
MAE_DT = metrics.mean_absolute_error(y_test3, y_pred3_DT)
MSE_DT = metrics.mean_squared_error(y_test3, y_pred3_DT)
RMSE_DT = np.sqrt(metrics.mean_squared_error(y_test3, y_pred3_DT))

In [181]:
print('Mean Absolute Error:',MAE_DT)  
print('Mean Squared Error:', MSE_DT)  
print('Root Mean Squared Error:',RMSE_DT)

Mean Absolute Error: 21.473754123679146
Mean Squared Error: 1113.280607746576
Root Mean Squared Error: 33.36585991318935


### Boosted Decision Tree Regression (XGBoost)

In [182]:
import xgboost as xgb
xgb_model = xgb.XGBRegressor(objective="reg:linear", random_state=42)

In [183]:
xgb_model.fit(X_train3, y_train3)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0,
       importance_type='gain', learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=42,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=1, verbosity=1)

In [186]:
y_pred3_XGB = xgb_model.predict(X_test3)

In [187]:
MAE_XGB = metrics.mean_absolute_error(y_test3, y_pred3_XGB)
MSE_XGB = metrics.mean_squared_error(y_test3, y_pred3_XGB)
RMSE_XGB = np.sqrt(metrics.mean_squared_error(y_test3, y_pred3_XGB))

In [188]:
print('Mean Absolute Error:',MAE_XGB)  
print('Mean Squared Error:',MSE_XGB)  
print('Root Mean Squared Error:',RMSE_XGB) 

Mean Absolute Error: 14.868262046804311
Mean Squared Error: 476.7570508650813
Root Mean Squared Error: 21.83476702108546


### Neural Network Regression

In [189]:
from sklearn.neural_network import MLPRegressor
NN_regressor = MLPRegressor(activation='relu')

In [190]:
NN_regressor.fit(X_train3, y_train3)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [191]:
y_pred3_NN = NN_regressor.predict(X_test3)

In [192]:
MAE_NN = metrics.mean_absolute_error(y_test3, y_pred3_NN)
MSE_NN = metrics.mean_squared_error(y_test3, y_pred3_NN)
RMSE_NN = np.sqrt(metrics.mean_squared_error(y_test3, y_pred3_NN))

In [193]:
print('Mean Absolute Error:',MAE_NN)  
print('Mean Squared Error:',MSE_NN)  
print('Root Mean Squared Error:',RMSE_NN)

Mean Absolute Error: 21.766187987979162
Mean Squared Error: 1063.260350488555
Root Mean Squared Error: 32.60767318421471


In [196]:
Mean_Abs_Error =[MAE_RF ,MAE_DT ,MAE_XGB , MAE_NN]
Mean_Sqrt_Error = [MSE_RF, MSE_DT, MSE_XGB, MSE_NN]
Root_Mean_Sqrd_Error = [RMSE_RF, RMSE_DT, RMSE_XGB, RMSE_NN]

In [197]:
d = {'model': models,'Mean_Abs_Error': Mean_Abs_Error,'Mean_Sqrt_Error': Mean_Sqrt_Error,'Root_Mean_Sqrd_Error':Root_Mean_Sqrd_Error}

In [198]:
pd.DataFrame(data=d)

Unnamed: 0,model,Mean_Abs_Error,Mean_Sqrt_Error,Root_Mean_Sqrd_Error
0,RF,4.018936,141.089176,11.878096
1,DT,21.473754,1113.280608,33.36586
2,XGboost,14.868262,476.757051,21.834767
3,NN,21.766188,1063.26035,32.607673
