## Real Estate Price Predictor

In [1]:
import pandas as pd

In [2]:
housing = pd.read_csv(r"C:\Users\ASUS\OneDrive\Desktop\Python Codes\RealEstate Project\HousingData.csv")

In [32]:
housing.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,800603.0,800603.0,800603.0,800603.0,800603.0,792802.0,800603.0,800603.0,800603.0,800603.0,800603.0,800603.0,800603.0
mean,3.580023,11.250923,11.14989,0.068526,0.55469,6.286778,68.608647,3.784658,9.478653,407.348898,18.499106,356.620099,12.877019
std,8.544039,23.203295,6.822098,0.252646,0.115169,0.703106,28.068194,2.096674,8.676061,167.680777,2.198247,90.752204,7.781402
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.08221,0.0,5.19,0.0,0.449,5.885,45.0,2.1,4.0,279.0,17.4,374.71,7.01
50%,0.26169,0.0,9.69,0.0,0.538,6.209,77.3,3.1523,5.0,330.0,19.1,391.34,11.45
75%,3.67367,12.5,18.1,0.0,0.624,6.63,94.1,5.118,24.0,666.0,20.2,396.21,17.11
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,23.0,396.9,76.0


In [4]:
# if there are missing values in the dataset

# hosuing["attribute_name"].fillna(median)

## Plotting Histograms

In [5]:
%matplotlib inline

In [6]:
import matplotlib.pyplot as plt

In [7]:
#housing.hist(bins=50, figsize=(20, 15))

## Train & Test Splitting

In [8]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
print(f"Rows in train set: {len(train_set)} \nRows in test set:  {len(test_set)}")

Rows in train set: 800603 
Rows in test set:  200151


Since attribute "CHAS" is a binary attribute and frequency of 1s are <<< than 0s, it is possible that all 0 can be in test/train. 
So we will use Stratified Sampling so that we have equal ratio of 0 and 1 in out train and test data    

In [9]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["CHAS"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [10]:
housing = strat_train_set.copy()

## Trying out Attribute Combinations

In [11]:
housing["TAXRM"] = housing['TAX']/housing['RM']
#housing.head()

## Looking for Correlations

In [12]:
corr_matrix = housing.corr()
corr_matrix['MEDV'].sort_values(ascending=False)

MEDV       1.000000
RM         0.667561
ZN         0.340292
B          0.318056
DIS        0.233479
CHAS       0.164970
AGE       -0.368418
RAD       -0.378947
CRIM      -0.380472
NOX       -0.411422
PTRATIO   -0.448149
TAX       -0.459397
INDUS     -0.463433
TAXRM     -0.527574
LSTAT     -0.564170
Name: MEDV, dtype: float64

In [13]:
#housing.plot(kind = "scatter", x="TAXRM", y ="MEDV")

In [14]:
housing = strat_train_set.drop("MEDV", axis=1)
housing_labels = strat_train_set["MEDV"].copy()

In [15]:
# from pandas.plotting import scatter_matrix
# attributes = ["MEDV", "RM", "ZN", "B"]
# scatter_matrix(housing[attributes], figsize=(12,8))

## SciKit-Learn Design

Primarily three types of objects:

Estimators : it estimates some parameters based on a dataset. Example...imputer. it has a fit method and transform method. Fit method- fits the data set and calculates internal parameters.

Transformers: transform method takes input and returns output based on the learnings from fit(). It also has a convenience function called fit_transform() which fits and then transforms. 

Predictors: LinearRegression model is an example of predictor. fit() and predict() are two common functions. It also gives score() function which will evaluate the predictions.

## Feature Scaling

Primarily, two types of feature scaling methods:
1. Min-Max scaling (Normalization):   
    (value-min)/(max-min)   (Sklearn provides a class for this called MinMaxScaler)
2. Standardization: (value-mean)/std  (Sklearn provides a class for this called StandardScaler)

## Creating a Pipeline

In [16]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer.fit(housing)
X = imputer.transform(housing)
housing_tr = pd.DataFrame(X, columns = housing.columns)

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
my_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),

    ('std_scaler',StandardScaler()),
])

In [18]:
housing_num_tr = my_pipeline.fit_transform(housing)

## Selecting Desired Model

In [19]:
#from sklearn.linear_model import LinearRegression
#from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
#model = LinearRegression()
#model = DecisionTreeRegressor()
model = RandomForestRegressor()
model.fit(housing_num_tr, housing_labels)

In [20]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
prepared_data = my_pipeline.transform(some_data)

In [21]:
model.predict(prepared_data)

array([14.9, 18.7, 25. , 23.3, 23.1])

In [22]:
list(some_labels)

[14.9, 18.7, 25.0, 23.3, 23.1]

In [23]:
import numpy as np
from sklearn.metrics import mean_squared_error
housing_predictions = model.predict(housing_num_tr)
MSE = mean_squared_error(housing_labels, housing_predictions)
RMSE = np.sqrt(MSE)

In [24]:
RMSE

1.8756733444410157e-13

## Using Cross Validation

In [25]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, housing_num_tr, housing_labels, scoring="neg_mean_squared_error", cv=10)
RMSE_scores = np.sqrt(-scores)

In [26]:
RMSE_scores

array([1.70879924e-13, 1.69746725e-13, 1.70420543e-13, 1.69274334e-13,
       1.69515816e-13, 1.69813218e-13, 1.69468086e-13, 1.69742699e-13,
       1.67937482e-13, 1.69551919e-13])

In [27]:
def print_scores(scores):
    print("Scores: ", scores)
    print("Mean: ", scores.mean())
    print("Standard Deviation: ", scores.std())

In [28]:
print_scores(RMSE_scores)

Scores:  [1.70879924e-13 1.69746725e-13 1.70420543e-13 1.69274334e-13
 1.69515816e-13 1.69813218e-13 1.69468086e-13 1.69742699e-13
 1.67937482e-13 1.69551919e-13]
Mean:  1.6963507464082803e-13
Standard Deviation:  7.26868104452815e-16


## Saving the model

In [29]:
from joblib import dump, load
dump(model, 'Real_Estate.joblib')

['Real_Estate.joblib']

## Testing the Model on Test Data

In [30]:
X_test = strat_test_set.drop("MEDV", axis=1)
Y_test = strat_test_set["MEDV"].copy()
X_test_prepared = my_pipeline.transform(X_test)
final_predictions = model.predict(X_test_prepared)
final_MSE = mean_squared_error(Y_test, final_predictions)
final_RMSE = np.sqrt(final_MSE)

In [31]:
final_RMSE

1.8789303351818316e-13