# Analyising

## Loading 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Raw = pd.read_csv('housing.csv')
Raw.head()

In [None]:
Raw.info()

In [None]:
print(type(Raw['RAD']))
print(type(Raw['CHAS']))
Raw['RAD'].value_counts()
Raw['CHAS'].value_counts()

In [None]:
# Getting all the stats info about the data
Raw.describe()

## Plotting the Graphs

In [None]:
# Raw.hist(figsize=(20,30), bins=20)

##### Dropping CHAS
We can note that it is not affecting the price much and even have the low co-relation so dropping this would be better

In [None]:
Raw.drop(columns='CHAS', inplace=True)
Raw.corr()

In [None]:
import seaborn as sns
plt.figure(figsize=(15,8))
sns.heatmap(Raw.corr(), cmap= 'PuOr', annot= True, fmt='.2f', alpha=.9)
# plt.show()

## Splitting the Data
And checking the split according to CHAS 

Selectes CHAS as this is a ceegrocial feature and should be distributed evenly

In [None]:
from sklearn.model_selection import train_test_split

Train1, Test1 = train_test_split(Raw, test_size=.2, random_state=np.random.randint(len(Raw)))
print(Train1.shape, Test1.shape)

Checking the split of the CHAS value by counting <br>

This wont work as the CHAS has been dropped would work befrore dropping

In [None]:
# print("CHAS column in Train", Train1['CHAS'].value_counts())    
# print("CHAS column in Test",Test1['CHAS'].value_counts())

#### Stratified Sampling
This is used to sample the data as per some strategy.<br>
Below is a example of Using condition on CHAS for even ditribution.

In [None]:
# This section I am copying and would be a comment only. I don't do this kind of sampling

# from sklearn.model_selection  import StratifiedShuffleSplit
# split = StratifiedShuffleSplit(n_splits=1 , test_size= .2, random_state= 423)
# for trian_idx, test_idx in split.split(Raw, Raw['CHAS']): # CHAS -> any int type
#     st_train = Raw.loc[trian_idx]
#     st_test = Raw.loc[test_idx]

# st_train.info()

## Finding Co-relation

In [None]:
Train1.corr()

Co-relation only for the price column
and aranigng it in decending order

In [None]:
PrcCorr = Raw.corr().MEDV.sort_values(ascending= False)
PrcCorr

We can note a high co-relation with **RM, LSTAT** where as modrate co-relation with **PTRatio, INDUS, TAX, NOX**

This means there have very high importance in out dataset

#### Plotting the Relation between the attributes

In [None]:
from pandas.plotting import scatter_matrix
# We are selecting the attrib in which we want to see co-relation
Attib = ['MEDV', 'RM', 'LSTAT', 'DIS', 'AGE']
# scatter_matrix(Raw[Attib], figsize=(15,15))

#### Dropping the Columns

In [None]:
Train_Price = Train1.MEDV
Train1.drop(columns='MEDV',inplace=True)

Test_Price = Test1.MEDV
Test1.drop(columns='MEDV',inplace=True)

### Missing Attib

The most imp method is filling it with the Mean / Median

#### Imputer
This is used to fill the missing data points in the data set. using some starategies like Mean/ Medain.

In [None]:
# from sklearn.impute import SimpleImputer
# imputer = SimpleImputer(strategy='median')
# imputer.fit(Raw)

# imputer.statics_ -> For getting the stats for the imputer
# Here we are using median as the strategy to Fill the missing values

## Pipeline

We create a pipeline for the reuseability of the section of the code


In [None]:
from sklearn.pipeline import Pipeline  
from sklearn.preprocessing import StandardScaler

my_pipe = Pipeline([('std_scalar', StandardScaler())])


In [None]:
np_Train = my_pipe.fit_transform(Train1)
type(np_Train)
# The dataset has been cinverted to np array

# Making Model
Creating different model under different headings and using them

### Linar Regression

In [63]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

### Decision Tree Regressor

In [80]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()

### Random Forest 

In [117]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

## Training and Testing

In [118]:
model.fit(np_Train, Train_Price)

### Checking the data 
Testing the data in with the Train Module only

In [119]:
data_Prep = my_pipe.transform(Train1)

### Predecting The Values

In [120]:
Predicted = model.predict(data_Prep)
# Predicted

In [121]:
# list(Train_Price)

## Evaluating the Model

#### Mean SQ Error

In [122]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(Predicted, Train_Price)
rmse = np.sqrt(mse)

In [123]:
print("MSE : ",mse)
print("RMSE : ",rmse)

MSE :  1.3836961980198004
RMSE :  1.1763061667864367


**NOTE** : We are getting 0 error when we are using the Decison Tree Regressor this is due to overfitting of the model.<br>
We need to get some other Method for calculating the Error. -> **Cross Validation**

#### Cross Validation

In [124]:
from sklearn.model_selection import cross_val_score
score = -cross_val_score(model, data_Prep, Train_Price, scoring= 'neg_mean_squared_error' , cv=10)
# CVS args -> model, data, label, scoring, cv-> No.of Folds
# We are selecting the neg_mean_squared_error for maxmizing the utility

rmse_scr = np.sqrt(score)
rmse_scr

array([3.2791478 , 2.60972771, 4.28055219, 2.57537495, 3.67417139,
       3.68503398, 2.75934258, 3.27281898, 2.25355583, 2.03893109])

Printing the scores and checking 

In [125]:
def printscr(score):
    # print("Scores :", score)
    print("Mean :", score.mean())
    print("STD :", score.std())

In [126]:
printscr(rmse_scr)

Mean : 3.0428656505770126
STD : 0.6759392590978517


# Downloading Model

Now after trying many model we noted that Random Forest is the best performing model of all.

We will download the modle as a `.joblib` file and can use this for deployment.

In [127]:
from joblib import dump, load
dump(model , 'HousePricing.joblib')

['HousePricing.joblib']

# Testing The model

In [128]:
Test_Prep = my_pipe.transform(Test1)
Final_Predection = model.predict(Test_Prep)

In [129]:
final_mse = mean_squared_error(Final_Predection, Test_Price)
final_rmse = np.sqrt(mse)

In [131]:
print("Final MSE", final_mse)
print("Final RMSE", final_rmse)

Final MSE 12.09530609803921
Final RMSE 1.1763061667864367
