#### Goal of task : This part of the california housing market's project is about choosing the best features to predict the price !

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split # to separe the test part from the train part
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer # to put sth in the empty places of the dataset with mean or median or...
from sklearn.preprocessing import  OneHotEncoder, StandardScaler, LabelEncoder # to do the preprocessing of the dataset
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) # to avoid deprecation warnings

In [2]:
#To retrieve California real estate price data 
from sklearn import datasets
data = datasets.fetch_california_housing(data_home=None, download_if_missing=True, return_X_y=False)
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [18]:
#Creation of a Dataframe to manipulate data with pandas
dataset=pd.DataFrame(columns=data['feature_names'],data=data['data'])
dataset.loc[:,'Price']=data['target']
dataset.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


#### We noticed in the part 1 of this project thanks to the correlation matrix that there are some outliers ! Let's remove them !

In [4]:
#Removing outliers
mask=(dataset['AveRooms'] < 10) & (dataset['AveBedrms'] < 10) & (dataset['Population'] < 15000) & (dataset['AveOccup'] < 10 ) & (dataset['Price'] < 5)
dataset.loc[mask,:]
dataset.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


#### To know more abour the dataset , we display some statistical information

In [5]:
#Shape of the dataset
print("The shape of the dataset is :")
display(dataset.shape)
#The columns of the dataset
print("The columns of the dataset :")
display(dataset.columns)
#The type of the columns of the dataset 
print("The Type of columns of the dataset :")
display(dataset.dtypes)
#Some statistical information about the dataset
print(" Some statistical information about the dataset :")
display(dataset.describe(include="all"))
#The pourcentage of missing value in the columns of the dataset
print(" The pourcentage of missing value in the columns of the dataset:")
display(100*dataset.isnull().sum()/dataset.shape[0])

The shape of the dataset is :


(20640, 9)

The columns of the dataset :


Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'Price'],
      dtype='object')

The Type of columns of the dataset :


MedInc        float64
HouseAge      float64
AveRooms      float64
AveBedrms     float64
Population    float64
AveOccup      float64
Latitude      float64
Longitude     float64
Price         float64
dtype: object

 Some statistical information about the dataset :


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


 The pourcentage of missing value in the columns of the dataset:


MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
Price         0.0
dtype: float64

#### Separate the target variable from the features

In [6]:
target_variable='Price'
X = dataset.drop(target_variable, axis = 1)
Y=dataset.loc[:,target_variable]
print('The features are :')
print(X.head())
print('The target varibale is:')
print(Y.head())


The features are :
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  
The target varibale is:
0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: Price, dtype: float64


#### We create columns containing non linear regression

In [7]:
for c in X.columns :
    X.loc[:,c+'_2']=X[c]**2
    X.loc[:,c+'_3']=X[c]**3
    X.loc[:,c+'_4']=X[c]**4
    X.loc[:,c+'_inverse1']=1/X[c]
    X.loc[:,c+'_inverse2']=1/(X[c]**2)
    
X.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_2,MedInc_3,...,Latitude_2,Latitude_3,Latitude_4,Latitude_inverse1,Latitude_inverse2,Longitude_2,Longitude_3,Longitude_4,Longitude_inverse1,Longitude_inverse2
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,69.308955,577.010912,...,1434.8944,54353.799872,2058922.0,0.026399,0.000697,14940.1729,-1826137.0,223208800.0,-0.008181,6.7e-05
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,68.913242,572.076387,...,1433.3796,54267.751656,2054577.0,0.026413,0.000698,14937.7284,-1825689.0,223135700.0,-0.008182,6.7e-05
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,52.669855,382.246204,...,1432.6225,54224.761625,2052407.0,0.02642,0.000698,14942.6176,-1826586.0,223281800.0,-0.008181,6.7e-05
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,31.844578,179.702136,...,1432.6225,54224.761625,2052407.0,0.02642,0.000698,14945.0625,-1827034.0,223354900.0,-0.00818,6.7e-05
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,14.793254,56.897815,...,1432.6225,54224.761625,2052407.0,0.02642,0.000698,14945.0625,-1827034.0,223354900.0,-0.00818,6.7e-05


In [8]:
#Dividing into train_set and test_set
print("Dividing into train and test sets")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)


Dividing into train and test sets


#### Since we have only numerical columns,no missing value and no useless columns to drop,  we use only normalization !


In [9]:
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
print('X_train transformed:')
print(X_train[0:5,:])
print('X_test transformed:')
print(X_test[0:5,:])


X_train transformed:
[[ 0.19001247 -1.79507596  0.90771428  0.1492426  -1.04760128  0.07408222
  -0.8400624   1.00389865 -0.03156475 -0.13764538 -0.15242575 -0.50220226
  -0.38115136 -1.25405214 -0.94149982 -0.75642616  2.4276708   1.07905239
   0.1074445  -0.00578186 -0.01174711 -1.24379542 -0.8004914  -0.00676086
  -0.01319392 -0.01092391 -0.77174054 -0.81962195 -0.23236969 -0.0401889
  -0.01545074  0.61986301 -0.00599379 -0.01113825 -0.01104955 -0.01057464
  -0.85051526 -0.78193791 -0.83670418 -0.83231992 -0.82689265  0.8437691
   0.8441503  -1.0012677   0.99857629 -0.99582641 -1.00897104  1.01140828]
 [ 0.26931072  1.85553889 -0.04200187  0.00918616  0.05210918  0.04527606
   0.98536392 -1.43477229  0.02763894 -0.10511859 -0.13793703 -0.54900732
  -0.39935857  2.2984189   2.63337202  2.88420585 -0.62070276 -0.20608152
  -0.0339419  -0.01647491 -0.0124402  -0.20769891 -0.23506222 -0.02095571
  -0.0139799  -0.01096032 -0.2982761  -0.36319584 -0.08012175 -0.03212363
  -0.01509512 -0.1

#### Train the model

In [10]:
regressor=LinearRegression()
regressor.fit(X_train,Y_train)
print('The model is done traning !!!')

The model is done traning !!!


In [11]:
print("R2 score on training set : ", regressor.score(X_train, Y_train))
print("R2 score on test set : ", regressor.score(X_test, Y_test))

R2 score on training set :  0.7178730937672136
R2 score on test set :  -234.05196932486686


#### The score is good on the train set and not on the test set due to the problem of overfitting it means that the model dont have lots of rows to learn but has lots of complexity ( I mean by that lots of coefficient due to the presence of lots of columns) so we do a bit of feature selection to choose the columns that we really need !

#### The Sequential Feature Selector adds (forward selection) or removes (backward selection) features to form a feature subset in a greedy fashion. At each stage, this estimator chooses the best feature to add or remove based on the cross-validation score of an estimator. In the case of unsupervised learning, this Sequential Feature Selector looks only at the features (X), not the desired outputs (y),
#### Ressource : https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html

In [15]:
from sklearn.feature_selection import  SequentialFeatureSelector
feature_selector =  SequentialFeatureSelector(regressor, n_features_to_select = 20)
feature_selector.fit(X_train, Y_train)#Learns the features to select from X.
features_list = X.columns
best_features = features_list[feature_selector.support_]
print("According to the forward selection algorithm, the following features should be kept: ")
print(best_features.to_list())

According to the forward selection algorithm, the following features should be kept: 
['MedInc', 'HouseAge', 'Population', 'Latitude', 'MedInc_2', 'MedInc_3', 'MedInc_4', 'AveRooms_inverse1', 'AveRooms_inverse2', 'AveBedrms_inverse1', 'AveOccup_inverse1', 'AveOccup_inverse2', 'Latitude_2', 'Latitude_3', 'Latitude_4', 'Latitude_inverse1', 'Latitude_inverse2', 'Longitude_4', 'Longitude_inverse1', 'Longitude_inverse2']


In [16]:
X_best=X.loc[:,best_features]
print("Dividing into train and test sets")
X_train, X_test, Y_train, Y_test = train_test_split(X_best, Y, test_size=0.2, random_state=0)
print("Preprocessing X_train")
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
print(X_train[0:5,:])
print("Preprocessing X_test")
X_test = scaler.transform(X_test)
print(X_test[0:5,:])
print("Train model")
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
print("R2 score on training set : ", regressor.score(X_train, Y_train))
print("R2 score on test set : ", regressor.score(X_test, Y_test))

Dividing into train and test sets
Preprocessing X_train
[[ 0.19001247 -1.79507596 -1.04760128 -0.8400624  -0.03156475 -0.13764538
  -0.15242575 -1.24379542 -0.8004914  -0.77174054 -0.85051526 -0.78193791
  -0.83670418 -0.83231992 -0.82689265  0.8437691   0.8441503  -0.99582641
  -1.00897104  1.01140828]
 [ 0.26931072  1.85553889  0.05210918  0.98536392  0.02763894 -0.10511859
  -0.13793703 -0.20769891 -0.23506222 -0.2982761  -0.67940396 -0.66041613
   0.97036347  0.9536373   0.93516241 -1.01032104 -1.02035961  1.45014232
   1.42420993 -1.41883605]
 [ 0.02989505 -0.20785212 -0.35295521 -0.8400624  -0.14478302 -0.19657849
  -0.17730605  0.26302542  0.09043429  0.12216714 -0.04462252 -0.15191835
  -0.83670418 -0.83231992 -0.82689265  0.8437691   0.8441503  -0.78308808
  -0.7766335   0.77507429]
 [-1.26447048  0.74448219 -0.59179448 -0.75581196 -0.74941761 -0.41887833
  -0.24731697  1.709896    1.359286    0.42914343 -0.95856513 -0.85527295
  -0.75763671 -0.75831148 -0.75780683  0.74883572

### In this new R2 test score we see that there is no more overfitting ! Hurray we did it ! HOW??
### Overfitting occurs when the model has few rows and lots of columns it means that it does not have enough information to learn from , using the feature selector reduces the number of the columns and the complexity of the model so it reduces the overfitting !