<a href="https://colab.research.google.com/github/mdkamrulhasan/data_mining_kdd/blob/main/notebooks/Regression_California_housing_normalization_LR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What will we cover today ?


*   **Data/feature scaling**
*   **Linear Regression**


We are going to learn how to train and test a regression model. More specifically this is a LogisticRegression model available through the sklearn module.

In [24]:
import numpy as np
import pandas as pd
# Classification modeling package (sklearn)
from sklearn.linear_model import LinearRegression
from sklearn import datasets
# from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# visualization
import plotly.express as px

# [Dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)

In [2]:
# Load the  dataset
from sklearn.datasets import fetch_california_housing
df = fetch_california_housing(as_frame=True)
df.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

In [3]:
df.data.head(2)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22


# Identifying the features and the labels

In [4]:
# Load the dataset in the form of features and labels
X, y = fetch_california_housing(return_X_y=True)
X.shape, y.shape

((20640, 8), (20640,))

See some label values (y)

In [5]:
y[:10]

array([4.526, 3.585, 3.521, 3.413, 3.422, 2.697, 2.992, 2.414, 2.267,
       2.611])

In [6]:
# plotting a random feautre
feature_column_index = 2 # you can choose any valid column index
fig = px.scatter(x=X[:, feature_column_index], y=y)
fig.show()

# Splitting data into train, test splits

In [13]:
TEST_PROP = 0.5 # test data amount (in terms of proportion)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_PROP, random_state=0)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((10320, 8), (10320,), (10320, 8), (10320,))

# Data scaling: [Min-Max scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

In [15]:
# data is not scaled (different columns have different scale/unit values)
X_train[:, :3]

array([[ 1.245     , 42.        ,  3.62425447],
       [ 2.4432    , 24.        ,  3.3991684 ],
       [ 2.4659    , 17.        ,  9.74772727],
       ...,
       [ 3.1977    , 31.        ,  3.64122137],
       [ 5.6315    , 34.        ,  4.54059829],
       [ 1.3882    , 15.        ,  3.9295302 ]])

In [7]:
# Data scaling package loading
from sklearn.preprocessing import MinMaxScaler

# scaling model initialization
scaler = MinMaxScaler()

In [11]:
# scaling model fitting

In [14]:
# must fit on training features only
scaler.fit(X_train)

In [17]:
# scale both training and test data
X_train = scaler.transform(X_train)
# MUST NOT retrain using the test data (only transformation allowed)
X_test = scaler.transform(X_test)

In [18]:
# checking whether data is scaled or not
X_train[:, :3]

array([[0.0513855 , 0.80392157, 0.02077844],
       [0.13401884, 0.45098039, 0.01906863],
       [0.13558434, 0.31372549, 0.06729367],
       ...,
       [0.18605261, 0.58823529, 0.02090732],
       [0.35389857, 0.64705882, 0.02773918],
       [0.06126122, 0.2745098 , 0.02309738]])

# Model Instantiation

In [19]:
# Create linear regression object
regr = LinearRegression()

# Model Training

In [20]:
# Train the model using the training set
regr.fit(X_train, y_train)

# Making Predictions and evaluation (on the traning data)

- just checking how good the model fit was on the training data.

In [21]:
# Training error
y_pred = regr.predict(X_train)
# The mean squared error
mse = mean_squared_error(y_train, y_pred)
print("Mean squared error: %.2f" % mse)

Mean squared error: 0.52


# Making Predictions and evaluation (on the test data)

- This is more interesting metric as we are reporting on unseen data (by the model)

In [22]:
# Test error
y_pred = regr.predict(X_test)
# The mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: %.2f" % mse)

Mean squared error: 0.53
