---
## Data Pre-Processing

- **Data Pre-Processing** is a way to train better models by manipulating the input data.
- Pre-Processing doesn't always improve the model, it depends on the dataset if this technique fits it or not.
- It has different techniques: Normalization, Standartization, etc.

In [1]:
import pandas as pd

data = pd.read_csv("datasets/prostate_cancer.txt")

x = data.drop(["id", "lpsa", "train"], axis=1)
y = data["lpsa"]

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

---
---
## Normalization

- Using __MinMaxScaler__ to normalize data inputs.
- Matematical Formula: **x_scaled = (x-x_min)/(x_max-x_min)**

In [2]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(x_train)

x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [3]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x_train_scaled, y_train)

y_pred = model.predict(x_test_scaled)

mse = mean_squared_error(y_test, y_pred)
mas = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse}")
print(f"MAS: {mas}")
print(f"R^2: {r2}")

MSE: 0.3469917891761397
MAS: 0.4278470015625614
R^2: 0.7575175130204983


In [4]:
from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=18.3) # HyperParameter Tuning - 'alpha'. Ranges -> 1, 10, 100, ...
ridge_model.fit(x_train_scaled, y_train)

y_pred = ridge_model.predict(x_test_scaled)

mse = mean_squared_error(y_test, y_pred)
mas = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse}")
print(f"MAS: {mas}")
print(f"R^2: {r2}")

MSE: 0.7440678082142982
MAS: 0.6026275820702806
R^2: 0.48003549869128614


In [5]:
from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=0.02) # HyperParameter Tuning - 'alpha'. Ranges -> 0.1, 0.2, 0.3, ...  
lasso_model.fit(x_train_scaled, y_train)

y_pred = lasso_model.predict(x_test_scaled)

mse = mean_squared_error(y_test, y_pred)
mas = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse}")
print(f"MAS: {mas}")
print(f"R^2: {r2}")

MSE: 0.41681329122746413
MAS: 0.4791156410292423
R^2: 0.7087253168067278


---
## Results (R2 score):
- **LinearRegression:**   0.7575175130204983
- **Ridge(alpha=18):**    0.48003549869128614
- **Lasso(alpha=0.02):**  0.7087253168067278

---
---
## Standartization

- Using __StandartScaler__ to startize data inputs.
- Matematical Formula: **x_scaled = (x-x_mean)/(std)**

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(x_train)

x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [7]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x_train_scaled, y_train)

y_pred = model.predict(x_test_scaled)

mse = mean_squared_error(y_test, y_pred)
mas = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse}")
print(f"MAS: {mas}")
print(f"R^2: {r2}")

MSE: 0.34699178917613993
MAS: 0.4278470015625612
R^2: 0.7575175130204982


In [8]:
from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=18) # HyperParameter Tuning - 'alpha'. Ranges -> 1, 10, 100, ...
ridge_model.fit(x_train_scaled, y_train)

y_pred = ridge_model.predict(x_test_scaled)

mse = mean_squared_error(y_test, y_pred)
mas = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse}")
print(f"MAS: {mas}")
print(f"R^2: {r2}")

MSE: 0.3685483742447039
MAS: 0.43985953784830245
R^2: 0.7424534840686281


In [9]:
from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=0.02) # HyperParameter Tuning - 'alpha'. Ranges -> 0.1, 0.2, 0.3, ...  
lasso_model.fit(x_train_scaled, y_train)

y_pred = lasso_model.predict(x_test_scaled)

mse = mean_squared_error(y_test, y_pred)
mas = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse}")
print(f"MAS: {mas}")
print(f"R^2: {r2}")

MSE: 0.35185138962764706
MAS: 0.4303792864209458
R^2: 0.7541215594562771


---
## Results (R2 score):
- **LinearRegression:**   0.7575175130204982
- **Ridge(alpha=18):**    0.7424534840686281
- **Lasso(alpha=0.02):**  0.7541215594562771