# **EXERCISES**

JUDIT LLORENS DÍAZ

## Exercise 01 --> Linear Regression model

## **IMPORTANT TYPES**
 - **Split data into train and test in order to evaluate the model with unseen data**
 - **Normalize data to avoid scaling issues when fitting the model (specially for the LR model)**
 - **Train your model and apply it to the test data.**
 - **Evaluate the model with the `score` function, for both train and test datasets**

In [None]:
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

In [None]:
data = fetch_california_housing() 

In [None]:
print(data.keys())

In [None]:
X = pd.DataFrame(data["data"], columns=data["feature_names"])
y = pd.Series(data["target"], name=data["target_names"][0])

In [None]:
type(data)

In [None]:
print(data["DESCR"])

In [None]:
X.head()

In [None]:
y.head()

## Build model

### Remove location variables we don't need!

In [None]:
X = X.drop(columns=["Latitude","Longitude"])
#we don't need them to predict the avarage

### Train/Test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,              # arrays or matrices I want to split
    test_size=0.3,     # the proportion to data for testing (30% for testing and 70% for trianing)
    random_state=123,  # can be any number. make the split reproducibile
    shuffle=True,       # if we want to shuffle data before splitting
)

In [None]:
X.shape

In [None]:
y.shape

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_test.shape

### Features Standardization

In [None]:
sc = StandardScaler()

In [None]:
X_train_sc = sc.fit_transform(X_train) #do it together!--> TRAINING DATA!
X_train_sc = pd.DataFrame(X_train_sc, columns=X_train.columns)

In [None]:
X_train.head()

In [None]:
X_train_sc.head()

### Train ML model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
reg = LinearRegression() #CREATING THE LINEAR REG MODEL

In [None]:
reg.fit(X_train_sc, y_train) #fit the model with x trained SCALED and y trian

### Evaluate model

**Question 1:** What are the $R^2$ metrics for train and test?

The $R^2$ metric can be calculated directly with the `score` function

In [None]:
# R2 for training
r2_train = reg.score(X_train_sc, y_train) #it's the operation from the regression model

In [None]:
r2_train

In [None]:
X_test.head()

In [None]:
# R2 for test --> SCALING TEST X!
X_test_sc = sc.transform(X_test)
X_test_sc = pd.DataFrame(X_test_sc, columns=X_test.columns)

In [None]:
X_test_sc.head()

In [None]:
r2_test = reg.score(X_test_sc, y_test)

In [None]:
print(f"R2 Train: {r2_train}")
print(f"R2 Test: {r2_test}")

- R2 Train: 0.5383: This value, 0.5383, represents the R2 score for your model when applied to the training dataset. It indicates that your model explains approximately 53.83% of the variance in the training data. In other words, the model captures 53.83% of the variability in the target variable based on the features in the training data.

- R2 Test: 0.5426: This value, 0.5426, is the R2 score for your model when applied to the testing dataset. It tells you that your model explains around 54.26% of the variance in the testing data. It's a measure of how well your model generalizes to unseen data.

In both cases, a higher R2 score indicates a better fit. 


When you calculate the R-squared (R²) for both the training and test datasets in a linear regression model, you can use these values to evaluate the performance of your model. R-squared measures the proportion of the variance in the dependent variable (the target) that is explained by the independent variables (the features) in your model. A higher R-squared value indicates a better fit of the model to the data.

Here's how you can interpret and compare the R-squared values for your training and test datasets:

1. **Training R-squared (R²):** This value measures how well your model fits the training data. A high R-squared on the training data suggests that your model explains a significant portion of the variance in the training data. However, a very high R-squared on the training data could be a sign of overfitting, where the model has memorized the training data but may not generalize well to new, unseen data.

2. **Test R-squared (R²):** This value measures how well your model generalizes to new, unseen data. A high R-squared on the test data indicates that your model can explain a significant portion of the variance in new data that it hasn't seen during training.

To determine if your linear regression model is good or if you should consider trying another model, consider the following scenarios:

- If both the training and test R-squared values are relatively high (close to 1), it suggests that your model is performing well on both the training and test data. This is a good sign, but you should also consider other evaluation metrics and examine the residuals (the differences between predicted and actual values) to ensure there is no systematic bias in your model.

- If the training R-squared is high, but the test R-squared is significantly lower, it may indicate overfitting. In this case, your model is fitting the training data too closely and might not generalize well to new data. You might need to simplify your model, regularize it, or increase the amount of training data to reduce overfitting.

- If both the training and test R-squared values are low, it suggests that your model may not be capturing the underlying relationships in the data. In this case, you should explore other models, feature engineering, or collect more relevant data.

- It's also a good practice to consider other evaluation metrics, such as mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE), in addition to R-squared, to get a more comprehensive understanding of your model's performance.

Remember that the interpretation of R-squared should be done in the context of your specific problem and dataset. It's important to strike a balance between model complexity and generalization when assessing your model's quality.

### Make estimations with the model

EXAMPLE: Imagine that me and my wife, we want to sell our house in 528-426 W Scott Ave Clovis, CA 93612 but we have no idea about the price. Our house is 30 years old, with 6 rooms and 3 bedrooms. In our geographic block group we are 300 people. Our income is 60K.--> price? (target)

In [None]:
new_sample = [{
    "MedInc": 6,
    "HouseAge": 30,
    "AveRooms": 6,
    "AveBedrms": 3,
    "Population": 300,
    "AveOccup": 2
}]

new_sample = pd.DataFrame.from_dict(new_sample, orient="columns")

In [None]:
new_sample

#### Scale features

In [None]:
new_sample_sc = sc.transform(new_sample)
new_sample_sc = pd.DataFrame(new_sample_sc, columns=new_sample.columns)

In [None]:
new_sample_sc

#### Predict

In [None]:
my_house_est_price = reg.predict(new_sample_sc)

# REG is the linear model i created and fitted then to test if it's good fit we do it with unseen data! predicting form it

In [None]:
my_house_est_price

### Let's do some graphics
Represent the `MedInc` against the target `MedHouseVal`

In [None]:
import plotly.express as px
import plotly.graph_objects as go

In [None]:
fig = go.Figure(layout_template="none", layout_title="MedInc -vs- MedHouseVal")

# add test data
fig.add_trace(go.Scattergl(   # the scattergl option is more suitable in case we have many samples (>1000) to plot
    x=X_test.MedInc,
    y=y_test,
    mode="markers",
    name="Rest of Houses",
))

# add my house
fig.add_trace(go.Scatter(
    x=new_sample.MedInc,
    y=my_house_est_price,
    mode="markers",
    marker=dict(
        size=15,
        symbol="star-triangle-up-dot"
    ),
    name="My House"
))

fig.update_xaxes(title="MedInc (x10K $)")
fig.update_yaxes(title="MedHouseVal (x100K $)")

### SECOND PROCESS

In [None]:
data = fetch_california_housing()
X = pd.DataFrame(data["data"], columns=data["feature_names"])
y = pd.Series(data["target"], name=data["target_names"][0])

### Determine location of reference points
You can use Google Maps for this

In [None]:
sfLocation = [37.756728, -122.447001]
sjLocation = [37.312811, -121.857302]
sdLocation = [32.731034, -117.162596]
myLocation = [36.811032, -119.721304]

### Calculate new variables

In [None]:
%%time

X["distance2SF"] = X.apply(lambda x: distance.distance(x[["Latitude","Longitude"]], sfLocation).km, axis="columns")
X["distance2SJ"] = X.apply(lambda x: distance.distance(x[["Latitude","Longitude"]], sjLocation).km, axis="columns")
X["distance2SD"] = X.apply(lambda x: distance.distance(x[["Latitude","Longitude"]], sdLocation).km, axis="columns")

In [None]:
X.head()

### Remove location variables

In [None]:
X = X.drop(columns=["Latitude","Longitude"])

### Train/Test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,              # arrays or matrices I want to split
    test_size=0.3,     # the proportion to data for testing
    random_state=123,  # can be any number. make the split reproducibile
    shuffle=True,       # if we want to shuffle data before splitting
)

### Features Standardization

In [None]:
sc = StandardScaler()

In [None]:
X_train_sc = sc.fit_transform(X_train)
X_train_sc = pd.DataFrame(X_train_sc, columns=X_train.columns)

In [None]:
X_train.head()

In [None]:
X_train_sc.head()

### Train ML model

In [None]:
reg = LinearRegression()

In [None]:
reg.fit(X_train, y_train)

### Evaluate model

In [None]:
# R2 for training
r2_train = reg.score(X_train, y_train)

# R2 for test
r2_test = reg.score(X_test, y_test)

In [None]:
print(f"R2 Train: {r2_train}")
print(f"R2 Test: {r2_test}")

### Make estimations with the model

**Question 3** - What is the recomended for sale price of my house now?

In [None]:
new_sample = [{
    "MedInc": 6,
    "HouseAge": 30,
    "AveRooms": 6,
    "AveBedrms": 3,
    "Population": 300,
    "AveOccup": 2,
    "distance2SF": distance.distance(myLocation, sfLocation).km,
    "distance2SJ": distance.distance(myLocation, sjLocation).km,
    "distance2SD": distance.distance(myLocation, sdLocation).km,
}]

new_sample = pd.DataFrame.from_dict(new_sample, orient="columns")

In [None]:
new_sample

#### Scale features

In [None]:
new_sample_sc = sc.transform(new_sample)
new_sample_sc = pd.DataFrame(new_sample_sc, columns=new_sample.columns)

In [None]:
new_sample_sc

#### Predict

In [None]:
my_house_est_price = reg.predict(new_sample_sc)

### Let's do some graphics
Represent the `MedInc` against the target `MedHouseVal`

In [None]:
import plotly.express as px
import plotly.graph_objects as go

In [None]:
fig = go.Figure(layout_template="none", layout_title="MedInc -vs- MedHouseVal")

# add test data
fig.add_trace(go.Scattergl(   # the scattergl option is more suitable in case we have many samples (>1000) to plot
    x=X_test.MedInc,
    y=y_test,
    mode="markers",
    name="Rest of Houses",
))

# add my house
fig.add_trace(go.Scatter(
    x=new_sample.MedInc,
    y=my_house_est_price,
    mode="markers",
    marker=dict(
        size=15,
        symbol="star-triangle-up-dot"
    ),
    name="My House"
))

fig.update_xaxes(title="MedInc (x10K $)")
fig.update_yaxes(title="MedHouseVal (x100K $)")

In [None]:
import joblib
joblib.dump(reg, "lr1.pkl")
my_model = joblib.load("lr1.pkl")