# Supervised Machine Learning: Linear Regression

## Linear Regression: Unscaled vs. Scaled Data
In this demo, we follow the ML process:
1. **Remember:** Load and inspect the data.
2. **Formulate:** Build a linear regression model first on raw (unscaled) data.
3. **Predict:** Evaluate the model's performance.

Then we apply feature scaling and rebuild the model to compare results.
We use the Student Performance dataset from Kaggle to predict the "Performance Index" of students.

In [3]:
# import neccesary libraries
import pandas as pd
import numpy as np

# Download data from Kaggle
!kaggle datasets download -d nikhil7280/student-performance-multiple-linear-regression
!unzip student-performance-multiple-linear-regression.zip

# Import dataframe
df = pd.read_csv("Student_Performance.csv")
df

zsh:1: command not found: kaggle
unzip:  cannot find or open student-performance-multiple-linear-regression.zip, student-performance-multiple-linear-regression.zip.zip or student-performance-multiple-linear-regression.zip.ZIP.


FileNotFoundError: [Errno 2] No such file or directory: 'Student_Performance.csv'

In [None]:
# Convert extracurricular activities to numeric
df["Extracurricular Activities"] = df["Extracurricular Activities"].map({"Yes":1, "No":0})

# Define the features and target variable based on the dataset
feature_vars = ["Hours Studied", "Previous Scores", "Sleep Hours",
                "Sample Question Papers Practiced", "Extracurricular Activities"]
X = pd.DataFrame(df[feature_vars])
y = pd.Series(df["Performance Index"]) # Target: Performance Index

# Display a preview of the dataset
print("Dataset preview:")
print(X.head())
print("\nTarget variable preview:")
print(y.head())

## Part 1: Linear Regression on Unscaled Data
In this section, we build a [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit) model on the raw data.
This helps us see the effect of differing scales on the coefficients.
We start by [spliting our data into training and testing sets](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Split the raw data (80% training, 20% testing)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model on unscaled data
lin_reg_raw = LinearRegression()
lin_reg_raw.fit(X_train_raw, y_train)

# Make predictions on the test set
y_pred_raw = lin_reg_raw.predict(X_test_raw)

In [None]:
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score

# Evaluate model performance
mse_raw = mean_squared_error(y_test, y_pred_raw)
rmse_raw = root_mean_squared_error(y_test, y_pred_raw)

r2_raw = r2_score(y_test, y_pred_raw)

print("Unscaled Data Model:")
print(f"Mean Squared Error: {mse_raw:.2f}")
print(f"Root Squared Error: {rmse_raw:.2f}")
print(f"R² Score: {r2_raw:.2f}")

### Notes on Unscaled Model:
- **Coefficients (Unscaled):**
    - Each coefficient represents the change in the Performance Index for a one-unit change in the respective feature, holding all other features constant.
    - For example, if "Hours Studied" has a coefficient of 2.85, it implies that for each additional hour studied, the Performance Index increases by 2.85 points (assuming other factors remain constant).
    - However, because features are in different units (e.g., hours vs. scores), comparing these coefficients directly may be misleading.

- **R² Score:**
    - This metric indicates the proportion of the variance in the target variable explained by the model.
    - An R² close to 1 suggests a very good fit, while an R² near 0 indicates the model fails to capture much variance.

- **MSE & RMSE:**
    - MSE measures the average squared difference between actual and predicted values.
    - RMSE, being the square root of MSE, gives an error metric in the same units as the target.
    - Lower RMSE values indicate better predictive performance.

In [None]:
# View our model's coefficients
print("Model Coefficients (Unscaled):")
print(pd.Series(lin_reg_raw.coef_,
                index=X.columns))
print("\nModel Intercept (Unscaled):")
print(pd.Series(lin_reg_raw.intercept_))

### Manually Computing a Prediction from Our Model
- In this section, we'll calculate a predicted value by hand (i.e., by multiplying the model's coefficients by the original feature values and adding the intercept).
- This mirrors exactly what the model does internally.

- **Why is this helpful?**
   - It reinforces how linear regression makes its predictions using the equation: `prediction = intercept + (coef_1 * x_1) + (coef_2 * x_2) + ...`
   - It helps us see the individual impact of each feature on the final prediction.
   - It confirms that the manual approach matches the `model.predict()` output.

#### 1. Extract the coefficients and intercept from our trained model

In [None]:
coef_series = pd.Series(lin_reg_raw.coef_, index=X.columns)
intercept = lin_reg_raw.intercept_

print("Coefficients (Unscaled):")
print(coef_series)
print("\nIntercept:", intercept)

#### 2. Select a single row of our data (e.g., the second row)
- We select only the columns that were used as features in our model.
- The row's values represent the actual data for Hours Studied, Previous Scores, etc.

In [None]:
# This row's feature values will be multiplied by our coefficients.
row_index = 1  # for demonstration
row_features = X.iloc[row_index]  # features only
print("Feature values (Row", row_index, "):\n", row_features)

#### 3. Compute the manual prediction

In [None]:
manual_prediction = (row_features * coef_series).sum() + intercept
print("\nManual Prediction for Row", row_index, ":", manual_prediction)

**Explanation:**
- We multiply each feature value by its corresponding coefficient and sum them up.
- Then, we add the intercept.
- This is precisely the linear regression equation:
$$
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n
$$

Where:
 - $\beta_0$ is the intercept
 - $\beta_i$ is the coefficient for feature $x_i$

 Thus, `manual_prediction` should match what the model would predict internally.

#### 4. Compare to `model.predict()` for confirmation

In [None]:
model_prediction = lin_reg_raw.predict([row_features])
print("Model Prediction from lin_reg_raw.predict():", model_prediction[0])

### **Observation:**
- The `manual_prediction` and `model_prediction` should be nearly identical (up to minor floating-point differences).
- If they match, we've confirmed our understanding of how the model uses coefficients and intercept to make a prediction.

### Why This Matters
- **Transparency:** It shows exactly how each feature influences the final predicted value.
- **Verification:** Confirms our "manual" math aligns with the model's internal computation.
- **Interpretability:** By inspecting the coefficients, we see which features have the biggest impact (positive or negative) on the Performance Index, and we can discuss whether the magnitudes make sense given the domain context.

## Part 2: Linear Regression on Scaled Data
Now we apply feature scaling using StandardScaler and rebuild the model.
Scaling brings all features to a similar scale, which aids in the interpretation of the coefficients.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler and apply it to the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Split the scaled data
X_train_scaled, X_test_scaled, _, _ = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model on scaled data
lin_reg_scaled = LinearRegression()
lin_reg_scaled.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred_scaled = lin_reg_scaled.predict(X_test_scaled)

# Evaluate model performance
mse_scaled = mean_squared_error(y_test, y_pred_scaled)
r2_scaled = r2_score(y_test, y_pred_scaled)
rmse_scaled = root_mean_squared_error(y_test, y_pred_raw)

print("\nScaled Data Model:")
print(f"Mean Squared Error: {mse_scaled:.2f}")
print(f"Root Mean Squared Error: {rmse_scaled:.2f}")
print(f"R² Score: {r2_scaled:.2f}")
print("Model Coefficients (Scaled):")
print(pd.Series(lin_reg_scaled.coef_, index=X.columns))

### Notes on Scaled Model:
- **Coefficients (Scaled):**
    - After scaling, each coefficient indicates the change in the Performance Index for a one standard deviation change in that feature.
    - This standardization makes it easier to compare the relative importance of features.
    - For example, a higher coefficient means that feature has a larger effect on the target, per standard deviation change.

- **R² and RMSE Comparison:**
    - Often the overall performance metrics (R² and RMSE) do not change dramatically after scaling for linear regression.
    - However, scaling is essential for interpreting the model coefficients correctly, especially when features are on different scales.
    - It is also a critical preprocessing step for many other algorithms.

# Conclusion
In this demo, we:
- Built and evaluated a linear regression model on unscaled data.
- Re-trained the model after applying feature scaling.
- Observed that while overall performance metrics (**MSE** and **R²**) may be similar, scaling is crucial for the interpretability of model coefficients and for ensuring that features contribute in a balanced way.
  
### Key Takeaways:
- **Coefficients:** On unscaled data, coefficients are tied to the original units, which can be hard to compare.
  After scaling, coefficients represent the effect of a one standard deviation change in the feature.
- **R² Score:** Reflects the proportion of variance in the target variable explained by the model.
- **MSE (and RMSE):** Lower values indicate better model performance; RMSE provides an error measure in the target's units.

This process reflects the "remember-formulate-predict" approach in machine learning.