<a href="https://colab.research.google.com/github/martatolos/eae-dsaa-2025/blob/main/regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression

> Goal of the session:
>
> - At the end of this activity, you will understand the basics of linear regression and the analysis to be performed before training a model.
>
> Scope of the session
>
> - Walk through the basic concepts of data exploration and analysis.
> - Train a linear regression model using the `sklearn` library and observe how the trained model inference works.

## 1. Setup

### Dependencies

- ``numpy`` 2.0.2
- ``pandas`` 2.2.2
- ``scikit-learn`` 1.6.1
- ``seaborn`` 0.13.2

> [!Note]
> Jupyter notebooks allow to install packages using ``%pip`` magic command, which is equivalent to using the ``pip`` command in the terminal.

In [None]:
%pip install islp==0.4.0 pandas==2.2.2 scikit-learn==1.6.1 seaborn==0.13.2

### Imports

After installing them, we import the libraries which we will use in this notebook. We also set the random seed to ensure reproducibility of results.

In [None]:
import numpy as np  # For numerical operations.
import pandas as pd  # Most commonly used library for data manipulation and analysis.
import seaborn as sns  # For data visualization.
from sklearn.linear_model import LinearRegression  # For linear regression modeling.
from sklearn.preprocessing import OneHotEncoder  # For one-hot encoding categorical variables.

### Data

We will use the **carseats** dataset. This dataset contains information about sales of child car seats at 400 different stores. The goal is to predict the sales of car seats based on various features such as price, location, and other attributes.

In [None]:
car_seats = pd.read_csv(
    "https://raw.githubusercontent.com/intro-stat-learning/ISLP/refs/heads/main/ISLP/data/Carseats.csv"
)

## 2. Analysis

In [None]:
car_seats

The **carseats** dataset we loaded is a dataframe with 400 observations on the following 11 variables:

- **Sales**
Unit sales (in thousands) at each location

- **CompPrice**
Price charged by competitor at each location

- **Income**
Community income level (in thousands of dollars)

- **Advertising**
Local advertising budget for company at each location (in thousands of dollars)

- **Population**
Population size in region (in thousands)

- **Price**
Price company charges for car seats at each site

- **ShelveLoc**
A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site

- **Age**
Average age of the local population

- **Education**
Education level at each location

- **Urban**
A factor with levels No and Yes to indicate whether the store is in an urban or rural location

- **US**
A factor with levels No and Yes to indicate whether the store is in the US or not

In [None]:
car_seats.dtypes

What is a float?
What does 64 mean?
What is a category?

In [None]:
car_seats.describe()

To which conclusions can we come by analyzing above table? Should we care about the value ranges? What's missing in this table?

In [None]:
# Convert object columns to category dtype
car_seats["ShelveLoc"] = car_seats["ShelveLoc"].astype("category")
car_seats["Urban"] = car_seats["Urban"].astype("category")
car_seats["US"] = car_seats["US"].astype("category")

car_seats.describe(include=["category"])

#### The convenience of proper types

Often string values of categorical variables are not recognized as such and get the ``object`` type assigned. This can lead to problems when we want to use these variables in our analysis. We can convert them to categorical variables using the ``astype`` method. This will allow us to use them better in our analysis and also save memory.

When a string column is converted from object to string, the actual values are encoded as integers and the mapping is stored in the column's metadata. This allows for more efficient storage and faster operations on the column, as the underlying data is now represented as integers rather than strings. The original string values can still be accessed using the mapping, so we don't lose any information in the process.

### Pairplot and correlation matrix

In [None]:
sns.pairplot(car_seats)  # hue="ShelveLoc"

What can we derive from the above diagram?

In [None]:
correlation = car_seats.corr(numeric_only=True)
correlation
# sns.heatmap(correlation, annot=True)

Heatmaps are a quick way to spot correlations between variables.

In [None]:
for column_name in car_seats.columns:
    sns.displot(car_seats[column_name], height=3)

In [None]:
sns.boxplot(car_seats)

## 3. Preparation

Can a (linear) regression algorithm work with categorical features?

In [None]:
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(car_seats[["ShelveLoc"]])
shelve_loc_ohe = enc.transform(car_seats[["ShelveLoc"]])
shelve_loc_ohe.toarray()

In [None]:
list(enc.categories_[0])

In [None]:
column_names = ["ShelveLoc" + category for category in list(enc.categories_[0])]
shelve_loc = pd.DataFrame(shelve_loc_ohe.toarray(), columns=column_names)
pd.concat([car_seats, shelve_loc], axis=1)

What's the difference between nominal and ordinal categorical features?

In [None]:
# Setting binary columns to 1/0 or True/False has the same effect when applying to most regressors. Boolean columns
# require less memory.
car_seats["UrbanEnc"] = car_seats["Urban"].transform(lambda boolean: boolean == "Yes")

In [None]:
car_seats["USEnc"] = car_seats["US"].transform(lambda boolean: boolean == "Yes")

In [None]:
car_seats["ShelveLocEnc"] = car_seats["ShelveLoc"].transform(
    lambda loc: 1 if loc == "Bad" else 2 if loc == "Medium" else 3
)

Is this a safe way to convert categorical values into numerical values? What if we have many different values?

In [None]:
shelve_loc_dict = {"Bad": 1, "Medium": 2, "Good": 3}
car_seats["ShelveLocEnc"] = car_seats["ShelveLoc"].map(shelve_loc_dict)

In [None]:
car_seats

What could we do more before training or linear regression model?

## 4. Linear Regression

In [None]:
X = car_seats[["Income", "Advertising", "Price", "Age", "CompPrice", "ShelveLocEnc"]]
y = car_seats["Sales"]

reg = LinearRegression().fit(X, y)

In [None]:
data = {"Income": [10], "Advertising": [20], "Price": [100], "Age": [20], "CompPrice": [100], "ShelveLocEnc": [3]}
X_new = pd.DataFrame(data)
reg.predict(X_new)

In [None]:
coefficients = reg.coef_
coefficients

In [None]:
intercept = reg.intercept_
intercept

Try to obtain the same result as the model by using the values in ``data``, ``coefficients`` and ``intercept`` variables.

In [None]:
def predict_sales(data: np.ndarray[float], coefficients: np.ndarray[float], intercept: np.float64) -> float:
    """Generate a prediction of sales based on the coefficients and intercept of a linear regression model.

    ``Data`` contains the input features

    :param data: A dictionary containing the input features for the prediction.
    :param coefficients: The coefficients of the linear regression model.
    :param intercept: The intercept of the linear regression model.
    """
    # Add your code here

In [None]:
raw_data = np.array([value[0] for value in data.values()], dtype=np.float64)

predicted_sales = predict_sales(data, coefficients, intercept)
model_sales = reg.predict(X_new)[0]

print(f"Predicted sales: {predicted_sales}")
print(f"Model sales: {model_sales}")

if round(predicted_sales, 3) == round(model_sales, 3):
    print("The predicted sales match the model's prediction.")

else:
    print("The predicted sales do not match the model's prediction.")