<a href="https://colab.research.google.com/github/alexanderkersten/eae-dsaa/blob/main/regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Setup

In [None]:
!pip install ISLP

In [None]:
from ISLP import load_data
car_seats = load_data('Carseats')

A simulated data set containing sales of child car seats at 400 different stores.

# 2. Analysis

In [None]:
car_seats

A data frame with 400 observations on the following 11 variables.

- **Sales**
Unit sales (in thousands) at each location

- **CompPrice**
Price charged by competitor at each location

- **Income**
Community income level (in thousands of dollars)

- **Advertising**
Local advertising budget for company at each location (in thousands of dollars)

- **Population**
Population size in region (in thousands)

- **Price**
Price company charges for car seats at each site

- **ShelveLoc**
A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site

- **Age**
Average age of the local population

- **Education**
Education level at each location

- **Urban**
A factor with levels No and Yes to indicate whether the store is in an urban or rural location

- **US**
A factor with levels No and Yes to indicate whether the store is in the US or not

In [None]:
car_seats.dtypes

What is a float?
What does 64 mean?
What is a category?

In [None]:
car_seats.describe()

To which conclusions can we come by analyzing above table? Should we care about the value ranges? What's missing in this table?

In [None]:
car_seats.describe(include=["category"])

In [None]:
import seaborn as sns
sns.pairplot(car_seats)  # hue="ShelveLoc"

What can we derive from the above diagram?

In [None]:
correlation = car_seats.corr(numeric_only=True)
correlation
# sns.heatmap(correlation, annot=True)

In [None]:
for series_name, series in car_seats.items():
  sns.displot(series, height=3)

In [None]:
sns.boxplot(car_seats)

# 3. Preparation

Can a (linear) regression algorithm work with categorical features?

In [None]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(car_seats[["ShelveLoc"]])
shelve_loc_ohe = enc.transform(car_seats[["ShelveLoc"]])
shelve_loc_ohe.toarray()

In [None]:
list(enc.categories_[0])

In [None]:
column_names = ["ShelveLoc" + category for category in list(enc.categories_[0])]
shelve_loc = pd.DataFrame(shelve_loc_ohe.toarray(), columns=column_names)
pd.concat([car_seats, shelve_loc], axis=1)

What's the difference between nominal and ordinal categorical features?

In [None]:
car_seats["UrbanEnc"] = car_seats.Urban.transform(lambda boolean: 1 if boolean == "Yes" else 0)

In [None]:
car_seats["USEnc"] = car_seats.US.transform(lambda boolean: 1 if boolean == "Yes" else 0)

In [None]:
car_seats["ShelveLocEnc"] = car_seats.ShelveLoc.transform(lambda loc: 1 if loc == "Bad" else 2 if loc == "Medium" else 3)

Is this a safe way to convert categorical values into numerical values? What if we have many different values?

In [None]:
shelve_loc_dict = { "Bad": 1, "Medium": 2, "Good": 3 }
car_seats["ShelveLocEnc"] = car_seats.ShelveLoc.transform(lambda loc: shelve_loc_dict[loc])

In [None]:
car_seats

What could we do more before training or linear regression model?

# 4. Regression

In [None]:
from sklearn.linear_model import LinearRegression

X = car_seats[["Income", "Advertising", "Price", "Age", "CompPrice", "ShelveLocEnc"]]
y = car_seats.Sales

reg = LinearRegression().fit(X, y)

In [None]:
data = {"Income": [10], "Advertising": [20], "Price": [100], "Age": [20], "CompPrice": [100], "ShelveLocEnc": [3]}
X_new = pd.DataFrame(data)
reg.predict(X_new)

In [None]:
reg.coef_

In [None]:
reg.intercept_