## Linear Regression code along

- we have labels -> supervised learning
- clustering -> unsupervised learning
- try to predict real(realt(float)) number -> regression
- predict discrete(int) values -> classification

In [4]:
import pandas as pd

df = pd.read_csv("../../data/Advertising.csv", index_col=0)

df.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


## EDA on dataset

In [12]:
df.shape, df.columns, df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 1 to 200
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   radio      200 non-null    float64
 2   newspaper  200 non-null    float64
 3   sales      200 non-null    float64
dtypes: float64(4)
memory usage: 7.8 KB


((200, 4), Index(['TV', 'radio', 'newspaper', 'sales'], dtype='object'), None)

In [10]:
df.describe()

Unnamed: 0,TV,radio,newspaper,sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [14]:
print(f"{df.shape[0]} samples")
print(f"{df.shape[1] -1} features")
print("sales column is our label/target")

200 samples
3 features
sales column is our label/target


## Divide data into X and y

In [None]:
# spara kolumnerna som ska vara våra features i X(droppa kolumnen som ska vara y)
# i y spara endast våran target/label (kolumnen som droppas i X)
# 3 features (TV,radio,newspaper) X , 1 target/label(sales) y

# tuple unpacking
# X = design matrix / feature matrix / features / independent variable
# y = target variable / label / dependent variable
X, y = df.drop("sales", axis="columns"), df["sales"]
X, y

(        TV  radio  newspaper
 1    230.1   37.8       69.2
 2     44.5   39.3       45.1
 3     17.2   45.9       69.3
 4    151.5   41.3       58.5
 5    180.8   10.8       58.4
 ..     ...    ...        ...
 196   38.2    3.7       13.8
 197   94.2    4.9        8.1
 198  177.0    9.3        6.4
 199  283.6   42.0       66.2
 200  232.1    8.6        8.7
 
 [200 rows x 3 columns],
 1      22.1
 2      10.4
 3       9.3
 4      18.5
 5      12.9
        ... 
 196     7.6
 197     9.7
 198    12.8
 199    25.5
 200    13.4
 Name: sales, Length: 200, dtype: float64)

In [None]:
# en dataframe är uppbyggd av flera series
type(X), type(y)

(pandas.core.frame.DataFrame, pandas.core.series.Series)

In [17]:
X.shape, y.shape

((200, 3), (200,))

## Scikit-learn steps

1. train|test split or train|val|test split
2. scale dataset
    - many algorithms require scaling, some don't
    - there exists different types of scaling (e.g. feature standardization, min-max scaling)
    - scale training data and test data to the training datas parameters to avoid data leakage
3. Fit algorithm to training data
4. Predict on test data
5. Evaluation metrics

## 1. train|test split

In [None]:
from sklearn.model_selection import train_test_split

# help(train_test_split)

# random_state gör att vi får samma resultat varje gång(samma shuffle), test_size tar 33% av datan till test och lämnar 67% till train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print(f"{X_train.shape = }")
print(f"{y_train.shape = }")
print(f"{X_test.shape = }")
print(f"{y_test.shape = }")

X_train.shape = (134, 3)
y_train.shape = (134,)
X_test.shape = (66, 3)
y_test.shape = (66,)


In [24]:
X_train.head()

Unnamed: 0,TV,radio,newspaper
43,293.6,27.7,1.8
190,18.7,12.1,23.4
91,134.3,4.9,9.3
137,25.6,39.0,9.3
52,100.4,9.6,3.6


In [25]:
y_train.head()

43     20.7
190     6.7
91     11.2
137     9.5
52     10.7
Name: sales, dtype: float64

## 2. feature scaling

- min-max scaling
- values transformed into 0 to 1
- för vanlig linear regression behöver man vanligtvis inte skala datan först

In [27]:
from sklearn.preprocessing import MinMaxScaler

# skapa en instans av klassen MinMaxScaler
scaler = MinMaxScaler()
type(scaler)

sklearn.preprocessing._data.MinMaxScaler

In [28]:
scaler

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False


In [None]:
# fit X_train 
scaler.fit(X_train)

# transform X_train data
scaled_X_train = scaler.transform(X_train)
# transform X_test data
scaled_X_test = scaler.transform(X_test)

# har transformerat om så att det endast finns värden mellan 0 och 1
print(f"{scaled_X_train.min() = }")
print(f"{scaled_X_train.max() = }")

# scaled_X_test är inte helt inom 0 och 1 då datan är anpassad efter X_trains parametrar via fit-funktionen, detta är ok, för vi har inte läckt datan från X_test
# om man får mellan 0 och 1 på test datan så har något gått fel(data leakage)
print(f"{scaled_X_test.min() = }")
print(f"{scaled_X_test.max() = }")

scaled_X_train.min() = np.float64(0.0)
scaled_X_train.max() = np.float64(1.0)
scaled_X_test.min() = np.float64(0.005964214711729622)
scaled_X_test.max() = np.float64(1.1302186878727631)


In [32]:
# är nu en numpy array istället för en dataframe, men samma shape
scaled_X_train, scaled_X_train.shape

(array([[0.99053094, 0.55846774, 0.01491054],
        [0.06087251, 0.24395161, 0.22962227],
        [0.45180927, 0.09879032, 0.08946322],
        [0.08420697, 0.78629032, 0.08946322],
        [0.33716605, 0.19354839, 0.03280318],
        [0.26885357, 0.        , 0.08846918],
        [0.63476496, 0.36491935, 0.25149105],
        [0.59621238, 0.6733871 , 0.38170974],
        [0.42272574, 0.74395161, 0.78429423],
        [0.70645925, 0.41532258, 0.10337972],
        [0.4808928 , 0.59072581, 0.1222664 ],
        [0.62292864, 0.88508065, 0.0139165 ],
        [0.74974636, 0.08669355, 0.49204771],
        [0.81501522, 0.76612903, 0.22763419],
        [0.0557998 , 0.92540323, 0.68588469],
        [0.40514034, 0.57459677, 0.13817097],
        [0.30098072, 0.19959677, 0.35188867],
        [0.64389584, 0.57862903, 0.17793241],
        [0.25295908, 0.21774194, 0.05666004],
        [0.65099763, 0.37096774, 0.6500994 ],
        [0.2874535 , 0.72177419, 0.48707753],
        [0.90023673, 0.88306452, 0

In [34]:
type(scaled_X_train), scaled_X_train.shape, scaled_X_train[:5]

(numpy.ndarray,
 (134, 3),
 array([[0.99053094, 0.55846774, 0.01491054],
        [0.06087251, 0.24395161, 0.22962227],
        [0.45180927, 0.09879032, 0.08946322],
        [0.08420697, 0.78629032, 0.08946322],
        [0.33716605, 0.19354839, 0.03280318]]))

## 3. Linear regression

$y = w_0 + w_1x_1 + w_2x_2 + w_3x_3$

In [35]:
from sklearn.linear_model import LinearRegression

# make instance/object of the LinearRegression class
model = LinearRegression()
model

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [None]:
# anpassa modellen efter scaled X_train tillsammans med "facit" y_train
model.fit(scaled_X_train, y_train)

# W1x(vikterna för TV, Radio, Newspaper)
print(f"Parameters/weights : {model.coef_}")

# W0(skärpunkten)
print(f"Intercept (W0): {model.intercept_}")

Parameters/weights: [13.20747617  9.75285112  0.61108329]
Intercept: 2.7911595196243653


## 4. Prediction

### predict test sample

In [43]:
X_test.iloc[0]

TV           163.3
radio         31.6
newspaper     52.9
Name: 96, dtype: float64

In [39]:
# måste reshapa för att få samma dimension som datan vi tränat modellen på
sample_features = scaled_X_test[0].reshape(1, -1)
sample_features

array([[0.54988164, 0.63709677, 0.52286282]])

In [41]:
model.predict(sample_features)

array([16.58673085])

In [42]:
y_test.iloc[0]

np.float64(16.9)

### predict on whole test data

In [None]:
y_pred = model.predict(scaled_X_test)
# jämför 5 första från y_pred med "facit" från y_test
y_pred[:5]

array([16.58673085, 21.18622524, 21.66752973, 10.81086512, 22.25210881])

In [None]:
# jämför 5 första med "facit"
y_test.iloc[:5]

96     16.9
16     22.4
31     21.4
159     7.3
129    24.7
Name: sales, dtype: float64

## 5. Evaluate

common metrics for regression case
- mae - mean absolute error
- mse - mean squared error
- rmse - root mean squared error

outliers blir mer bestraffade vid mse och rmse, då dessa är mer avvikande från förväntat resultat

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# jämför test data med prediction data på y
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
# roten ur mse
rmse = np.sqrt(mse)

# siffrorna används för att jämföra olika modeller, där man då kan avgöra vilken modell(knn/random forrest/linear regression) som har lägst errors för ett dataset
# lägre siffror är bättre
print(f"{mae = }")
print(f"{mse = }")
print(f"{rmse = }")

mae = 1.4937750024728977
mse = 3.72792833068152
rmse = np.float64(1.9307843822347228)
