# Multiple Linear Regression


## Overview

Multiple Linear Regression is an extension of Simple Linear Regression to model the relationship between a dependent variable and multiple independent variables. It assumes a linear relationship between the dependent variable and each of the independent variables.

## Mathematical Model

The model can be represented by the equation: \[ y = b_0 + b_1x_1 + b_2x_2 + \cdots + b_nx_n \]

Where:

-   \( y \) is the dependent variable (target).
-   \( x_1, x_2, \ldots, x_n \) are the independent variables (features).
-   \( b_0 \) is the intercept.
-   \( b_1, b_2, \ldots, b_n \) are the coefficients of the features.

## Implementation Steps

1. **Data Preparation**

    - Load the dataset.
    - Split the data into training and testing sets.
    - (Optional) Standardize or normalize the features.

2. **Model Training**

    - Use the `fit` method to train the model on the training data.

3. **Prediction**
    - Use the `predict` method to make predictions on new data.

## Example Code

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Load the data
data = pd.read_csv("data/prediction/sales.csv")
features = data[["Feature1", "Feature2", "Feature3"]].values  # Replace with your actual feature names
target = data["Target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=0)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
```


In [2]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [3]:
data = pd.read_csv(r"data/prediction/data.csv")
data

Unnamed: 0,ulke,boy,kilo,yas,cinsiyet
0,tr,130,30,10,e
1,tr,125,36,11,e
2,tr,135,34,10,k
3,tr,133,30,9,k
4,tr,129,38,12,e
5,tr,180,90,30,e
6,tr,190,80,25,e
7,tr,175,90,35,e
8,tr,177,60,22,k
9,us,185,105,33,e


In [4]:
le = LabelEncoder()
ohe = OneHotEncoder()

In [5]:
country = data["ulke"].values

country = le.fit_transform(country)
country = country.reshape(-1, 1)
country = ohe.fit_transform(country).toarray()

country

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

In [6]:
genders = data["cinsiyet"].values

genders = le.fit_transform(genders)
genders = genders.reshape(-1, 1)
genders = ohe.fit_transform(genders).toarray()

genders

array([[1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.]])

In [7]:
country_df = pd.DataFrame(data=country, columns=["fr", "tr", "us"])
country_df

Unnamed: 0,fr,tr,us
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0
5,0.0,1.0,0.0
6,0.0,1.0,0.0
7,0.0,1.0,0.0
8,0.0,1.0,0.0
9,0.0,0.0,1.0


In [8]:
numeric_df = data[["boy", "kilo", "yas"]]
numeric_df.columns = ["height", "weight", "age"]
numeric_df

Unnamed: 0,height,weight,age
0,130,30,10
1,125,36,11
2,135,34,10
3,133,30,9
4,129,38,12
5,180,90,30
6,190,80,25
7,175,90,35
8,177,60,22
9,185,105,33


In [9]:
genders_df = pd.DataFrame(data=genders[:, 1], columns=["gender"])
genders_df

Unnamed: 0,gender
0,0.0
1,0.0
2,1.0
3,1.0
4,0.0
5,0.0
6,0.0
7,0.0
8,1.0
9,0.0


In [10]:
final_df = pd.concat([country_df, numeric_df], axis=1)
final_df

Unnamed: 0,fr,tr,us,height,weight,age
0,0.0,1.0,0.0,130,30,10
1,0.0,1.0,0.0,125,36,11
2,0.0,1.0,0.0,135,34,10
3,0.0,1.0,0.0,133,30,9
4,0.0,1.0,0.0,129,38,12
5,0.0,1.0,0.0,180,90,30
6,0.0,1.0,0.0,190,80,25
7,0.0,1.0,0.0,175,90,35
8,0.0,1.0,0.0,177,60,22
9,0.0,0.0,1.0,185,105,33


In [11]:
x_train, x_test, y_train, y_test = train_test_split(final_df, genders_df, test_size=.33, random_state=0)

In [12]:
lr = LinearRegression()
lr.fit(x_train, y_train)

In [13]:
pred = lr.predict(x_test)
threshold = .5
pred_binary = (pred > threshold).astype(int)
pred_binary == y_test

Unnamed: 0,gender
20,False
10,True
14,True
13,True
1,True
21,False
11,True
19,True


### Other


In [17]:
height_df = final_df.iloc[:,3:4]
height_df

Unnamed: 0,height
0,130
1,125
2,135
3,133
4,129
5,180
6,190
7,175
8,177
9,185


In [42]:
part_1 = final_df.iloc[:, :3]
part_2 = final_df.iloc[:, 4:]
part_3 = genders_df
other_final_df = pd.concat([part_1, part_2, part_3], axis=1)
other_final_df

Unnamed: 0,fr,tr,us,weight,age,gender
0,0.0,1.0,0.0,30,10,0.0
1,0.0,1.0,0.0,36,11,0.0
2,0.0,1.0,0.0,34,10,1.0
3,0.0,1.0,0.0,30,9,1.0
4,0.0,1.0,0.0,38,12,0.0
5,0.0,1.0,0.0,90,30,0.0
6,0.0,1.0,0.0,80,25,0.0
7,0.0,1.0,0.0,90,35,0.0
8,0.0,1.0,0.0,60,22,1.0
9,0.0,0.0,1.0,105,33,0.0


In [43]:
x_train, x_test, y_train, y_test = train_test_split(other_final_df, height_df, test_size=.33, random_state=0)

In [49]:
lr = LinearRegression()
lr.fit(x_train, y_train)
pred = lr.predict(x_test)
result = pred - y_test
result

Unnamed: 0,height
20,18.266387
10,-12.128385
14,-4.206136
13,-3.693314
1,5.82889
21,7.961384
11,-4.872173
19,-1.731011


### Backward Elimination


In [52]:
import statsmodels.api as sm

In [79]:
X_l = other_final_df.iloc[:, [0, 1, 2, 3, 4, 5]].values
X_l = np.array(X_l, dtype=float)
model = sm.OLS(height_df.values, X_l).fit()
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.885
Model:,OLS,Adj. R-squared:,0.849
Method:,Least Squares,F-statistic:,24.69
Date:,"Cum, 06 Eyl 2024",Prob (F-statistic):,5.41e-07
Time:,18:33:33,Log-Likelihood:,-73.95
No. Observations:,22,AIC:,159.9
Df Residuals:,16,BIC:,166.4
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,103.4708,9.878,10.475,0.000,82.530,124.412
x2,97.7050,7.463,13.091,0.000,81.883,113.527
x3,93.8734,11.312,8.298,0.000,69.892,117.855
x4,0.9211,0.119,7.737,0.000,0.669,1.174
x5,0.0814,0.221,0.369,0.717,-0.386,0.549
x6,10.5980,5.052,2.098,0.052,-0.112,21.308

0,1,2,3
Omnibus:,1.031,Durbin-Watson:,2.759
Prob(Omnibus):,0.597,Jarque-Bera (JB):,0.624
Skew:,0.407,Prob(JB):,0.732
Kurtosis:,2.863,Cond. No.,678.0


In [80]:
X_l = other_final_df.iloc[:, [0, 1, 2, 3, 5]].values
X_l = np.array(X_l, dtype=float)
model = sm.OLS(height_df.values, X_l).fit()
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.884
Model:,OLS,Adj. R-squared:,0.857
Method:,Least Squares,F-statistic:,32.47
Date:,"Cum, 06 Eyl 2024",Prob (F-statistic):,9.32e-08
Time:,18:34:26,Log-Likelihood:,-74.043
No. Observations:,22,AIC:,158.1
Df Residuals:,17,BIC:,163.5
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,104.5490,9.193,11.373,0.000,85.153,123.944
x2,97.9693,7.238,13.536,0.000,82.699,113.240
x3,95.4352,10.220,9.338,0.000,73.873,116.998
x4,0.9405,0.104,9.029,0.000,0.721,1.160
x5,11.1093,4.733,2.347,0.031,1.123,21.096

0,1,2,3
Omnibus:,0.871,Durbin-Watson:,2.719
Prob(Omnibus):,0.647,Jarque-Bera (JB):,0.459
Skew:,0.351,Prob(JB):,0.795
Kurtosis:,2.91,Cond. No.,596.0


In [81]:
X_l = other_final_df.iloc[:, [0, 1, 2, 3]].values
X_l = np.array(X_l, dtype=float)
model = sm.OLS(height_df.values, X_l).fit()
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.847
Model:,OLS,Adj. R-squared:,0.821
Method:,Least Squares,F-statistic:,33.16
Date:,"Cum, 06 Eyl 2024",Prob (F-statistic):,1.52e-07
Time:,18:35:04,Log-Likelihood:,-77.131
No. Observations:,22,AIC:,162.3
Df Residuals:,18,BIC:,166.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,119.8136,7.265,16.491,0.000,104.550,135.077
x2,109.8084,5.804,18.919,0.000,97.615,122.002
x3,114.4212,6.984,16.382,0.000,99.747,129.095
x4,0.7904,0.092,8.595,0.000,0.597,0.984

0,1,2,3
Omnibus:,2.925,Durbin-Watson:,2.855
Prob(Omnibus):,0.232,Jarque-Bera (JB):,1.499
Skew:,0.605,Prob(JB):,0.473
Kurtosis:,3.416,Cond. No.,369.0


In [59]:
pd.concat([other_final_df, pd.DataFrame(np.ones((22, 1)).astype(int), columns=["temp"])], axis=1)

Unnamed: 0,fr,tr,us,weight,age,gender,temp
0,0.0,1.0,0.0,30,10,0.0,1
1,0.0,1.0,0.0,36,11,0.0,1
2,0.0,1.0,0.0,34,10,1.0,1
3,0.0,1.0,0.0,30,9,1.0,1
4,0.0,1.0,0.0,38,12,0.0,1
5,0.0,1.0,0.0,90,30,0.0,1
6,0.0,1.0,0.0,80,25,0.0,1
7,0.0,1.0,0.0,90,35,0.0,1
8,0.0,1.0,0.0,60,22,1.0,1
9,0.0,0.0,1.0,105,33,0.0,1
