# 2- Çoklu Doğrusal Regresyon / Multiple Linear Regression

## TEORİ

**Temel amaç, bağımlı ve bağımsız değişkenler arasındaki ilişkiyi ifade eden doğrusal fonksiyonu bulmaktır.**

* Hata kareler toplamını minimize edecek şekilde katsayı tahminlerini bulmaya çalışarak yapılır.
* Çoklu Doğrusal Regresyon, gözlem sayısı fazla olan verisetlerinde kullanılabilir.
* Bağımsız değişkenlerin adet sayısı farketmez ancak bağımlı değişkenin adedi 1 olmalı ve değişkenler sürekli olmalıdır.

Veri bilimicilerin/analistlerin amacı;
1. Bağımlı değişkeni etkilediği belirlenen değişkenler aracılığıyla bağımlı değişkenin değerlerinin tahmin edilmesi,
2. Bağımlı değişkeni etkilediği düşünülen bağımsız değişkenlerden hangisinin veya hangilerinin bağımlı değişkeni ne yönde, ne şeklide etkilediğini tespit etmek, aralarındaki ilişkiyi tanımlamaktır.

## UYGULAMA

### Verisetinin İncelenmesi ve Manipule Edilmesi

* Kullanılan Kütüphaneler

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

* Verisetinin Yüklenmsei ve İncelenmesi

Verisetini [buradan](https://mrkizmaz-s3data.s3.eu-west-1.amazonaws.com/DataSets/Advertising.csv) indirebilirsiniz.

In [2]:
url = "https://mrkizmaz-s3data.s3.eu-west-1.amazonaws.com/DataSets/Advertising.csv"

ad = pd.read_csv(url, usecols = [1, 2, 3, 4])
df = ad.copy()
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [3]:
df.shape, df.ndim, df.size # verisetinin sekli

((200, 4), 2, 800)

In [4]:
df.isnull().values.any() # verisetinde ayrık veya bos deger var mı?

False

In [5]:
df.info() # verisetinin bilgileri

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   radio      200 non-null    float64
 2   newspaper  200 non-null    float64
 3   sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


In [6]:
df.describe() # verisetinin dagılım bilgileri

Unnamed: 0,TV,radio,newspaper,sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [7]:
df.corr() # verisetindeki degiskenlerin aralarındaki korelasyon

Unnamed: 0,TV,radio,newspaper,sales
TV,1.0,0.054809,0.056648,0.782224
radio,0.054809,1.0,0.354104,0.576223
newspaper,0.056648,0.354104,1.0,0.228299
sales,0.782224,0.576223,0.228299,1.0


### Statsmodels ile Modelleme

#### Model Kurulumu

In [8]:
# 1. yol
X1 = df.drop('sales', axis = 1) # bagımsız degiskenler (sales dısındaki degiskenler)
y1 = df['sales'] # bagımlı degisken

X1 = sm.add_constant(X1) # bu modelde katsayı eklenmeli!
lm1 = sm.OLS(y1, X1)
model1 = lm1.fit()
model1.summary() # model hakkında özet bilgi tablosu verir

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Fri, 18 Feb 2022",Prob (F-statistic):,1.58e-96
Time:,01:56:08,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


In [9]:
X1.head()

Unnamed: 0,const,TV,radio,newspaper
0,1.0,230.1,37.8,69.2
1,1.0,44.5,39.3,45.1
2,1.0,17.2,45.9,69.3
3,1.0,151.5,41.3,58.5
4,1.0,180.8,10.8,58.4


In [10]:
y1.head()

0    22.1
1    10.4
2     9.3
3    18.5
4    12.9
Name: sales, dtype: float64

In [11]:
model1.params # modelin parametre degerleri

const        2.938889
TV           0.045765
radio        0.188530
newspaper   -0.001037
dtype: float64

In [12]:
model1.summary().tables[1] # katsayı tablosu

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011


In [13]:
model1.f_pvalue # modelin anlamlılıgı

1.575227256092437e-96

In [14]:
print("Modelin anlamlılıgı (p-value): %.5f" %model1.f_pvalue) # p-value < 0.05 ise anlamlıdır

Modelin anlamlılıgı (p-value): 0.00000


In [15]:
model1.rsquared * 100 # Modelin acıklanabilirlik basarı oranı

89.72106381789521

In [16]:
# 2. yol
# bagımlı degisken: TV, bagımsız degiskenler: TV, radio, newspaper
lm2 = smf.ols(formula = 'sales ~ TV + radio + newspaper', data = df) 
model2 = lm2.fit()
model2.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Fri, 18 Feb 2022",Prob (F-statistic):,1.58e-96
Time:,01:56:08,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


In [17]:
model2.params # model parametreleri

Intercept    2.938889
TV           0.045765
radio        0.188530
newspaper   -0.001037
dtype: float64

In [18]:
print("Modelin anlamlılıgı (p-value): %.5f" %model2.f_pvalue) # 0.05'den küçük olmalıdır.

Modelin anlamlılıgı (p-value): 0.00000


In [19]:
model2.rsquared * 100 # Modelin basarı yüzdesi

89.72106381789521

In [20]:
# Sonuç: İki modelde de sonuclar neredeyse aynı, bu nedenle uygulama alanlarında herhangi birinin kullanılması farketmez!

In [21]:
# Model denkemi,
denklem_stats = "Statsmodels ile elde edilen denklem: Sales = {} + TV * {} + radio * {} + newspaper * {}".format(model1.params[0],
                                                                                                                       model1.params[1],
                                                                                                                       model1.params[2],
                                                                                                                       model1.params[3])
denklem_stats

'Statsmodels ile elde edilen denklem: Sales = 2.9388893694594076 + TV * 0.04576464545539764 + radio * 0.1885300169182042 + newspaper * -0.0010374930424762382'

#### Tahminleme

In [22]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


* Örneğin, reklam harcamalarında 130 birim TV, 40 birim radio ve 10 birim de newspaper harcama olduğunda satışların tahmini değeri ne olur?

In [23]:
birimler = [[1, 130, 40, 10]] # sabit degerin katsayı degeri 1 olarak verilmelidir!

In [24]:
model1.predict(birimler)

array([16.41911902])

In [25]:
# Modelin basarı degerlerinin hesaplanması,

In [26]:
y1.head() # gercek y degerleri

0    22.1
1    10.4
2     9.3
3    18.5
4    12.9
Name: sales, dtype: float64

In [27]:
ypred1 = model1.predict(X1) # bagımsız degiskenler üzerinden tahmin edilen y degerleri

In [28]:
ypred1[0:5]

0    20.523974
1    12.337855
2    12.307671
3    17.597830
4    13.188672
dtype: float64

In [29]:
mse = mean_squared_error(y1, ypred1) # MSE
rmse_stats = np.sqrt(mse) # RMSE

In [30]:
rmse_stats # hata basarı degeri [önemli]

1.66857014072257

In [31]:
r2_stats = r2_score(y1, ypred1)

In [32]:
r2_stats * 100 # modelin basarı yüzdesi [önemli]

89.72106381789521

In [33]:
print(denklem_stats)
print(f"Modelin hata degeri: {rmse_stats}")
print("Modelin basarı yüzdesi: ", r2_stats * 100)

Statsmodels ile elde edilen denklem: Sales = 2.9388893694594076 + TV * 0.04576464545539764 + radio * 0.1885300169182042 + newspaper * -0.0010374930424762382
Modelin hata degeri: 1.66857014072257
Modelin basarı yüzdesi:  89.72106381789521


### ScikitLearn ile Modelleme

#### Model Kurulumu

In [34]:
X = df.drop('sales', axis = 1) # bagımsız degiskenler
y = df['sales'] # bagımlı degiskenler

# %80 egitim, %20 test seti,
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size = 0.20,
                                                   random_state = 42)
# model nesnesinin olusturulması ve fit edilmesi
lm = LinearRegression() # nesne modeli
model = lm.fit(X_train, y_train) # modeli fit eder
model

LinearRegression()

In [35]:
model.intercept_ # sabit deger

2.9790673381226256

In [36]:
model.coef_ # katsayılar

array([0.04472952, 0.18919505, 0.00276111])

In [37]:
model.score(X_train, y_train) # modelin basarı degeri (r-squared)

0.8957008271017817

In [38]:
denklem_sklearn = "Scikitlearn ile elde edilen denklem: Sales = {} + TV * {} + radio * {} + newspaper * {}".format(model.intercept_,
                                                                                                                   model.coef_[0],
                                                                                                                   model.coef_[1],
                                                                                                                   model.coef_[2])
print(denklem_sklearn)

Scikitlearn ile elde edilen denklem: Sales = 2.9790673381226256 + TV * 0.044729517468716326 + radio * 0.1891950542343766 + newspaper * 0.0027611143413672056


#### Tahminleme

In [39]:
y.head(10) # gercek y degerleri

0    22.1
1    10.4
2     9.3
3    18.5
4    12.9
5     7.2
6    11.8
7    13.2
8     4.8
9    10.6
Name: sales, dtype: float64

In [40]:
model.predict([[130,40,10]]) # örnek tahmin; 130 TV, 40 radio, 10 np icin sales ne olur?

array([16.38931792])

In [41]:
ypred = model.predict(X_test)
ypred[0:10] # tahmin edilen y degerleri

array([16.4080242 , 20.88988209, 21.55384318, 10.60850256, 22.11237326,
       13.10559172, 21.05719192,  7.46101034, 13.60634581, 15.15506967])

In [42]:
mse = mean_squared_error(y_test, ypred) # MSE
rmse_sklearn = np.sqrt(mse) # RMSE

In [43]:
rmse_sklearn # Hata degeri [önemli]

1.7815996615334502

In [44]:
r2_sklearn = r2_score(y_test, ypred)

In [45]:
r2_sklearn * 100 # Modelin basarı yüzdesi [önemli]

89.9438024100912

In [46]:
print(denklem_sklearn)
print(f"Modelin hata degeri: {rmse_sklearn}")
print("Modelin basarı yüzdesi: ", r2_sklearn * 100)

Scikitlearn ile elde edilen denklem: Sales = 2.9790673381226256 + TV * 0.044729517468716326 + radio * 0.1891950542343766 + newspaper * 0.0027611143413672056
Modelin hata degeri: 1.7815996615334502
Modelin basarı yüzdesi:  89.9438024100912


#### Model Tuning / Model Doğrulama

* random_state = 42, 120, 10, ... random state degerleri degisirse modelin parametreleri de degisir. 
* Bu sorunu gidermek icin, model üzerinde capraz dogrulama (cross validation) uygulanır. 
* Böylelikle modelin dogrulugu hakkında daha net bilgi elde edilir.

In [47]:
# model veriseti üzerinde 5 katlı capraz dogrulama uygulandıgında r2 (r-squared) degerleri
cross_val_score(model, X_train, y_train, cv = 5, scoring = "r2")

array([0.71981527, 0.92992247, 0.92652848, 0.91883369, 0.80234225])

In [48]:
r2_real = cross_val_score(model, X_train, y_train, cv = 5, scoring = "r2").mean() # r2 degerlerinin ortalaması

In [49]:
r2_real * 100 # modelin basarı yüzdesi [önemli]

85.94884313276513

In [50]:
rmse_real = np.sqrt(-cross_val_score(model, X_train, y_train, cv = 10, scoring = "neg_mean_squared_error").mean()) # RMSE

In [51]:
rmse_real # modelin hata degeri [önemli]

1.7201075950880973

In [54]:
print("Cross Validation yöntemiyle elde edilen hata degerleri;")
print("Modelin hata degeri: ", rmse_real)
print("Modelin basarı yüzdesi: ", r2_real * 100)

Cross Validation yöntemiyle elde edilen hata degerleri;
Modelin hata degeri:  1.7201075950880973
Modelin basarı yüzdesi:  85.94884313276513


## SONUÇ

* Statsmodels ile elde edilen denklem: Sales = 2.9388893694594076 + TV * 0.04576464545539764 + radio * 0.1885300169182042 + newspaper * -0.0010374930424762382
    * Modelin hata degeri: **1.66857014072257**
    * Modelin basarı yüzdesi:  **89.72106381789521**
* Scikitlearn ile elde edilen denklem: Sales = 2.9790673381226256 + TV * 0.044729517468716326 + radio * 0.1891950542343766 + newspaper * 0.0027611143413672056
    * Modelin hata degeri: **1.7815996615334502**
    * Modelin basarı yüzdesi:  **89.9438024100912**
* Cross Validation yöntemiyle elde edilen hata degerleri;
    * Modelin hata degeri:  **1.7201075950880973**
    * Modelin basarı yüzdesi:  **85.94884313276513**
