# 7- ElasticNet (ENET) Regression

## TEORİ

**Temel amaç, ridge ve lasso regresyon ile aynıdır ama elastic net, ridge ve lasso regresyonunu birleştirir. Ridge regresyon tarzı cezalandırma ve lasso regresyon tarzında değişken seçimi yapar.**

## UYGULAMA

### Verisetinin İncelenmesi ve Manipüle Edilmesi

* Kullanılan Kütüphanler

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings("ignore", category = UserWarning)

* Verisetinin İncelenmesi

Verisetini [buradan](https://mrkizmaz-s3data.s3.eu-west-1.amazonaws.com/DataSets/Hitters.csv) indirebilirsiniz.

In [3]:
hit = pd.read_csv("https://mrkizmaz-s3data.s3.eu-west-1.amazonaws.com/DataSets/Hitters.csv")
df = hit.copy()
df.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N


In [4]:
df.isnull().values.any() # verisetinde bos degerler var mı?

True

In [5]:
df = df.dropna(axis = 0) # bos degerler olan satırları siler
df.isnull().values.any()

False

In [6]:
df.info() # veriseti hakkında genel bilgi verir

<class 'pandas.core.frame.DataFrame'>
Int64Index: 263 entries, 1 to 321
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   AtBat      263 non-null    int64  
 1   Hits       263 non-null    int64  
 2   HmRun      263 non-null    int64  
 3   Runs       263 non-null    int64  
 4   RBI        263 non-null    int64  
 5   Walks      263 non-null    int64  
 6   Years      263 non-null    int64  
 7   CAtBat     263 non-null    int64  
 8   CHits      263 non-null    int64  
 9   CHmRun     263 non-null    int64  
 10  CRuns      263 non-null    int64  
 11  CRBI       263 non-null    int64  
 12  CWalks     263 non-null    int64  
 13  League     263 non-null    object 
 14  Division   263 non-null    object 
 15  PutOuts    263 non-null    int64  
 16  Assists    263 non-null    int64  
 17  Errors     263 non-null    int64  
 18  Salary     263 non-null    float64
 19  NewLeague  263 non-null    object 
dtypes: float64

In [7]:
df.describe().T # verisetinin istatistiksel dagılımı hakkında bilgi verir

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
AtBat,263.0,403.642586,147.307209,19.0,282.5,413.0,526.0,687.0
Hits,263.0,107.828897,45.125326,1.0,71.5,103.0,141.5,238.0
HmRun,263.0,11.619772,8.757108,0.0,5.0,9.0,18.0,40.0
Runs,263.0,54.745247,25.539816,0.0,33.5,52.0,73.0,130.0
RBI,263.0,51.486692,25.882714,0.0,30.0,47.0,71.0,121.0
Walks,263.0,41.114068,21.718056,0.0,23.0,37.0,57.0,105.0
Years,263.0,7.311787,4.793616,1.0,4.0,6.0,10.0,24.0
CAtBat,263.0,2657.543726,2286.582929,19.0,842.5,1931.0,3890.5,14053.0
CHits,263.0,722.186312,648.199644,4.0,212.0,516.0,1054.0,4256.0
CHmRun,263.0,69.239544,82.197581,0.0,15.0,40.0,92.5,548.0


In [8]:
df.corr() # verisetinin degiskenleri arasındaki korelasyon

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,Salary
AtBat,1.0,0.963969,0.555102,0.899829,0.796015,0.624448,0.012725,0.207166,0.225341,0.212422,0.237278,0.221393,0.132926,0.309607,0.342117,0.325577,0.394771
Hits,0.963969,1.0,0.530627,0.91063,0.788478,0.587311,0.018598,0.206678,0.235606,0.189364,0.238896,0.219384,0.122971,0.299688,0.303975,0.279876,0.438675
HmRun,0.555102,0.530627,1.0,0.631076,0.849107,0.440454,0.113488,0.217464,0.217496,0.492526,0.258347,0.349858,0.227183,0.250931,-0.161602,-0.009743,0.343028
Runs,0.899829,0.91063,0.631076,1.0,0.778692,0.697015,-0.011975,0.171811,0.191327,0.229701,0.237831,0.202335,0.1637,0.27116,0.179258,0.192609,0.419859
RBI,0.796015,0.788478,0.849107,0.778692,1.0,0.569505,0.129668,0.278126,0.292137,0.44219,0.307226,0.387777,0.233619,0.312065,0.062902,0.150155,0.449457
Walks,0.624448,0.587311,0.440454,0.697015,0.569505,1.0,0.134793,0.26945,0.270795,0.349582,0.332977,0.312697,0.42914,0.280855,0.102523,0.081937,0.443867
Years,0.012725,0.018598,0.113488,-0.011975,0.129668,0.134793,1.0,0.915681,0.897844,0.722371,0.876649,0.863809,0.837524,-0.020019,-0.085118,-0.156512,0.400657
CAtBat,0.207166,0.206678,0.217464,0.171811,0.278126,0.26945,0.915681,1.0,0.995057,0.801676,0.982747,0.95073,0.906712,0.053393,-0.007897,-0.070478,0.526135
CHits,0.225341,0.235606,0.217496,0.191327,0.292137,0.270795,0.897844,0.995057,1.0,0.786652,0.984542,0.946797,0.890718,0.067348,-0.013144,-0.068036,0.54891
CHmRun,0.212422,0.189364,0.492526,0.229701,0.44219,0.349582,0.722371,0.801676,0.786652,1.0,0.825625,0.927903,0.810878,0.093822,-0.188886,-0.165369,0.524931


In [9]:
# Verisetindeki kategorik degiskenleri sürekli degiskenlere cevirmek gerekir,
dummy = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
dummy.head()

Unnamed: 0,League_A,League_N,Division_E,Division_W,NewLeague_A,NewLeague_N
1,0,1,0,1,0,1
2,1,0,0,1,1,0
3,0,1,1,0,0,1
4,0,1,1,0,0,1
5,1,0,0,1,1,0


In [10]:
# anaveriseti icerisindeki kategorik degiskenler silinmeli
df1 = df.drop(['League', 'Division', 'NewLeague'], axis = 1)
df1.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,Salary
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,632,43,10,475.0
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,880,82,14,480.0
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,200,11,3,500.0
4,321,87,10,39,42,30,2,396,101,12,48,46,33,805,40,4,91.5
5,594,169,4,74,51,35,11,4408,1133,19,501,336,194,282,421,25,750.0


In [11]:
# dummy ve df1 verisetleri birlestirilerek güncel veriseti olusturulur,
df = pd.concat([df1, dummy[['League_N', 'Division_W', 'NewLeague_N']]], axis = 1)
df.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,Salary,League_N,Division_W,NewLeague_N
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,632,43,10,475.0,1,1,1
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,880,82,14,480.0,0,1,0
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,200,11,3,500.0,1,0,1
4,321,87,10,39,42,30,2,396,101,12,48,46,33,805,40,4,91.5,1,0,1
5,594,169,4,74,51,35,11,4408,1133,19,501,336,194,282,421,25,750.0,0,1,0


### Model Kurulumu

In [13]:
X = df.drop('Salary', axis = 1) # bagımsız degiskenler (salary dısındakileri alır)
y = df['Salary'] # bagımlı degiskenler 

In [14]:
# %20 test, %80 egitim seti,
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size = 0.20,
                                                   random_state = 42)
# model nesnesinin olusturulması ve fit edilmesi
enet = ElasticNet()
enetmodel = enet.fit(X_train, y_train)
enetmodel

ElasticNet()

In [15]:
enetmodel.intercept_ # sabit deger

9.66017233160926

In [16]:
enetmodel.coef_ # katsayı degerleri

array([ -1.61694103,   7.46019777,   2.60346369,  -2.49584787,
        -0.49720266,   5.38019309,   6.07654408,  -0.23306998,
         0.07107404,  -0.60880473,   1.74638227,   0.99206863,
        -0.75861262,   0.2545778 ,   0.2288989 ,  -0.82299964,
        15.41006115, -35.23558671,   4.76373134])

In [17]:
rscore = enetmodel.score(X_train, y_train) # modelin anlamlılık (basarı) degeri

In [18]:
rscore # [önemli]

0.5808662063571429

### Tahminleme

In [20]:
ypred = enetmodel.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, ypred)) # modelin hata degeri

In [21]:
rmse # [önemli]

356.7878012642823

In [22]:
print(f"Cross validation uygulamadan elde edilen degerler; \nModelin basarı yüzdesi: {rscore * 100} \nModelin hata degeri: {rmse}")

Cross validation uygulamadan elde edilen degerler; 
Modelin basarı yüzdesi: 58.08662063571429 
Modelin hata degeri: 356.7878012642823


### Model Tuning / Model Doğrulama

In [27]:
# amac, optimum hata degerinin elde edilmesi,
enetcv_model = ElasticNetCV(cv = 10, random_state = 1).fit(X_train, y_train)

In [28]:
enetcv_model.alpha_ # optimum alpha degeri

1179.7218882369168

In [29]:
# final model,
enetcv_final = ElasticNet(alpha = enetcv_model.alpha_).fit(X_train, y_train)
ypred_final = enetcv_final.predict(X_test)
rmse_final = np.sqrt(mean_squared_error(y_test, ypred_final))

In [30]:
rmse_final

378.23482159854785

In [34]:
rscore_final = enetcv_final.score(X_train, y_train) # modelin anlamlılıgı (basarısı) 

In [35]:
rscore_final # [önemli]

0.531895654493352

In [None]:
print(f"Cross validation uygulandıktan sonra elde edilen degerler; \nModelin basarı yüzdesi: {rscore_final * 100} \nModelin hata degeri: {rmse_final}")

### Sonuç

* Cross validation uygulamadan elde edilen degerler; 
    * Modelin basarı yüzdesi: 58.08662063571429 
    * Modelin hata degeri: **356.7878012642823**
* Cross validation uygulandıktan sonra elde edilen degerler; 
    * Modelin basarı yüzdesi: 53.1895654493352 
    * Modelin hata degeri: **378.23482159854785**

* Aynı veriseti üzerinde,
    * PCR regresyon modelinde optimum hata degeri: **333.6546715301251** olarak bulundu.
    * PLC regresyon modelinde optimum hata degeri: **333.60890630848354** olarak bulundu.
    * Ridge regresyon modelinde optimum hata degeri: **383.0073770592842** olarak bulundu.
    * Lasso regresyon modelinde optimum hata degeri: **358.60412684844744** olarak bulundu.
    * ElasticNet regresyon modelinde optimum hata degeri: **378.23482159854785**

Doğrusal regresyon modellerinde, aynı veriseti üzerinde uygulanan regresyon modellerinde en düşük hata değerini veren model; PLC regresyon modeli oldu.