In [1]:
from sklearn import datasets, ensemble, preprocessing, tree, linear_model
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. Dataset

## 1.a. Veriyi alma

In [2]:
data = datasets.fetch_california_housing()

In [3]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MEDV'] = data.target

In [4]:
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

In [5]:
df.head(20)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MEDV
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
5,4.0368,52.0,4.761658,1.103627,413.0,2.139896,37.85,-122.25,2.697
6,3.6591,52.0,4.931907,0.951362,1094.0,2.128405,37.84,-122.25,2.992
7,3.12,52.0,4.797527,1.061824,1157.0,1.788253,37.84,-122.25,2.414
8,2.0804,42.0,4.294118,1.117647,1206.0,2.026891,37.84,-122.26,2.267
9,3.6912,52.0,4.970588,0.990196,1551.0,2.172269,37.84,-122.25,2.611


## 1.b. Veriyi ozetleme

Degiskenlerin tanimlayici istatistiklerini goruntuleyiniz. 

# 2. Onisleme

## 2.a. Girdiler ve hedef degisken

`MEDV` degiskeni hedef degisken olup, tablodaki diger tum degiskenler tahmin maksatli girdi olarak kullanilabilir.

> Bagimli ve bagimsiz degiskenleri ayiriniz.

## 2.b. "Feature Engineering"

Aklinizda modelinizin performansini arttirmaya yardimci olacagini dusundugunuz yeni degiskenler varsa, ilk modelinizi egittikten sonra buraya donup bu degiskenleri ekleyiniz.

## 2.c. Modelin ihtiyac duydugu donusumler

Karar agaclari ile kuracaginiz *bagging* ve *boosting* modelleri hangi donusumlere ihtiyac duymaktadir?

Yapilmasinin sart oldugunu dusundugunuz donusumleri bu kisimda yapiniz.

# 3. Modelleme

Asagida ev fiyatlarini tahmin etmek icin baz bir lineer model olusturulmustur:

In [6]:
from sklearn import linear_model, model_selection, metrics, pipeline, compose
lm = pipeline.make_pipeline(
    pipeline.make_union(
        preprocessing.FunctionTransformer(lambda x: x), 
        preprocessing.KBinsDiscretizer(15, encode='onehot-dense'),
        preprocessing.KBinsDiscretizer(3, encode='onehot-dense'),
        preprocessing.FunctionTransformer(lambda x: 1. / x),
        preprocessing.FunctionTransformer(lambda x: np.log1p(x.drop(['Latitude', 'Longitude'], axis=1))),
        ),
    preprocessing.StandardScaler(),
    compose.TransformedTargetRegressor(linear_model.LinearRegression(), func=np.log1p, inverse_func=np.expm1)
    # linear_model.LinearRegression()
    )
lmcv = model_selection.cross_validate(
    lm, 
    df.drop(['MEDV'], axis=1), 
    df['MEDV'], 
    cv=model_selection.KFold(
        n_splits=10, 
        shuffle=True, 
        random_state=42
        ),
    return_train_score=True,
    scoring={
        'r2': metrics.make_scorer(metrics.r2_score), 
        'mape': metrics.make_scorer(
            lambda a, b: 100.*metrics.mean_absolute_percentage_error(a, b)
            ), 
        'wape': metrics.make_scorer(
            lambda a, b: 100.*metrics.mean_absolute_percentage_error(a, b, sample_weight=a
            )
        )
    }
)
pd.DataFrame(lmcv).mean()

fit_time       0.410629
score_time     0.026716
test_r2        0.749593
train_r2       0.754881
test_mape     21.718011
train_mape    21.468726
test_wape     19.252674
train_wape    19.077249
dtype: float64

## 3.a. Baz modelin uzerine cikis

- Bu baz modelin performansini her 3 test metriginde de ($R^2$, MAPE, WAPE) surpase edecek bir ensemble model olusturunuz.
- **Modelinizin her bir fold uzerinde ortalama egitim suresi (fit time) 0.5 saniye altinda olmalidir.**
- Modellerinizin performansini 10-fold cross validation ile kontrol ediniz. **K-fold'dan once dataseti karistirdiginizdan emin olunuz.**


*Dogrulama icin kullanacaginiz fonksiyon (ornegin `model_selection.cross_validate`) `cv` parametresini kabul ediyorsa,*
```python
cv=model_selection.KFold(n_splits=10, shuffle=True, random_state=42)
``` 
*pas ediniz.*

In [31]:
# freestyle

### 3.a.I Degerlendirme

- Istenilen sure kisitini yakalamak icin modelinizde varsayilan parametrelerden nasil sapmalara ihtiyac duydunuz?
- Bagging ve boosting yontemlerinden hangisi bu gorev icin daha cok isinize yaradi? Nedenleriyle aciklayiniz.

## 3.b. Yeni ufuklar

Sure kisitini 1 saniyeye cikararak, $R^2$ skoru %80'in uzerinde bir model olusturunuz.

## 3.c. Maksimum performans

Sure kisitini 180 saniyeye cikararak, **$R^2$ skoru %83'un uzerinde** bir model olusturunuz.

> Eger her bir fold icin parametre aramasi yapiyorsaniz, parametre aramasinin suresi de 180 saniyeye dahildir.
>
> `sklearn` kutuphanesine bagli kalmak zorunda degilsiniz. Ozellikle gradient boosting icin baska kutuphaneler de deneyebilirsiniz.