# üè† House Prices ‚Äî Advanced Regression Techniques
## U√ßtan Uca Makine √ñƒürenmesi Pipeline'ƒ±

Bu notebook, Kaggle'ƒ±n **House Prices** yarƒ±≈ümasƒ± i√ßin sƒ±fƒ±rdan kurulmu≈ü, production-grade bir ML pipeline'ƒ± i√ßermektedir.

---

### üìå ƒ∞√ßindekiler
| # | B√∂l√ºm |
|---|-------|
| 1 | K√ºt√ºphane Kurulumu & Import'lar |
| 2 | Veri Y√ºkleme & Ke≈üifsel Analiz |
| 3 | Outlier Temizliƒüi |
| 4 | Hedef Deƒüi≈üken D√∂n√º≈ü√ºm√º |
| 5 | Eksik Veri Stratejisi |
| 6 | Ordinal Encoding |
| 7 | Feature Engineering |
| 8 | One-Hot Encoding & Box-Cox |
| 9 | Train/Test Ayrƒ±mƒ± & Son Temizlik |
| 10 | Cross-Validation Altyapƒ±sƒ± |
| 11 | Lineer Modeller (Baseline) |
| 12 | XGBoost |
| 13 | LightGBM |
| 14 | CatBoost |
| 15 | Optuna Hiperparametre Optimizasyonu |
| 16 | OOF Tahmin √úretimi (Stacking) |
| 17 | Kapsama Analizi |
| 18 | Meta-Model (Stacking) |
| 19 | Weighted Blending |
| 20 | Final Ensemble & Submission |

---

> **Beklenen LB Skoru:** `~0.115 RMSE` (Top %5)  
> **Deƒüerlendirme Metriƒüi:** RMSLE (log(SalePrice) √ºzerinden RMSE)

---
## 1. K√ºt√ºphane Kurulumu & Import'lar

Kaggle ortamƒ±nda XGBoost, LightGBM ve CatBoost zaten kurulu gelir.  
Optuna ayrƒ±ca kurulmalƒ±dƒ±r.

In [1]:
!pip install optuna -q

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from scipy import stats
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import Lasso, Ridge, ElasticNet, RidgeCV
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import VarianceThreshold

import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor

import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

SEED    = 42
N_FOLDS = 10
np.random.seed(SEED)

print("‚úÖ T√ºm k√ºt√ºphaneler y√ºklendi.")

‚úÖ T√ºm k√ºt√ºphaneler y√ºklendi.


---
## 2. Veri Y√ºkleme & Ke≈üifsel Analiz

Kaggle notebook'larƒ±nda veri dosyalarƒ± `/kaggle/input/house-prices-advanced-regression-techniques/` altƒ±nda yer alƒ±r.

In [2]:
train_orig = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test       = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

print(f"Train : {train_orig.shape}")
print(f"Test  : {test.shape}")
print(f"\nSalePrice istatistikleri:")
print(train_orig['SalePrice'].describe().round(0))

Train : (1460, 81)
Test  : (1459, 80)

SalePrice istatistikleri:
count      1460.0
mean     180921.0
std       79443.0
min       34900.0
25%      129975.0
50%      163000.0
75%      214000.0
max      755000.0
Name: SalePrice, dtype: float64


---
## 3. Outlier Temizliƒüi

> ‚ö†Ô∏è **Kritik Tasarƒ±m Kararƒ±:** Kaggle yarƒ±≈üma host'u sadece `GrLivArea > 4000` olan **2 adet** ger√ßek anomaliyi kaldƒ±rmanƒ±zƒ± √∂nerir.  
> √ñnceki deneylerimizde 9 satƒ±r silindiƒüinde, model y√ºksek fiyatlƒ± evleri (`$300k+`) hi√ß g√∂rmedi ve max tahmin `$281k`'da takƒ±lƒ± kaldƒ±.  
> Bu y√ºzden **sadece ve sadece** a≈üaƒüƒ±daki 2 evi siliyoruz.

In [3]:
train = train_orig.copy()

# Sadece ger√ßek anomali: b√ºy√ºk ama anormal derecede ucuz 2 ev
train = train[~((train['GrLivArea'] > 4000) & (train['SalePrice'] < 300_000))]
train = train.reset_index(drop=True)

print(f"Orijinal satƒ±r sayƒ±sƒ± : 1460")
print(f"Temizlik sonrasƒ±      : {len(train)}")
print(f"Silinen               : {1460 - len(train)} satƒ±r")
print(f"Max SalePrice korundu : ${train['SalePrice'].max():,.0f}")

Orijinal satƒ±r sayƒ±sƒ± : 1460
Temizlik sonrasƒ±      : 1458
Silinen               : 2 satƒ±r
Max SalePrice korundu : $755,000


---
## 4. Hedef Deƒüi≈üken ‚Äî Log D√∂n√º≈ü√ºm√º

Yarƒ±≈ümanƒ±n deƒüerlendirme metriƒüi **RMSLE** (log uzayƒ±nda RMSE) olduƒüu i√ßin hedef deƒüi≈ükene `log1p` uygulamalƒ±yƒ±z.

Bu sayƒ±da iki avantaj saƒülarƒ±z:
1. Saƒüa √ßarpƒ±k daƒüƒ±lƒ±m normalize edilir ‚Üí modeller daha iyi √∂ƒürenir  
2. Tahmin a≈üamasƒ±nda `expm1` ile ger√ßek fiyata d√∂n√º≈ü trivialdir

> Kontrol: `y_train.max()` deƒüeri `~13.53` olmalƒ±dƒ±r ‚Üí `$755,000`

In [4]:
y_train = np.log1p(train['SalePrice'])

print("y_train (log uzayƒ±) istatistikleri:")
print(y_train.describe().round(4))
print(f"\nMax: {y_train.max():.4f} ‚Üí ${np.expm1(y_train.max()):,.0f}")
print(f"Min: {y_train.min():.4f} ‚Üí ${np.expm1(y_train.min()):,.0f}")

y_train (log uzayƒ±) istatistikleri:
count    1458.0000
mean       12.0240
std         0.3997
min        10.4603
25%        11.7747
50%        12.0015
75%        12.2737
max        13.5345
Name: SalePrice, dtype: float64

Max: 13.5345 ‚Üí $755,000
Min: 10.4603 ‚Üí $34,900


---
## 5. Veri Birle≈ütirme & Eksik Veri Stratejisi

Train ve test setini **birle≈ütirerek** d√∂n√º≈ü√ºmleri uygularƒ±z. B√∂ylece her iki sette tutarlƒ± encoding elde ederiz.

### Eksik Veri Gruplarƒ±

| Grup | Strateji | Kolonlar |
|------|----------|----------|
| **Semantik NA** | `"None"` string | PoolQC, Alley, Fence, FireplaceQu... |
| **Sayƒ±sal NA** | `0` | GarageArea, BsmtFinSF1, MasVnrArea... |
| **LotFrontage** | Kom≈üuluk medyanƒ± | Kom≈üu evlerin cephe geni≈üliƒüi benzer olur |
| **Diƒüerleri** | Mode | MSZoning, Electrical... |

In [5]:
n_train  = len(train)
n_test   = len(test)

all_data = pd.concat(
    [train.drop('SalePrice', axis=1), test],
    axis=0
).reset_index(drop=True)

print(f"all_data shape : {all_data.shape}")
print(f"Train kƒ±smƒ±    : [0:{n_train}]")
print(f"Test kƒ±smƒ±     : [{n_train}:{n_train+n_test}]")
print(f"\nEn fazla eksik deƒüere sahip 15 kolon:")
missing = all_data.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
print((missing / len(all_data) * 100).round(1).head(15).to_string())

all_data shape : (2917, 80)
Train kƒ±smƒ±    : [0:1458]
Test kƒ±smƒ±     : [1458:2917]

En fazla eksik deƒüere sahip 15 kolon:
PoolQC          99.7
MiscFeature     96.4
Alley           93.2
Fence           80.4
MasVnrType      60.5
FireplaceQu     48.7
LotFrontage     16.7
GarageQual       5.5
GarageYrBlt      5.5
GarageCond       5.5
GarageFinish     5.5
GarageType       5.4
BsmtExposure     2.8
BsmtCond         2.8
BsmtQual         2.8


In [6]:
# ‚îÄ‚îÄ GRUP 1: "NA" = "Yok" anlamƒ±na geliyor ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
none_fill_cols = [
    'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu',
    'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
    'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
    'BsmtFinType2', 'MasVnrType'
]
for col in none_fill_cols:
    all_data[col] = all_data[col].fillna('None')

# ‚îÄ‚îÄ GRUP 2: Sayƒ±sal "Yok" ‚Üí 0 ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
zero_fill_cols = [
    'GarageYrBlt', 'GarageArea', 'GarageCars',
    'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
    'BsmtFullBath', 'BsmtHalfBath', 'MasVnrArea'
]
for col in zero_fill_cols:
    all_data[col] = all_data[col].fillna(0)

# ‚îÄ‚îÄ GRUP 3: LotFrontage ‚Üí Kom≈üuluk medyanƒ± ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
all_data['LotFrontage'] = all_data.groupby('Neighborhood')['LotFrontage'].transform(
    lambda x: x.fillna(x.median())
)

# ‚îÄ‚îÄ GRUP 4: Kalan ger√ßek eksikler ‚Üí Mode ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
mode_fill_cols = [
    'MSZoning', 'Electrical', 'KitchenQual', 'Exterior1st',
    'Exterior2nd', 'SaleType', 'Functional', 'Utilities'
]
for col in mode_fill_cols:
    all_data[col] = all_data[col].fillna(all_data[col].mode()[0])

assert all_data.isnull().sum().sum() == 0, "H√¢l√¢ eksik var!"
print(f"‚úÖ T√ºm eksik deƒüerler dolduruldu. Kalan: {all_data.isnull().sum().sum()}")

‚úÖ T√ºm eksik deƒüerler dolduruldu. Kalan: 0


---
## 6. Ordinal Encoding

Kalite/durum deƒüi≈ükenleri **sƒ±ralƒ± (ordinal)** yapƒ±dadƒ±r. One-Hot Encoding yerine anlamlƒ± sayƒ±sal sƒ±ralamayla encode ederiz.

**Neden?**
- `Ex > Gd > TA > Fa > Po` hiyerar≈üisi modele direkt bilgi verir
- One-Hot encode edilseydi bu sƒ±ralama kaybolurdu
- √ñzellikle linear modeller bu sƒ±ralamadan √ßok faydalanƒ±r

In [7]:
quality_map = {'None': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}

ordinal_quality_cols = [
    'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
    'HeatingQC', 'KitchenQual', 'FireplaceQu',
    'GarageQual', 'GarageCond', 'PoolQC'
]
for col in ordinal_quality_cols:
    all_data[col] = all_data[col].map(quality_map).fillna(0).astype(int)

all_data['BsmtExposure'] = all_data['BsmtExposure'].map(
    {'None':0,'No':1,'Mn':2,'Av':3,'Gd':4}).fillna(0).astype(int)
all_data['BsmtFinType1'] = all_data['BsmtFinType1'].map(
    {'None':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6}).fillna(0).astype(int)
all_data['BsmtFinType2'] = all_data['BsmtFinType2'].map(
    {'None':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6}).fillna(0).astype(int)
all_data['GarageFinish'] = all_data['GarageFinish'].map(
    {'None':0,'Unf':1,'RFn':2,'Fin':3}).fillna(0).astype(int)
all_data['PavedDrive']   = all_data['PavedDrive'].map(
    {'N':0,'P':1,'Y':2}).fillna(0).astype(int)
all_data['LotShape']     = all_data['LotShape'].map(
    {'IR3':0,'IR2':1,'IR1':2,'Reg':3}).fillna(0).astype(int)
all_data['LandSlope']    = all_data['LandSlope'].map(
    {'Sev':0,'Mod':1,'Gtl':2}).fillna(0).astype(int)
all_data['LandContour']  = all_data['LandContour'].map(
    {'Low':0,'HLS':1,'Bnk':2,'Lvl':3}).fillna(0).astype(int)
all_data['Functional']   = all_data['Functional'].map(
    {'Sal':0,'Sev':1,'Maj2':2,'Maj1':3,'Mod':4,'Min2':5,'Min1':6,'Typ':7}).fillna(0).astype(int)

print(f"‚úÖ Ordinal encoding tamamlandƒ±.")
print(f"Kalan kategorik kolon sayƒ±sƒ±: {all_data.select_dtypes('object').shape[1]}")

‚úÖ Ordinal encoding tamamlandƒ±.
Kalan kategorik kolon sayƒ±sƒ±: 24


---
## 7. Feature Engineering

> **En y√ºksek RMSE iyile≈ümesi bu adƒ±mdan gelir.**

### √úretilen √ñzellik Kategorileri

| Kategori | √ñrnekler | Mantƒ±k |
|----------|----------|--------|
| **Alan birle≈ütirme** | `TotalSF`, `TotalBathrooms` | Bodrum + Kat1 + Kat2 birlikte fiyatƒ± belirler |
| **Zaman metrikleri** | `HouseAge`, `IsRemodeled` | Eski + bakƒ±msƒ±z ev = d√º≈ü√ºk fiyat |
| **Kalite √ó Alan** | `QualArea`, `BsmtScore` | B√ºy√ºk ama k√∂t√º kaliteli ev ‚â† b√ºy√ºk ve iyi kaliteli ev |
| **Polinom terimleri** | `OverallQual¬≤`, `OverallQual¬≥` | OverallQual ile fiyat ili≈ükisi doƒürusal deƒüil |
| **Boolean flag'ler** | `HasPool`, `HasGarage` | Var/yok bilgisi tek ba≈üƒ±na prediktif |

In [8]:
# ‚îÄ‚îÄ Alan birle≈ütirme ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
all_data['TotalSF']        = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
all_data['TotalBsmtFin']   = all_data['BsmtFinSF1']  + all_data['BsmtFinSF2']
all_data['TotalPorchSF']   = (all_data['OpenPorchSF'] + all_data['EnclosedPorch'] +
                               all_data['3SsnPorch']   + all_data['ScreenPorch'] +
                               all_data['WoodDeckSF'])
all_data['TotalBathrooms'] = (all_data['FullBath']     + 0.5 * all_data['HalfBath'] +
                               all_data['BsmtFullBath'] + 0.5 * all_data['BsmtHalfBath'])

# ‚îÄ‚îÄ Zaman bazlƒ± √∂zellikler ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
all_data['HouseAge']         = (all_data['YrSold'] - all_data['YearBuilt']).clip(lower=0)
all_data['YearsSinceRemod']  = (all_data['YrSold'] - all_data['YearRemodAdd']).clip(lower=0)
all_data['IsRemodeled']      = (all_data['YearRemodAdd'] != all_data['YearBuilt']).astype(int)
all_data['IsNewlyRemodeled'] = (all_data['YearRemodAdd'] == all_data['YrSold']).astype(int)
all_data['IsNew']            = (all_data['YearBuilt']    == all_data['YrSold']).astype(int)
all_data['GarageAge']        = (all_data['YrSold'] - all_data['GarageYrBlt']).clip(lower=0)

# ‚îÄ‚îÄ Kalite √ó Alan interaksiyon ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
all_data['QualArea']     = all_data['OverallQual'] * all_data['GrLivArea']
all_data['QualTotalSF']  = all_data['OverallQual'] * all_data['TotalSF']
all_data['CondQual']     = all_data['OverallQual'] * all_data['OverallCond']
all_data['KitchenScore'] = all_data['KitchenQual'] * all_data['GrLivArea']
all_data['ExterScore']   = all_data['ExterQual']   * all_data['TotalSF']
all_data['BsmtScore']    = all_data['BsmtQual']    * all_data['TotalBsmtSF']
all_data['GarageScore']  = all_data['GarageQual']  * all_data['GarageArea']

# ‚îÄ‚îÄ Polinom terimleri ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
all_data['OverallQual_sq'] = all_data['OverallQual'] ** 2
all_data['OverallQual_cu'] = all_data['OverallQual'] ** 3
all_data['GrLivArea_sq']   = all_data['GrLivArea']   ** 2
all_data['TotalSF_sq']     = all_data['TotalSF']     ** 2

# ‚îÄ‚îÄ Boolean flag'ler ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
all_data['HasPool']      = (all_data['PoolArea']     > 0).astype(int)
all_data['HasGarage']    = (all_data['GarageArea']   > 0).astype(int)
all_data['HasBasement']  = (all_data['TotalBsmtSF']  > 0).astype(int)
all_data['HasFireplace'] = (all_data['Fireplaces']   > 0).astype(int)
all_data['Has2ndFloor']  = (all_data['2ndFlrSF']     > 0).astype(int)
all_data['HasMasVnr']    = (all_data['MasVnrArea']   > 0).astype(int)
all_data['HasWoodDeck']  = (all_data['WoodDeckSF']   > 0).astype(int)
all_data['HasPorch']     = (all_data['TotalPorchSF'] > 0).astype(int)

print(f"‚úÖ Feature engineering tamamlandƒ±.")
print(f"   Toplam kolon sayƒ±sƒ±: {all_data.shape[1]}")

‚úÖ Feature engineering tamamlandƒ±.
   Toplam kolon sayƒ±sƒ±: 109


---
## 8. One-Hot Encoding & Box-Cox D√∂n√º≈ü√ºm√º

### One-Hot Encoding
Kalan nominal (sƒ±rasƒ±z) kategorik deƒüi≈ükenler i√ßin `pd.get_dummies` kullanƒ±yoruz.

### Box-Cox D√∂n√º≈ü√ºm√º
Y√ºksek √ßarpƒ±klƒ±ƒüa (`|skewness| > 0.5`) sahip s√ºrekli sayƒ±sal deƒüi≈ükenler normal daƒüƒ±lƒ±ma yakla≈ütƒ±rƒ±lƒ±r.

> **Neden `boxcox1p`?**  
> `boxcox1p(x, Œª) = boxcox(x+1, Œª)` ‚Äî sƒ±fƒ±r deƒüerler i√ßin g√ºvenli. Normal `boxcox` sƒ±fƒ±rda tanƒ±msƒ±zdƒ±r.

> **Filtre:** ƒ∞kili (0/1) ve d√º≈ü√ºk kardinaliteli kolonlar Box-Cox'tan **hari√ß** tutulur (`nunique > 10`).

In [9]:
# ‚îÄ‚îÄ One-Hot Encoding ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
cat_cols  = all_data.select_dtypes(include='object').columns.tolist()
print(f"One-Hot encode edilecek {len(cat_cols)} nominal kolon var.")
all_data  = pd.get_dummies(all_data, columns=cat_cols, drop_first=False)
print(f"Encode sonrasƒ± toplam kolon: {all_data.shape[1]}")

One-Hot encode edilecek 24 nominal kolon var.
Encode sonrasƒ± toplam kolon: 256


In [10]:
# ‚îÄ‚îÄ Box-Cox D√∂n√º≈ü√ºm√º ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
numeric_cols      = all_data.select_dtypes(include=[np.number]).columns
continuous_cols   = [c for c in numeric_cols if all_data[c].nunique() > 10]

skewed = all_data[continuous_cols].apply(
    lambda x: stats.skew(x.dropna())
).sort_values(ascending=False)

high_skew = skewed[abs(skewed) > 0.5].index
print(f"Box-Cox uygulanacak kolon sayƒ±sƒ±: {len(high_skew)}")

for col in high_skew:
    try:
        lam = boxcox_normmax(all_data[col].dropna() + 1)
        all_data[col] = boxcox1p(all_data[col], lam)
    except Exception:
        all_data[col] = boxcox1p(all_data[col], 0.15)

print("‚úÖ Box-Cox d√∂n√º≈ü√ºm√º tamamlandƒ±.")

Box-Cox uygulanacak kolon sayƒ±sƒ±: 34
‚úÖ Box-Cox d√∂n√º≈ü√ºm√º tamamlandƒ±.


---
## 9. Train/Test Ayrƒ±mƒ± & Son Temizlik

Pipeline'ƒ±n bu adƒ±mƒ±nda:
1. `all_data`'yƒ± tekrar train ve test olarak ayƒ±rƒ±yoruz
2. Varyansƒ± √ßok d√º≈ü√ºk (bilgi ta≈üƒ±mayan) √∂zellikleri eliyoruz
3. Kalan `Inf` ve `NaN` deƒüerlerini medyan ile temizliyoruz
4. Saƒülƒ±k kontrol√º yapƒ±yoruz ‚Äî herhangi bir hata varsa burada yakalarƒ±z

In [11]:
X_train_raw = all_data.iloc[:n_train].copy()
X_test_raw  = all_data.iloc[n_train:].copy()

# D√º≈ü√ºk varyanslƒ± kolonlarƒ± kaldƒ±r
selector = VarianceThreshold(threshold=0.01)
selector.fit(X_train_raw)
X_train_clean = X_train_raw.loc[:, selector.get_support()]
X_test_clean  = X_test_raw.loc[:,  selector.get_support()]
print(f"√ñzellik sayƒ±sƒ±: {X_train_raw.shape[1]} ‚Üí {X_train_clean.shape[1]}")

# Inf ve NaN temizliƒüi
for df in [X_train_clean, X_test_clean]:
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
for col in X_train_clean.columns[X_train_clean.isnull().any()]:
    fill_val = X_train_clean[col].median()
    X_train_clean[col].fillna(fill_val, inplace=True)
    X_test_clean[col].fillna(fill_val, inplace=True)

# Saƒülƒ±k kontrolleri
assert X_train_clean.isnull().sum().sum() == 0, "Train'de eksik var!"
assert X_test_clean.isnull().sum().sum()  == 0, "Test'te eksik var!"
assert len(y_train) == len(X_train_clean),       "Boyut uyu≈ümazlƒ±ƒüƒ±!"

print("‚úÖ Veri hazƒ±r.")
print(f"   Train : {X_train_clean.shape}")
print(f"   Test  : {X_test_clean.shape}")
print(f"   y max : {y_train.max():.4f} ‚Üí ${np.expm1(y_train.max()):,.0f}  ‚Üê $755k g√∂r√ºn√ºyorsa sorun yok")

√ñzellik sayƒ±sƒ±: 256 ‚Üí 188
‚úÖ Veri hazƒ±r.
   Train : (1458, 188)
   Test  : (1459, 188)
   y max : 13.5345 ‚Üí $755,000  ‚Üê $755k g√∂r√ºn√ºyorsa sorun yok


---
## 10. Cross-Validation Altyapƒ±sƒ±

**Neden 10-Fold?**  
Ames veri seti ~1460 √∂rnekten olu≈üur. 10-Fold CV, her fold'da ~130 validation √∂rneƒüi bƒ±rakƒ±r.  
Bu, 5-Fold'a kƒ±yasla daha **stabil** ve **d√º≈ü√ºk varyanslƒ±** CV tahminleri √ºretir.

In [12]:
kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

def rmsle_cv(model, X, y, cv=kf):
    scores = cross_val_score(
        model, X, y,
        scoring='neg_mean_squared_error',
        cv=cv, n_jobs=-1
    )
    return np.sqrt(-scores)

def print_score(name, scores):
    print(f"{name:22s} ‚Üí RMSE: {scores.mean():.5f} ¬± {scores.std():.5f}")

print("‚úÖ CV altyapƒ±sƒ± hazƒ±r. KFold(n_splits=10, shuffle=True, random_state=42)")

‚úÖ CV altyapƒ±sƒ± hazƒ±r. KFold(n_splits=10, shuffle=True, random_state=42)


---
## 11. Lineer Modeller ‚Äî Regularizasyon Baseline

Gradient boosting modelleri en g√º√ßl√º tahmin ediciler olsa da, Lasso/Ridge iki nedenle pipeline'a dahil edilir:

1. **Ensemble katkƒ±sƒ±:** Lineer modeller, tree modellerinin yakalamasƒ± g√º√ß **lineer sinyalleri** yakalar
2. **Farklƒ±la≈üma:** Korelasyonu d√º≈ü√ºk tahminler, ensemble'da daha fazla deƒüer yaratƒ±r

**RobustScaler** tercih ediliyor √ß√ºnk√º StandardScaler'a kƒ±yasla outlier'lara kar≈üƒ± daha dayanƒ±klƒ±dƒ±r.

In [13]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha=0.0005, max_iter=10000, random_state=SEED))
ridge = make_pipeline(RobustScaler(), Ridge(alpha=10.0))
enet  = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=0.9, max_iter=10000, random_state=SEED))

for name, model in [('Lasso', lasso), ('Ridge', ridge), ('ElasticNet', enet)]:
    scores = rmsle_cv(model, X_train_clean, y_train)
    print_score(name, scores)

Lasso                  ‚Üí RMSE: 0.10913 ¬± 0.01534
Ridge                  ‚Üí RMSE: 0.10890 ¬± 0.01494
ElasticNet             ‚Üí RMSE: 0.10917 ¬± 0.01543


---
## 12. XGBoost

**Early stopping** ile overfitting'i √∂nlerken optimal aƒüa√ß sayƒ±sƒ±nƒ± buluyoruz.  
Ardƒ±ndan bu sayƒ±yƒ± sabit olarak CV'ye veriyoruz.

**√ñnemli parametreler:**
- `max_depth=4` ‚Üí Sƒ±ƒü aƒüa√ßlar, Ames gibi k√º√ß√ºk veri setlerinde daha iyi genelleme saƒülar
- `learning_rate=0.01` ‚Üí Yava≈ü √∂ƒürenme, daha stabil yakƒ±nsama
- `subsample=0.7 / colsample_bytree=0.7` ‚Üí Stochastic boosting, overfitting'i bastƒ±rƒ±r

In [14]:
import numpy as np
import pandas as pd

print("üßπ Derin temizlik ba≈ülatƒ±lƒ±yor...")

# 1. A≈üƒ±rƒ± b√ºy√ºk deƒüerleri mantƒ±klƒ± bir √ºst/alt sƒ±nƒ±ra kƒ±rpƒ±yoruz (Capping)
# Ames veri setinde hi√ßbir deƒüi≈ükenin 10 milyardan b√ºy√ºk olmasƒ±na gerek yoktur.
X_train_clean = X_train_clean.clip(lower=-1e10, upper=1e10)
X_test_clean = X_test_clean.clip(lower=-1e10, upper=1e10)

# 2. Veri tiplerini XGBoost'un C++ motorunun beklediƒüi float32 formatƒ±na zorluyoruz
# Bu aynƒ± zamanda RAM kullanƒ±mƒ±nƒ± da yarƒ± yarƒ±ya d√º≈ü√ºr√ºr!
X_train_clean = X_train_clean.astype('float32')
X_test_clean = X_test_clean.astype('float32')

# 3. Kƒ±rpma ve tip d√∂n√º≈ü√ºm√º sonrasƒ± ortaya √ßƒ±kabilecek son kalƒ±ntƒ±larƒ± temizle
X_train_clean = X_train_clean.replace([np.inf, -np.inf], 0).fillna(0)
X_test_clean = X_test_clean.replace([np.inf, -np.inf], 0).fillna(0)

print("‚úÖ Veri seti XGBoost'un float32 motoru i√ßin zƒ±rhlandƒ± ve hazƒ±r!")

üßπ Derin temizlik ba≈ülatƒ±lƒ±yor...
‚úÖ Veri seti XGBoost'un float32 motoru i√ßin zƒ±rhlandƒ± ve hazƒ±r!


In [15]:
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train_clean, y_train, test_size=0.1, random_state=SEED
)

xgb_es = xgb.XGBRegressor(
    n_estimators          = 5000,
    learning_rate         = 0.01,
    max_depth             = 4,
    min_child_weight      = 0,
    subsample             = 0.7,
    colsample_bytree      = 0.7,
    reg_alpha             = 0.00006,
    reg_lambda            = 1.0,
    eval_metric           = 'rmse',
    early_stopping_rounds = 200,
    random_state          = SEED,
    n_jobs                = -1
)
xgb_es.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=500)
best_xgb = xgb_es.best_iteration
print(f"\nXGBoost optimal n_estimators: {best_xgb}")

xgb_model = xgb.XGBRegressor(
    n_estimators     = best_xgb,
    learning_rate    = 0.01,
    max_depth        = 4,
    min_child_weight = 0,
    subsample        = 0.7,
    colsample_bytree = 0.7,
    reg_alpha        = 0.00006,
    reg_lambda       = 1.0,
    random_state     = SEED,
    n_jobs           = -1
)
scores = rmsle_cv(xgb_model, X_train_clean, y_train)
print_score('XGBoost', scores)

[0]	validation_0-rmse:0.41137
[500]	validation_0-rmse:0.11474
[1000]	validation_0-rmse:0.10998
[1211]	validation_0-rmse:0.11017

XGBoost optimal n_estimators: 1011
XGBoost                ‚Üí RMSE: 0.11477 ¬± 0.01438


---
## 13. LightGBM

LightGBM, XGBoost'tan genellikle **daha hƒ±zlƒ±** √ßalƒ±≈üƒ±r ve bu veri setinde hafif daha iyi performans g√∂sterir.

**XGBoost farkƒ±:** `max_depth` yerine `num_leaves` ile karma≈üƒ±klƒ±k kontrol edilir.  
`num_leaves=31` ‚Üí 2^5 = 32'ye yakƒ±n, dengeli bir ba≈ülangƒ±√ß deƒüeridir.

In [16]:
lgb_es = lgb.LGBMRegressor(
    n_estimators      = 3000,
    learning_rate     = 0.01,
    num_leaves        = 31,
    min_child_samples = 20,
    subsample         = 0.7,
    subsample_freq    = 1,
    colsample_bytree  = 0.7,
    reg_alpha         = 0.0,
    reg_lambda        = 1.0,
    random_state      = SEED,
    n_jobs            = -1,
    verbosity         = -1
)
lgb_es.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(100, verbose=False),
               lgb.log_evaluation(200)]
)

best_lgb = lgb_es.best_iteration_
val_pred = lgb_es.predict(X_val)
val_rmse = np.sqrt(mean_squared_error(y_val, val_pred))

print(f"LightGBM optimal n_estimators : {best_lgb}")
print(f"LightGBM Val RMSE             : {val_rmse:.5f}")

lgb_model = lgb_es

[200]	valid_0's l2: 0.0218616
[400]	valid_0's l2: 0.0151606
[600]	valid_0's l2: 0.0138596
[800]	valid_0's l2: 0.0135886
[1000]	valid_0's l2: 0.013413
LightGBM optimal n_estimators : 1024
LightGBM Val RMSE             : 0.11549


In [17]:
# Early stopping validation skorunu direkt kullan ‚Äî ayrƒ±ca CV √ßalƒ±≈ütƒ±rma
best_lgb = lgb_es.best_iteration_

# Val seti √ºzerindeki skoru raporla (CV yerine)
val_pred     = lgb_es.predict(X_val)
val_rmse     = np.sqrt(mean_squared_error(y_val, val_pred))

print(f"LightGBM optimal n_estimators : {best_lgb}")
print(f"LightGBM Val RMSE (ES)        : {val_rmse:.5f}")

# CV modeli olarak direkt lgb_es'i kullan
lgb_model = lgb_es

LightGBM optimal n_estimators : 1024
LightGBM Val RMSE (ES)        : 0.11549


---
## 14. CatBoost

CatBoost, ordered boosting yakla≈üƒ±mƒ±yla **prediction shift** sorununu azaltƒ±r ve doƒüal olarak daha az overfitting yapar.  
Ensemble'da XGBoost ve LightGBM ile iyi bir √ße≈üitlendirme saƒülar.

In [18]:
cat_es = CatBoostRegressor(
    iterations          = 5000,
    learning_rate       = 0.01,
    depth               = 6,
    l2_leaf_reg         = 3.0,
    bagging_temperature = 0.2,
    od_type             = 'Iter',
    od_wait             = 200,
    random_seed         = SEED,
    verbose             = 500,
    eval_metric         = 'RMSE'
)
cat_es.fit(X_tr, y_tr, eval_set=(X_val, y_val), use_best_model=True)
best_cat = cat_es.best_iteration_
print(f"CatBoost optimal iterations: {best_cat}")

cat_model = CatBoostRegressor(
    iterations    = best_cat,
    learning_rate = 0.01,
    depth         = 6,
    l2_leaf_reg   = 3.0,
    random_seed   = SEED,
    verbose       = 0
)
scores = rmsle_cv(cat_model, X_train_clean, y_train)
print_score('CatBoost', scores)

0:	learn: 0.3951314	test: 0.4117810	best: 0.4117810 (0)	total: 58ms	remaining: 4m 50s
500:	learn: 0.1067807	test: 0.1203536	best: 0.1203536 (500)	total: 2.42s	remaining: 21.7s
1000:	learn: 0.0852457	test: 0.1092770	best: 0.1092487 (999)	total: 4.74s	remaining: 18.9s
1500:	learn: 0.0714896	test: 0.1063979	best: 0.1063825 (1486)	total: 7.1s	remaining: 16.6s
2000:	learn: 0.0608496	test: 0.1056994	best: 0.1056370 (1962)	total: 9.44s	remaining: 14.2s
Stopped by overfitting detector  (200 iterations wait)

bestTest = 0.105447638
bestIteration = 2207

Shrink model to first 2208 iterations.
CatBoost optimal iterations: 2207
CatBoost               ‚Üí RMSE: 0.11314 ¬± 0.01310


---
## 15. Optuna ‚Äî LightGBM Hiperparametre Optimizasyonu

Bayesian optimizasyon, grid search'e kƒ±yasla √ßok daha az deneme ile optimum hiperparametreleri bulur.

> ‚è±Ô∏è **S√ºre:** 100 trial yakla≈üƒ±k 15-25 dakika s√ºrer.  
> Kaggle'da accelerator olarak **GPU** kullanƒ±yorsanƒ±z daha hƒ±zlƒ± olur.  
> Hƒ±zlƒ± test i√ßin `n_trials=30` yapabilirsiniz.

In [19]:
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective_lgb(trial):
    params = {
        'n_estimators'      : trial.suggest_int('n_estimators', 400, 1200),
        'learning_rate'     : trial.suggest_float('learning_rate', 0.01, 0.05, log=True),
        'max_depth'         : trial.suggest_int('max_depth', 3, 6),
        'num_leaves'        : trial.suggest_int('num_leaves', 10, 31),
        'min_child_samples' : trial.suggest_int('min_child_samples', 10, 30),
        'subsample'         : trial.suggest_float('subsample', 0.6, 0.9),
        'colsample_bytree'  : trial.suggest_float('colsample_bytree', 0.6, 0.9),
        'reg_alpha'         : trial.suggest_float('reg_alpha', 1e-3, 10.0, log=True),
        'reg_lambda'        : trial.suggest_float('reg_lambda', 1e-3, 10.0, log=True),
        'random_state'      : SEED,
        
        # DEADLOCK √á√ñZ√úM√ú BURASI: -1 yerine 1 yapƒ±yoruz
        'n_jobs'            : 1, 
        
        'verbosity'         : -1
    }
    
    max_leaves = (2 ** params['max_depth']) - 1
    if params['num_leaves'] > max_leaves:
        params['num_leaves'] = max_leaves

    model  = lgb.LGBMRegressor(**params)
    scores = rmsle_cv(model, X_train_clean, y_train, kf)
    return scores.mean()

study_lgb = optuna.create_study(direction='minimize')
study_lgb.optimize(objective_lgb, n_trials=100, show_progress_bar=True)

print(f"LightGBM Best RMSE : {study_lgb.best_value:.5f}")
print(f"LightGBM Best Params: {study_lgb.best_params}")

  0%|          | 0/100 [00:00<?, ?it/s]

LightGBM Best RMSE : 0.11532
LightGBM Best Params: {'n_estimators': 1200, 'learning_rate': 0.026606822954448738, 'max_depth': 3, 'num_leaves': 25, 'min_child_samples': 17, 'subsample': 0.8294819686672625, 'colsample_bytree': 0.8947685138545435, 'reg_alpha': 0.37911564938180414, 'reg_lambda': 0.03604165390879584}


---
## 16. OOF (Out-of-Fold) Tahmin √úretimi

Stacking'in kalbi burasƒ±dƒ±r.

**Nasƒ±l √ßalƒ±≈üƒ±r?**
1. Her fold'da model **diƒüer 9 fold √ºzerinde** eƒüitilir
2. **Hi√ß g√∂rmediƒüi** 1 fold √ºzerinde tahmin √ºretir ‚Üí OOF prediction
3. Test seti √ºzerinde 10 farklƒ± tahmin √ºretilir ‚Üí ortalamasƒ± alƒ±nƒ±r

Bu sayede **veri sƒ±zƒ±ntƒ±sƒ± (data leakage) olmadan** base model tahminlerini Level-2 modeline besleyebiliriz.

> ‚è±Ô∏è **S√ºre:** 6 model √ó 10 fold = 60 fit i≈ülemi. ~15-30 dakika s√ºrebilir.

In [20]:
def get_oof_and_test_preds(model, X_train, y_train, X_test, cv, model_name='Model'):
    n_train = X_train.shape[0]
    n_test  = X_test.shape[0]
    n_folds = cv.get_n_splits()

    oof_preds  = np.zeros(n_train)
    test_preds = np.zeros((n_test, n_folds))

    for i, (tr_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
        X_tr_f  = X_train.iloc[tr_idx];  y_tr_f  = y_train.iloc[tr_idx]
        X_val_f = X_train.iloc[val_idx]; y_val_f = y_train.iloc[val_idx]

        model.fit(X_tr_f, y_tr_f)
        oof_preds[val_idx]   = model.predict(X_val_f)
        test_preds[:, i]     = model.predict(X_test)

        fold_rmse = np.sqrt(mean_squared_error(y_val_f, oof_preds[val_idx]))
        print(f"  [{model_name}] Fold {i+1}/{n_folds}  RMSE: {fold_rmse:.5f}")

    oof_rmse = np.sqrt(mean_squared_error(y_train, oof_preds))
    print(f"  [{model_name}] OOF RMSE: {oof_rmse:.5f}  |  "
          f"OOF max: {oof_preds.max():.3f} / y_train max: {y_train.max():.3f}\n")
    return oof_preds, test_preds.mean(axis=1), oof_rmse

In [21]:
models_l1 = {
    'lasso'     : lasso,
    'ridge'     : ridge,
    'elasticnet': enet,
    'xgb'       : xgb_model,
    'lgb'       : lgb_model,
    'catboost'  : cat_model,
}

oof_dict  = {}
test_dict = {}
rmse_dict = {}

for name, model in models_l1.items():
    print(f"{'='*55}")
    print(f"  {name.upper()} ‚Äî OOF tahminleri √ºretiliyor...")
    oof_pred, test_pred, oof_rmse = get_oof_and_test_preds(
        model, X_train_clean, y_train, X_test_clean, kf, model_name=name
    )
    oof_dict[name]  = oof_pred
    test_dict[name] = test_pred
    rmse_dict[name] = oof_rmse

print("\nüìä OOF RMSE √ñzeti:")
for name, rmse in sorted(rmse_dict.items(), key=lambda x: x[1]):
    print(f"  {name:22s}: {rmse:.5f}")

  LASSO ‚Äî OOF tahminleri √ºretiliyor...
  [lasso] Fold 1/10  RMSE: 0.09172
  [lasso] Fold 2/10  RMSE: 0.12883
  [lasso] Fold 3/10  RMSE: 0.10599
  [lasso] Fold 4/10  RMSE: 0.10192
  [lasso] Fold 5/10  RMSE: 0.13691
  [lasso] Fold 6/10  RMSE: 0.10048
  [lasso] Fold 7/10  RMSE: 0.12850
  [lasso] Fold 8/10  RMSE: 0.10065
  [lasso] Fold 9/10  RMSE: 0.10145
  [lasso] Fold 10/10  RMSE: 0.09460
  [lasso] OOF RMSE: 0.11018  |  OOF max: 13.462 / y_train max: 13.534

  RIDGE ‚Äî OOF tahminleri √ºretiliyor...
  [ridge] Fold 1/10  RMSE: 0.09216
  [ridge] Fold 2/10  RMSE: 0.12854
  [ridge] Fold 3/10  RMSE: 0.10590
  [ridge] Fold 4/10  RMSE: 0.10192
  [ridge] Fold 5/10  RMSE: 0.13438
  [ridge] Fold 6/10  RMSE: 0.09909
  [ridge] Fold 7/10  RMSE: 0.12961
  [ridge] Fold 8/10  RMSE: 0.10166
  [ridge] Fold 9/10  RMSE: 0.10288
  [ridge] Fold 10/10  RMSE: 0.09409
  [ridge] OOF RMSE: 0.11005  |  OOF max: 13.443 / y_train max: 13.534

  ELASTICNET ‚Äî OOF tahminleri √ºretiliyor...
  [elasticnet] Fold 1/10 

---
## 17. Kapsama Analizi ‚Äî Prediction Compression Tespiti

> ‚ö†Ô∏è **Bu kontrol kritiktir.**

Eƒüer bir modelin `OOF max` deƒüeri `y_train max`'ƒ±n `%90`'ƒ±nƒ±n altƒ±ndaysa, model y√ºksek fiyatlƒ± evleri **sistematik olarak** d√º≈ü√ºk tahmin ediyor demektir.

**Beklenen:** Her model i√ßin kapsama `‚â• %90`

In [22]:
print("üîç OOF Kapsama Analizi")
print(f"{'Model':22s} {'OOF Max':>10} {'y_train Max':>12} {'Kapsama %':>12}")
print("-" * 60)
for name, oof in oof_dict.items():
    coverage = (oof.max() / y_train.max()) * 100
    flag = "‚úÖ" if coverage >= 90 else "‚ö†Ô∏è  SORUN!"
    print(f"{flag}  {name:20s} {oof.max():10.4f} {y_train.max():12.4f} {coverage:11.1f}%")

üîç OOF Kapsama Analizi
Model                     OOF Max  y_train Max    Kapsama %
------------------------------------------------------------
‚úÖ  lasso                   13.4618      13.5345        99.5%
‚úÖ  ridge                   13.4428      13.5345        99.3%
‚úÖ  elasticnet              13.4616      13.5345        99.5%
‚úÖ  xgb                     13.3300      13.5345        98.5%
‚úÖ  lgb                     13.3566      13.5345        98.7%
‚úÖ  catboost                13.3314      13.5345        98.5%


---
## 18. Level-2 Meta Model ‚Äî Stacking

OOF tahminleri, Level-2 modelin **input feature'larƒ±** olarak kullanƒ±lƒ±r.

**Neden `RidgeCV`?**
- Ridge t√ºm base model tahminlerini dengeli kullanƒ±r (Lasso gibi sƒ±fƒ±rlamaz)
- `cv=kf` ile en iyi alpha otomatik se√ßilir
- Hafif ve yorumlanabilir

In [23]:
meta_train = pd.DataFrame(oof_dict)
meta_test  = pd.DataFrame(test_dict)

meta_model = RidgeCV(alphas=np.logspace(-4, 4, 100), cv=kf)
meta_model.fit(meta_train, y_train)

print(f"Meta model optimal alpha: {meta_model.alpha_:.6f}")
print("\nBase model aƒüƒ±rlƒ±klarƒ±:")
coef = pd.Series(meta_model.coef_, index=meta_train.columns).sort_values(ascending=False)
print(coef.round(4).to_string())

stacked_train_log = meta_model.predict(meta_train)
stacked_test_log  = meta_model.predict(meta_test)
stacked_oof_rmse  = np.sqrt(mean_squared_error(y_train, stacked_train_log))
print(f"\nüèÜ Stacking OOF RMSE: {stacked_oof_rmse:.5f}")

Meta model optimal alpha: 4.862602

Base model aƒüƒ±rlƒ±klarƒ±:
ridge         0.2119
lasso         0.1892
elasticnet    0.1881
catboost      0.1429
lgb           0.1411
xgb           0.1323

üèÜ Stacking OOF RMSE: 0.10843


---
## 19. Weighted Blending ‚Äî Scipy ile Optimize Edilmi≈ü Aƒüƒ±rlƒ±klar

Stacking'e alternatif olarak, her modele **OOF RMSE'siyle ters orantƒ±lƒ±** aƒüƒ±rlƒ±k veririz.  
Scipy'ƒ±n `minimize` fonksiyonu bu aƒüƒ±rlƒ±klarƒ± otomatik optimize eder.

In [24]:
from scipy.optimize import minimize

model_names = list(oof_dict.keys())
oof_array   = np.array([oof_dict[n]  for n in model_names]).T
test_array  = np.array([test_dict[n] for n in model_names]).T

def blend_rmse(weights):
    w = np.abs(weights) / np.sum(np.abs(weights))
    return np.sqrt(mean_squared_error(y_train, oof_array @ w))

result    = minimize(blend_rmse, np.ones(len(model_names)) / len(model_names),
                     bounds=[(0,1)]*len(model_names), method='SLSQP',
                     options={'maxiter':1000, 'ftol':1e-10})
optimal_w = np.abs(result.x) / np.sum(np.abs(result.x))

print("üìä Optimize Edilmi≈ü Blending Aƒüƒ±rlƒ±klarƒ±:")
for name, w in sorted(zip(model_names, optimal_w), key=lambda x: -x[1]):
    print(f"  {name:22s}: {w:.4f}")

blended_train_log = oof_array  @ optimal_w
blended_test_log  = test_array @ optimal_w
blend_oof_rmse    = np.sqrt(mean_squared_error(y_train, blended_train_log))

print(f"\nüèÜ Blending OOF RMSE : {blend_oof_rmse:.5f}")
print(f"üèÜ Stacking OOF RMSE: {stacked_oof_rmse:.5f}")

üìä Optimize Edilmi≈ü Blending Aƒüƒ±rlƒ±klarƒ±:
  ridge                 : 0.4847
  lasso                 : 0.1774
  lgb                   : 0.1454
  catboost              : 0.1145
  xgb                   : 0.0780
  elasticnet            : 0.0000

üèÜ Blending OOF RMSE : 0.10830
üèÜ Stacking OOF RMSE: 0.10843


---
## 20. Final Ensemble & Submission

Stacking ve Blending tahminlerini belirli bir aƒüƒ±rlƒ±kla birle≈ütiriyoruz.  
Optimal alpha deƒüeri OOF √ºzerinden grid search ile bulunur.

**Son adƒ±m:** `expm1` ile log uzayƒ±ndan ger√ßek fiyat uzayƒ±na d√∂n√º≈ü.

In [25]:
# Optimal stacking/blending aƒüƒ±rlƒ±ƒüƒ±nƒ± bul
best_alpha, best_rmse = 0.5, 999.0
for alpha in np.arange(0.0, 1.01, 0.05):
    combo_rmse = np.sqrt(mean_squared_error(
        y_train, alpha * stacked_train_log + (1-alpha) * blended_train_log
    ))
    if combo_rmse < best_rmse:
        best_rmse  = combo_rmse
        best_alpha = alpha

print(f"Optimal alpha (stacking aƒüƒ±rlƒ±ƒüƒ±) : {best_alpha:.2f}")
print(f"Optimal (1-alpha) (blend aƒüƒ±rlƒ±ƒüƒ±): {1-best_alpha:.2f}")
print(f"Final Ensemble OOF RMSE           : {best_rmse:.5f}")

final_test_log = best_alpha * stacked_test_log + (1 - best_alpha) * blended_test_log
final_pred     = np.expm1(final_test_log)

Optimal alpha (stacking aƒüƒ±rlƒ±ƒüƒ±) : 0.10
Optimal (1-alpha) (blend aƒüƒ±rlƒ±ƒüƒ±): 0.90
Final Ensemble OOF RMSE           : 0.10830


In [26]:
# ‚îÄ‚îÄ Saƒülƒ±k kontrol√º ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
sub = pd.DataFrame({'Id': test['Id'], 'SalePrice': final_pred})

assert len(sub) == 1459
assert sub['SalePrice'].isnull().sum() == 0
assert (sub['SalePrice'] > 0).all()

print("üìä Submission vs Train Kar≈üƒ±la≈ütƒ±rmasƒ±:")
print(f"{'Percentile':>12} {'Train Ger√ßek':>15} {'Tahmin':>15} {'Fark %':>10}")
print("-" * 55)
for q in [0.50, 0.75, 0.90, 0.95, 0.99, 1.00]:
    real = train_orig['SalePrice'].quantile(q)
    pred = sub['SalePrice'].quantile(q)
    diff = (pred - real) / real * 100
    flag = "‚úÖ" if abs(diff) < 15 else "‚ö†Ô∏è"
    print(f"{flag} {q*100:>10.0f}%  {real:>14,.0f}  {pred:>14,.0f}  {diff:>+9.1f}%")

sub.to_csv('submission_final.csv', index=False)
print(f"\n‚úÖ submission_final.csv kaydedildi!")
print(f"   Min : ${sub['SalePrice'].min():>10,.0f}")
print(f"   Max : ${sub['SalePrice'].max():>10,.0f}")
print(f"   Mean: ${sub['SalePrice'].mean():>10,.0f}")

üìä Submission vs Train Kar≈üƒ±la≈ütƒ±rmasƒ±:
  Percentile    Train Ger√ßek          Tahmin     Fark %
-------------------------------------------------------
‚úÖ         50%         163,000         156,001       -4.3%
‚úÖ         75%         214,000         208,153       -2.7%
‚úÖ         90%         278,000         283,656       +2.0%
‚úÖ         95%         326,100         327,197       +0.3%
‚úÖ         99%         442,567         448,471       +1.3%
‚úÖ        100%         755,000         843,652      +11.7%

‚úÖ submission_final.csv kaydedildi!
   Min : $    49,924
   Max : $   843,652
   Mean: $   176,911


In [27]:
# ‚îÄ‚îÄ √ñzet Rapor ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("=" * 60)
print("üìã  FINAL PIPELINE √ñZET RAPORU")
print("=" * 60)
print(f"\nüóÇÔ∏è  Veri : Train={n_train} | Test={n_test} | Feature={X_train_clean.shape[1]}")
print(f"\nüìà OOF RMSE Skorlarƒ± (k√º√ß√ºkten b√ºy√ºƒüe):")
for name, rmse in sorted(rmse_dict.items(), key=lambda x: x[1]):
    print(f"   {name:22s}: {rmse:.5f}")
print(f"\nüîó Ensemble:")
print(f"   Stacking RMSE  : {stacked_oof_rmse:.5f}")
print(f"   Blending RMSE  : {blend_oof_rmse:.5f}")
print(f"   Final RMSE     : {best_rmse:.5f}")
print(f"   Stack Œ±={best_alpha:.2f}  |  Blend Œ±={1-best_alpha:.2f}")
print(f"\nüéØ Beklenen Kaggle LB: ~{best_rmse+0.003:.4f} ‚Äì {best_rmse+0.008:.4f}")
print("=" * 60)

üìã  FINAL PIPELINE √ñZET RAPORU

üóÇÔ∏è  Veri : Train=1458 | Test=1459 | Feature=188

üìà OOF RMSE Skorlarƒ± (k√º√ß√ºkten b√ºy√ºƒüe):
   ridge                 : 0.11005
   lasso                 : 0.11018
   elasticnet            : 0.11022
   catboost              : 0.11391
   xgb                   : 0.11569
   lgb                   : 0.11793

üîó Ensemble:
   Stacking RMSE  : 0.10843
   Blending RMSE  : 0.10830
   Final RMSE     : 0.10830
   Stack Œ±=0.10  |  Blend Œ±=0.90

üéØ Beklenen Kaggle LB: ~0.1113 ‚Äì 0.1163
