<a href="https://colab.research.google.com/github/nahidmaleki/Cross-Validation/blob/main/CrossValidation%26Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# üßÆ Insurance Cost Regression ‚Äî Cross-Validation & Hyperparameter Search (Raw ‚Üí Refined)
**Goal**: Build a clean, educational workflow on Kaggle‚Äôs Medical Cost (insurance.csv):

Baseline with minimal preprocessing, quick CV check.

Refined pipeline: scale, encode, then CV + hyperparameter search (manual grid for SVR + RandomizedSearchCV across linear/regularized/ensemble models) ‚Äî compare metrics.

### ‚ö†Ô∏è Note on dataset ethics:
Contains personal attributes (age, sex, smoking, region). Use only for learning; avoid sensitive inferences and consider fairness.

### üìå What you‚Äôll get
Reproducible splits (KFold CV, fixed seed).

Preprocessing: Min-Max scaling (numeric) + get_dummies (categorical, redundant dummies dropped).

Model suite: LR, Ridge, Lasso, ElasticNet, SVR, Random Forest, Gradient Boosting.

Searches: SVR ParameterGrid with log-target (TransformedTargetRegressor) + RandomizedSearchCV over multiple models.

Evaluation: R¬≤ (primary) and RMSE; ranked trials log; best model refit on train and tested on hold-out.

Lightweight, classroom-ready code; minimal imports (pandas, scikit-learn, kagglehub).

### üì• Load Kaggle ‚ÄúMedical Cost (insurance.csv)‚Äù + quick scan

In [1]:
# Requires: `pip install kagglehub` (once)
import kagglehub  # new import; pandas (as pd) already available
import pandas as pd

path = kagglehub.dataset_download("mirichoi0218/insurance")
df = pd.read_csv(f"{path}/insurance.csv")  # main df

df.head()   # first rows


Using Colab cache for faster access to the 'insurance' dataset.


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [2]:
df.shape    # (rows, columns)


(1338, 7)

In [3]:
df.info()   # dtypes & not-null counts


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [4]:
df.describe(include="all")    # summary (numeric+categorical)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,1338.0,1338,1338.0,1338.0,1338,1338,1338.0
unique,,2,,,2,4,
top,,male,,,no,southeast,
freq,,676,,,1064,364,
mean,39.207025,,30.663397,1.094918,,,13270.422265
std,14.04996,,6.098187,1.205493,,,12110.011237
min,18.0,,15.96,0.0,,,1121.8739
25%,27.0,,26.29625,0.0,,,4740.28715
50%,39.0,,30.4,1.0,,,9382.033
75%,51.0,,34.69375,2.0,,,16639.912515


In [5]:
df.isna().sum().sum()   # missing value per columns


np.int64(0)

### ‚úÇÔ∏è Train‚Äìtest split (regression target = charges)

In [6]:
from sklearn.model_selection import train_test_split
# Features (mixed types for now; encoding handled later via Pipeline)
X = df.drop(columns=["charges"])
y = df["charges"]

# Split (no stratify for regression)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
    )
print("Train: ", X_train.shape, "Test: ", X_test.shape)


Train:  (1070, 6) Test:  (268, 6)


### üìè Min‚ÄìMax scale numeric columns (fit on train only)

In [16]:
from sklearn.preprocessing import MinMaxScaler
# assumes: from sklearn.preprocessing import MinMaxScaler (imported earlier)

num_cols = X_train.select_dtypes(include="number").columns

scaler = MinMaxScaler()
X_train.loc[:, num_cols] = scaler.fit_transform(X_train[num_cols])
X_test.loc[:,  num_cols] = scaler.transform(X_test[num_cols])


### üî§ Encode categoricals (Ordinal for ordinals, One-Hot for nominals)

In [8]:
X_train["region"].unique()


array(['northwest', 'northeast', 'southeast', 'southwest'], dtype=object)

In [9]:
X_train["region"].nunique()


4

In [10]:
X_train.shape


(1070, 6)

In [17]:
# Encode categoricals with pandas; drop redundant dummies; align columns
cat_cols = X_train.select_dtypes(include="object").columns

X_train = pd.get_dummies(X_train, columns=cat_cols)
X_test  = pd.get_dummies(X_test,  columns=cat_cols)

# Drop chosen redundant binaries (keep sex_female, smoker_yes)
for c in ["sex_male", "smoker_no"]:
    if c in X_train.columns: X_train = X_train.drop(columns=c)
    if c in X_test.columns:  X_test  = X_test.drop(columns=c)

# Ensure same columns in train/test
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)


In [12]:
X_train.shape


(1070, 11)

In [18]:
# Grid search over SVR (including kernel) using ParameterGrid + log-target
from sklearn.svm import SVR
from sklearn.model_selection import ParameterGrid, KFold, cross_val_score
from sklearn.compose import TransformedTargetRegressor
from sklearn.metrics import r2_score
import numpy as np

grid = {
    "kernel": ["rbf", "linear", "poly"],
    "C": [1.0, 3.0, 10.0, 30.0],
    "epsilon": [0.1, 0.5, 1.0],
    "gamma": ["scale", "auto"],   # ignored for linear; OK to keep
    "degree": [2, 3],             # used only for poly
}

cv = KFold(n_splits=5, shuffle=True, random_state=42)
best, best_r2 = None, -1e9

for p in ParameterGrid(grid):
    model = TransformedTargetRegressor(
        regressor=SVR(**p),
        func=np.log1p, inverse_func=np.expm1
    )
    r2 = cross_val_score(model, X_train, y_train, scoring="r2", cv=cv).mean()
    if r2 > best_r2:
        best, best_r2 = p, r2
    print(p, f"CV_R2={r2:.3f}")

best_model = TransformedTargetRegressor(
    regressor=SVR(**best),
    func=np.log1p, inverse_func=np.expm1
).fit(X_train, y_train)

print("Best:", best, f"Test_R2={r2_score(y_test, best_model.predict(X_test)):.3f}")


{'C': 1.0, 'degree': 2, 'epsilon': 0.1, 'gamma': 'scale', 'kernel': 'rbf'} CV_R2=0.809
{'C': 1.0, 'degree': 2, 'epsilon': 0.1, 'gamma': 'scale', 'kernel': 'linear'} CV_R2=0.414
{'C': 1.0, 'degree': 2, 'epsilon': 0.1, 'gamma': 'scale', 'kernel': 'poly'} CV_R2=0.794
{'C': 1.0, 'degree': 2, 'epsilon': 0.1, 'gamma': 'auto', 'kernel': 'rbf'} CV_R2=0.730
{'C': 1.0, 'degree': 2, 'epsilon': 0.1, 'gamma': 'auto', 'kernel': 'linear'} CV_R2=0.414
{'C': 1.0, 'degree': 2, 'epsilon': 0.1, 'gamma': 'auto', 'kernel': 'poly'} CV_R2=0.637
{'C': 1.0, 'degree': 2, 'epsilon': 0.5, 'gamma': 'scale', 'kernel': 'rbf'} CV_R2=0.681
{'C': 1.0, 'degree': 2, 'epsilon': 0.5, 'gamma': 'scale', 'kernel': 'linear'} CV_R2=0.461
{'C': 1.0, 'degree': 2, 'epsilon': 0.5, 'gamma': 'scale', 'kernel': 'poly'} CV_R2=0.679
{'C': 1.0, 'degree': 2, 'epsilon': 0.5, 'gamma': 'auto', 'kernel': 'rbf'} CV_R2=0.673
{'C': 1.0, 'degree': 2, 'epsilon': 0.5, 'gamma': 'auto', 'kernel': 'linear'} CV_R2=0.461
{'C': 1.0, 'degree': 2, 'epsilon'

In [19]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

pipe = Pipeline([("model", LinearRegression())])  # placeholder; replaced by grids below

param_distributions = [
    {"model": [LinearRegression()], "model__fit_intercept": [True, False]},
    {"model": [Ridge()], "model__alpha": [1e-3, 1e-2, 1e-1, 1, 10, 100], "model__fit_intercept": [True, False]},
    {"model": [Lasso(max_iter=5000)], "model__alpha": [1e-3, 1e-2, 1e-1, 1, 10], "model__fit_intercept": [True, False]},
    {"model": [ElasticNet(max_iter=5000)], "model__alpha": [1e-3, 1e-2, 1e-1, 1, 10], "model__l1_ratio": [0.05, 0.2, 0.5, 0.8, 0.95], "model__fit_intercept": [True, False]},
    {"model": [SVR()], "model__C": [0.1, 1, 10, 100], "model__gamma": ["scale", "auto"], "model__epsilon": [0.01, 0.1, 0.5]},
]


In [20]:
from sklearn.model_selection import KFold, RandomizedSearchCV
from sklearn.metrics import r2_score

cv = KFold(n_splits=5, shuffle=True, random_state=42)

rs = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_distributions,
    n_iter=50,
    scoring="neg_root_mean_squared_error",
    cv=cv,
    n_jobs=-1,
    random_state=42,
    refit=True,
)

rs.fit(X_train, y_train)

y_pred = rs.predict(X_test)
r2   = r2_score(y_test, y_pred)

print(rs.best_estimator_)
print({"R2": round(r2, 3)})


Pipeline(steps=[('model', Lasso(alpha=10, max_iter=5000))])
{'R2': 0.783}
