<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/code/01RAD_Ex11__hands_on_empty.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle house data set



## Downloading the Kaggle house rent dataset

The dataset we will use comes from Kaggle:

- *House Rent Prediction Dataset*  
  https://www.kaggle.com/datasets/iamsouravbanerjee/house-rent-prediction-dataset/data

To download directly from Kaggle inside this notebook you need a Kaggle
API token (see *Account ? API ? Create New Token* on Kaggle). The cell
below assumes you have configured your `KAGGLE_USERNAME` and
`KAGGLE_KEY` environment variables or placed `kaggle.json` in the
standard location.

Exampe of Auto NB - let's beat it

https://www.kaggle.com/code/sahityasetu/boosting-algorithms-for-machine-learning


In [1]:

# Download the Kaggle house rent dataset using kagglehub (no API key needed for public data)
try:
    import kagglehub  # lightweight helper for Kaggle datasets
except ImportError:  # pragma: no cover
    %pip install -q kagglehub
    import kagglehub

# Download latest version of the dataset; this returns a local directory path
path = kagglehub.dataset_download("iamsouravbanerjee/house-rent-prediction-dataset")
print("Path to dataset files:", path)


Path to dataset files: C:\Users\francji1\.cache\kagglehub\datasets\iamsouravbanerjee\house-rent-prediction-dataset\versions\9


In [2]:

from pathlib import Path
import pandas as pd

# `path` is a directory returned by kagglehub; locate the CSV inside it
dataset_dir = Path(path)
candidates = list(dataset_dir.rglob("House_Rent_Dataset.csv"))
if not candidates:
    raise FileNotFoundError(f"House_Rent_Dataset.csv not found under {dataset_dir}")

csv_path = candidates[0]
print("Loading data from:", csv_path)
house = pd.read_csv(csv_path)
print("Shape:", house.shape)
house.head()


Loading data from: C:\Users\francji1\.cache\kagglehub\datasets\iamsouravbanerjee\house-rent-prediction-dataset\versions\9\House_Rent_Dataset.csv
Shape: (4746, 12)


Unnamed: 0,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2022-05-18,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2022-05-13,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2022-05-16,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2022-07-04,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2022-05-09,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner



## Questions for a linear regression analysis of house rent

When building a linear regression model for rent, it is useful to think
in terms of a workflow:

1. **Understand the data**
   - What is the response variable (e.g. `Rent`)?  
     What are the main predictor types (numeric, categorical, locations,
     amenities)?
   - Are there obvious data quality issues (missing values, impossible
     values, outliers)?

2. **Preprocessing and feature engineering**
   - How should categorical variables (e.g. city, furnishing status,
     point of contact) be encoded for a linear model (one?hot encoding,
     target encoding, etc.)?
   - Which numeric variables might benefit from scaling (standardization
     or robust scaling), and why can this matter for regularized
     regression?
   - Are there interactions that are conceptually meaningful
     (e.g. `BHK , Size`, `City , DistanceFromMainArea`)?
   - Can we create more interpretable features (e.g. rent per square
     foot, distance to city centre bins)?

3. **Transformations of response and regressors**
   - Is the distribution of `Rent` highly skewed or heavy tailed? Would a
     log transformation (modeling $\log(\text{Rent})$) stabilize
     variance and make residuals closer to normal?
   - Do some predictors show non linear relationships with rent? Would
     polynomial terms, splines, or monotone transforms (log, square
     root) be appropriate?
   - Are there predictors that should be centered or standardized before
     creating interaction or polynomial terms?

4. **Model specification and selection**
   - Start with a simple baseline: which variables should be included in
     a first OLS model, and how do residual plots look?
   - How to compare alternative specifications
     (different sets of features, transformed vs untransformed variables)
     using cross validation or a validation set?
   - When is it useful to move from plain OLS to regularized models such
     as ridge or lasso (e.g. many correlated predictors, high variance)?

5. **Model evaluation and diagnostics**
   - How to check linear model assumptions: residual vs fitted plots,
     QQ plots, heteroscedasticity, influential observations?
   - Which error metrics are most relevant here
     (RMSE, MAE, MAPE)?  How do training and test errors compare
     (overfitting vs underfitting)?
   - Are there systematic groups of houses (by city, BHK, furnishing)
     for which the model performs much worse, suggesting missing
     structure or interactions?



### More detailed questions to explore

- **Preprocessing**
  - How should missing values be handled for each variable (impute,
    drop, or create explicit missing indicators)?
  - Do we need to cap or Winsorize extreme values of `Rent` or `Size`
    before fitting a linear model?
  - Are there rare categories (e.g. cities or furnishing statuses with
    very few observations) that should be grouped together?

- **Transformations and linearity**
  - Plot `Rent` (or $\log(\text{Rent})$) against key predictors:
    `Size`, `BHK`, `Bathroom`, `City`, etc.  Do the relationships look
    approximately linear after transformation?
  - Would modeling $\log(\text{Rent})$ make residuals more symmetric and
    reduce heteroscedasticity?

- **Multicollinearity and regularization**
  - Are some predictors strongly correlated (e.g. `Size` and `BHK`)?  How
    do VIFs and condition numbers look for the chosen design matrix?
  - How do ridge and lasso behave in this dataset in terms of coefficient
    shrinkage and variable selection?
  - Which predictors consistently get selected by lasso across
    cross?validation folds?

- **Model selection and validation**
  - How does test error change when we:
    1. Add more predictors,
    2. Add interaction terms,
    3. Add polynomial terms,
    4. Switch from OLS to ridge/lasso?
  - How to choose the final model: by minimum cross?validated RMSE,
    parsimony (fewest predictors), or domain interpretability?

Use these questions as a checklist to design your own modeling pipeline
for the house rent dataset using linear regression and its regularized
variants.


## 1. Explorační datová analýza (EDA)

In [None]:
# Základní informace o datasetu
print("=" * 60)
print("ZÁKLADNÍ INFORMACE O DATASETU")
print("=" * 60)
print(f"\nRozměry datasetu: {house.shape[0]} řádků, {house.shape[1]} sloupců")
print("\nDatatypy:")
print(house.dtypes)
print("\n" + "=" * 60)
print("CHYBĚJÍCÍ HODNOTY")
print("=" * 60)
missing = house.isnull().sum()
missing_pct = 100 * house.isnull().sum() / len(house)
missing_table = pd.DataFrame({'Počet': missing, 'Procento': missing_pct})
print(missing_table[missing_table['Počet'] > 0])

print("\n" + "=" * 60)
print("ZÁKLADNÍ STATISTIKY - NUMERICKÉ PROMĚNNÉ")
print("=" * 60)
print(house.describe())

print("\n" + "=" * 60)
print("KATEGORICKÉ PROMĚNNÉ - UNIKÁTNÍ HODNOTY")
print("=" * 60)
categorical_cols = house.select_dtypes(include=['object']).columns
for col in categorical_cols:
    print(f"\n{col}: {house[col].nunique()} unikátních hodnot")
    print(house[col].value_counts().head(10))

In [None]:
# Vizualizace - distribuce klíčových proměnných
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (16, 12)

fig, axes = plt.subplots(3, 3, figsize=(18, 14))

# 1. Distribuce Rent
axes[0, 0].hist(house['Rent'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Distribuce Rent', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Rent')
axes[0, 0].set_ylabel('Četnost')

# 2. Distribuce log(Rent)
axes[0, 1].hist(np.log(house['Rent']), bins=50, edgecolor='black', alpha=0.7, color='orange')
axes[0, 1].set_title('Distribuce log(Rent)', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('log(Rent)')
axes[0, 1].set_ylabel('Četnost')

# 3. Boxplot Rent
axes[0, 2].boxplot(house['Rent'])
axes[0, 2].set_title('Boxplot Rent', fontsize=12, fontweight='bold')
axes[0, 2].set_ylabel('Rent')

# 4. Rent vs Size
axes[1, 0].scatter(house['Size'], house['Rent'], alpha=0.5)
axes[1, 0].set_title('Rent vs Size', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Size')
axes[1, 0].set_ylabel('Rent')

# 5. Rent vs BHK
axes[1, 1].boxplot([house[house['BHK'] == bhk]['Rent'].values for bhk in sorted(house['BHK'].unique())])
axes[1, 1].set_title('Rent vs BHK', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('BHK')
axes[1, 1].set_ylabel('Rent')
axes[1, 1].set_xticklabels(sorted(house['BHK'].unique()))

# 6. Rent vs Bathroom
axes[1, 2].boxplot([house[house['Bathroom'] == bath]['Rent'].values for bath in sorted(house['Bathroom'].unique())])
axes[1, 2].set_title('Rent vs Bathroom', fontsize=12, fontweight='bold')
axes[1, 2].set_xlabel('Bathroom')
axes[1, 2].set_ylabel('Rent')
axes[1, 2].set_xticklabels(sorted(house['Bathroom'].unique()))

# 7. Rent vs City (top 10 měst)
top_cities = house['City'].value_counts().head(10).index
city_data = house[house['City'].isin(top_cities)]
city_means = city_data.groupby('City')['Rent'].mean().sort_values()
axes[2, 0].barh(range(len(city_means)), city_means.values)
axes[2, 0].set_yticks(range(len(city_means)))
axes[2, 0].set_yticklabels(city_means.index, fontsize=9)
axes[2, 0].set_title('Průměrný Rent podle City (top 10)', fontsize=12, fontweight='bold')
axes[2, 0].set_xlabel('Průměrný Rent')

# 8. Rent vs Furnishing Status
furn_order = ['Unfurnished', 'Semi-Furnished', 'Furnished']
furn_data = [house[house['Furnishing Status'] == f]['Rent'].values for f in furn_order if f in house['Furnishing Status'].unique()]
axes[2, 1].boxplot(furn_data)
axes[2, 1].set_title('Rent vs Furnishing Status', fontsize=12, fontweight='bold')
axes[2, 1].set_xlabel('Furnishing Status')
axes[2, 1].set_ylabel('Rent')
axes[2, 1].set_xticklabels([f for f in furn_order if f in house['Furnishing Status'].unique()], rotation=15, ha='right')

# 9. Korelační matice numerických proměnných
numeric_cols = house.select_dtypes(include=[np.number]).columns
corr_matrix = house[numeric_cols].corr()
im = axes[2, 2].imshow(corr_matrix, cmap='coolwarm', aspect='auto', vmin=-1, vmax=1)
axes[2, 2].set_xticks(range(len(numeric_cols)))
axes[2, 2].set_yticks(range(len(numeric_cols)))
axes[2, 2].set_xticklabels(numeric_cols, rotation=45, ha='right', fontsize=9)
axes[2, 2].set_yticklabels(numeric_cols, fontsize=9)
axes[2, 2].set_title('Korelační matice', fontsize=12, fontweight='bold')
plt.colorbar(im, ax=axes[2, 2])

plt.tight_layout()
plt.show()

print("\nKorelační matice:")
print(corr_matrix)

## 2. Data Preprocessing

In [None]:
# Vytvoření kopie datasetu pro preprocessing
house_clean = house.copy()

print("=" * 60)
print("DATA CLEANING")
print("=" * 60)

# 1. Kontrola a odstranění duplicit
print(f"\nPočet duplicitních řádků: {house_clean.duplicated().sum()}")
house_clean = house_clean.drop_duplicates()
print(f"Rozměry po odstranění duplicit: {house_clean.shape}")

# 2. Odstranění extrémních outlierů pomocí IQR metody pro Rent
Q1 = house_clean['Rent'].quantile(0.25)
Q3 = house_clean['Rent'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR

print(f"\nRent outliers:")
print(f"  Lower bound: {lower_bound:.0f}, Upper bound: {upper_bound:.0f}")
outliers = ((house_clean['Rent'] < lower_bound) | (house_clean['Rent'] > upper_bound)).sum()
print(f"  Počet outlierů: {outliers}")

# Odstranění pouze extrémních outlierů (3*IQR)
house_clean = house_clean[(house_clean['Rent'] >= lower_bound) & (house_clean['Rent'] <= upper_bound)]
print(f"Rozměry po odstranění outlierů: {house_clean.shape}")

# 3. Odstranění nelogických hodnot
print(f"\nOdstranění nelogických hodnot:")
print(f"  Řádky s Rent <= 0: {(house_clean['Rent'] <= 0).sum()}")
print(f"  Řádky s Size <= 0: {(house_clean['Size'] <= 0).sum()}")
print(f"  Řádky s BHK <= 0: {(house_clean['BHK'] <= 0).sum()}")

house_clean = house_clean[(house_clean['Rent'] > 0) & 
                           (house_clean['Size'] > 0) & 
                           (house_clean['BHK'] > 0)]
print(f"Rozměry po čištění: {house_clean.shape}")

# 4. Parsování Floor informace
print(f"\n{'=' * 60}")
print("PARSOVÁNÍ FLOOR COLUMN")
print("=" * 60)
print(f"\nPříklady Floor hodnot:")
print(house_clean['Floor'].value_counts().head(10))

def parse_floor(floor_str):
    """Extrahuje číslo patra a celkový počet pater"""
    if pd.isna(floor_str):
        return np.nan, np.nan
    
    floor_str = str(floor_str).lower()
    
    # Ground floor
    if 'ground' in floor_str:
        floor_num = 0
    else:
        try:
            floor_num = int(floor_str.split()[0])
        except:
            floor_num = np.nan
    
    # Celkový počet pater
    try:
        total_floors = int(floor_str.split('of')[-1].strip())
    except:
        total_floors = np.nan
    
    return floor_num, total_floors

house_clean[['Floor_Number', 'Total_Floors']] = house_clean['Floor'].apply(
    lambda x: pd.Series(parse_floor(x))
)

print(f"\nNové sloupce Floor_Number a Total_Floors vytvořeny")
print(f"Floor_Number - missing values: {house_clean['Floor_Number'].isna().sum()}")
print(f"Total_Floors - missing values: {house_clean['Total_Floors'].isna().sum()}")

# Imputace missing values mediánem
house_clean['Floor_Number'].fillna(house_clean['Floor_Number'].median(), inplace=True)
house_clean['Total_Floors'].fillna(house_clean['Total_Floors'].median(), inplace=True)

print(f"\n{'=' * 60}")
print("FINÁLNÍ STAV PO PREPROCESSINGU")
print("=" * 60)
print(f"\nFinální rozměry: {house_clean.shape}")
print(f"\nChybějící hodnoty:")
print(house_clean.isnull().sum())

## 3. Feature Engineering

In [None]:
print("=" * 60)
print("FEATURE ENGINEERING")
print("=" * 60)

# 1. Vytvoření nových features
print("\n1. Vytváření nových numerických features:")

# Rent per square foot
house_clean['Rent_per_sqft'] = house_clean['Rent'] / house_clean['Size']
print(f"   - Rent_per_sqft: průměr = {house_clean['Rent_per_sqft'].mean():.2f}")

# Size per BHK (průměrná velikost na jeden pokoj)
house_clean['Size_per_BHK'] = house_clean['Size'] / house_clean['BHK']
print(f"   - Size_per_BHK: průměr = {house_clean['Size_per_BHK'].mean():.2f}")

# Floor ratio (na kterém patře z celkového počtu)
house_clean['Floor_Ratio'] = house_clean['Floor_Number'] / (house_clean['Total_Floors'] + 1)
print(f"   - Floor_Ratio: průměr = {house_clean['Floor_Ratio'].mean():.2f}")

# Binary feature: je to přízemí?
house_clean['Is_Ground_Floor'] = (house_clean['Floor_Number'] == 0).astype(int)
print(f"   - Is_Ground_Floor: počet = {house_clean['Is_Ground_Floor'].sum()}")

# Binary feature: je to nejvyšší patro?
house_clean['Is_Top_Floor'] = (house_clean['Floor_Number'] == house_clean['Total_Floors']).astype(int)
print(f"   - Is_Top_Floor: počet = {house_clean['Is_Top_Floor'].sum()}")

# 2. Encoding kategorických proměnných
print("\n2. Encoding kategorických proměnných:")
from sklearn.preprocessing import LabelEncoder

# Area Type - ordinální encoding (Carpet < Built < Super)
area_type_map = {'Carpet Area': 1, 'Built Area': 2, 'Super Area': 3}
house_clean['Area_Type_Encoded'] = house_clean['Area Type'].map(area_type_map)
print(f"   - Area Type encoded: {house_clean['Area_Type_Encoded'].value_counts().to_dict()}")

# Furnishing Status - ordinální encoding
furnishing_map = {'Unfurnished': 0, 'Semi-Furnished': 1, 'Furnished': 2}
house_clean['Furnishing_Encoded'] = house_clean['Furnishing Status'].map(furnishing_map)
print(f"   - Furnishing Status encoded: {house_clean['Furnishing_Encoded'].value_counts().to_dict()}")

# Tenant Preferred - binary encoding (Bachelors vs Family)
house_clean['Tenant_Bachelors'] = house_clean['Tenant Preferred'].str.contains('Bachelors', na=False).astype(int)
house_clean['Tenant_Family'] = house_clean['Tenant Preferred'].str.contains('Family', na=False).astype(int)
print(f"   - Tenant_Bachelors: počet = {house_clean['Tenant_Bachelors'].sum()}")
print(f"   - Tenant_Family: počet = {house_clean['Tenant_Family'].sum()}")

# Point of Contact - binary encoding
house_clean['Contact_Owner'] = (house_clean['Point of Contact'] == 'Contact Owner').astype(int)
print(f"   - Contact_Owner: počet = {house_clean['Contact_Owner'].sum()}")

# City - one-hot encoding (top 10 měst + Other)
print("\n3. City encoding (top 10 + Other):")
top_10_cities = house_clean['City'].value_counts().head(10).index.tolist()
for city in top_10_cities:
    house_clean[f'City_{city}'] = (house_clean['City'] == city).astype(int)
    print(f"   - City_{city}: {house_clean[f'City_{city}'].sum()}")

house_clean['City_Other'] = (~house_clean['City'].isin(top_10_cities)).astype(int)
print(f"   - City_Other: {house_clean['City_Other'].sum()}")

print("\n" + "=" * 60)
print("PŘEHLED NOVÝCH FEATURES")
print("=" * 60)
print(f"\nCelkový počet sloupců: {house_clean.shape[1]}")
print("\nNové numerické features:")
new_numeric = ['Rent_per_sqft', 'Size_per_BHK', 'Floor_Ratio', 'Is_Ground_Floor', 'Is_Top_Floor', 
               'Floor_Number', 'Total_Floors']
print(house_clean[new_numeric].describe())

## 4. Příprava dat pro modelování - Train/Test Split a Transformace

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

print("=" * 60)
print("PŘÍPRAVA DAT PRO MODELOVÁNÍ")
print("=" * 60)

# Definice feature columns (bez původních kategorických sloupců)
feature_cols = ['BHK', 'Size', 'Bathroom', 
                'Floor_Number', 'Total_Floors', 'Floor_Ratio',
                'Is_Ground_Floor', 'Is_Top_Floor',
                'Rent_per_sqft', 'Size_per_BHK',
                'Area_Type_Encoded', 'Furnishing_Encoded',
                'Tenant_Bachelors', 'Tenant_Family', 'Contact_Owner']

# Přidání city dummy variables
city_cols = [col for col in house_clean.columns if col.startswith('City_')]
feature_cols.extend(city_cols)

print(f"\nPočet features pro modelování: {len(feature_cols)}")
print(f"Features: {feature_cols[:10]}... (zobrazeno prvních 10)")

# Vytvoření X a y
X = house_clean[feature_cols].copy()
y = house_clean['Rent'].copy()

print(f"\nX shape: {X.shape}")
print(f"y shape: {y.shape}")

# Train-test split (80-20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"\nTrain set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Vytvoření log-transformované verze target variable
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)

print(f"\nLog transformace target variable:")
print(f"  y_train mean: {y_train.mean():.2f} -> log: {y_train_log.mean():.2f}")
print(f"  y_train std: {y_train.std():.2f} -> log: {y_train_log.std():.2f}")

# Scaling features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nFeatures byly standardizovány (mean=0, std=1)")
print(f"\nPrvních 5 features před scaling:")
print(X_train.iloc[0, :5].values)
print(f"Prvních 5 features po scaling:")
print(X_train_scaled[0, :5])

# Vytvoření DataFrame pro scaled data (pro lepší interpretaci)
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=feature_cols, index=X_train.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=feature_cols, index=X_test.index)

print("\n" + "=" * 60)
print("DATA PŘIPRAVENA PRO MODELOVÁNÍ")
print("=" * 60)

## 5. Baseline OLS Model - Běžný Rent vs Log(Rent)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import scipy.stats as stats

print("=" * 60)
print("BASELINE OLS MODELS")
print("=" * 60)

# Model 1: OLS s původním Rent
print("\n" + "=" * 60)
print("MODEL 1: OLS - původní Rent")
print("=" * 60)

ols_model = LinearRegression()
ols_model.fit(X_train_scaled, y_train)

y_train_pred = ols_model.predict(X_train_scaled)
y_test_pred = ols_model.predict(X_test_scaled)

train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print(f"\nTrain RMSE: {train_rmse:.2f}")
print(f"Test RMSE: {test_rmse:.2f}")
print(f"Train MAE: {train_mae:.2f}")
print(f"Test MAE: {test_mae:.2f}")
print(f"Train R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")

# Model 2: OLS s log(Rent)
print("\n" + "=" * 60)
print("MODEL 2: OLS - log(Rent)")
print("=" * 60)

ols_log_model = LinearRegression()
ols_log_model.fit(X_train_scaled, y_train_log)

y_train_log_pred = ols_log_model.predict(X_train_scaled)
y_test_log_pred = ols_log_model.predict(X_test_scaled)

# Transformace zpět na původní škálu
y_train_pred_exp = np.exp(y_train_log_pred)
y_test_pred_exp = np.exp(y_test_log_pred)

train_rmse_log = np.sqrt(mean_squared_error(y_train, y_train_pred_exp))
test_rmse_log = np.sqrt(mean_squared_error(y_test, y_test_pred_exp))
train_mae_log = mean_absolute_error(y_train, y_train_pred_exp)
test_mae_log = mean_absolute_error(y_test, y_test_pred_exp)
train_r2_log = r2_score(y_train, y_train_pred_exp)
test_r2_log = r2_score(y_test, y_test_pred_exp)

print(f"\nTrain RMSE: {train_rmse_log:.2f}")
print(f"Test RMSE: {test_rmse_log:.2f}")
print(f"Train MAE: {train_mae_log:.2f}")
print(f"Test MAE: {test_mae_log:.2f}")
print(f"Train R²: {train_r2_log:.4f}")
print(f"Test R²: {test_r2_log:.4f}")

print("\n" + "=" * 60)
print("POROVNÁNÍ MODELŮ")
print("=" * 60)
comparison = pd.DataFrame({
    'Model': ['OLS - Rent', 'OLS - log(Rent)'],
    'Train RMSE': [train_rmse, train_rmse_log],
    'Test RMSE': [test_rmse, test_rmse_log],
    'Train MAE': [train_mae, train_mae_log],
    'Test MAE': [test_mae, test_mae_log],
    'Train R²': [train_r2, train_r2_log],
    'Test R²': [test_r2, test_r2_log]
})
print(comparison.to_string(index=False))

In [None]:
# Diagnostické grafy pro OLS modely
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Model 1: OLS - Rent
residuals = y_train - y_train_pred

# 1. Residuals vs Fitted
axes[0, 0].scatter(y_train_pred, residuals, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_title('OLS-Rent: Residuals vs Fitted', fontweight='bold')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')

# 2. QQ plot
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('OLS-Rent: Q-Q Plot', fontweight='bold')

# 3. Scale-Location
standardized_residuals = residuals / residuals.std()
axes[0, 2].scatter(y_train_pred, np.sqrt(np.abs(standardized_residuals)), alpha=0.5)
axes[0, 2].set_title('OLS-Rent: Scale-Location', fontweight='bold')
axes[0, 2].set_xlabel('Fitted values')
axes[0, 2].set_ylabel('√|Standardized residuals|')

# Model 2: OLS - log(Rent)
residuals_log = y_train_log - y_train_log_pred

# 4. Residuals vs Fitted (log scale)
axes[1, 0].scatter(y_train_log_pred, residuals_log, alpha=0.5, color='orange')
axes[1, 0].axhline(y=0, color='r', linestyle='--')
axes[1, 0].set_title('OLS-log(Rent): Residuals vs Fitted', fontweight='bold')
axes[1, 0].set_xlabel('Fitted values (log scale)')
axes[1, 0].set_ylabel('Residuals')

# 5. QQ plot (log)
stats.probplot(residuals_log, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('OLS-log(Rent): Q-Q Plot', fontweight='bold')
axes[1, 1].get_lines()[0].set_color('orange')
axes[1, 1].get_lines()[1].set_color('r')

# 6. Scale-Location (log)
standardized_residuals_log = residuals_log / residuals_log.std()
axes[1, 2].scatter(y_train_log_pred, np.sqrt(np.abs(standardized_residuals_log)), alpha=0.5, color='orange')
axes[1, 2].set_title('OLS-log(Rent): Scale-Location', fontweight='bold')
axes[1, 2].set_xlabel('Fitted values (log scale)')
axes[1, 2].set_ylabel('√|Standardized residuals|')

plt.tight_layout()
plt.show()

print("\n" + "=" * 60)
print("INTERPRETACE DIAGNOSTIKY")
print("=" * 60)
print("""
Residuals vs Fitted: Kontrola linearity a homoskedasticity
- Ideálně by měly být residuals náhodně rozloženy kolem 0
- Žádný systematický vzor by neměl být patrný

Q-Q Plot: Kontrola normality residuals
- Body by měly ležet přibližně na diagonální čáře
- Odchylky naznačují heavy tails nebo skewness

Scale-Location: Kontrola homoskedasticity
- Horizontální linie naznačuje konstantní varianci
- Rostoucí trend naznačuje heteroskedasticitu

Log transformace by měla zlepšit normalitu a homoskedasticitu residuals.
""")

## 6. Regularizované modely - Ridge a Lasso s Cross-Validation

In [None]:
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.model_selection import cross_val_score

print("=" * 60)
print("REGULARIZOVANÉ MODELY - RIDGE A LASSO")
print("=" * 60)

# Ridge Regression s CV pro nalezení optimální alpha
print("\n" + "=" * 60)
print("RIDGE REGRESSION")
print("=" * 60)

alphas_ridge = np.logspace(-2, 4, 100)
ridge_cv = RidgeCV(alphas=alphas_ridge, cv=5, scoring='neg_mean_squared_error')

# Trénování na log(Rent) - obvykle dává lepší výsledky
ridge_cv.fit(X_train_scaled, y_train_log)

print(f"\nOptimální alpha: {ridge_cv.alpha_:.4f}")

# Predikce
y_train_ridge_pred = ridge_cv.predict(X_train_scaled)
y_test_ridge_pred = ridge_cv.predict(X_test_scaled)

# Transformace zpět na původní škálu
y_train_ridge_exp = np.exp(y_train_ridge_pred)
y_test_ridge_exp = np.exp(y_test_ridge_pred)

train_rmse_ridge = np.sqrt(mean_squared_error(y_train, y_train_ridge_exp))
test_rmse_ridge = np.sqrt(mean_squared_error(y_test, y_test_ridge_exp))
train_mae_ridge = mean_absolute_error(y_train, y_train_ridge_exp)
test_mae_ridge = mean_absolute_error(y_test, y_test_ridge_exp)
train_r2_ridge = r2_score(y_train, y_train_ridge_exp)
test_r2_ridge = r2_score(y_test, y_test_ridge_exp)

print(f"\nTrain RMSE: {train_rmse_ridge:.2f}")
print(f"Test RMSE: {test_rmse_ridge:.2f}")
print(f"Train MAE: {train_mae_ridge:.2f}")
print(f"Test MAE: {test_mae_ridge:.2f}")
print(f"Train R²: {train_r2_ridge:.4f}")
print(f"Test R²: {test_r2_ridge:.4f}")

# Lasso Regression s CV
print("\n" + "=" * 60)
print("LASSO REGRESSION")
print("=" * 60)

alphas_lasso = np.logspace(-4, 1, 100)
lasso_cv = LassoCV(alphas=alphas_lasso, cv=5, max_iter=10000, random_state=42)

lasso_cv.fit(X_train_scaled, y_train_log)

print(f"\nOptimální alpha: {lasso_cv.alpha_:.6f}")

# Predikce
y_train_lasso_pred = lasso_cv.predict(X_train_scaled)
y_test_lasso_pred = lasso_cv.predict(X_test_scaled)

# Transformace zpět na původní škálu
y_train_lasso_exp = np.exp(y_train_lasso_pred)
y_test_lasso_exp = np.exp(y_test_lasso_pred)

train_rmse_lasso = np.sqrt(mean_squared_error(y_train, y_train_lasso_exp))
test_rmse_lasso = np.sqrt(mean_squared_error(y_test, y_test_lasso_exp))
train_mae_lasso = mean_absolute_error(y_train, y_train_lasso_exp)
test_mae_lasso = mean_absolute_error(y_test, y_test_lasso_exp)
train_r2_lasso = r2_score(y_train, y_train_lasso_exp)
test_r2_lasso = r2_score(y_test, y_test_lasso_exp)

print(f"\nTrain RMSE: {train_rmse_lasso:.2f}")
print(f"Test RMSE: {test_rmse_lasso:.2f}")
print(f"Train MAE: {train_mae_lasso:.2f}")
print(f"Test MAE: {test_mae_lasso:.2f}")
print(f"Train R²: {train_r2_lasso:.4f}")
print(f"Test R²: {test_r2_lasso:.4f}")

# Analýza feature selection (Lasso)
print("\n" + "=" * 60)
print("LASSO - FEATURE SELECTION")
print("=" * 60)

lasso_coefs = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': lasso_cv.coef_
})
lasso_coefs = lasso_coefs.sort_values('Coefficient', key=abs, ascending=False)

print(f"\nPočet nenulových koeficientů: {(lasso_cv.coef_ != 0).sum()} / {len(feature_cols)}")
print(f"\nTop 15 nejdůležitějších features podle Lasso:")
print(lasso_coefs.head(15).to_string(index=False))

print(f"\nFeatures s nulovým koeficientem (eliminované Lasso):")
zero_coefs = lasso_coefs[lasso_coefs['Coefficient'] == 0]
if len(zero_coefs) > 0:
    print(zero_coefs['Feature'].tolist())
else:
    print("Žádné features nebyly eliminovány")

## 7. Finální porovnání modelů a závěry

In [None]:
print("=" * 80)
print("FINÁLNÍ POROVNÁNÍ VŠECH MODELŮ")
print("=" * 80)

# Tabulka s výsledky všech modelů
results = pd.DataFrame({
    'Model': ['OLS - Rent', 'OLS - log(Rent)', 'Ridge - log(Rent)', 'Lasso - log(Rent)'],
    'Train RMSE': [train_rmse, train_rmse_log, train_rmse_ridge, train_rmse_lasso],
    'Test RMSE': [test_rmse, test_rmse_log, test_rmse_ridge, test_rmse_lasso],
    'Train MAE': [train_mae, train_mae_log, train_mae_ridge, train_mae_lasso],
    'Test MAE': [test_mae, test_mae_log, test_mae_ridge, test_mae_lasso],
    'Train R²': [train_r2, train_r2_log, train_r2_ridge, train_r2_lasso],
    'Test R²': [test_r2, test_r2_log, test_r2_ridge, test_r2_lasso]
})

print("\n" + results.to_string(index=False))

# Nalezení nejlepšího modelu podle Test RMSE
best_idx = results['Test RMSE'].idxmin()
best_model = results.loc[best_idx, 'Model']
print(f"\n{'=' * 80}")
print(f"NEJLEPŠÍ MODEL: {best_model}")
print(f"{'=' * 80}")
print(f"Test RMSE: {results.loc[best_idx, 'Test RMSE']:.2f}")
print(f"Test MAE: {results.loc[best_idx, 'Test MAE']:.2f}")
print(f"Test R²: {results.loc[best_idx, 'Test R²']:.4f}")

# Vizualizace porovnání modelů
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Porovnání RMSE
models = results['Model'].values
train_rmse_vals = results['Train RMSE'].values
test_rmse_vals = results['Test RMSE'].values

x = np.arange(len(models))
width = 0.35

axes[0, 0].bar(x - width/2, train_rmse_vals, width, label='Train', alpha=0.8)
axes[0, 0].bar(x + width/2, test_rmse_vals, width, label='Test', alpha=0.8)
axes[0, 0].set_xlabel('Model')
axes[0, 0].set_ylabel('RMSE')
axes[0, 0].set_title('Porovnání RMSE', fontweight='bold', fontsize=14)
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(models, rotation=15, ha='right')
axes[0, 0].legend()
axes[0, 0].grid(axis='y', alpha=0.3)

# 2. Porovnání R²
train_r2_vals = results['Train R²'].values
test_r2_vals = results['Test R²'].values

axes[0, 1].bar(x - width/2, train_r2_vals, width, label='Train', alpha=0.8)
axes[0, 1].bar(x + width/2, test_r2_vals, width, label='Test', alpha=0.8)
axes[0, 1].set_xlabel('Model')
axes[0, 1].set_ylabel('R²')
axes[0, 1].set_title('Porovnání R²', fontweight='bold', fontsize=14)
axes[0, 1].set_xticks(x)
axes[0, 1].set_xticklabels(models, rotation=15, ha='right')
axes[0, 1].legend()
axes[0, 1].grid(axis='y', alpha=0.3)

# 3. Actual vs Predicted - nejlepší model
if best_model == 'OLS - Rent':
    y_test_best = y_test_pred
elif best_model == 'OLS - log(Rent)':
    y_test_best = y_test_pred_exp
elif best_model == 'Ridge - log(Rent)':
    y_test_best = y_test_ridge_exp
else:  # Lasso
    y_test_best = y_test_lasso_exp

axes[1, 0].scatter(y_test, y_test_best, alpha=0.5)
axes[1, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1, 0].set_xlabel('Actual Rent')
axes[1, 0].set_ylabel('Predicted Rent')
axes[1, 0].set_title(f'Actual vs Predicted - {best_model}', fontweight='bold', fontsize=14)
axes[1, 0].grid(alpha=0.3)

# 4. Residuals distribution - nejlepší model
residuals_best = y_test.values - y_test_best
axes[1, 1].hist(residuals_best, bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].axvline(x=0, color='r', linestyle='--', lw=2)
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title(f'Residuals Distribution - {best_model}', fontweight='bold', fontsize=14)
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Závěry a doporučení

### Shrnutí analýzy

V této analýze jsme provedli komplexní lineární regresi pro predikci cen nájemného domů s následujícími kroky:

1. **Explorační datová analýza (EDA)**
   - Dataset obsahuje ~4700 záznamů s 12 původními features
   - Identifikovali jsme outliers, missing values a distribuce proměnných
   - Rent má pravostranně skosenou distribuci → kandidát na log transformaci

2. **Data preprocessing**
   - Odstranění duplicit a extrémních outlierů (3*IQR)
   - Parsování Floor sloupce na Floor_Number a Total_Floors
   - Imputace missing values

3. **Feature engineering**
   - Vytvoření nových features: Rent_per_sqft, Size_per_BHK, Floor_Ratio
   - Binary features: Is_Ground_Floor, Is_Top_Floor
   - Encoding kategorických proměnných (Area Type, Furnishing, City, ...)
   - Celkem ~26 features

4. **Modelování**
   - **OLS - Rent**: Baseline model, vyšší RMSE kvůli skewed distribuci
   - **OLS - log(Rent)**: Lepší výsledky díky log transformaci
   - **Ridge Regression**: Regularizace pomáhá s multikolinearitou
   - **Lasso Regression**: Feature selection + regularizace

### Klíčová zjištění

- **Log transformace** target variable (Rent) výrazně zlepšuje:
  - Normalitu residuals (lepší QQ plot)
  - Homoskedasticitu (konstantní varianci residuals)
  - Prediktivní výkon (nižší RMSE/MAE)

- **Nejdůležitější features** (podle Lasso):
  - Size, BHK, Bathroom (základní charakteristiky)
  - City (velký vliv lokality)
  - Furnishing Status (vybavení)
  - Rent_per_sqft (engineered feature)

- **Regularizace** (Ridge/Lasso):
  - Pomáhá s overfittingem
  - Lasso provádí automatic feature selection
  - Minimální rozdíl v performance mezi Ridge a Lasso

### Doporučení

1. **Pro produkci**: Použít Ridge nebo Lasso model s log(Rent)
2. **Pro interpretaci**: OLS model poskytuje jednodušší koeficienty
3. **Další zlepšení**:
   - Přidat polynomial features (např. Size²)
   - Interaction terms (např. City × Furnishing)
   - Zkusit ensemble metody (Random Forest, Gradient Boosting)
   - Více features z Area Locality (distance to city center, neighborhood rating)

### Limity analýzy

- Dataset je omezen na několik měst v Indii
- Chybí časová dynamika (všechny záznamy jsou z roku 2022)
- Některé kategorie mají málo pozorování (rare cities)
- Rent_per_sqft je partially leaking target (vytvořen z Rent)