### Problem Statement

You are a data scientist / AI engineer working on a regression problem. You have been provided with the **Ames Housing dataset** to predict house sale prices.

Your task is to build and evaluate regression models. You will start with basic models and gradually move towards advanced models like **XGBoost Regressor**. Finally, you will explore various parameters of the XGBoost model to enhance performance.


**Import Necessary Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## Ames Housing Dataset Analysis

### Task 1: Data Preparation and Exploration
1. Import the data from `house_prices.csv`.
2. Display the number of rows and columns.
3. Display the first few rows.
4. Check for missing values and data types.
5. Generate descriptive statistics.

In [None]:
# Step 1: Import the data
df = pd.read_csv('house_prices.csv')

# Step 2: Display rows and columns
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

# Step 3: Display first few rows
display(df.head())

# Step 4: Check for missing values and data types
print(df.info())

# Step 5: Generate descriptive statistics
display(df.describe())

### Task 2: Exploratory Data Analysis (EDA)
1. Visualize the distribution of the target variable `SalePrice`.
2. Create a correlation heatmap of all features including the target.

In [None]:
# Step 1: Distribution of target
sns.histplot(df['SalePrice'], kde=True, color='blue')
plt.title('Distribution of SalePrice')
plt.show()

# Step 2: Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

### Task 3: Model Training Using Basic Models
1. Split the data into training and test sets (80/20).
2. Train a Linear Regression model.
3. Train a Decision Tree Regressor.
4. Evaluate both using R2 Score, Mean Squared Error (MSE), and Mean Absolute Error (MAE).

In [None]:
# Step 1: Split data
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Function to evaluate models
def evaluate_model(y_true, y_pred, model_name):
    mse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"--- {model_name} Evaluation ---")
    print(f"MSE: {mse:.2f}")
    print(f"MAE: {mae:.2f}")
    print(f"R2 Score: {r2:.4f}\n")

# Step 2: Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
evaluate_model(y_test, y_pred_lr, "Linear Regression")

# Step 3: Decision Tree Regressor
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
evaluate_model(y_test, y_pred_dt, "Decision Tree Regressor")

### Task 4: Model Training Using XGBoost Regressor
1. Initialize and train an XGBoost Regressor model.
2. Evaluate the model using R2 Score, MSE, and MAE.
3. Display feature importances.

In [None]:
# Step 1: XGBoost Regressor
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)

# Step 2: Evaluation
y_pred_xgb = xgb.predict(X_test)
evaluate_model(y_test, y_pred_xgb, "XGBoost Regressor (Default)")

# Step 3: Feature Importances
importances = pd.DataFrame({'feature': X.columns, 'importance': xgb.feature_importances_})
importances = importances.sort_values('importance', ascending=False)
sns.barplot(x='importance', y='feature', data=importances, palette='viridis')
plt.title('Feature Importances')
plt.show()

### Task 5: Exploring Various Parameters in XGBoost Regressor
1. Train an XGBoost model with the following parameters:
    - `n_estimators`: Number of boosting rounds (e.g., 200)
    - `learning_rate`: Step size shrinkage to prevent overfitting (e.g., 0.05)
    - `max_depth`: Maximum depth of a tree (e.g., 5)
    - `subsample`: Subsample ratio of the training instance (e.g., 0.7)
    - `colsample_bytree`: Subsample ratio of columns when constructing each tree (e.g., 0.7)
2. Evaluate the tuned model and compare its performance to the default model.

In [None]:
# Step 1: Tuned XGBoost Regressor
xgb_tuned = XGBRegressor(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.7,
    colsample_bytree=0.7,
    random_state=42
)
xgb_tuned.fit(X_train, y_train)

# Step 2: Evaluation
y_pred_tuned = xgb_tuned.predict(X_test)
evaluate_model(y_test, y_pred_tuned, "XGBoost Regressor (Tuned)")

# Comparison
print(f"Default R2 Score: {r2_score(y_test, y_pred_xgb):.4f}")
print(f"Tuned R2 Score: {r2_score(y_test, y_pred_tuned):.4f}")