# **Training Notebook**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint
import pickle  # For saving the model and scaler

# **1. Getting the Data Ready:**
- **Loading the Data:** The code starts by loading a dataset that contains information about houses and their prices. Think of it as opening up a spreadsheet full of house details.
- **Cleaning and Organizing:** It then cleans up the data a bit.
  - It converts the 'date' column into a format the computer understands as a date.
  - It extracts the year and month from the date of the sale and stores them in separate columns.
  - Removes the "id" and "date" columns, which aren't helpful for predicting prices.
  - It calculates the age of each house based on the year it was built and the sale year.
  - Creates a new column indicating whether the house has been renovated or not.

In [2]:
# Load the dataset
df = pd.read_csv("/content/kc_house_data.csv")

# Data Preprocessing (same as before)
df['date'] = pd.to_datetime(df['date'])
df['sale_year'] = df['date'].dt.year
df['sale_month'] = df['date'].dt.month
df = df.drop(['id', 'date'], axis=1)  # Drop 'id' and 'date'

df['age'] = df['sale_year'] - df['yr_built']
df['renovated'] = df['yr_renovated'].apply(lambda x: 1 if x > 0 else 0)

# **2. Selecting the Important Features:**
- **Choosing the Right Columns:** The code selects a specific set of columns (features) that it believes are most important for predicting house prices. These columns include things like number of bedrooms, square footage, location, age, and whether the house has a waterfront view.

In [3]:
# Feature Selection
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront',
            'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built',
            'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15',
            'age', 'renovated', 'sale_year', 'sale_month']

X = df[features]
y = df['price']

# **3. Splitting the Data:**
- **Creating Training and Testing Sets:** The code divides the data into two parts: a training set and a testing set. The training set is used to teach the model how to predict prices, and the testing set is used to evaluate how well the model has learned. Think of it as using one set of practice problems to learn and a separate set to take the final exam.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **4. Scaling the Data:**
- **Normalizing the Values:** The code uses a "scaler" to adjust the range of values in the data. This is important because some features might have very large values (like square footage), while others have very small values (like number of bedrooms). Scaling ensures that all features are on a similar scale, which helps the model learn more effectively.

In [5]:
# Scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# **5. Hyperparameter Tuning:**
- **Finding the Best Settings:** The code uses a technique called **"Randomized Search"** to find the best settings (hyperparameters) for the Random Forest model. It tries out different combinations of settings and sees which ones produce the best results. This is like trying different combinations of ingredients to find the perfect recipe.

In [6]:
# Hyperparameter Tuning (RandomizedSearchCV)
param_distributions = {
    'n_estimators': randint(50, 200),
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': randint(2, 11),
    'min_samples_leaf': randint(1, 5)
}

random_search = RandomizedSearchCV(RandomForestRegressor(random_state=42),
                                    param_distributions,
                                    n_iter=20,
                                    cv=3,
                                    scoring='neg_mean_squared_error',
                                    n_jobs=-1,
                                    random_state=42)

random_search.fit(X_train_scaled, y_train)
best_rf_model = random_search.best_estimator_
best_params = random_search.best_params_
print("\nBest Parameters:", best_params)



Best Parameters: {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 4, 'n_estimators': 124}


# **6. Model Evaluation:**
- **Testing the Model:** The code uses the testing set to evaluate how well the model can predict house prices. It calculates several metrics, such as the Mean Squared Error (MSE) and R-squared, to measure the model's accuracy. The R-squared tells you how much of the variance in the house prices is explained by the model.

In [7]:
# Model Evaluation (on the test set)
from sklearn.metrics import mean_squared_error, r2_score

y_rf_pred = best_rf_model.predict(X_test_scaled)
rf_mse = mean_squared_error(y_test, y_rf_pred)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, y_rf_pred)

print("\nRandom Forest Regressor Model Evaluation (Tuned):")
print("Mean Squared Error:", rf_mse)
print("Root Mean Squared Error:", rf_rmse)
print("R-squared:", rf_r2)


Random Forest Regressor Model Evaluation (Tuned):
Mean Squared Error: 21832650023.193043
Root Mean Squared Error: 147758.75616420517
R-squared: 0.8555819233183328


# **7. Feature Importance:**
- **Figuring Out What Matters Most:** The code determines which features were most important in making predictions. This can give you insights into what aspects of a house have the biggest impact on its price.

In [8]:
# Feature Importance
feature_importances = best_rf_model.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
print("\nFeature Importances (Random Forest):")
print(feature_importance_df)


Feature Importances (Random Forest):
          Feature  Importance
8           grade    0.318736
2     sqft_living    0.277698
14            lat    0.152921
15           long    0.061868
5      waterfront    0.030999
16  sqft_living15    0.030321
18            age    0.022729
9      sqft_above    0.017443
13        zipcode    0.013705
3        sqft_lot    0.012033
11       yr_built    0.011572
17     sqft_lot15    0.010626
1       bathrooms    0.010472
6            view    0.010028
10  sqft_basement    0.005253
21     sale_month    0.004940
7       condition    0.002354
0        bedrooms    0.002233
4          floors    0.001422
12   yr_renovated    0.001185
20      sale_year    0.001118
19      renovated    0.000345


# **8. Saving the Model and Scaler:**
- **Storing the Results:** The code saves the trained model and the scaler to files. This allows you to load the model and scaler later and use them to predict prices for new houses without having to retrain the model from scratch. Think of it as saving your perfect recipe so you can use it again and again.

In [9]:
# Save the Model and Scaler
with open('model.pkl', 'wb') as f:
    pickle.dump(best_rf_model, f)

with open('scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

print("\nModel and Scaler saved to model.pkl and scaler.pkl")


Model and Scaler saved to model.pkl and scaler.pkl


In [10]:
import pickle
import gzip

# Save the Model with Gzip compression
with gzip.open('model.pkl.gz', 'wb') as f:
    pickle.dump(best_rf_model, f)

# Save the Scaler with Gzip compression
with gzip.open('scaler.pkl.gz', 'wb') as f:
    pickle.dump(scaler, f)

print("\nModel and Scaler saved to model.pkl.gz and scaler.pkl.gz")



Model and Scaler saved to model.pkl.gz and scaler.pkl.gz
