Absolutely, let's dive into a more complex project. How about building a machine learning model to predict house prices using the famous "California Housing Prices" dataset? This project will involve the following steps:

    Understanding the problem
    Loading and exploring the data
    Preprocessing the data
    Building and training a machine learning model
    Evaluating the model
    Fine-tuning the model for better performance
    Making predictions with the model

Understanding the Problem

The goal is to predict the median house value for each district in California, based on a mix of features.
Dataset Features

    Longitude
    Latitude
    Housing Median Age
    Total Rooms
    Total Bedrooms
    Population
    Households
    Median Income
    Median House Value (target variable)

Step 1: Setting Up Your Environment

Make sure you have the following libraries installed:

pip install pandas numpy scikit-learn matplotlib seaborn

Step 2: Loading and Exploring the Data

First, let’s load the dataset and explore it.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Load the dataset
url = "https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv"
df = pd.read_csv(url)

# Display the first few rows of the DataFrame
print(df.head())

# Display basic statistics
print(df.describe())

# Check for missing values
print(df.isnull().sum())

Step 3: Visualizing the Data

Understanding the data through visualization is crucial.

# Scatter plot of median income vs. median house value
plt.figure(figsize=(10, 6))
plt.scatter(df['median_income'], df['median_house_value'], alpha=0.1)
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.title('Median Income vs Median House Value')
plt.show()

# Correlation matrix
corr_matrix = df.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

Step 4: Preprocessing the Data

We'll handle missing values, encode categorical variables, and normalize numeric features.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Handling missing values in 'total_bedrooms'
imputer = SimpleImputer(strategy='median')
df['total_bedrooms'] = imputer.fit_transform(df[['total_bedrooms']])

# Separating features and target variable
X = df.drop('median_house_value', axis=1)
y = df['median_house_value']

# Encode categorical feature ("ocean_proximity")
cat_features = ["ocean_proximity"]
cat_transformer = OneHotEncoder()

# Standardizing numerical features
num_features = X.drop("ocean_proximity", axis=1).columns
num_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Combining transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ])

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Building and Training the Model

We'll use a more sophisticated model, a Random Forest Regressor.

from sklearn.ensemble import RandomForestRegressor

# Create a pipeline for the data processing and model
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Train the model
model_pipeline.fit(X_train, y_train)

Step 6: Evaluating the Model

We'll evaluate the model's performance on the test set.

from sklearn.metrics import mean_squared_error, mean_absolute_error

# Make predictions
y_pred = model_pipeline.predict(X_test)

# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")

Step 7: Fine-Tuning the Model

We'll use GridSearchCV to fine-tune the hyperparameters of the RandomForestRegressor.

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__max_features': ['auto', 'sqrt', 'log2']
}

# Create the grid search
grid_search = GridSearchCV(model_pipeline, param_grid, cv=3, scoring='neg_mean_squared_error', verbose=2, n_jobs=-1)

# Run the grid search
grid_search.fit(X_train, y_train)

# Best parameters
print("Best parameters:", grid_search.best_params_)

# Best estimator
best_model = grid_search.best_estimator_

# Evaluate the best model
y_best_pred = best_model.predict(X_test)
best_rmse = np.sqrt(mean_squared_error(y_test, y_best_pred))

print(f"Best Root Mean Squared Error: {best_rmse}")

Step 8: Making Predictions

Finally, you can use your best model to make predictions on new data.

# Example new data
new_data = pd.DataFrame({
    'longitude': [-122.23],
    'latitude': [37.88],
    'housing_median_age': [41.0],
    'total_rooms': [880.0],
    'total_bedrooms': [129.0],
    'population': [322.0],
    'households': [126.0],
    'median_income': [8.3252],
    'ocean_proximity': ['NEAR BAY']
})

# Make prediction
predicted_value = best_model.predict(new_data)
print(f"Predicted Median House Value: {predicted_value[0]}")

Conclusion

Congratulations! You've built a more complex machine learning model to predict house prices using the California Housing Prices dataset. You've learned how to preprocess data, build and train a model, evaluate and fine-tune it, and make predictions.

