<a href="https://colab.research.google.com/github/kaziwahidaltaher-droid/.github/blob/main/notebooks/Getting_started_with_google_colab_ai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Colab is making it easier than ever to integrate powerful Generative AI capabilities into your projects. We are launching public preview for a simple and intuitive Python library (google.colab.ai) to access state-of-the-art language models directly within Pro and Pro+ subscriber Colab environments.  This means subscribers can spend less time on configuration and set up and more time bringing their ideas to life. With just a few lines of code, you can now perform a variety of tasks:
- Generate text
- Translate languages
- Write creative content
- Categorize text

Happy Coding!


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/main/notebooks/Getting_started_with_google_colab_ai.ipynb)

In [114]:
# @title List available models
from google.colab import ai

ai.list_models()

['google/gemini-2.5-flash', 'google/gemini-2.5-flash-lite']

Choosing a Model
The model names give you a hint about their capabilities and intended use:

Pro: These are the most capable models, ideal for complex reasoning, creative tasks, and detailed analysis.

Flash: These models are optimized for high speed and efficiency, making them great for summarization, chat applications, and tasks requiring rapid responses.

Gemma: These are lightweight, open-weight models suitable for a variety of text generation tasks and are great for experimentation.

In [115]:
# @title Simple batch generation example
# Only text-to-text input/output is supported
from google.colab import ai

response = ai.generate_text("What is the capital of France?")
print(response)

APIStatusError: Error code: 402 - {'message': 'Colab Models is only available to Colab Pro and Pro+ subscribers.', 'type': 'invalid_request_error'}

## Visualize results (optional)

### Subtask:
Visualize the predictions of the tuned model.

**Reasoning**:
Create a scatter plot of the actual vs. predicted values from the tuned Ridge model, add labels, title, a diagonal line for perfect predictions, and a grid.

In [None]:
# Create a scatter plot of actual vs. predicted values for the tuned Ridge model
plt.figure(figsize=(8, 6))
plt.scatter(y_test_engineered, y_pred_tuned_ridge, alpha=0.5)

# Add labels and title
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values (Tuned Ridge Regression)')

# Add a diagonal line for perfect predictions
# Determine the range for the diagonal line based on both actual and predicted values
plot_range_tuned = [min(y_test_engineered.min(), y_pred_tuned_ridge.min()), max(y_test_engineered.max(), y_pred_tuned_ridge.max())]
plt.plot(plot_range_tuned, plot_range_tuned, color='red', linestyle='--')

# Add a grid
plt.grid(True)

# Show the plot
plt.show()

## Compare with previous models

### Subtask:
Compare the performance of the tuned model to the previously trained models.

**Reasoning**:
Print the performance metrics for all models and compare them to summarize the impact of feature engineering and hyperparameter tuning.

In [116]:
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Store performance metrics in a dictionary
performance_metrics = {
    "Initial Linear Regression (Original Features)": {"MSE": mse, "R2": r2},
    "Ridge Regression (Original Features, Default)": {"MSE": mse_ridge, "R2": r2_ridge},
    "Linear Regression (Engineered Features)": {"MSE": mse_engineered, "R2": r2_engineered},
    "Tuned Ridge Regression (Engineered Features, Tuned)": {"MSE": mse_tuned_ridge, "R2": r2_tuned_ridge}
}

# Print the performance metrics in a formatted way
print("--- Model Performance Comparison ---")
for model_name, metrics in performance_metrics.items():
    print(f"\n{model_name}:")
    print(f"  Mean Squared Error (MSE): {metrics['MSE']:.2f}")
    print(f"  R-squared (R2) Score: {metrics['R2']:.4f}")

# Summarize the findings
print("\n--- Performance Comparison Summary ---")

# Determine the best model based on MSE and R2
best_mse_model = min(performance_metrics, key=lambda k: performance_metrics[k]['MSE'])
best_r2_model = max(performance_metrics, key=lambda k: performance_metrics[k]['R2'])

if best_mse_model == best_r2_model:
    print(f"The {best_mse_model} performed the best based on both MSE (lower is better) and R2 (higher is better).")
else:
    print(f"The {best_mse_model} performed the best based on MSE (lower is better).")
    print(f"The {best_r2_model} performed the best based on R2 (higher is better).")

print("\nImpact of Feature Engineering and Hyperparameter Tuning:")

# Compare Engineered Linear Regression to Initial Linear Regression
if performance_metrics["Linear Regression (Engineered Features)"]["MSE"] < performance_metrics["Initial Linear Regression (Original Features)"]["MSE"] \
   and performance_metrics["Linear Regression (Engineered Features)"]["R2"] > performance_metrics["Initial Linear Regression (Original Features)"]["R2"]:
    print("- Feature engineering improved the performance of the Linear Regression model.")
else:
    print("- Feature engineering did not significantly improve the performance of the Linear Regression model.")

# Compare Tuned Ridge (Engineered) to Linear Regression (Engineered)
if performance_metrics["Tuned Ridge Regression (Engineered Features, Tuned)"]["MSE"] < performance_metrics["Linear Regression (Engineered Features)"]["MSE"] \
   and performance_metrics["Tuned Ridge Regression (Engineered Features, Tuned)"]["R2"] > performance_metrics["Linear Regression (Engineered Features)"]["R2"]:
    print("- Hyperparameter tuning of the Ridge model with engineered features further improved performance compared to the Linear Regression model with engineered features.")
else:
     print("- Hyperparameter tuning of the Ridge model with engineered features did not significantly improve performance compared to the Linear Regression model with engineered features.")

# Compare Tuned Ridge (Engineered) to Default Ridge (Original)
if performance_metrics["Tuned Ridge Regression (Engineered Features, Tuned)"]["MSE"] < performance_metrics["Ridge Regression (Original Features, Default)"]["MSE"] \
   and performance_metrics["Tuned Ridge Regression (Engineered Features, Tuned)"]["R2"] > performance_metrics["Ridge Regression (Original Features, Default)"]["R2"]:
     print("- Hyperparameter tuning of the Ridge model with engineered features improved performance compared to the Ridge model with default hyperparameters.")
else:
     print("- Hyperparameter tuning of the Ridge model with engineered features did not significantly improve performance compared to the Ridge model with default hyperparameters.")

--- Model Performance Comparison ---

Initial Linear Regression (Original Features):
  Mean Squared Error (MSE): 4634658406.22
  R-squared (R2) Score: 0.6636

Ridge Regression (Original Features, Default):
  Mean Squared Error (MSE): 4634651616.32
  R-squared (R2) Score: 0.6636

Linear Regression (Engineered Features):
  Mean Squared Error (MSE): 4552463037.86
  R-squared (R2) Score: 0.6696

Tuned Ridge Regression (Engineered Features, Tuned):
  Mean Squared Error (MSE): 4552359506.89
  R-squared (R2) Score: 0.6696

--- Performance Comparison Summary ---
The Tuned Ridge Regression (Engineered Features, Tuned) performed the best based on both MSE (lower is better) and R2 (higher is better).

Impact of Feature Engineering and Hyperparameter Tuning:
- Feature engineering improved the performance of the Linear Regression model.
- Hyperparameter tuning of the Ridge model with engineered features further improved performance compared to the Linear Regression model with engineered features.
-

## Train the final model

### Subtask:
Train the chosen model with the best hyperparameters on the entire training set.

**Reasoning**:
Get the best estimator from the GridSearchCV object and train it on the entire engineered training data.

In [None]:
# Get the best estimator from the GridSearchCV object
best_ridge_model = grid_search.best_estimator_

# Train the best estimator on the entire engineered training data
best_ridge_model.fit(X_train_engineered, y_train_engineered)

print("Best Ridge Regression model trained on the entire engineered training set.")

## Update features for modeling

### Subtask:
Select the updated set of features (including the new ones) for training the model.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Select the log-transformed features to visualize
features_to_visualize_log = ['total_rooms_log', 'median_income_log']

# Create histograms for each selected log-transformed feature
df[features_to_visualize_log].hist(bins=50, figsize=(10, 5))
plt.tight_layout() # Adjust layout to prevent overlap
plt.show()

In [None]:
# Update the features list to include log-transformed features and exclude original skewed ones
features_engineered_transformed = ['longitude', 'latitude', 'housing_median_age',
                                   'total_bedrooms', 'population', 'households',
                                   'rooms_per_household', 'bedrooms_per_room', 'population_per_household',
                                   'total_rooms_log', 'median_income_log']

# Create a new DataFrame X_engineered_transformed by selecting these columns from df
X_engineered_transformed = df[features_engineered_transformed]

# Keep the target variable y as it is (the 'median_house_value' column from df)
# y was already defined in a previous step as df['median_house_value']

# Print the head of X_engineered_transformed and y to verify
print("Head of X_engineered_transformed:")
display(X_engineered_transformed.head())

print("\nHead of y:")
display(y.head())

## Apply Transformations to Skewed Features

### Subtask:
Apply log transformation to skewed numerical features identified during exploration.

**Reasoning**:
Apply log transformation to 'total_rooms' and 'median_income' to reduce skewness and display the head of the DataFrame to show the transformed columns.

In [None]:
import numpy as np

# Apply log transformation to 'total_rooms' and 'median_income'
# Add a small constant (e.g., 1) before taking the log to handle potential zero values,
# although based on describe() output, these columns don't have zeros.
# Using np.log1p which calculates log(1+x) is a robust way to handle this.
df['total_rooms_log'] = np.log1p(df['total_rooms'])
df['median_income_log'] = np.log1p(df['median_income'])

# Display the head of the DataFrame to verify the new transformed columns
display(df.head())

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Select a few numerical features to visualize
features_to_visualize = ['housing_median_age', 'total_rooms', 'median_income', 'median_house_value']

# Create histograms for each selected feature
df[features_to_visualize].hist(bins=50, figsize=(15, 10))
plt.tight_layout() # Adjust layout to prevent overlap
plt.show()

In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot of median_income vs. median_house_value
plt.figure(figsize=(10, 6))
plt.scatter(df['median_income'], df['median_house_value'], alpha=0.5)

# Add labels and title
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.title('Relationship between Median Income and Median House Value')

# Add a grid for better readability
plt.grid(True)

# Show the plot
plt.show()

## Split the data (if necessary)

### Subtask:
Split the updated dataset (`X_engineered`, `y`) into training and testing sets.

**Reasoning**:
Split the features and target into training and testing sets using train_test_split as instructed and print their shapes.

In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train_engineered, X_test_engineered, y_train_engineered, y_test_engineered = train_test_split(X_engineered, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print(f"Shape of X_train_engineered: {X_train_engineered.shape}")
print(f"Shape of X_test_engineered: {X_test_engineered.shape}")
print(f"Shape of y_train_engineered: {y_train_engineered.shape}")
print(f"Shape of y_test_engineered: {y_test_engineered.shape}")

**Reasoning**:
Select the updated set of features for training the model, including the engineered features.

In [None]:
# Define the list of features, including the original and engineered ones
features_engineered = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
                       'total_bedrooms', 'population', 'households', 'median_income',
                       'rooms_per_household', 'bedrooms_per_room', 'population_per_household']

# Create a new DataFrame X_engineered by selecting these columns from df
X_engineered = df[features_engineered]

# Keep the target variable y as it is (the 'median_house_value' column from df)
# y was already defined in a previous step as df['median_house_value']

# Print the head of X_engineered and y to verify
print("Head of X_engineered:")
display(X_engineered.head())

print("\nHead of y:")
display(y.head())

## Identify potential new features

### Subtask:
Determine which existing features can be combined or transformed to create meaningful new features (e.g., ratios, polynomial features, interaction terms).

**Reasoning**:
Describe the rationale for choosing features to create based on the analysis of existing features and potential relationships.

In [None]:
# Rationale for choosing new features:
# Based on domain knowledge and common practices in housing price prediction,
# ratios of existing features can provide more meaningful insights into the
# characteristics of a housing block group than the raw counts alone.

# 1. Rooms per household ('rooms_per_household'):
#    This ratio (total_rooms / households) can indicate the average number of rooms
#    available per household in a block group. It might be a strong predictor
#    of housing value, as larger houses (more rooms per household) are often
#    associated with higher values.

# 2. Bedrooms per room ('bedrooms_per_room'):
#    This ratio (total_bedrooms / total_rooms) can provide an idea of the
#    proportion of rooms that are bedrooms. A higher ratio might indicate
#    a different type of housing stock which could influence the median house value.

# 3. Population per household ('population_per_household'):
#    This ratio (population / households) represents the average household size.
#    Larger household sizes in a block group might correlate with different housing
#    demands and potentially impact housing values.

# These ratios normalize the counts by the number of households or rooms,
# making them potentially more robust indicators than the raw counts themselves.

## Create new features

### Subtask:
Write code to generate the new features and add them to the DataFrame.

**Reasoning**:
Generate the new features by calculating the ratios as described in the instructions and add them as new columns to the DataFrame. Then, display the head of the updated DataFrame to confirm the changes.

In [None]:
# Calculate 'rooms_per_household' and add it as a new column
df['rooms_per_household'] = df['total_rooms'] / df['households']

# Calculate 'bedrooms_per_room' and add it as a new column
df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']

# Calculate 'population_per_household' and add it as a new column
df['population_per_household'] = df['population'] / df['households']

# Display the head of the DataFrame to verify the new columns
display(df.head())

In [None]:
# Display summary statistics of the DataFrame
display(df.describe())

## Visualize the results (optional)

### Subtask:
Visualize the predictions of the new model versus the actual values.

**Reasoning**:
Create a scatter plot of the actual vs. predicted values from the Ridge model, add labels, title, a diagonal line for perfect predictions, and a grid.

In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot of actual vs. predicted values for the Ridge model
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred_ridge, alpha=0.5)

# Add labels and title
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values (Ridge Regression)')

# Add a diagonal line for perfect predictions
# Determine the range for the diagonal line based on both actual and predicted values
plot_range = [min(y_test.min(), y_pred_ridge.min()), max(y_test.max(), y_pred_ridge.max())]
plt.plot(plot_range, plot_range, color='red', linestyle='--')

# Add a grid
plt.grid(True)

# Show the plot
plt.show()

## Choose a new model

### Subtask:
Select a different regression algorithm to try (e.g., Ridge, Lasso, Decision Tree Regressor).

**Reasoning**:
Choose a different regression algorithm and mention it in a markdown cell.

In [None]:
# Choosing Ridge Regression as an alternative regression model.
# Ridge is a linear model with L2 regularization.
# It can help to prevent overfitting, especially when dealing with multicollinearity
# among predictor variables, which might be present in this dataset.
# This choice is a common next step after trying simple Linear Regression.

## Train the new model

### Subtask:
Instantiate and train the chosen model using the training data (`X_train`, `y_train`).

**Reasoning**:
Instantiate and train the Ridge model using the training data.

In [None]:
from sklearn.linear_model import Ridge

# Instantiate a Ridge model object with default parameters
ridge_model = Ridge()

# Fit the Ridge model to the training data
ridge_model.fit(X_train, y_train)

print("Ridge Regression model trained successfully.")

## Visualize the results (optional)

### Subtask:
Visualize the predictions versus the actual values.

**Reasoning**:
Create a scatter plot of actual vs. predicted values and add a diagonal line for perfect predictions.

In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)

# Add labels and title
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values')

# Add a diagonal line for perfect predictions
# Determine the range for the diagonal line
plot_range = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
plt.plot(plot_range, plot_range, color='red', linestyle='--')

# Add a grid
plt.grid(True)

# Show the plot
plt.show()

## Summary:

### Data Analysis Key Findings

* The `california_housing_train.csv` dataset was successfully loaded, containing 17000 entries and 9 columns of `float64` data type.
* No missing values were found in the dataset after filling the initial missing values in `total_rooms` and `total_bedrooms` with their respective medians.
* The dataset was split into training (13600 samples) and testing (3400 samples) sets, with features including 'longitude', 'latitude', 'housing\_median\_age', 'total\_rooms', 'total\_bedrooms', 'population', 'households', and 'median\_income', and the target being 'median\_house\_value'.
* A Linear Regression model was successfully trained on the training data.
* The model achieved a Mean Squared Error (MSE) of approximately $4,634,658,406.22$ and an R-squared (R2) score of approximately 0.6636 on the testing data.

### Insights or Next Steps

* The R-squared score of 0.66 suggests the model explains a reasonable portion of the variance in median house values, but there is room for improvement.
* Further steps could involve exploring feature engineering, trying different regression algorithms (e.g., Ridge, Lasso, or more complex models), or performing hyperparameter tuning to potentially improve the model's performance.

In [None]:
import matplotlib.pyplot as plt

# Plot the original data points
plt.scatter(X, y, color='blue', label='Original Data')

# Plot the regression line
# We need to predict y values for the range of X values to plot the line
plt.plot(X, model.predict(X), color='red', label='Regression Line')

plt.xlabel('Features (X)')
plt.ylabel('Target (y)')
plt.title('Linear Regression Example')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Generate some sample data
# X represents the features (input), y represents the target (output)
X = np.array([1, 2, 3, 4, 5, 6]).reshape(-1, 1) # Reshape for scikit-learn
y = np.array([2, 4, 5, 4, 5, 6])

# Create a Linear Regression model
model = LinearRegression()

# Train the model using the data
model.fit(X, y)

# Make a prediction
new_X = np.array([7]).reshape(-1, 1)
prediction = model.predict(new_X)

print(f"Features (X):\n{X}")
print(f"Target (y):\n{y}")
print(f"Prediction for X={new_X[0][0]}: {prediction[0]}")

In [None]:
# @title Choose a different model
from google.colab import ai

response = ai.generate_text("What is the capital of England", model_name='google/gemini-2.0-flash-lite')
print(response)

For longer text generations, you can stream the response. This displays the output token by token as it's generated, rather than waiting for the entire response to complete. This provides a more interactive and responsive experience. To enable this, simply set stream=True.

In [None]:
# @title Simple streaming example
from google.colab import ai

stream = ai.generate_text("Tell me a short story.", stream=True)
for text in stream:
  print(text, end='')

In [None]:
#@title Text formatting setup
#code is not necessary for colab.ai, but is useful in fomatting text chunks
import sys

class LineWrapper:
    def __init__(self, max_length=80):
        self.max_length = max_length
        self.current_line_length = 0

    def print(self, text_chunk):
        i = 0
        n = len(text_chunk)
        while i < n:
            start_index = i
            while i < n and text_chunk[i] not in ' \n': # Find end of word
                i += 1
            current_word = text_chunk[start_index:i]

            delimiter = ""
            if i < n: # If not end of chunk, we found a delimiter
                delimiter = text_chunk[i]
                i += 1 # Consume delimiter

            if current_word:
                needs_leading_space = (self.current_line_length > 0)

                # Case 1: Word itself is too long for a line (must be broken)
                if len(current_word) > self.max_length:
                    if needs_leading_space: # Newline if current line has content
                        sys.stdout.write('\n')
                        self.current_line_length = 0
                    for char_val in current_word: # Break the long word
                        if self.current_line_length >= self.max_length:
                            sys.stdout.write('\n')
                            self.current_line_length = 0
                        sys.stdout.write(char_val)
                        self.current_line_length += 1
                # Case 2: Word doesn't fit on current line (print on new line)
                elif self.current_line_length + (1 if needs_leading_space else 0) + len(current_word) > self.max_length:
                    sys.stdout.write('\n')
                    sys.stdout.write(current_word)
                    self.current_line_length = len(current_word)
                # Case 3: Word fits on current line
                else:
                    if needs_leading_space:
                        # Define punctuation that should not have a leading space
                        # when they form an entire "word" (token) following another word.
                        no_leading_space_punctuation = {
                            ",", ".", ";", ":", "!", "?",        # Standard sentence punctuation
                            ")", "]", "}",                     # Closing brackets
                            "'s", "'S", "'re", "'RE", "'ve", "'VE", # Common contractions
                            "'m", "'M", "'ll", "'LL", "'d", "'D",
                            "n't", "N'T",
                            "...", "â€¦"                          # Ellipses
                        }
                        if current_word not in no_leading_space_punctuation:
                            sys.stdout.write(' ')
                            self.current_line_length += 1
                    sys.stdout.write(current_word)
                    self.current_line_length += len(current_word)

            if delimiter == '\n':
                sys.stdout.write('\n')
                self.current_line_length = 0
            elif delimiter == ' ':
                # If line is full and a space delimiter arrives, it implies a wrap.
                if self.current_line_length >= self.max_length:
                    sys.stdout.write('\n')
                    self.current_line_length = 0

        sys.stdout.flush()


In [None]:
# @title Formatted streaming example
from google.colab import ai

wrapper = LineWrapper()
for chunk in ai.generate_text('Give me a long winded description about the evolution of the Roman Empire.', model_name='google/gemini-2.0-flash', stream=True):
  wrapper.print(chunk)

# Task
Train a linear regression model on a more complex dataset from "/content/sample_data/".

## Load the dataset

### Subtask:
Load a more complex dataset, such as one from the `/content/sample_data/` directory, into a pandas DataFrame.


**Reasoning**:
The subtask is to load a dataset from `/content/sample_data/` into a pandas DataFrame. I need to import pandas and then use `pd.read_csv` to load a file. Displaying the head of the dataframe will confirm the data has been loaded.



In [None]:
import pandas as pd

# Load a sample dataset from /content/sample_data/ - choosing california_housing_train.csv
try:
    df = pd.read_csv('/content/sample_data/california_housing_train.csv')
    # Display the first few rows
    display(df.head())
except FileNotFoundError:
    print("File not found. Please check the file path.")
except Exception as e:
    print(f"An error occurred: {e}")

## Explore and preprocess the data

### Subtask:
Inspect the dataset, handle any missing values, and select the features and target variable for the linear regression model.


**Reasoning**:
Display DataFrame info, check for missing values, and handle them if necessary before selecting features and target variables.



In [None]:
# Display info about the DataFrame
df.info()

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Handle missing values if any.
# The isnull().sum() output will show if there are any missing values.
# Based on the output of isnull().sum(), if 'total_bedrooms' has missing values,
# we will fill them with the median as it's a common practice for numerical features
# and less sensitive to outliers than the mean.
if df['total_bedrooms'].isnull().sum() > 0:
    median_bedrooms = df['total_bedrooms'].median()
    df['total_bedrooms'].fillna(median_bedrooms, inplace=True)
    print("\nMissing values in 'total_bedrooms' filled with median.")

# Also fill missing values in 'total_rooms' with the median
if df['total_rooms'].isnull().sum() > 0:
    median_rooms = df['total_rooms'].median()
    df['total_rooms'].fillna(median_rooms, inplace=True)
    print("Missing values in 'total_rooms' filled with median.")

print(df.isnull().sum()) # Verify after filling

# Select features (X) and target variable (y)
# We will use several numerical features as predictors for 'median_house_value'
features = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
            'total_bedrooms', 'population', 'households', 'median_income']
X = df[features]
y = df['median_house_value']

print("\nFeatures (X) and Target (y) selected.")
display(X.head())
display(y.head())

## Split the data

### Subtask:
Split the dataset into training and testing sets.


**Reasoning**:
Split the features and target into training and testing sets using train_test_split as instructed and print their shapes.



In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

## Train the model

### Subtask:
Train the linear regression model on the training data.


**Reasoning**:
Train the linear regression model using the training data (X_train and y_train).



In [None]:
from sklearn.linear_model import LinearRegression

# Instantiate a Linear Regression model object
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

print("Linear Regression model trained successfully.")

## Evaluate the model

### Subtask:
Evaluate the trained model's performance on the testing data using appropriate metrics.


**Reasoning**:
Evaluate the trained model's performance on the testing data using appropriate metrics.



In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Use the trained model to make predictions on the testing features
y_pred = model.predict(X_test)

# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Calculate the R-squared score
r2 = r2_score(y_test, y_pred)

# Print the calculated metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2) Score: {r2}")

## Visualize the results (optional)

### Subtask:
Visualize the predictions versus the actual values.


**Reasoning**:
Create a scatter plot of actual vs. predicted values and add a diagonal line for perfect predictions.



In [None]:
# Create a scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)

# Add labels and title
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values')

# Add a diagonal line for perfect predictions
# Determine the range for the diagonal line
plot_range = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
plt.plot(plot_range, plot_range, color='red', linestyle='--')

# Add a grid
plt.grid(True)

# Show the plot
plt.show()

## Summary:

### Data Analysis Key Findings

*   The `california_housing_train.csv` dataset was successfully loaded, containing 17000 entries and 9 columns of `float64` data type.
*   No missing values were found in the dataset, eliminating the need for imputation.
*   The dataset was split into training (13600 samples) and testing (3400 samples) sets, with features including 'longitude', 'latitude', 'housing\_median\_age', 'total\_rooms', 'total\_bedrooms', 'population', 'households', and 'median\_income', and the target being 'median\_house\_value'.
*   A Linear Regression model was successfully trained on the training data.
*   The model achieved a Mean Squared Error (MSE) of approximately $4,634,658,406.22$ and an R-squared (R2) score of approximately 0.6636 on the testing data.

### Insights or Next Steps

*   The R-squared score of 0.66 suggests the model explains a reasonable portion of the variance in median house values, but there is room for improvement.
*   Further steps could involve exploring feature engineering, trying different regression algorithms (e.g., Ridge, Lasso, or more complex models), or performing hyperparameter tuning to potentially improve the model's performance.


# Task
Generate a plan to train and evaluate a new regression model on the existing dataset, compare its performance to the previously trained linear regression model, and summarize the findings.

## Choose a new model

### Subtask:
Select a different regression algorithm to try (e.g., Ridge, Lasso, Decision Tree Regressor).


**Reasoning**:
Choose a different regression algorithm and mention it in a markdown cell.



In [None]:
# Choosing Ridge Regression as an alternative regression model.
# Ridge is a linear model with L2 regularization.
# It can help to prevent overfitting, especially when dealing with multicollinearity
# among predictor variables, which might be present in this dataset.
# This choice is a common next step after trying simple Linear Regression.

## Train the new model

### Subtask:
Instantiate and train the chosen model using the training data (`X_train`, `y_train`).


**Reasoning**:
Instantiate and train the Ridge model using the training data.



In [None]:
from sklearn.linear_model import Ridge

# Instantiate a Ridge model object with default parameters
ridge_model = Ridge()

# Fit the Ridge model to the training data
ridge_model.fit(X_train, y_train)

print("Ridge Regression model trained successfully.")

## Evaluate the new model

### Subtask:
Evaluate the new model's performance on the testing data (`X_test`, `y_test`) using appropriate metrics (e.g., MSE, R2).


**Reasoning**:
Calculate MSE and R2 for the Ridge model using the test data and predictions, and print the results.



In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Use the trained ridge_model to make predictions on the testing features
y_pred_ridge = ridge_model.predict(X_test)

# Calculate the Mean Squared Error (MSE) for the Ridge model
mse_ridge = mean_squared_error(y_test, y_pred_ridge)

# Calculate the R-squared score for the Ridge model
r2_ridge = r2_score(y_test, y_pred_ridge)

# Print the calculated metrics for the Ridge model
print(f"Ridge Model Mean Squared Error (MSE): {mse_ridge}")
print(f"Ridge Model R-squared (R2) Score: {r2_ridge}")

## Compare model performance

### Subtask:
Compare the performance of the new model to the linear regression model we already trained.


**Reasoning**:
Compare the performance metrics of the two models and print a summary.



In [None]:
# Print the performance metrics for the Linear Regression model
print(f"Linear Regression Model Mean Squared Error (MSE): {mse}")
print(f"Linear Regression Model R-squared (R2) Score: {r2}")

# Print the performance metrics for the Ridge Regression model
print(f"Ridge Model Mean Squared Error (MSE): {mse_ridge}")
print(f"Ridge Model R-squared (R2) Score: {r2_ridge}")

# Compare and summarize the performance
if mse_ridge < mse:
    print("\nSummary: The Ridge Regression model performed slightly better than the Linear Regression model based on MSE.")
elif mse_ridge > mse:
    print("\nSummary: The Linear Regression model performed slightly better than the Ridge Regression model based on MSE.")
else:
    print("\nSummary: Both models performed similarly based on MSE.")

if r2_ridge > r2:
    print("Summary: The Ridge Regression model performed slightly better than the Linear Regression model based on R2.")
elif r2_ridge < r2:
    print("Summary: The Linear Regression model performed slightly better than the Ridge Regression model based on R2.")
else:
     print("Summary: Both models performed similarly based on R2.")

## Visualize the results (optional)

### Subtask:
Visualize the predictions of the new model versus the actual values.


**Reasoning**:
Create a scatter plot of the actual vs. predicted values from the Ridge model, add labels, title, a diagonal line for perfect predictions, and a grid.



In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot of actual vs. predicted values for the Ridge model
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred_ridge, alpha=0.5)

# Add labels and title
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values (Ridge Regression)')

# Add a diagonal line for perfect predictions
# Determine the range for the diagonal line based on both actual and predicted values
plot_range = [min(y_test.min(), y_pred_ridge.min()), max(y_test.max(), y_pred_ridge.max())]
plt.plot(plot_range, plot_range, color='red', linestyle='--')

# Add a grid
plt.grid(True)

# Show the plot
plt.show()

## Summary:

### Data Analysis Key Findings

*   Ridge Regression was chosen as an alternative model to Linear Regression.
*   The Ridge Regression model was successfully trained on the training data.
*   The Ridge Regression model achieved a Mean Squared Error (MSE) of approximately 4,634,651,616.32 on the test set.
*   The Ridge Regression model achieved an R-squared (R2) score of approximately 0.6636 on the test set.
*   Comparing to the Linear Regression model (MSE: 4,634,658,406.22, R2: 0.6636), the Ridge Regression model performed slightly better based on MSE and similarly based on R2.
*   A scatter plot visualizing the actual versus predicted values for the Ridge model was generated, showing the distribution of predictions relative to a perfect prediction line.

### Insights or Next Steps

*   Although Ridge Regression showed a minor improvement, the performance difference compared to Linear Regression is minimal, suggesting that the L2 regularization had a limited impact on this specific dataset.
*   Further model exploration with different algorithms (e.g., tree-based models like Random Forest or Gradient Boosting) or hyperparameter tuning of the Ridge model could potentially yield better performance.


# Task
Perform feature engineering on the `df` DataFrame to create new features, train a model with the updated features, and evaluate its performance.

## Identify potential new features

### Subtask:
Determine which existing features can be combined or transformed to create meaningful new features (e.g., ratios, polynomial features, interaction terms).


**Reasoning**:
Describe the rationale for choosing features to create based on the analysis of existing features and potential relationships.



In [None]:
# Rationale for choosing new features:
# Based on domain knowledge and common practices in housing price prediction,
# ratios of existing features can provide more meaningful insights into the
# characteristics of a housing block group than the raw counts alone.

# 1. Rooms per household ('rooms_per_household'):
#    This ratio (total_rooms / households) can indicate the average number of rooms
#    available per household in a block group. It might be a strong predictor
#    of housing value, as larger houses (more rooms per household) are often
#    associated with higher values.

# 2. Bedrooms per room ('bedrooms_per_room'):
#    This ratio (total_bedrooms / total_rooms) can provide an idea of the
#    proportion of rooms that are bedrooms. A higher ratio might indicate
#    a different type of housing stock which could influence the median house value.

# 3. Population per household ('population_per_household'):
#    This ratio (population / households) represents the average household size.
#    Larger household sizes in a block group might correlate with different housing
#    demands and potentially impact housing values.

# These ratios normalize the counts by the number of households or rooms,
# making them potentially more robust indicators than the raw counts themselves.

## Create new features

### Subtask:
Write code to generate the new features and add them to the DataFrame.


**Reasoning**:
Generate the new features by calculating the ratios as described in the instructions and add them as new columns to the DataFrame. Then, display the head of the updated DataFrame to confirm the changes.



In [None]:
# Calculate 'rooms_per_household' and add it as a new column
df['rooms_per_household'] = df['total_rooms'] / df['households']

# Calculate 'bedrooms_per_room' and add it as a new column
df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']

# Calculate 'population_per_household' and add it as a new column
df['population_per_household'] = df['population'] / df['households']

# Display the head of the DataFrame to verify the new columns
display(df.head())

## Update features for modeling

### Subtask:
Select the updated set of features (including the new ones) for training the model.


**Reasoning**:
Select the updated set of features for training the model, including the engineered features.



In [None]:
# Define the list of features, including the original and engineered ones
features_engineered = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
                       'total_bedrooms', 'population', 'households', 'median_income',
                       'rooms_per_household', 'bedrooms_per_room', 'population_per_household']

# Create a new DataFrame X_engineered by selecting these columns from df
X_engineered = df[features_engineered]

# Keep the target variable y as it is (the 'median_house_value' column from df)
# y was already defined in a previous step as df['median_house_value']

# Print the head of X_engineered and y to verify
print("Head of X_engineered:")
display(X_engineered.head())

print("\nHead of y:")
display(y.head())

## Split the data (if necessary)

### Subtask:
Split the updated dataset (`X_engineered`, `y`) into training and testing sets.


**Reasoning**:
Split the features and target into training and testing sets using train_test_split as instructed and print their shapes.



In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train_engineered, X_test_engineered, y_train_engineered, y_test_engineered = train_test_split(X_engineered, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print(f"Shape of X_train_engineered: {X_train_engineered.shape}")
print(f"Shape of X_test_engineered: {X_test_engineered.shape}")
print(f"Shape of y_train_engineered: {y_train_engineered.shape}")
print(f"Shape of y_test_engineered: {y_test_engineered.shape}")

## Train a model with new features

### Subtask:
Train a regression model (either the Linear Regression or Ridge model again, or a new one) using the updated features (`X_train_engineered`, `y_train_engineered`).


**Reasoning**:
Instantiate and train a Linear Regression model using the engineered training data.



In [None]:
from sklearn.linear_model import LinearRegression

# Instantiate a Linear Regression model object
model_engineered = LinearRegression()

# Fit the model to the engineered training data
model_engineered.fit(X_train_engineered, y_train_engineered)

print("Linear Regression model trained successfully using engineered features.")

## Evaluate the model with new features

### Subtask:
Evaluate the performance of the model trained with the new features using appropriate metrics (e.g., MSE, R2) on the testing data (`X_test_engineered`, `y_test_engineered`).


**Reasoning**:
Calculate MSE and R2 for the model trained with engineered features using the test data and predictions, and print the results.



In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Use the trained model_engineered to make predictions on the engineered testing features
y_pred_engineered = model_engineered.predict(X_test_engineered)

# Calculate the Mean Squared Error (MSE) for the model with engineered features
mse_engineered = mean_squared_error(y_test_engineered, y_pred_engineered)

# Calculate the R-squared score for the model with engineered features
r2_engineered = r2_score(y_test_engineered, y_pred_engineered)

# Print the calculated metrics for the model with engineered features
print(f"Model with Engineered Features Mean Squared Error (MSE): {mse_engineered}")
print(f"Model with Engineered Features R-squared (R2) Score: {r2_engineered}")

## Compare performance

### Subtask:
Compare the performance of the model with and without the new features.


**Reasoning**:
Print the performance metrics for both models and compare them based on MSE and R2 to determine which model performed better.



In [None]:
# Print the performance metrics for the Linear Regression model trained on original features
print(f"Linear Regression Model (Original Features) MSE: {mse}")
print(f"Linear Regression Model (Original Features) R2 Score: {r2}")

# Print the performance metrics for the Linear Regression model trained on engineered features
print(f"Linear Regression Model (Engineered Features) MSE: {mse_engineered}")
print(f"Linear Regression Model (Engineered Features) R2 Score: {r2_engineered}")

# Compare and summarize the performance based on MSE
if mse_engineered < mse:
    print("\nSummary: The Linear Regression model with engineered features performed better than the model with original features based on MSE (lower is better).")
elif mse_engineered > mse:
    print("\nSummary: The Linear Regression model with original features performed better than the model with engineered features based on MSE (lower is better).")
else:
    print("\nSummary: Both Linear Regression models performed similarly based on MSE.")

# Compare and summarize the performance based on R2
if r2_engineered > r2:
    print("Summary: The Linear Regression model with engineered features performed better than the model with original features based on R2 (higher is better).")
elif r2_engineered < r2:
    print("Summary: The Linear Regression model with original features performed better than the model with engineered features based on R2 (higher is better).")
else:
    print("Summary: Both Linear Regression models performed similarly based on R2.")

## Visualize results (optional)

### Subtask:
Visualize the predictions of the model trained with new features versus the actual values.


**Reasoning**:
Create a scatter plot of the actual vs. predicted values from the model trained with engineered features, add labels, title, a diagonal line for perfect predictions, and a grid.



In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot of actual vs. predicted values for the engineered features model
plt.figure(figsize=(8, 6))
plt.scatter(y_test_engineered, y_pred_engineered, alpha=0.5)

# Add labels and title
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values (Linear Regression with Engineered Features)')

# Add a diagonal line for perfect predictions
# Determine the range for the diagonal line based on both actual and predicted values
plot_range_engineered = [min(y_test_engineered.min(), y_pred_engineered.min()), max(y_test_engineered.max(), y_pred_engineered.max())]
plt.plot(plot_range_engineered, plot_range_engineered, color='red', linestyle='--')

# Add a grid
plt.grid(True)

# Show the plot
plt.show()

## Summary:

### Data Analysis Key Findings

*   Three new features were created: 'rooms\_per\_household', 'bedrooms\_per\_room', and 'population\_per\_household'.
*   The dataset was successfully split into training (80%) and testing (20%) sets, resulting in training sets with 13600 samples and testing sets with 3400 samples.
*   A Linear Regression model was trained using the dataset with the engineered features.
*   The model trained with engineered features achieved a Mean Squared Error (MSE) of approximately 4,552,463,037.86 and an R-squared (R2) score of approximately 0.670 on the test set.
*   Comparing the performance to the model with original features (MSE of ~4.63 billion and R2 of 0.664), the model with engineered features showed slightly better performance based on both metrics (lower MSE and higher R2).

### Insights or Next Steps

*   The engineered features slightly improved the Linear Regression model's performance, suggesting that these ratios capture some valuable information about housing values. Further investigation into other feature engineering techniques or non-linear models could potentially yield more significant improvements.
*   Visualizing the actual vs. predicted values confirms that the model's predictions generally follow the trend of the actual values, although there is still a notable spread, particularly for higher house values, indicating room for further model improvement.


# Task
Explore Ridge or Lasso Regression for hyperparameter tuning.

## Choose a model for tuning

### Subtask:
Select either Ridge or Lasso Regression for hyperparameter tuning.


## Define the parameter grid

### Subtask:
Specify the range of hyperparameter values to search over.


**Reasoning**:
Define a range of alpha values for hyperparameter tuning of the Ridge model and store them in a dictionary named `param_grid`.



In [None]:
import numpy as np

# Choose a range of alpha values for Ridge Regression
# Using a logarithmic scale from 0.1 to 100.
alpha_range = np.logspace(-1, 2, 100)

# Create a dictionary named param_grid
param_grid = {'alpha': alpha_range}

print("Hyperparameter grid for Ridge Regression:")
print(param_grid)

## Set up cross-validation

### Subtask:
Choose a cross-validation strategy (e.g., K-Fold) to evaluate different parameter combinations robustly.


**Reasoning**:
Instantiate a KFold object with the specified parameters for cross-validation.



In [None]:
from sklearn.model_selection import KFold

# Instantiate a KFold object
# n_splits=5 is a common choice for K-Fold cross-validation
# shuffle=True shuffles the data before splitting, which is important if the data
# has any inherent order.
# random_state ensures that the shuffling is the same each time the code is run,
# making the results reproducible.
kf = KFold(n_splits=5, shuffle=True, random_state=42)

print("KFold cross-validation strategy instantiated.")

## Perform hyperparameter search

### Subtask:
Use a tuning method (e.g., GridSearchCV or RandomizedSearchCV) to find the best hyperparameters.


**Reasoning**:
Instantiate and fit GridSearchCV to find the best hyperparameters for the Ridge model using the defined parameter grid and cross-validation strategy. Then, print the best parameters and best score.



In [None]:
from sklearn.model_selection import GridSearchCV

# Instantiate a GridSearchCV object
# estimator: The model to tune (ridge_model)
# param_grid: The grid of hyperparameters to search over
# scoring: The metric to optimize (Negative Mean Squared Error, as GridSearchCV maximizes scores)
# cv: The cross-validation strategy (kf)
grid_search = GridSearchCV(estimator=ridge_model, param_grid=param_grid,
                           scoring='neg_mean_squared_error', cv=kf)

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train_engineered, y_train_engineered)

# Print the best hyperparameters found by the search
print("Best hyperparameters found by GridSearchCV:")
print(grid_search.best_params_)

# Print the best cross-validation score achieved during the search
# Note: The score is negative MSE, so we take the negative to get the actual MSE
print("\nBest cross-validation MSE score:")
print(-grid_search.best_score_)

## Train the final model

### Subtask:
Train the chosen model with the best hyperparameters on the entire training set.


**Reasoning**:
Get the best estimator from the GridSearchCV object and train it on the entire engineered training data.



In [None]:
# Get the best estimator from the GridSearchCV object
best_ridge_model = grid_search.best_estimator_

# Train the best estimator on the entire engineered training data
best_ridge_model.fit(X_train_engineered, y_train_engineered)

print("Best Ridge Regression model trained on the entire engineered training set.")

## Evaluate the tuned model

### Subtask:
Evaluate the performance of the tuned model on the testing data.


**Reasoning**:
Calculate MSE and R2 for the tuned Ridge model using the test data and predictions, and print the results.



In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Use the trained best_ridge_model to make predictions on the engineered testing features
y_pred_tuned_ridge = best_ridge_model.predict(X_test_engineered)

# Calculate the Mean Squared Error (MSE) for the tuned Ridge model
mse_tuned_ridge = mean_squared_error(y_test_engineered, y_pred_tuned_ridge)

# Calculate the R-squared score for the tuned Ridge model
r2_tuned_ridge = r2_score(y_test_engineered, y_pred_tuned_ridge)

# Print the calculated metrics for the tuned Ridge model
print(f"Tuned Ridge Model Mean Squared Error (MSE): {mse_tuned_ridge}")
print(f"Tuned Ridge Model R-squared (R2) Score: {r2_tuned_ridge}")

## Compare with previous models

### Subtask:
Compare the performance of the tuned model to the previously trained models.


**Reasoning**:
Print the performance metrics for all models and compare them to summarize the impact of feature engineering and hyperparameter tuning.



In [None]:
# Print the performance metrics for all models for comparison

# Initial Linear Regression model (original features)
print(f"Initial Linear Regression Model (Original Features) MSE: {mse}")
print(f"Initial Linear Regression Model (Original Features) R2 Score: {r2}")

# Ridge Regression model (original features, default hyperparameters)
print(f"Ridge Regression Model (Original Features, Default Hyperparameters) MSE: {mse_ridge}")
print(f"Ridge Regression Model (Original Features, Default Hyperparameters) R2 Score: {r2_ridge}")

# Linear Regression model (engineered features)
print(f"Linear Regression Model (Engineered Features) MSE: {mse_engineered}")
print(f"Linear Regression Model (Engineered Features) R2 Score: {r2_engineered}")

# Tuned Ridge Regression model (engineered features, tuned hyperparameters)
print(f"Tuned Ridge Regression Model (Engineered Features, Tuned Hyperparameters) MSE: {mse_tuned_ridge}")
print(f"Tuned Ridge Regression Model (Engineered Features, Tuned Hyperparameters) R2 Score: {r2_tuned_ridge}")

# Summarize the findings
print("\n--- Performance Comparison Summary ---")

# Compare MSE
if mse_tuned_ridge < mse_engineered and mse_tuned_ridge < mse_ridge and mse_tuned_ridge < mse:
    print("The Tuned Ridge Regression model with engineered features performed the best based on MSE (lower is better).")
elif mse_engineered < mse_tuned_ridge and mse_engineered < mse_ridge and mse_engineered < mse:
    print("The Linear Regression model with engineered features performed the best based on MSE (lower is better).")
elif mse_ridge < mse_tuned_ridge and mse_ridge < mse_engineered and mse_ridge < mse:
    print("The Ridge Regression model with default hyperparameters performed the best based on MSE (lower is better).")
else:
    print("The Initial Linear Regression model with original features performed the best based on MSE (lower is better).")

# Compare R2
if r2_tuned_ridge > r2_engineered and r2_tuned_ridge > r2_ridge and r2_tuned_ridge > r2:
    print("The Tuned Ridge Regression model with engineered features performed the best based on R2 (higher is better).")
elif r2_engineered > r2_tuned_ridge and r2_engineered > r2_ridge and r2_engineered > r2:
    print("The Linear Regression model with engineered features performed the best based on R2 (higher is better).")
elif r2_ridge > r2_tuned_ridge and r2_ridge > r2_engineered and r2_ridge > r2:
    print("The Ridge Regression model with default hyperparameters performed the best based on R2 (higher is better).")
else:
    print("The Initial Linear Regression model with original features performed the best based on R2 (higher is better).")

print("\nImpact of Feature Engineering and Hyperparameter Tuning:")
if mse_engineered < mse and r2_engineered > r2:
    print("- Feature engineering improved the performance of the Linear Regression model.")
else:
    print("- Feature engineering did not significantly improve the performance of the Linear Regression model.")

if mse_tuned_ridge < mse_engineered and r2_tuned_ridge > r2_engineered:
    print("- Hyperparameter tuning of the Ridge model with engineered features further improved performance compared to the Linear Regression model with engineered features.")
else:
    print("- Hyperparameter tuning of the Ridge model with engineered features did not significantly improve performance compared to the Linear Regression model with engineered features.")

if mse_tuned_ridge < mse_ridge and r2_tuned_ridge > r2_ridge:
     print("- Hyperparameter tuning of the Ridge model with engineered features improved performance compared to the Ridge model with default hyperparameters.")
else:
     print("- Hyperparameter tuning of the Ridge model with engineered features did not significantly improve performance compared to the Ridge model with default hyperparameters.")

## Visualize results (optional)

### Subtask:
Visualize the predictions of the tuned model.


**Reasoning**:
Create a scatter plot of the actual vs. predicted values from the tuned Ridge model, add labels, title, a diagonal line for perfect predictions, and a grid.



In [None]:
# Create a scatter plot of actual vs. predicted values for the tuned Ridge model
plt.figure(figsize=(8, 6))
plt.scatter(y_test_engineered, y_pred_tuned_ridge, alpha=0.5)

# Add labels and title
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values (Tuned Ridge Regression)')

# Add a diagonal line for perfect predictions
# Determine the range for the diagonal line based on both actual and predicted values
plot_range_tuned = [min(y_test_engineered.min(), y_pred_tuned_ridge.min()), max(y_test_engineered.max(), y_pred_tuned_ridge.max())]
plt.plot(plot_range_tuned, plot_range_tuned, color='red', linestyle='--')

# Add a grid
plt.grid(True)

# Show the plot
plt.show()

## Summary:

### Data Analysis Key Findings

*   The hyperparameter tuning process for the Ridge Regression model with engineered features identified an optimal `alpha` of 0.1.
*   The best cross-validation Mean Squared Error (MSE) achieved during the grid search was approximately 5,070,263,224.18.
*   The tuned Ridge Regression model with engineered features achieved a Mean Squared Error (MSE) of approximately 4,552,359,506.89 and an R-squared (R2) score of approximately 0.67 on the testing data.
*   Feature engineering improved the performance of the Linear Regression model compared to using original features.
*   Hyperparameter tuning of the Ridge model with engineered features further improved performance compared to both the Linear Regression model with engineered features and the Ridge model with default hyperparameters.
*   The Tuned Ridge Regression model with engineered features was the best-performing model among those evaluated, based on the lowest MSE and highest R2 score.

### Insights or Next Steps

*   The visualization of actual vs. predicted values for the tuned Ridge model shows a reasonable spread around the perfect prediction line, suggesting the model captures a significant portion of the variance but still has notable errors for some predictions.
*   Further exploration could involve trying other regression algorithms (e.g., Lasso, ElasticNet, Gradient Boosting) or more advanced feature engineering techniques to potentially improve the model's performance further.
