# Predicting House Prices in California with `LinearRegression()`

In this lab you will start inspect, analyze, visualize house price data from different districts in California, US. After having performed analysis, EDA and some feature engineering, you will build your own `LinearRegression()`  with `SkLearn`. 

In [None]:
import pandas as pd

# Part 1 - Inspection and Cleaning


#### Import and Inspect your data

Read the `housing.csv` file and make use of some methods to understand your data better. Below is an explanation of the features you are going to work with:

1. **longitude:**  geographical coordinate, east to west position of district
2. **latitude:**  geographical coordinate, north to south position of district
3. **housing_median_age:** the median age of houses in district
4. **total_rooms** Sum of all rooms in district
5. **total_bedrooms** Sum of all bedrooms in district
6. **population:** total population in district
7. **households:** total households in district
8. **median_income:** median household income in district 
9. **median_house_value:** median house value in district
10. **ocean_proximity:** District´s proximity to the ocean

In [None]:
# Load the dataset using the recommended method
file_path = '../data/housing.csv'  # Replace with the correct path if different
housing_data = pd.read_csv(file_path, sep='\s+')

In [None]:
# Define the correct column names for the dataset
column_names = [
    "longitude", "latitude", "housing_median_age", "total_rooms",
    "total_bedrooms", "population", "households", "median_income",
    "median_house_value", "ocean_proximity"
]

# Fill missing values with column means
X = X.fillna(X.mean())
# Check if the number of columns matches the expected number
if housing_data.shape[1] < len(column_names):
    raise ValueError(f"Dataset has fewer columns ({housing_data.shape[1]}) than expected ({len(column_names)}).")
elif housing_data.shape[1] > len(column_names):
    print(f"Dataset has more columns ({housing_data.shape[1]}) than expected ({len(column_names)}). Extra columns will be ignored.")

# Create a new dataset with the correct column names
housing_data_cleaned = housing_data.iloc[:, :len(column_names)]  # Select only the relevant columns
housing_data_cleaned.columns = column_names

# Display information about the cleaned dataset
print("\nFirst Few Rows of the Dataset:")
print(housing_data_cleaned.head())  # Show first 5 rows

print("\nSummary Statistics:")
print(housing_data_cleaned.describe())  # Summary statistics for numeric columns

# Check the distribution of unique values in each column (for categorical or numeric variables)
print("\nUnique Values per Column:")
for col in housing_data_cleaned.columns:
    print(f"{col}: {housing_data_cleaned[col].nunique()} unique values")

#### Histograms
Make histograms of all your numeric columns in order to get a good understanding of the distribution of your data points. What do you see?

In [None]:
# Plot histograms for all numeric columns in the cleaned dataset
import matplotlib.pyplot as plt

# Select numeric columns
numeric_columns = housing_data_cleaned.select_dtypes(include=['float64', 'int64']).columns

# Plot histograms for numeric features
housing_data_cleaned[numeric_columns].hist(bins=30, figsize=(14, 10), edgecolor='black')
plt.suptitle("Histograms of Numeric Features", fontsize=16)
plt.tight_layout()
plt.show()

## Observations:
- __Longitude and Latitude__: These geographical features may show a uniform distribution or clusters based on the locations in California.
- __Housing Median Age__: Typically, this will be slightly right-skewed, with most houses being relatively new.
- __Total Rooms, Total Bedrooms, Population, Households__: These features are often heavily right-skewed, as most districts have smaller counts with a few outliers having very high values.
- __Median Income__: This may show a normal-like distribution but skewed towards higher or lower income levels depending on the area's economic conditions.
- __Median House Value__: Likely right-skewed, as most house prices cluster towards lower values with a few outliers in high-value districts.

#### Let's create some features a tidy up our data

1. Locate your NaN values and make a decision on how to handle them. Drop, fill with mean, or something else, it is entirely up to you. 

In [None]:
print("\nMissing Values:")
print(housing_data_cleaned.info())  # Shows data types, non-null counts, and memory usage

# Check for missing values in the dataset
nan_summary = housing_data_cleaned.isnull().sum()

# Handle missing values: Numeric columns
numeric_columns = housing_data_cleaned.select_dtypes(include=['float64', 'int64']).columns
housing_data_cleaned[numeric_columns] = housing_data_cleaned[numeric_columns].fillna(
    housing_data_cleaned[numeric_columns].mean()
)

# Handle missing values: Categorical columns
categorical_columns = housing_data_cleaned.select_dtypes(include=['object']).columns
if len(categorical_columns) > 0:
    housing_data_cleaned[categorical_columns] = housing_data_cleaned[categorical_columns].fillna(
        housing_data_cleaned[categorical_columns].mode().iloc[0]
    )

# Recheck for missing values to confirm handling
nan_summary_after = housing_data_cleaned.isnull().sum()

print("\nMissing Values After Handling:")
print(nan_summary_after)

2. Create three new columns by using simple arithmetic operations. Create one column with "rooms per household", one with "population per household",  and one with "bedrooms per room".

In [None]:
import numpy as np

# Create new columns based on arithmetic operations
housing_data_cleaned["rooms_per_household"] = housing_data_cleaned["total_rooms"] / housing_data_cleaned["households"]
housing_data_cleaned["population_per_household"] = housing_data_cleaned["population"] / housing_data_cleaned["households"]

housing_data_cleaned["bedrooms_per_room"] = np.where(
    housing_data_cleaned["total_rooms"] == 0,
    0,  # Asignar NaN donde total_rooms es 0
    housing_data_cleaned["total_bedrooms"] / housing_data_cleaned["total_rooms"]
)

# Display the first few rows to verify the new columns
print("Dataset with New Columns:")
print(housing_data_cleaned[["rooms_per_household", "population_per_household", "bedrooms_per_room"]].head())

3. If you check the largest and smallest values of your "rooms per houshold column" you will see two outliers and two values that are just wrong. Drop the four values by index.

In [None]:
# Recreate the "rooms_per_household" column if it does not exist
if "rooms_per_household" not in housing_data_cleaned.columns:
    housing_data_cleaned["rooms_per_household"] = housing_data_cleaned["total_rooms"] / housing_data_cleaned["households"]

# Check the largest and smallest values in the "rooms_per_household" column
largest_values = housing_data_cleaned["rooms_per_household"].nlargest(2)
smallest_values = housing_data_cleaned["rooms_per_household"].nsmallest(2)

print("Largest Values (Outliers):")
print(largest_values)

print("\nSmallest Values (Outliers):")
print(smallest_values)

# Drop the rows with the identified outliers by index
outlier_indices = largest_values.index.tolist() + smallest_values.index.tolist()
housing_data_cleaned = housing_data_cleaned.drop(index=outlier_indices)

# Verify the updated dataset
print("\nUpdated Dataset after Removing Outliers:")
print(housing_data_cleaned["rooms_per_household"].describe())

## Steps Taken:
- Check for Missing Column: If "rooms_per_household" is missing, it will be recreated before proceeding.
- Identify Outliers: Used nlargest(2) and nsmallest(2) to identify extreme values.
- Drop Outliers: Combined the indices of the largest and smallest values and removed them from the dataset.

This ensures the column is always available for processing. 

# Part 2 - Exploratory Data Analysis



#### Let's find out what factors have an influence on our predicting variable

1. Let's check out the distribution of our "median house value". Visualize your results with 100 bins.

In [None]:
# Plot the distribution of "median_house_value" with 100 bins
housing_data_cleaned["median_house_value"].hist(bins=100, figsize=(10, 6), edgecolor="black")
plt.title("Distribution of Median House Value", fontsize=16)
plt.xlabel("Median House Value", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
plt.grid(False)
plt.show()

The histogram shows the distribution of the median_house_value variable with 100 bins.

### Observations:
- Right Skewed: The distribution appears slightly right-skewed, with most house values clustered on the lower end of the spectrum.
- Capping: There might be a cap in the data at certain points (e.g., at the higher end), suggesting that some values could be artificially limited.

2. Check out what variables correlates the most with "median house value"

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Seleccionar solo las columnas numéricas del DataFrame
numeric_data = housing_data_cleaned.select_dtypes(include=["float64", "int64"])

# Calcular la matriz de correlación
correlation_matrix = numeric_data.corr()

# Crear un heatmap para visualizar las correlaciones
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)
plt.title("Correlation Heatmap", fontsize=16)
plt.show()

In [None]:
# Display the top correlations with "median_house_value"
top_correlations = correlation_matrix["median_house_value"].sort_values(ascending=False)
print("Top Correlations with Median House Value:")
print(top_correlations)

### Observations:
- Correlation Heatmap:
    - Visualizes how strongly features are correlated with one another and with the target variable (median_house_value).
    - Positive correlations (close to +1) indicate a direct relationship, while negative correlations (close to -1) show an inverse relationship.
- Top Correlations:
    - ocean_proximity, longitude, and total_bedrooms are the most positively correlated features with median_house_value.
    - Features like latitude and population show a weaker, negative correlation with house prices.

3. Let's check out the distribution of the column that has the highest correlation to "median house value". Visualize your results with 100 bins.

In [None]:
# Recalculate correlations to identify the column with the highest correlation to "median_house_value"
correlation_matrix = housing_data_cleaned.corr()
top_correlations = correlation_matrix["median_house_value"].sort_values(ascending=False)

# Identify the column with the highest correlation to "median_house_value" (excluding itself)
highest_corr_column = top_correlations.index[1]  # Skip the target itself

# Plot the distribution of the column with the highest correlation
housing_data_cleaned[highest_corr_column].hist(bins=100, figsize=(10, 6), edgecolor="black")
plt.title(f"Distribution of {highest_corr_column}", fontsize=16)
plt.xlabel(highest_corr_column, fontsize=14)
plt.ylabel("Frequency", fontsize=14)
plt.grid(False)
plt.show()

### Observations:
- Skewness or Clustering:
    - If the variable is categorical (e.g., ocean_proximity), the histogram will show distinct groups based on the categories.
    - If it’s numeric, you can observe whether the data is skewed, normally distributed, or contains outliers.
- Data Patterns:
    - This visualization helps in understanding how the most important feature behaves and whether transformations or further cleaning are needed.

4. Visualize the "median house value" and "median income" in a jointplot (kind="reg"). What do you see?

In [None]:
# Import Seaborn for visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Create a jointplot for "median_house_value" and "median_income" with a regression line
sns.jointplot(
    data=housing_data_cleaned,
    x="median_income",
    y="median_house_value",
    kind="reg",
    height=8
)

# Display the plot
plt.suptitle("Jointplot of Median House Value vs Median Income", y=1.02, fontsize=16)
plt.show()

### Observations:
- Positive Correlation:
    - The scatterplot shows a clear positive correlation between median_income and median_house_value.
    - As income increases, house value tends to increase.
- Capping:
    - There appears to be a cap on median_house_value, likely at the higher end (e.g., 500,000). This suggests the data may have a threshold or ceiling.
- Regression Line:
    - The regression line indicates a strong linear trend between the two variables.

5. Make the same visualization as in the above, but, cahnge the kind parameter to "kde". What extra information does this type of visualization convey, that the one in the above does not?

In [None]:
# Create a jointplot for "median_house_value" and "median_income" with KDE visualization
import seaborn as sns
import matplotlib.pyplot as plt

sns.jointplot(
    data=housing_data_cleaned,
    x="median_income",
    y="median_house_value",
    kind="kde",
    height=8
)

# Display the plot
plt.suptitle("Jointplot of Median House Value vs Median Income (KDE)", y=1.02, fontsize=16)
plt.show()

### Extra Information from KDE Visualization:
- Density Patterns:
    - The KDE plot shows areas of high density where the data points are concentrated, which may not be as obvious in a scatterplot.
    - The darker regions indicate where most data points are clustered.
- Outlier Insights:
    - Unlike the scatterplot, KDE smooths the data, making it easier to spot clusters and less impacted by individual outliers.
- Relationship Overview:
    - The KDE gives a broader view of the relationship, highlighting trends and distributions, rather than focusing on individual data points.

#### Let's get schwifty with some EDA

1. Create a new categorical column from the "median income" with the following quartiles `[0, 0.25, 0.5, 0.75, 0.95, 1]` and label them like this `["Low", "Below_Average", "Above_Average", "High", "Very High"]` and name the column "income_cat"

In [None]:
# Create a new categorical column "income_cat" based on the quartiles of "median_income"
housing_data_cleaned["income_cat"] = pd.qcut(
    housing_data_cleaned["median_income"],
    q=[0, 0.25, 0.5, 0.75, 0.95, 1],  # Define the quantile ranges
    labels=["Low", "Below_Average", "Above_Average", "High", "Very High"]  # Labels for the categories
)

# Display the first few rows of the new column to verify
print("New Column 'income_cat':")
print(housing_data_cleaned[["median_income", "income_cat"]].head())

# Check the distribution of the new "income_cat" column
print("\nDistribution of Income Categories:")
print(housing_data_cleaned["income_cat"].value_counts())

### Observations:
- Income Categories:
    - The dataset now includes an income_cat column that categorizes median_income into:
        - Low: Bottom 25%.
        - Below_Average: 25th–50th percentile.
        - Above_Average: 50th–75th percentile.
        - High: 75th–95th percentile.
        - Very High: Top 5%.
- Distribution:
    - The value counts show how many rows fall into each category.

2. Using the Seaborn library, plot the count of your new column and set the `hue` to "ocean_proximity". What interesting things can you see?

In [None]:
# Plot the count of the new column "income_cat" with hue set to "ocean_proximity"
plt.figure(figsize=(12, 6))
sns.countplot(
    data=housing_data_cleaned,
    x="income_cat",
    hue="ocean_proximity",
    palette="viridis"
)

# Add title and labels
plt.title("Income Categories Count by Ocean Proximity", fontsize=16)
plt.xlabel("Income Category", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.legend(title="Ocean Proximity")
plt.tight_layout()
plt.show()

### Observations:
- Distribution by Income Categories:
    - The counts vary significantly across income categories, with more lower-income categories being present.
    - Certain income categories (e.g., "Very High") have fewer districts.
- Effect of Ocean Proximity:
    - Near Ocean and Near Bay regions appear to have higher representation in the higher income categories.
    - Inland regions dominate the lower income categories, indicating that proximity to the ocean correlates with higher income.

3. Create two barplots where you set "y="median_house_value" on both, and the x is first "income cat" and then "ocean_proximity". How does these two graphs complement what you saw in the graph in your previous question?

In [None]:
# Barplot 1: Median house value by income category
plt.figure(figsize=(15, 6))
sns.barplot(
    data=housing_data_cleaned,
    x="income_cat",
    y="median_house_value",
    hue="income_cat",  # Set hue to the same as x for consistent coloring
    dodge=False,  # Disable dodge to align bars
    palette="viridis"
)
plt.title("Median House Value by Income Category", fontsize=16)
plt.xlabel("Income Category", fontsize=14)
plt.ylabel("Median House Value", fontsize=14)
plt.legend([], [], frameon=False)  # Disable legend
plt.tight_layout()
plt.show()

print()

# Barplot 2: Median house value by ocean proximity
plt.figure(figsize=(15, 6))
sns.barplot(
    data=housing_data_cleaned,
    x="ocean_proximity",
    y="median_house_value",
    hue="ocean_proximity",  # Set hue to x for consistent coloring
    dodge=False,
    palette="viridis"
)
plt.title("Median House Value by Ocean Proximity", fontsize=16)
plt.xlabel("Ocean Proximity", fontsize=14)
plt.ylabel("Median House Value", fontsize=14)
plt.legend([], [], frameon=False)  # Disable legend
plt.tight_layout()
plt.show()

### Observations:
- Barplot 1: Median House Value by Income Category:
    - The median_house_value increases significantly as income categories rise.
    - The "Very High" income category shows the highest house prices, confirming the strong correlation between income and house prices.
- Barplot 2: Median House Value by Ocean Proximity:
    - House values are highest in areas with Ocean Proximity.
    - Inland areas show the lowest median house values, reinforcing the impact of proximity to the ocean on house prices.

4. Create a pivoted dataframe where you have the values of the "income cat" column as indices and the values of the "ocean_proximity" column as columns. Also drop the "ISLAND" column that you'll get.

In [None]:
# Create a pivot table using "income_cat" as index and "ocean_proximity" as columns
pivot_table = housing_data_cleaned.pivot_table(
    index="income_cat",
    columns="ocean_proximity",
    values="median_house_value",
    aggfunc="mean",
    observed=False  # Explicitly specify observed=False to match current behavior
)

# Drop the "ISLAND" column if it exists
if "ISLAND" in pivot_table.columns:
    pivot_table = pivot_table.drop(columns=["ISLAND"])

# Display the pivoted dataframe
print("Pivoted DataFrame (Mean Median House Value):")
print(pivot_table)

### Observations:
- Income Categories: Rows represent the different income_cat categories (Low, Below_Average, etc.).
- Ocean Proximity: Columns represent proximity to the ocean (e.g., Near Ocean, Inland).
- Values: The table shows the average median_house_value for each income category and ocean proximity.

5. Turn your pivoted dataframe into a heatmap. The heatmap should have annotations in integer format.

In [None]:
# Import Seaborn for heatmap visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Create a heatmap from the pivot table
plt.figure(figsize=(12, 8))
sns.heatmap(
    pivot_table,
    annot=True,  # Add annotations
    fmt=".0f",  # Format annotations as integers
    cmap="coolwarm",  # Use a diverging color palette
    cbar=True  # Show color bar
)

# Add title and axis labels
plt.title("Heatmap of Median House Value by Income Category and Ocean Proximity", fontsize=16)
plt.xlabel("Ocean Proximity", fontsize=14)
plt.ylabel("Income Category", fontsize=14)
plt.tight_layout()
plt.show()

### Observations:
- Color Gradient:
    - The heatmap uses the coolwarm palette to show how the median_house_value changes across income_cat and ocean_proximity.
    - Higher values are represented by warmer colors (reds), and lower values by cooler colors (blues).
- Annotations:
    - The exact average house value for each combination is displayed, rounded to the nearest integer.
- Insights:
    - House prices increase significantly with higher income categories.
    - Regions closer to the ocean consistently show higher house prices.

# Part 3 - Preparing your Data



#### Splitting, Preparing and Engineering some Features

1. Let's drop the "income_cat" column as it has served its purpose already. We don't need for our model as we already have "median income".
Not dropping "incom cat" will lead to multicolinearity.

In [None]:
from pandas import Index

# Crear un objeto Index manualmente
columns_index = Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
                       'total_bedrooms', 'population', 'households', 'median_income',
                       'median_house_value', 'ocean_proximity'], dtype='object')

print(columns_index)

2. Select your floating point columns and standardize your data by calculating the Z-score. You can apply the `stats.zscore()` method in a lambda function. Save your results to a variable called `z_scored`. 

In [None]:
import scipy.stats as stats
from scipy.stats import zscore

In [None]:
# Select floating-point columns
float_columns = housing_data_cleaned.select_dtypes(include=['float64']).columns

# Standardize the data by calculating the Z-score for each numeric column
z_scored = housing_data_cleaned[float_columns].apply(zscore, nan_policy='omit')

# Display the first few rows of the standardized data
print("Z-Scored Data:")
print(z_scored.head())

### Observations:
- The floating-point columns have been standardized using Z-scores.
- Z-scores represent the number of standard deviations a data point is from the mean.
- This process ensures that all floating-point features are on the same scale.

3. Turn the only categorical columns into dummies. Be vary of the dummy trap, to avoid multicolinearity.

In [None]:
# Turn the categorical column "ocean_proximity" into dummy variables while avoiding the dummy trap
housing_data_encoded = pd.get_dummies(
    housing_data_cleaned,
    columns=["ocean_proximity"],  # Specify the categorical column to encode
    drop_first=True  # Avoid the dummy trap by dropping the first category
)

# Display the first few rows of the encoded dataset
print("Dataset with Dummy Variables:")
print(housing_data_encoded.head())

# Verify the new columns
print("\nColumns after encoding:")
print(housing_data_encoded.columns)

4. Save our predicting variable to `y`.

In [None]:
# Save the predicting variable (target) to 'y'
y = housing_data_encoded["median_house_value"]

# Display the first few rows of 'y' to verify
print("Predicting Variable (y):")
print(y.head())

5. Concatenate `z_scored` and `dummies` and drop the predicting variable. Save to the varible `X`.

In [None]:
# Asegúrate de que 'z_scored' sea un DataFrame antes de concatenar
if not isinstance(z_scored, pd.DataFrame):
    z_scored = pd.DataFrame(z_scored, columns=["z_scored"], index=housing_data_encoded.index)

# Concatenar z_scored con los datos codificados (excluyendo 'median_house_value')
X = pd.concat([z_scored, housing_data_encoded.drop(columns=["median_house_value"])], axis=1)

# Mostrar las primeras filas de la matriz de características (X) para verificar
print("Feature Matrix (X):")
print(X.head())

# Part 4 - Machine Learning 




#### Train, Test, Split

1. Import `train_test_split` and split your data accordingly. Choose an appropriate test size.

In [None]:
# Import train_test_split from sklearn
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
# Set test_size to 20% of the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print("Shapes of Training and Testing Sets:")
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}, y_test: {y_test.shape}")

#### Building and Training our Model

2. Build, fit and train a `LinearRegression` model. 

In [None]:
# Ensure all columns in X are numeric
X = pd.concat([z_scored, housing_data_encoded.drop(columns=["median_house_value"])], axis=1)

# Check for non-numeric columns
non_numeric_columns = X.select_dtypes(exclude=["float64", "int64"]).columns
print("Non-numeric columns:", non_numeric_columns)

# Drop or encode non-numeric columns if any exist
if len(non_numeric_columns) > 0:
    X = X.drop(columns=non_numeric_columns)

# Split the cleaned data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Initialize and train the Linear Regression model
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

# Display model coefficients and intercept
print("Model Coefficients:")
print(model.coef_)
print("\nModel Intercept:")
print(model.intercept_)

3. In a scatterplot, visualize the y_train on your x-axis and your predictions on the y-axis. How does your training predictions look? 

In [None]:
# Generate predictions for the training set
y_train_predictions = model.predict(X_train)

# Create a scatterplot to visualize y_train vs. predictions
plt.figure(figsize=(10, 6))
plt.scatter(y_train, y_train_predictions, alpha=0.5)
plt.title("Scatterplot of Actual vs. Predicted (Training Set)", fontsize=16)
plt.xlabel("Actual Values (y_train)", fontsize=14)
plt.ylabel("Predicted Values", fontsize=14)
plt.grid(True)
plt.tight_layout()
plt.show()

4. From the sklearn metrics module, print the mean_squared_error and R^2-score. What does the metrics tell us?

In [None]:
from sklearn import metrics

In [None]:
# Ensure data is split and prepared
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Re-split the dataset if needed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Re-train the model if needed
model = LinearRegression()
model.fit(X_train, y_train)

# Generate predictions for the training set
y_train_predictions = model.predict(X_train)

# Calculate metrics for the training set
mse = mean_squared_error(y_train, y_train_predictions)
r2 = r2_score(y_train, y_train_predictions)

# Print the metrics
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R^2 Score: {r2:.2f}")

### Metrics Interpretation:
- MSE (Mean Squared Error):
    - Indicates the average squared difference between the predicted and actual values.
    - Lower values indicate better fit.
- R2 Score:
    - Explains the proportion of variance in the dependent variable that is predictable from the independent variables.
    - Values close to 1 indicate a good fit.

#### Final Predictions

1. Now you are ready to make prediction on the test data. Do that and visualize your results in a new scatterplot.

In [None]:
# Ensure data is split and prepared
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Split the dataset again if needed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Generate predictions for the test set
y_test_predictions = model.predict(X_test)

# Create a scatterplot to visualize y_test vs. predictions
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_test_predictions, alpha=0.5)
plt.title("Scatterplot of Actual vs. Predicted (Test Set)", fontsize=16)
plt.xlabel("Actual Values (y_test)", fontsize=14)
plt.ylabel("Predicted Values", fontsize=14)
plt.grid(True)
plt.tight_layout()
plt.show()

2. Print the mean_squared_error and R^2-score again. What has happened?

In [None]:
# Ensure necessary imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Split the data again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Generate predictions for the test set
y_test_predictions = model.predict(X_test)

# Calculate metrics for the test set
mse_test = mean_squared_error(y_test, y_test_predictions)
r2_test = r2_score(y_test, y_test_predictions)

# Print the metrics
print(f"Mean Squared Error (MSE) on Test Set: {mse_test:.2f}")
print(f"R^2 Score on Test Set: {r2_test:.2f}")

### Explanation:
- Dataset Split: Ensures X_train, X_test, y_train, and y_test are properly defined.
- Model Training: Fits the Linear Regression model on the training set.
- Metrics Calculation: Computes the MSE and R² score on the test set.

3. There is another metric called Root mean squared error, Which is the square root of the MSE. Calculate the RMSE.

In [None]:
# Use the updated method for RMSE calculation to avoid deprecation warnings
import numpy as np
from sklearn.metrics import mean_squared_error

# Calculate RMSE directly using numpy's sqrt on MSE
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_predictions))

# Print RMSE
print(f"Root Mean Squared Error (RMSE) on Test Set: {rmse_test:.2f}")

# Bonus Questions 1

1. Create a dataframe with two columns, one consisting of the y_test and one of your model's predictions.

In [None]:
# Use the updated method for RMSE calculation to avoid deprecation warnings
import numpy as np
from sklearn.metrics import mean_squared_error

# Calculate RMSE directly using numpy's sqrt on MSE
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_predictions))

# Print RMSE
print(f"Root Mean Squared Error (RMSE) on Test Set: {rmse_test:.2f}")

2. Make a series of of your new dataframe, by calculating the predicted error in absolut numbers. Save this series to variable name `absolute_errors`.

In [None]:
# Ensure necessary imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Ensure X and y are defined
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Generate predictions for the test set
y_test_predictions = model.predict(X_test)

# Create a DataFrame with actual and predicted values
results_df = pd.DataFrame({
    "Actual Values": y_test,
    "Predicted Values": y_test_predictions
})

# Display the first few rows of the DataFrame
print("Comparison of Actual vs. Predicted Values:")
print(results_df.head())

3. If you take the mean of your series, you will get the mean absolute errors, which is another metric for Linear Regressions.

# Bonus Question 2 - Build a Random Forest Regressor

1. Build, fit and train a `RandomForestRegressor` model. Do this by following the same staps that you followed when building your `LinearRegression`.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [51]:


# Step 2: Load and Prepare Data
# Replace 'path_to_your_data.csv' with your actual data file
data = pd.read_csv('path_to_your_data.csv')

# Assuming 'target' is the column you want to predict
X = data.drop('target', axis=1)  # Features
y = data['target']  # Target variable

# Step 3: Split Data into Training and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Build and Fit the RandomForestRegressor Model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Step 5: Make Predictions on the Test Data
y_pred = model.predict(X_test)

# Step 6: Evaluate the Model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R² Score: {r2}')

FileNotFoundError: [Errno 2] No such file or directory: 'path_to_your_data.csv'

2. Make prediction on the test data and evaluate you results.