## Notebook set up

### Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_friedman1
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score

## Exercise 1: Generate and explore the dataset

**Tasks**:

1. Use `make_friedman1()` to generate a regression dataset with 1000 samples and 5 features. Set `random_state=315` for reproducibility. Store the features in `X` and the target in `y`.

2. Convert `X` and `y` to a pandas DataFrame and Series respectively. Name the features `feature_0`, `feature_1`, etc., and name the target `target`.

3. Display basic information about the dataset:
   - Shape of features and target
   - First few rows
   - Summary statistics

4. Create a figure with 5 subplots (one for each feature) showing scatter plots of each feature vs. the target.

**Hints**:

- `make_friedman1()` returns features and target as numpy arrays
  - Example: `X, y = make_friedman1(n_samples=100, n_features=5, random_state=315)`

- To convert to pandas:
  - `X_df = pd.DataFrame(X, columns=['feature_0', 'feature_1', ...])`
  - `y_series = pd.Series(y, name='target')`

- The Friedman #1 regression problem generates output using the formula:
  - y = 10 * sin(π * X₀ * X₁) + 20 * (X₂ - 0.5)² + 10 * X₃ + 5 * X₄ + noise
  - This means the relationship is **non-linear** and only uses the first 5 features

In [None]:
# Your code here

## Exercise 2: Train and evaluate models

**Tasks**:

1. Split the data into training and testing sets using an 80-20 split. Use `random_state=315`.

2. Train a `LinearRegression` model on the training data.

3. Train a `DecisionTreeRegressor` model on the training data. Use `random_state=315`.

4. For both models, calculate:
   - Training RMSE
   - Testing RMSE
   - Training R² score
   - Testing R² score

5. Print a comparison table showing these metrics for both models.

6. Create a figure with 2 subplots:
   - Left: Scatter plot of true vs. predicted values for linear regression
   - Right: Scatter plot of true vs. predicted values for decision tree
   - Add a diagonal reference line (y=x) to each plot

**Hints**:

- Use `train_test_split()` with `test_size=0.2`
  - Example: `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=315)`

- To calculate metrics:
  - First make predictions: `y_pred = model.predict(X_test)`
  - Then calculate: `rmse = root_mean_squared_error(y_test, y_pred)`

- To add a reference line to a plot:
  - `plt.plot([min, max], [min, max], 'k--', alpha=0.3)`

In [None]:
# Your code here

## Exercise 3: Investigate model performance with cross-validation

**Tasks**:

1. Use 5-fold cross-validation to evaluate both models on the full dataset. Use `neg_root_mean_squared_error` as the scoring metric.

2. Create a boxplot comparing the cross-validation RMSE scores for both models.

3. Calculate and print the mean and standard deviation of the cross-validation scores for each model.

4. Answer the following questions:
   - Which model performs better overall?
   - Is there a significant difference in performance?
   - Does either model show signs of overfitting? (Hint: compare training vs. testing performance from Exercise 2)

**Hints**:

- Use `cross_val_score()` with `cv=5` and `scoring='neg_root_mean_squared_error'`
  - Example: `scores = cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error')`
  - Note: sklearn returns negative scores, so multiply by -1 to get positive RMSE values

- To create a boxplot of multiple datasets:
  - `plt.boxplot([scores1, scores2], labels=['Model 1', 'Model 2'])`

In [None]:
# Your code here

## Exercise 4: Investigate why the models perform differently

**Tasks**:

1. Create visualizations to understand the relationship between features and target:
   - For features 0 and 1: Create a 2D scatter plot colored by the target value (use a colormap)
   - For features 2, 3, and 4: Create individual scatter plots vs. target

2. Based on the Friedman #1 formula (y = 10 * sin(π * X₀ * X₁) + 20 * (X₂ - 0.5)² + 10 * X₃ + 5 * X₄ + noise):
   - Identify which relationships are linear
   - Identify which relationships are non-linear
   - Explain how this affects each model's performance

3. Create residual plots for both models (predicted vs. residuals):
   - What patterns do you see in the linear regression residuals?
   - What patterns do you see in the decision tree residuals?
   - What do these patterns tell you about each model's ability to capture the underlying relationships?

4. (Optional) Try to improve the linear regression model by adding polynomial features for the non-linear relationships. Does this improve performance?

**Hints**:

- For a 2D scatter plot with color mapping:
  - `plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')`
  - `plt.colorbar(label='Target')`

- Residuals are calculated as: `residuals = y_true - y_predicted`

- The decision tree can capture non-linear relationships by splitting the feature space, while linear regression assumes linear relationships

In [None]:
# Your code here

## Reflection

Based on your analysis, answer the following questions:

1. **Model performance**: Which model performed better and why?

2. **Linear assumptions**: What happens when you apply linear regression to non-linear data?

3. **Model complexity**: What are the trade-offs between simpler models (linear regression) and more complex models (decision trees)?

4. **Real-world implications**: In what situations would you prefer:
   - A linear regression model?
   - A decision tree model?
   - Consider factors like interpretability, performance, and data characteristics.