# Lab 3: Linear Regression

In this lab, we will walkthrough the basic implementation of several regression methods. First, we will cover simple linear regression, for when we have one predictor (X). Next, we will cover Multiple Linear Regression, for when we have multiple predictors. We will then cover two regularization methods.

## Part 1: Simple Linear Regression

In Part 1, we will perform Simple Linear Regression using Python.
We will use the `pandas`, `numpy`, `scipy`, and `scikit-learn` libraries for data manipulation and modeling. We will also visualize the results using `matplotlib`.

### Step 1: Import Libraries

First, we need to import the necessary libraries. Run the following code:

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Get everything imported!

### Step 2: Load the Dataset
Next, we will load a dataset. For this tutorial, we will use some simulated house pricing data.

```python
# Manually creating the synthetic dataset
data = {
    'SquareFootage': [3532, 3407, 2453, 1635, 1563, 2531, 1833, 1077, 2578, 2628,
                      3447, 3942, 3448, 3162, 1505, 3399, 2935, 3022, 3697, 2501],
    'NumBedrooms': [2, 1, 2, 5, 4, 1, 4, 1, 3, 4,
                    1, 2, 4, 4, 4, 1, 2, 2, 2, 1],
    'AgeOfHouse': [31, 10, 23, 35, 11, 28, 34, 0, 0, 36,
                   5, 38, 40, 17, 15, 4, 41, 42, 31, 1],
    'Price': [838560.919253, 804484.724846, 563445.633404, 432226.399244, 408372.745913,
              548757.706247, 440905.015175, 248148.490314, 625590.329047, 644081.556976,
              817637.919091, 953896.908492, 846710.103200, 799152.331779, 385981.010762,
              806350.318599, 659057.167194, 687884.192489, 886352.154312, 563846.859079]
}

# Create the DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Create this dataset!

### Step 3: Split the Data
Now we will split the data into training and testing sets. Let's select `SquareFootage` as our single predictor. We want to keep 80% of the data aside for training and leave the remaining 20% for testing.

```python
# split the data into test and train sets
X_train, X_test, y_train, y_test = train_test_split(df[['XXX']], df['YYY'], test_size=XXX, random_state=26)
```
#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Split the data!

### Step 4: Visualize the Data
Let's visualize the data to understand the relationship between the feature and the target variable.

```python
# visualize our data
plt.scatter(XXXX, YYYY, color='blue')
plt.title('Scatter Plot of Feature vs Target')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.grid()
plt.show()
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;:  Visualize our data. Can you make an estimate for the intercept and slope just by looking at it?

### Step 5: Create and Train the Model
Now, we'll create a linear regression model and fit it to our training data. Documentation for LinearRegression can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

```python
# specify the model
model = LinearRegression()

# fit the model
model.fit(XXX, XXX)

# print the slope and intercept coefficients
beta1 = model.XXX[0] # we specify 0 here because coefficients are returned as a list
beta0 = model.XXX
print(f'Coefficient: {beta1}')
print(f'Intercept: {beta0}')
```
#### <font color='red'>**TRY IT**</font> &#x1f9e0;: The time has come, train your first model!

### Step 6: Visualize the Regression Line
After training the model, we can make predictions on the training data to plot the line.

```python
# Make predictions
y_pred = model.XXX(XXX)

# Plotting
plt.figure(figsize=(10, 6))

# Scatter plot of actual data
plt.scatter(df['XXX'], df['YYY'], color='blue', label='Actual Data')

# Plotting the regression line
plt.plot(X_train, y_pred, color='orange', linewidth=2, label='Regression Line')

# Add titles and labels
plt.title('Linear Regression Model')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.grid()
plt.show()
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Use the code above to plot the data points and the regression line for the training data (we have not moved on to test yet!).


### Step 7: Evaluate Accuracy of Coefficients
Now, we will evaluate the accuracy of the coefficients using standard errors.

```python
# extract the columns into an array for easier processing
X = df['XXX'].values
y = df['YYY'].values

# get number of observations
n = len(y)

# Calculate the means
X_mean = np.mean(X)
y_mean = np.mean(y)

# Calculate residuals
residuals = y_train - y_pred

# Calculate the residual standard error (RSE)
rse = XXX

# Calculate the SE for each beta
se_beta1 = rse / np.sqrt(np.sum((X - X_mean) ** 2))
se_beta0 = rse * np.sqrt((1/n) + (X_mean**2 / np.sum((X - X_mean) ** 2)))

# Print the coefficients and their standard errors
beta1 = model.coef_[0]
beta0 = model.intercept_

# Print the results
print(f"Beta0 (Intercept): {beta0}, Standard Error: {se_beta0}")
print(f"Beta1 (Slope): {beta1}, Standard Error: {se_beta1}")
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Evaluate the model you just trained. Are the coefficients reasonable estimates?

**Interpretation of Beta1 and it's SE**: For each additional square foot, the predicted house price increases by about `Beta1` dollars, on average, holding everything else constant (which here is nothing else). A standard error of about `se_beta0` dollars means that if we repeated this study many times with different samples of houses, the estimated price increase per square foot would usually vary by only about `se_beta0` dollars from the true value.

**Interpretation of Beta0 and its SE:** `Beta0` represents the predicted house price when square footage is zero. While this value does not have a meaningful real-world interpretation (since a house can't have zero square feet!), it serves as the baseline that positions the regression line. A standard error of about `se_beta0` dollars means that if we repeated this study many times with different samples of houses, the estimated baseline price would typically vary by about `se_beta0`.



Next, we can calculate the **t-statistic** for `Beta1` to help us determine if there truly is a relationship between our predictor and target variable.


```python
# Calculate t-statistic for Beta1
t_stat = XXX / XXX

# Calculate the p-value (two-tailed)
p_value = 2 * (1 - stats.t.cdf(np.abs(t_stat), n-2))

# display results
print(f"t-statistic for Beta1: {t_stat}, p-value: {p_value}")
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Is there truly a relationships between our predictor and target?

### Step 8: Evaluate the Model

**Residual Standard Error (RSE)**: We will calculate the Residual Standard Error to give us an idea of the average distance that the observed values fall from the predicted values.

```python
# get the test set residuals and calculate RSE
residuals = y_train - y_pred
rse = np.sqrt(mean_squared_error(y_train, y_pred))
print(f'Residual Standard Error (RSE): {rse:.2f}')
```

**R-squared Value**: We will evaluate the model using the R-squared value to tell us the proportion of variance in the response variable that is explained by the predictor variable.

```python
# calculate r squared
r2 = r2_score(y_train, y_pred)
print(f'R-squared: {r2:.2f}')
```
#### <font color='red'>**TRY IT**</font> &#x1f9e0;: What do RSE and R-squared tell you about this model?

## Part 2: Multiple Linear Regression

In Part 2, we will perform Multiple Linear Regression using Python. This allows us to use multiple predictors. Let's say we want to include `AgeOfHouse` in our model. The code for splitting data, specifying the model, and fitting the model is entirely the same as with Simple Linear Regression. The only difference is in how we specify our X values:

```python
# Prepare the data
X = df[['SquareFootage', 'AgeOfHouse']].values  # IVs
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Create the X values as above, then use the code from previously to specify y, split the data, specify the model and fit it, and predict using our `X_train` data and call it `y_train_predict`.





Then we can evaluate the model and it's fit:

```python
# Calculate F-statistic
residuals = y_train - y_train_pred # Residuals
n = len(y_train)  # Number of observations
p = X_train.shape[1]  # Number of predictors
rss = np.sum(residuals**2)  # Residual sum of squares
tss = np.sum((y_train - np.mean(y_train))**2)  # Total sum of squares
f_statistic = (tss - rss) / p / (rss / (n - p - 1)) # get F
f_p_value = 1 - stats.f.cdf(f_statistic, p, n - p - 1) # get p
print(f"F-statistic: {f_statistic}, p-value: {f_p_value}")

# Calculate R-squared
r_squared = model.score(X_train, y_train)
print(f"R-squared: {r_squared}")

# Calculate RSE
rse = np.sqrt(mean_squared_error(y_train, y_train_pred))
print(f"Residual Standard Error (RSE): {rse}")
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Assess the model overall. Is there at least one predictor which is related to our target? How good is the fit?