---
### Getting more technical (THRUSDAY)

talk about 

Define:
- Mean Squared Errors
- residuals
- total sum of error
- R2


#### Mean Squared Error (MSE)

MSE stands for Mean Squared Error, which is a widely used loss function for regression problems. It measures the average squared difference between the predicted and actual values. Here's the formula for MSE:

$MSE = 1/n * ∑(y - ŷ)²$

where:

  - $n$ is the number of observations
  - $y$ is the actual value
  - $ŷ$ is the predicted value

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate random data
np.random.seed(123)
x = np.random.rand(50)
y = 2*x + 0.5*np.random.randn(50)

# Fit a linear regression model
coeffs = np.polyfit(x, y, 1)
y_pred = np.polyval(coeffs, x)

# Compute MSE
mse = np.mean((y - y_pred)**2)
print("MSE:", mse)

The MSE reports the average squared difference between the predicted and actual values. 

The MSE is use to fit models and evaluate models.

Below, we plot the data and the regression line using Matplotlib.

In [None]:
# Plot the data and the regression line
plt.scatter(x, y, label='Data')
plt.plot(x, y_pred, color='r', label='Linear regression')
plt.legend()
plt.show()

#### Residuals

Residuals are the differences between the actual values of the dependent variable and the predicted values of the dependent variable. 

In the context of MSE, the residuals are the differences between the observed values y and the predicted values ŷ. Residuals are useful for evaluating the performance of the model and checking for patterns or trends that the model may have missed.

In [None]:
import numpy as np

# Generate random data
np.random.seed(123)
x = np.random.rand(50)
y = 2*x + 0.5*np.random.randn(50)

# Fit a linear regression model
coeffs = np.polyfit(x, y, 1)
y_pred = np.polyval(coeffs, x)

# Compute residuals
residuals = y - y_pred
print("Residuals:", residuals)

In this example, we first generate some random data using NumPy. Then, we fit a linear regression model using the np.polyfit function and make predictions on the input data using the np.polyval function. Finally, we compute the residuals as the difference between the actual values of y and the predicted values ŷ. The residuals are printed to the console.

After model fitting, it is important to examine the residuals to ensure that they are randomly distributed around zero and do not exhibit any patterns or trends. One way to visualize the residuals is to create a scatter plot of the residuals against the predicted values. If the residuals are randomly distributed around zero, the plot should not exhibit any patterns or trends. If there is a pattern or trend, it suggests that the model is not capturing some aspect of the data and may need to be adjusted.

This code creates a scatter plot of the residuals against the predicted values, with a red line at y=0 to indicate where the residuals should be centered around. A well-fitted model should have a scatter plot of residuals that is relatively evenly distributed around the red line. If there are any patterns or trends, this may indicate that the model is not capturing some aspect of the data.

In [None]:
import matplotlib.pyplot as plt

# Create scatter plot of residuals
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='-')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residual plot')
plt.show()

### Linear regression using scikit-learn (generalized linear regression)


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

Next, we need to create some sample data to work with. Let's say we want to establish a relationship between the number of hours studied and the grades obtained by students. We can create a NumPy array with some random data like this:

In [None]:
x = np.array([2, 4, 6, 8, 10, 12, 14, 16, 18, 20]).reshape((-1, 1))
y = np.array([60, 80, 90, 92, 94, 94, 96, 98, 98, 100])

Here, we've created two NumPy arrays: x, which contains the number of hours studied, and y, which contains the corresponding grades obtained by the students.

Now, we can create a LinearRegression object and fit our data to it:

In [None]:
model = LinearRegression()
model.fit(x, y)

This created a linear regression model and fit it to our data.

We can now use the predict() method of our model to make predictions for new data points. For example, if we want to predict the grades of students who have studied for 15 hours, we can do this:

In [None]:
x_new = np.array([15]).reshape((-1, 1))
y_new = model.predict(x_new)
print(y_new)

The above outputted the predicted grade for a student who has studied for 15 hours.

We can also plot our data and the linear regression line to visualize the relationship between the two variables:

In [None]:
def myR2(y,y_hat):
    # Calculate R-squared
    residuals = y - y_hat

    ss_res = np.sum(residuals**2)
    ss_tot = np.sum((y - np.mean(y))**2)
    r_squared = 1 - (ss_res / ss_tot)
    print(r_squared)
    
    return r_squared

In [None]:
r2 = myR2(y,y_hat)

In the code above, we first generate some random data and fit a linear regression model to it using polyfit from NumPy.

We then calculate the residuals of the model, the sum of squares of the residuals (ss_res), and the total sum of squares (ss_tot) of the data.

Finally, we calculate the R-squared (R²) value using the formula 1 - (ss_res / ss_tot).

R-squared (R²) is a statistical measure that represents the proportion of variance in the dependent variable (y) that is explained by the independent variable(s) (x) in a linear regression model. In other words, it measures how well the observed data points fit the regression line.

R-squared values range from 0 to 1, where a value of 0 indicates that the model does not explain any of the variation in the dependent variable and a value of 1 indicates that the model explains all of the variation in the dependent variable.

An R-squared value of 0.8, for example, indicates that 80% of the variation in the dependent variable is explained by the independent variable(s) in the model. R-squared is commonly used as a goodness-of-fit measure to evaluate how well a linear regression model fits the data.

Our R-squared (R²) was OK in this case not too good not too bad. We will discuss more about R², sum of squares, etc.

Before we do that, let's think about good and bad models.