# STOR 320: Introduction to Data Science
## Lab 9

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

1. We use the following code to generate two columns of features and the target values. Based on the code below, what is the true linear model between `Target` and two features?

In [3]:
np.random.seed(0)

# Generate feature matrix X (100 samples, 2 features)
X = np.random.rand(100, 2)

# Define true coefficients for generating y
true_beta = np.array([5, 2, -3])  # Intercept, beta_1, beta_2

# Generate y with some added noise
y = true_beta[0] + X @ true_beta[1:] + np.random.normal(0, 0.5, X.shape[0])

# Combine X and y into a pandas DataFrame
data = pd.DataFrame(np.column_stack((X, y)), columns=['Feature_1', 'Feature_2', 'Target'])
data

Unnamed: 0,Feature_1,Feature_2,Target
0,0.548814,0.715189,4.515377
1,0.602763,0.544883,4.030911
2,0.423655,0.645894,3.335893
3,0.437587,0.891773,2.980945
4,0.963663,0.383442,5.527985
...,...,...,...
95,0.398221,0.209844,4.879017
96,0.186193,0.944372,2.610245
97,0.739551,0.490459,4.848061
98,0.227415,0.254356,5.037529


The true model is $y = 5 + 2 * F_1 - 3 * F_2$

2. Based on `data` table, separate X (features) and y (target) from the data. In other words, create a `100*2` numpy matrix for X and a numpy vector for y.

In [5]:
x = data[["Feature_1", "Feature_2"]]
y = data["Target"]

In [6]:
x

Unnamed: 0,Feature_1,Feature_2
0,0.548814,0.715189
1,0.602763,0.544883
2,0.423655,0.645894
3,0.437587,0.891773
4,0.963663,0.383442
...,...,...
95,0.398221,0.209844
96,0.186193,0.944372
97,0.739551,0.490459
98,0.227415,0.254356


In [7]:
y

0     4.515377
1     4.030911
2     3.335893
3     2.980945
4     5.527985
        ...   
95    4.879017
96    2.610245
97    4.848061
98    5.037529
99    4.160183
Name: Target, Length: 100, dtype: float64

3. Add a column of ones to X to account for the intercept term in the coefficient vector.

In [11]:
x_with_intercept = np.column_stack((np.ones(x.shape[0]), x))
x_with_intercept

array([[1.        , 0.5488135 , 0.71518937],
       [1.        , 0.60276338, 0.54488318],
       [1.        , 0.4236548 , 0.64589411],
       [1.        , 0.43758721, 0.891773  ],
       [1.        , 0.96366276, 0.38344152],
       [1.        , 0.79172504, 0.52889492],
       [1.        , 0.56804456, 0.92559664],
       [1.        , 0.07103606, 0.0871293 ],
       [1.        , 0.0202184 , 0.83261985],
       [1.        , 0.77815675, 0.87001215],
       [1.        , 0.97861834, 0.79915856],
       [1.        , 0.46147936, 0.78052918],
       [1.        , 0.11827443, 0.63992102],
       [1.        , 0.14335329, 0.94466892],
       [1.        , 0.52184832, 0.41466194],
       [1.        , 0.26455561, 0.77423369],
       [1.        , 0.45615033, 0.56843395],
       [1.        , 0.0187898 , 0.6176355 ],
       [1.        , 0.61209572, 0.616934  ],
       [1.        , 0.94374808, 0.6818203 ],
       [1.        , 0.3595079 , 0.43703195],
       [1.        , 0.6976312 , 0.06022547],
       [1.

4. Calculate the estimation of parameter $\beta$ manually using NumPy's matrix operations.
- Hint: You can refer to the lecture notes to find the closed-form solution for the regression coefficients
- Hint: You can use `np.linalg.inv` to calculate the inverse of a matrix.

In [13]:
XTX_inv = np.linalg.inv(x_with_intercept.T @ x_with_intercept)
beta_hat = XTX_inv @ x_with_intercept.T @ y
print(f"Manually calculated beta: {beta_hat}")

Manually calculated beta: [ 5.05725163  1.78685022 -2.98521247]


5. Compute $R$ squared manually. You can refer to the lecture notes to see the definitino of R-squared.

In [15]:
y_pred = x_with_intercept @ beta_hat
SS_res = np.sum((y - y_pred) ** 2)
SS_total = np.sum((y - np.mean(y)) ** 2)
R_squared = 1 - (SS_res/SS_total)
R_squared

0.8308337212672314

6. **Comparison with statsmodels**

We fit a linear regression model using `statsmodels`, then print and compare both the manually calculated coefficients and R-squared values with those from statsmodels. Are they the same?

In [17]:
x_sm = sm.add_constant(x)
mod = sm.OLS(y, x_sm)
results = mod.fit()
results.summary()

0,1,2,3
Dep. Variable:,Target,R-squared:,0.831
Model:,OLS,Adj. R-squared:,0.827
Method:,Least Squares,F-statistic:,238.2
Date:,"Sun, 03 Nov 2024",Prob (F-statistic):,3.74e-38
Time:,11:37:57,Log-Likelihood:,-64.212
No. Observations:,100,AIC:,134.4
Df Residuals:,97,BIC:,142.2
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.0573,0.130,39.044,0.000,4.800,5.314
Feature_1,1.7869,0.166,10.750,0.000,1.457,2.117
Feature_2,-2.9852,0.164,-18.217,0.000,-3.310,-2.660

0,1,2,3
Omnibus:,0.41,Durbin-Watson:,2.045
Prob(Omnibus):,0.815,Jarque-Bera (JB):,0.502
Skew:,0.144,Prob(JB):,0.778
Kurtosis:,2.807,Cond. No.,5.58
