# Practical Exercises 
For practicing the numpy skills learned so far 

# 1. Regression factors
The formula for the regression coefficients is

$\beta = (X'X)^{(-1)}X'Y $

But the data is a bit messed up, meaning that the format of the independent variables are saved in a flat array. That means we have a 1xN vector. I.e. the data was changed from that: 

<img src="../assets/data_before.png" alt="" width="500"/>

to that:

<img src="../assets/data_after.png" alt="" width="700"/>

The array contains the following variables: 

- Sale (in Dollars) - Amount of money received by the store
- Pack Size - Number of bottles per item
- State Bottle Cost - Cost of producing the bottle 
- Packs Sold - Amount of bottles sold
- Bottle Volume (in ml) - How many ml each bottle has



Question: Determine the regression coefficents of the following OLS regression

$Sale = \beta_0 + \beta_1 * (Pack Size) + \beta_2 * (State Bottle Cost) + \beta_3 * (Packs Sold) + \beta_4 * (Bottle Volume) + \epsilon $

In [6]:
import pickle

def load_data(path: str):
    with open(path,'rb') as f:
        data = pickle.load(f)
    return data

data = load_data("../../data/data.pkl")

In [8]:
# Numpy Way
import numpy as np

# Reshaping array from a 1x500000 format to a 5x100000 format
reshaped_data = data.reshape(100_000,-1)

# Changing the string varibles to floats
float_data = reshaped_data.astype(float)

# Separating the Sale variable from the rest
independent = float_data[:,1:]
Y = float_data[:,0]

# Creating a column with only ones and add that to the numpy array as a column (this is done for the intercept)
ones = np.ones(independent.shape[0])
X = np.c_[ones, independent]

# Applying regression coefficient formula
X_prime = np.transpose(X)  

inverse_part = np.linalg.inv(np.dot(X_prime, X))
X_prime_Y = np.dot(X_prime, Y)
beta = np.dot(inverse_part, X_prime_Y)

# Printing the coefficients and the name of the regressor 
print(beta)

[-3.88928013e+01 -4.62402519e+00  9.48100848e+00  1.53183949e+01
 -1.88215965e-02]


In [11]:
import statsmodels.api as sm

# Reshaping array from a 1x500000 format to a 5x100000 format
reshaped_data = data.reshape(100_000,-1)

# Changing the string varibles to floats
float_data = reshaped_data.astype(np.float)

# Separating the Sale variable from the rest
independent = float_data[:,1:]
Y = float_data[:,0]

# Creating a column with only ones and add that to the numpy array as a column (this is done for the intercept)
ones = np.ones(independent.shape[0])
X = np.c_[ones, independent]

# Defining statistical model
model = sm.OLS(Y, X)

# Fitting the results
results = model.fit()

# Printing the entire OLS summary statistics
results.summary()

ModuleNotFoundError: No module named 'statsmodels'