# Linear Regression

In this lab, I implemented linear regression with multiple variables to help a company decide whether to focus their efforts on their mobile app experience or their website based on the Yearly Amount Spent. 

## 1 - Packages 

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

import copy
import math
%matplotlib inline

## 2 -  Problem Statement

This dataset is having data of customers who buys clothes online. The store offers in-store style and clothing advice sessions. Customers come in to the store, have sessions/meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.

The company is trying to decide whether to focus their efforts on their mobile app experience or their website.

## 3 - Dataset

  - `x_train` is Email, Address, Avatar, Avg. Session Length, Time on App, Time on Website, Length of Membership
  - `y_train` is Yearly Amount Spent  
  - Both `X_train` and `y_train` are numpy arrays.

In [4]:
data = pd.read_csv('Ecommerce-Customers.data')

### Data Prepocessing

In [77]:
data.drop(columns=['Email','Address','Avatar'],axis = 1, inplace = True)

KeyError: "['Email' 'Address' 'Avatar'] not found in axis"

In [78]:
data.head()

Unnamed: 0,Avg. Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent
0,1.456351,0.60728,2.493589,0.550107,1.118654
1,-1.136502,-0.949464,0.206556,-0.870927,-1.351783
2,-0.052723,-0.727139,0.049681,0.572067,-0.148501
3,1.26301,1.67639,-0.335978,-0.413996,1.041684
4,0.279838,0.74777,0.471737,0.914422,1.263224


In [9]:
#check for null data 
data.isnull().sum()

Avg. Session Length     0
Time on App             0
Time on Website         0
Length of Membership    0
Yearly Amount Spent     0
dtype: int64

###### No missing data

In [43]:
# Standardize numerical features
scaler = StandardScaler()
numerical_features = ['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership', 'Yearly Amount Spent']
data[numerical_features] = scaler.fit_transform(data[numerical_features])

In [44]:
data.head()

Unnamed: 0,Avg. Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent
0,1.456351,0.60728,2.493589,0.550107,1.118654
1,-1.136502,-0.949464,0.206556,-0.870927,-1.351783
2,-0.052723,-0.727139,0.049681,0.572067,-0.148501
3,1.26301,1.67639,-0.335978,-0.413996,1.041684
4,0.279838,0.74777,0.471737,0.914422,1.263224


In [45]:
X = data.drop("Yearly Amount Spent", axis=1)
y = data["Yearly Amount Spent"]

In [48]:
# Create a Ridge regression model with a regularization strength (alpha)
alpha = 0.01
ridge = Ridge(alpha=alpha)

# Fit the model to the data
ridge.fit(X,y)

# Evaluate the model
ridge_score = ridge.score(X, y)

#### Checking the dimensions of the dataset

Another useful way to get familiar with your data is to view its dimensions.

In [49]:
m,n = data.shape
print(m,n)

500 5


## 4 - Linear regression

I will fit the linear regression parameters $(w,b)$ to the dataset.
- The model function for linear regression
    $$f_{w,b}(x) = wx + b$$
    

- To train a linear regression model, you want to find the best $(w,b)$ parameters that fit your dataset.  

    - To compare how one choice of $(w,b)$ is better or worse than another choice, you can evaluate it with a cost function $J(w,b)$
      - $J$ is a function of $(w,b)$. That is, the value of the cost $J(w,b)$ depends on the value of $(w,b)$.
  
    - The choice of $(w,b)$ that fits your data the best is the one that has the smallest cost $J(w,b)$.


- To find the values $(w,b)$ that gets the smallest possible cost $J(w,b)$, you can use a method called **gradient descent**. 
  - With each step of gradient descent, your parameters $(w,b)$ come closer to the optimal values that will achieve the lowest cost $J(w,b)$.

## 5 - Compute Cost

Gradient descent involves repeated steps to adjust the value of your parameter $(w,b)$ to gradually get a smaller and smaller cost $J(w,b)$.
- At each step of gradient descent, it will be helpful for you to monitor your progress by computing the cost $J(w,b)$ as $(w,b)$ gets updated. 

#### Cost function
The cost function for multiple linear regression $J(w,b)$ is defined as

$$J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2$$ 

- $m$ is the number of training examples in the dataset

#### Model prediction

- For linear regression, the prediction of the model $f_{w,b}$ for an example $x^{(i)}$ is representented as:

$$ f_{w,b}(x^{(i)}) = w.x^{(i)} + b$$

This is the equation for a line, with an intercept $b$ and a slope $w$

In [50]:
# split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [51]:
# convert the sets into NumPy arrays
X_train = np.array(X_train)
y_train = np.array(y_train)

In [52]:
def compute_cost(x, y, w, b): # Computes the cost function for linear regressio
    

    # number of training examples
    m = len(y) 
    
    total_cost = 0
    
    for i in range(m):
        fw_b = np.dot(w, x[i]) + b
        cost = (fw_b - y[i])**2
        total_cost += cost
        
    total_cost = total_cost/(2*m)

    return total_cost

In [53]:
# Compute cost with some initial values for paramaters w, b
initial_w = 2
initial_b = 1

cost = compute_cost(X_train, y_train, initial_w, initial_b)

print(type(cost))
print(f'Cost at initial w,b: {cost}')

<class 'numpy.ndarray'>
Cost at initial w,b: [2.29497162 1.95781955 2.88833104 1.47599975]


## 6 - Gradient descent 

The gradient descent algorithm is:

$$\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & \phantom {0000} b := b -  \alpha \frac{\partial J(w,b)}{\partial b} \newline       \; & \phantom {0000} w := w -  \alpha \frac{\partial J(w,b)}{\partial w} \tag{1}  \; & 
\newline & \rbrace\end{align*}$$

where, n is the number of features, parameters $w_j$,  $b$, are updated simultaneously and where  

$$
\begin{align}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{6}  \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{7}
\end{align}
$$
* m is the number of training examples in the data set

    
*  $f_{w,b}(x^{(i)})$ is the model's prediction, while $y^{(i)}$, is the target value

In [72]:
# compute_gradient
def compute_gradient(x, y, w, b): # Computes the gradient for linear regression 

    
    # Number of training examples
    m,n = x.shape
    
    dj_dw = np.zeros((n,))
    dj_db = 0.
    
    for i in range(m):
        err = (np.dot(x[i], w) + b) - y[i]  
        dj_dw += err * x[i]
        dj_db += err                       
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m  
        
    return dj_dw, dj_db

In [73]:
n = X_train.shape[1] if len(X_train.shape) > 1 else 1

In [74]:
# Compute and display gradient with w initialized to zeroes
initial_w = np.zeros((n,))
initial_b = 0

tmp_dj_dw, tmp_dj_db = compute_gradient(X_train, y_train, initial_w, initial_b)
print(f'dj_db at initial w,b: {tmp_dj_db}')
print(f'dj_dw at initial w,b: \n {tmp_dj_dw}') 

dj_db at initial w,b: -0.03851065874308216
dj_dw at initial w,b: 
 [-0.35085457 -0.5330541  -0.01141393 -0.8593005 ]


<a name="toc_15456_5.2"></a>
## Gradient Descent With Multiple Variables


In [75]:
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters): 
    
    # An array to store cost J and w's at each iteration 
    J_history = []
    w = copy.deepcopy(w_in)  #avoid modifying global w within function
    b = b_in
    
    for i in range(num_iters):

        # Calculate the gradient and update the parameters
        dj_db,dj_dw = gradient_function(X, y, w, b)   

        # Update Parameters using w, b, alpha and gradient
        w = w - alpha * dj_dw               
        b = b - alpha * dj_db               
      
        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(X, y, w, b))

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i % math.ceil(num_iters / 10) == 0:
            print(f"Iteration {i:4d}: Cost {J_history[-1][-1]:8.2f}   ")
        
    return w, b, J_history 

In [76]:
# initialize parameters
initial_w = np.zeros_like(initial_w)
initial_b = 0.

# some gradient descent settings
iterations = 1000
alpha = 5.0e-7
w_final, b_final, J_hist = gradient_descent(X_train, y_train, initial_w, initial_b, compute_cost, compute_gradient, alpha, iterations)

print(f"b,w found by gradient descent: {b_final},{w_final} ")
m,_ = X_train.shape
for i in range(m):
    print(f"prediction: {np.dot(X_train[i], w_final) + b_final}, target value: {y_train[i]}")

Iteration    0: Cost     0.53   
Iteration  100: Cost     0.53   
Iteration  200: Cost     0.53   
Iteration  300: Cost     0.53   
Iteration  400: Cost     0.53   
Iteration  500: Cost     0.53   
Iteration  600: Cost     0.53   
Iteration  700: Cost     0.53   
Iteration  800: Cost     0.53   
Iteration  900: Cost     0.53   
b,w found by gradient descent: [1.75421943e-04 2.66521840e-04 5.70193881e-06 4.29640394e-04],[1.92116098e-05 1.91888576e-05 1.92539974e-05 1.91481184e-05] 
prediction: [1.79092735e-04 2.70192632e-04 9.37273064e-06 4.33311186e-04], target value: 1.7389747837531464
prediction: [ 1.69710623e-04  2.60810520e-04 -9.38124685e-09  4.23929074e-04], target value: -0.2534591670590334
prediction: [2.48795411e-04 3.39895309e-04 7.90754071e-05 5.03013862e-04], target value: 0.6379286901624921
prediction: [ 1.02349388e-04  1.93449286e-04 -6.73706161e-05  3.56567839e-04], target value: -0.5233308362994558
prediction: [ 1.47937701e-04  2.39037598e-04 -2.17823033e-05  4.02156152