# Logistic Regression

<p>This project classifies emails as "spam" (1) or "non-spam" (0) using a logistic regression model. The hypothesis/decision rule in a logistic regression model is given by</p>

$$h_\theta(x) = \sigma(\theta^Tx) \\ \text{where } \sigma  \text{ is the sigmoid function}$$

<p>Since logistic regression does not have a closed form solution, we will use gradient descent to obtain the parameters $\theta$. We will use the negative log likelihood loss with L2 regularization as the loss function. Mathematically, the loss function $l(\theta)$ for a given set of parameters $\theta$ will be,</p>

$$l(\theta) = NLL(\theta) + \frac{\lambda}{2}||\theta||^2 \\ \text{where } NLL(\theta) = -\sum_{i=1}^{n} y_i\log(h(x_i)) + (1 - y_i)\log(1 - h(x_i))$$

<p>These equations are not needed for implementing gradient descent. However, what is needed is the gradient or the derivative of the loss function. For a given $n$$ x $$d$ matrix $X$ of data, $n$ x $1$ vector of labels (0/1) $y$, and corresponding $n$ x $1$ vector of predictions $\hat{y}$, the loss function gradient is</p>

$$\nabla l(\theta) = (\hat{y} - y)^{T} \cdot X + \lambda \cdot \theta$$


## Prepare data

Load the dataset file spambase_data.csv using pandas, and then split the dataset into a train set and a test set.

In [3]:
# read in raw dataset
spam_df = pd.read_csv("spambase_data.csv")
spam_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [4]:
# split the dataset into a train set and a test set
percent_test = 0.2
X_train, X_test, y_train, y_test = train_test_split(spam_df.iloc[:, :-1], spam_df.iloc[:, -1], test_size=percent_test)

# convert dataframe to numpy objects
X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
y_train = y_train.to_numpy().reshape(X_train.shape[0], 1)
y_test = y_test.to_numpy().reshape(X_test.shape[0], 1)

## Gradient Descent

Using the loss gradient equation above, implement gradient descent (using only the train set) to find the parameters $\theta$ of the logistic regression model.

In [5]:
def sig(x):
    return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))


def gradient_descent (X, y, lr=0.00001, alpha=10, n_steps=3000):
    """
    Keyword Arguments:
        n_steps {int} -- number of iterations in gradient descent
        lr {float} -- learning rate, step size in gradient descent
        alpha {int} -- coefficient of l2 regularization (aka lambda)
    Returns:
        w {ndarray} -- (1, num_features+1) shaped ndarray of weights for log. reg.
    """    
    num_points, num_features = X.shape
    
    w = np.random.rand(1, num_features)         # initialize weights
    
    for n in range(n_steps):
        
        y_pred = sig(X @ w.transpose())         # make prediction
                
        nll_gr = (y_pred - y).transpose() @ X   # gradient of loss function
        l2_gr = alpha * w
        nll_l2_gr =  nll_gr + l2_gr
        
        if (n%250 == 0):
            print('ccr = {:.5f}'.format(((y_pred > 0.5) == y).mean()))
            
        w -= lr * nll_l2_gr        # update w
        
    return w 

In [6]:
w = gradient_descent(X_train, y_train)

  


ccr = 0.39864
ccr = 0.62527
ccr = 0.43777
ccr = 0.62772
ccr = 0.49538
ccr = 0.73125
ccr = 0.63152
ccr = 0.74592
ccr = 0.48342
ccr = 0.54239
ccr = 0.48940
ccr = 0.66033


## Compute CCR
Report the correct classification rate (CCR) of the model on train data and test data. The CCR is defined as 

$$CCR = \frac{num\_correct\_predictions}{num\_samples}$$

In [7]:
# predict on test data and train data
y_pred_test = sig(X_test @ w.transpose()) >= 0.5
y_pred_train = sig(X_train @ w.transpose()) >= 0.5

# calculate CCR
ccr_test = (y_pred_test == y_test).mean()
ccr_train = (y_pred_train == y_train).mean()
print('Test ccr = {:.5f}, Train ccr = {:.5f}'.format(ccr_test, ccr_train))

Test ccr = 0.64929, Train ccr = 0.64022


  
