# Logistic Regression Evaluative

- In this evaluative, you will be implementing a logistic regression model from scratch.
- You will be using the `train.csv` as your training dataset to train your model and evaluate it's performance on the `test.csv`
- Use **ONLY** Logistic Regression model
- You will be evaluated on the basis of the accuracy score on the test dataset.
- **DO NOT** change the notebook name. The notebook name should be `eval.ipynb`.

Guidelines to be followed:
- You can refer to last labs as you wish. Using pre-implemented setups is not allowed and will be given a 0.
- You are to submit your results in the format shown in the sample submissions.csv file.

In [22]:
#These are the only imports allowed.
import os
import numpy as np
from matplotlib import pyplot
import pandas as pd
%matplotlib inline

In [23]:
# Seeding numpy for determinstic behavior
# DO NOT CHANGE THIS (unless you know what you are doing)
np.random.seed(42)

## Data Preprocessing

In [24]:
df = pd.read_csv('train.csv')
df.describe()

Unnamed: 0,Age,Fee,PhotoAmt,target
count,10383.0,10383.0,10383.0,10383.0
mean,11.770105,23.912646,3.612058,0.734277
std,19.487016,80.72063,3.175399,0.441739
min,0.0,0.0,0.0,0.0
25%,2.0,0.0,2.0,0.0
50%,4.0,0.0,3.0,1.0
75%,12.0,0.0,5.0,1.0
max,255.0,2000.0,30.0,1.0


In [25]:
df.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt,target
0,Dog,2,Mixed Breed,Male,Black,Brown,Medium,Medium,Yes,No,Healthy,0,3,1
1,Dog,3,Jack Russell Terrier,Female,Brown,White,Medium,Short,Yes,No,Healthy,500,1,1
2,Cat,3,Domestic Short Hair,Female,Gray,White,Small,Medium,No,No,Healthy,0,1,1
3,Dog,2,Mixed Breed,Female,Black,Brown,Medium,Medium,Yes,No,Healthy,0,7,1
4,Dog,12,Poodle,Male,Brown,Cream,Medium,Medium,Yes,Yes,Healthy,0,8,1


In [26]:
df = pd.get_dummies(df, columns=['Type','Breed1','Gender','Color1','Color2','MaturitySize','FurLength','Vaccinated','Sterilized','Health']).astype(float)
df.head()

Unnamed: 0,Age,Fee,PhotoAmt,target,Type_Cat,Type_Dog,Breed1_0,Breed1_Abyssinian,Breed1_Akita,Breed1_American Bulldog,...,FurLength_Short,Vaccinated_No,Vaccinated_Not Sure,Vaccinated_Yes,Sterilized_No,Sterilized_Not Sure,Sterilized_Yes,Health_Healthy,Health_Minor Injury,Health_Serious Injury
0,2.0,0.0,3.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,3.0,500.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
2,3.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,2.0,0.0,7.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
4,12.0,0.0,8.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0


In [27]:
df_min = df.min()
df_max = df.max()

# Scale the features
df = (df - df_min) / (df_max - df_min)

df.head()

Unnamed: 0,Age,Fee,PhotoAmt,target,Type_Cat,Type_Dog,Breed1_0,Breed1_Abyssinian,Breed1_Akita,Breed1_American Bulldog,...,FurLength_Short,Vaccinated_No,Vaccinated_Not Sure,Vaccinated_Yes,Sterilized_No,Sterilized_Not Sure,Sterilized_Yes,Health_Healthy,Health_Minor Injury,Health_Serious Injury
0,0.007843,0.0,0.1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,0.011765,0.25,0.033333,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
2,0.011765,0.0,0.033333,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,0.007843,0.0,0.233333,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.047059,0.0,0.266667,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0


Conversion of data to numpy arrays

In [34]:
x = df.copy().drop('target', axis=1).to_numpy()
y = df.copy()['target'].to_numpy().reshape(-1, 1)

## Learning Weights

Randomly initialize the weights using samples from a standard normal distribution

In [52]:
w = np.random.randn(1, x.shape[1])

Make a prediction using the random weights

In [58]:
# Define the sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Function to make predictions
def predict(x, w):
    # Add bias term (intercept)
    features_with_bias = np.hstack((np.ones((x.shape[0], 1)), x))
    
    # Calculate the linear combination of features and weights
    z = np.dot(features_with_bias, w)
    
    # Apply the sigmoid function to get predicted probabilities
    probabilities = sigmoid(z)
    
    # Classify based on a threshold (e.g., 0.5)
    predictions = (probabilities >= 0.5).astype(int)
    
    return predictions

predictions = predict(x,w)
print(predictions)

ValueError: shapes (10383,200) and (1,199) not aligned: 200 (dim 1) != 1 (dim 0)

Define the Loss Function

In [59]:
def mse_loss_fn(y_true, y_pred):
    return 0.5 * np.mean((y_true - y_pred)**2)

Define the gradient of the loss function with respect to the weights

In [62]:
def costFunction(theta, x, y):
    """
    Compute cost and gradient for logistic regression. 
    
    Parameters
    ----------
    theta : array_like
        The parameters for logistic regression. This a vector
        of shape (n+1, ).
    
    X : array_like
        The input dataset of shape (m x n+1) where m is the total number
        of data points and n is the number of features. We assume the 
        intercept has already been added to the input.
    
    y : array_like
        Labels for the input. This is a vector of shape (m, ).
    
    Returns
    -------
    J : float
        The computed value for the cost function. 
    
    grad : array_like
        A vector of shape (n+1, ) which is the gradient of the cost
        function with respect to theta, at the current values of theta.
        
    Instructions
    ------------
    Compute the cost of a particular choice of theta. You should set J to 
    the cost. Compute the partial derivatives and set grad to the partial
    derivatives of the cost w.r.t. each parameter in theta.
    """
    # Initialize some useful values
    m = y.size  # number of training examples

    # You need to return the following variables correctly 
    J = 0
    grad = np.zeros(theta.shape)

    # ====================== YOUR CODE HERE ======================
    
     # Compute the hypothesis (predicted probabilities)
    h = 1 / (1 + np.exp(-np.dot(x, theta)))

    # Compute the cost function J
    J = (-1 / m) * (np.dot(y, np.log(h)) + np.dot((1 - y), np.log(1 - h)))

    # Compute the gradient
    grad = (1 / m) * np.dot(x.T, (h - y))
    
    # =============================================================
    return J, grad

Write the Gradient Descent Algorithm and train the model

In [65]:
initial_theta = np.zeros(n+1)

cost, grad = costFunction(initial_theta, x, y)



NameError: name 'n' is not defined

Print the Training Accuracy

## Predictions on Test Set

Preprocess the test set in the same way as the training set.

**NOTE:** The number of features in the test set may be different from the number of features in the training set. This is because some categorical features may have different number of categories in the test set and the training set. This is a common problem in data preprocessing. You will have to find a way around it. (If did the previous labs, you might have encountered the same problem)

**NOTE:** Depending on your method of preprocessing, you might not encounter this error

Convert the test set to numpy array(s)

Use the Learnt weights to make predictions on the test set

Save these predictions as a csv file called `submission.csv` in the format given in `sample_submission.csv`

## Submission Cells

We will now zip and prepare the notebook and csv for submission.

Preliminary checks to ensure `submission.csv` is in the correct format.

In [None]:
df_temp = pd.read_csv('submission.csv')
test_temp = pd.read_csv('test.csv')
assert len(df_temp.columns) == 2, "Number of columns in the submission file is not correct, check the submission format"
assert list(df_temp.columns) == ['ID', 'target'] , "Column names are not correct, check the submission format"
assert df_temp['target'].nunique() == 1 or df_temp['target'].nunique() == 2, "The prediction should be 0 or 1 only"
assert len(df_temp) == len(test_temp), "Number of rows in the submission file is not correct"

Making the submission zip ready<br>
Note: Ensure that your notebook has been saved uptil now with the name eval.ipynb

In [None]:
import shutil
import os

if not os.path.exists('temp'):
    os.makedirs('temp')

if os.path.exists('submission.csv'):
    shutil.copy('submission.csv','temp/submission.csv')

if os.path.exists('eval.ipynb'):
    shutil.copy('eval.ipynb',os.path.join('temp','eval.ipynb'))

shutil.make_archive('submission', 'zip', 'temp')
shutil.rmtree('temp')

Submit the `submission.zip` file to kaggle