# Logistic Regression Evaluative

- In this evaluative, you will be implementing a logistic regression model from scratch.
- You will be using the `train.csv` as your training dataset to train your model and evaluate it's performance on the `test.csv`
- Use **ONLY** Logistic Regression model
- You will be evaluated on the basis of the accuracy score on the test dataset.
- **DO NOT** change the notebook name. The notebook name should be `eval.ipynb`.

Guidelines to be followed:
- You can refer to last labs as you wish. Using pre-implemented setups is not allowed and will be given a 0.
- You are to submit your results in the format shown in the sample submissions.csv file.

In [1]:
#These are the only imports allowed.
import os
import numpy as np
from matplotlib import pyplot
import pandas as pd
%matplotlib inline

In [2]:
# Seeding numpy for determinstic behavior
# DO NOT CHANGE THIS (unless you know what you are doing)
np.random.seed(42)

## Data Preprocessing

In [3]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('train.csv')

df.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt,target
0,Dog,2,Mixed Breed,Male,Black,Brown,Medium,Medium,Yes,No,Healthy,0,3,1
1,Dog,3,Jack Russell Terrier,Female,Brown,White,Medium,Short,Yes,No,Healthy,500,1,1
2,Cat,3,Domestic Short Hair,Female,Gray,White,Small,Medium,No,No,Healthy,0,1,1
3,Dog,2,Mixed Breed,Female,Black,Brown,Medium,Medium,Yes,No,Healthy,0,7,1
4,Dog,12,Poodle,Male,Brown,Cream,Medium,Medium,Yes,Yes,Healthy,0,8,1


In [4]:
#Do one hot encoding

# Assuming your DataFrame is named df
columns_to_encode = ['Type', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize', 'FurLength', 'Vaccinated', 'Sterilized', 'Health']

# Apply one-hot encoding using get_dummies
df = pd.get_dummies(df, columns=columns_to_encode)

df.head()

Unnamed: 0,Age,Fee,PhotoAmt,target,Type_Cat,Type_Dog,Breed1_0,Breed1_Abyssinian,Breed1_Akita,Breed1_American Bulldog,...,FurLength_Short,Vaccinated_No,Vaccinated_Not Sure,Vaccinated_Yes,Sterilized_No,Sterilized_Not Sure,Sterilized_Yes,Health_Healthy,Health_Minor Injury,Health_Serious Injury
0,2,0,3,1,False,True,False,False,False,False,...,False,False,False,True,True,False,False,True,False,False
1,3,500,1,1,False,True,False,False,False,False,...,True,False,False,True,True,False,False,True,False,False
2,3,0,1,1,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
3,2,0,7,1,False,True,False,False,False,False,...,False,False,False,True,True,False,False,True,False,False
4,12,0,8,1,False,True,False,False,False,False,...,False,False,False,True,False,False,True,True,False,False


In [6]:
# Store the min/max values to be used at test time
df_min = df.min()
df_max = df.max()

# Scale the features
df = (df - df_min) / (df_max - df_min)

df.head()

TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

In [None]:
x = df.copy().drop('target', axis=1).to_numpy() # (N, D)
y = df.copy()['target'].to_numpy().reshape(-1, 1) # (N, 1)

## Learning Weights

Randomly initialize the weights using samples from a standard normal distribution

In [None]:
w = np.random.randn(1, x.shape[1])
b = np.random.randn(1, 1)

Make a prediction using the random weights

In [None]:
# Compute the linear combination of weights and features plus bias
z = np.dot(x, w.T) + b

# Apply the logistic sigmoid function to get the predicted probability
predicted_probabilities = 1 / (1 + np.exp(-z))

# Make a binary prediction (0 or 1) based on a threshold (e.g., 0.5)
# For example, if predicted_probability is greater than 0.5, classify as 1, otherwise as 0
predictions = (predicted_probabilities > 0.5).astype(int)

# Print the predicted probabilities and binary predictions
print("Predicted Probabilities:")
print(predicted_probabilities)

print("Binary Predictions:")
print(predictions)

Predicted Probabilities:
[[0.0127581 ]
 [0.00272123]
 [0.06873924]
 ...
 [0.04211053]
 [0.00359642]
 [0.67139339]]
Binary Predictions:
[[0]
 [0]
 [0]
 ...
 [0]
 [0]
 [1]]


In [None]:
# Calculate the number of correct predictions
correct_predictions = np.sum(predictions == y)

# Calculate the total number of data points
total_data_points = len(y)

# Calculate the accuracy as the ratio of correct predictions to total data points
accuracy = correct_predictions / total_data_points

# Convert accuracy to percentage by multiplying by 100
accuracy_percentage = accuracy * 100

print("Accuracy:", accuracy_percentage, "%")

Accuracy: 27.516132139073484 %


Define the Loss Function

Write the Gradient Descent Algorithm and train the model

Define the gradient of the loss function with respect to the weights

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def binary_cross_entropy_loss(y, y_pred):
    m = len(y)
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    loss = - (1/m) * np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
    return loss

def gradient_descent(x, y, alpha, num_iterations):
    m, n = x.shape  # m is the number of data points, n is the number of features (including bias)
    theta = np.zeros((n, 1))  # Initialize theta with zeros

    for iteration in range(num_iterations):
        z = np.dot(x, theta)
        h = sigmoid(z)

        gradient = np.dot(x.T, (h - y)) / m

        # Update theta using the gradient and learning rate alpha
        theta -= alpha * gradient

        # Compute the loss for monitoring
        cost = binary_cross_entropy_loss(y, h)
        print(f"Iteration {iteration + 1}/{num_iterations}, Loss: {cost}")

    return theta

# Example usage:
# Assuming X is your feature matrix (including a column of ones for the bias term)
# and y is your binary labels (0 or 1)

# Set hyperparameters
learning_rate = 0.001
num_iterations = 3000

# Add a column of ones to X for the bias term
X_with_bias = np.column_stack((np.ones((x.shape[0], 1)), x))

# Apply gradient descent to optimize theta
optimized_theta = gradient_descent(X_with_bias, y, learning_rate, num_iterations)


Iteration 1/3000, Loss: 0.6931471805599453
Iteration 2/3000, Loss: 0.6928312884205975
Iteration 3/3000, Loss: 0.6925162764888377
Iteration 4/3000, Loss: 0.692202142273235
Iteration 5/3000, Loss: 0.6918888832886183
Iteration 6/3000, Loss: 0.6915764970560693
Iteration 7/3000, Loss: 0.6912649811029139
Iteration 8/3000, Loss: 0.6909543329627151
Iteration 9/3000, Loss: 0.690644550175266
Iteration 10/3000, Loss: 0.6903356302865807
Iteration 11/3000, Loss: 0.6900275708488869
Iteration 12/3000, Loss: 0.6897203694206184
Iteration 13/3000, Loss: 0.6894140235664067
Iteration 14/3000, Loss: 0.6891085308570732
Iteration 15/3000, Loss: 0.6888038888696211
Iteration 16/3000, Loss: 0.6885000951872271
Iteration 17/3000, Loss: 0.688197147399233
Iteration 18/3000, Loss: 0.6878950431011382
Iteration 19/3000, Loss: 0.6875937798945911
Iteration 20/3000, Loss: 0.6872933553873799
Iteration 21/3000, Loss: 0.6869937671934252
Iteration 22/3000, Loss: 0.686695012932771
Iteration 23/3000, Loss: 0.6863970902315767
I

Print the Training Accuracy

In [None]:
def accuracy(y_true, y_pred):
    # Compare predicted labels to actual labels and calculate the accuracy
    correct_predictions = np.sum(y_true == y_pred)
    total_predictions = len(y_true)
    accuracy = correct_predictions / total_predictions
    return accuracy

# Assuming you have a set of actual labels y_true and predicted labels y_pre
# Calculate accuracy
accuracy_score = accuracy(y, (optimized_theta[0] + np.dot(x, optimized_theta[1:])) > 0)
print(f"Accuracy: {accuracy_score * 100:.2f}%")


Accuracy: 73.43%


## Predictions on Test Set

Preprocess the test set in the same way as the training set.

**NOTE:** The number of features in the test set may be different from the number of features in the training set. This is because some categorical features may have different number of categories in the test set and the training set. This is a common problem in data preprocessing. You will have to find a way around it. (If did the previous labs, you might have encountered the same problem)

**NOTE:** Depending on your method of preprocessing, you might not encounter this error

In [None]:
df_test = pd.read_csv('test.csv')
id_columns = df_test['ID']
df_test = df_test.drop('ID', axis=1)
df_test.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt
0,Dog,60,Mixed Breed,Female,Black,Brown,Large,Short,Yes,Yes,Healthy,0,5
1,Cat,2,Domestic Short Hair,Female,Black,Gray,Small,Short,No,No,Healthy,0,3
2,Dog,1,Mixed Breed,Female,Black,Brown,Medium,Short,No,No,Healthy,0,5
3,Cat,2,Domestic Medium Hair,Female,Black,White,Small,Medium,No,No,Healthy,0,3
4,Cat,12,Domestic Medium Hair,Female,Black,White,Medium,Medium,Yes,Yes,Healthy,150,3


In [None]:
# Apply one-hot encoding using get_dummies
df_test = pd.get_dummies(df_test, columns=columns_to_encode)

df_test.head()

Unnamed: 0,Age,Fee,PhotoAmt,Type_Cat,Type_Dog,Breed1_Abyssinian,Breed1_American Shorthair,Breed1_Australian Kelpie,Breed1_Australian Terrier,Breed1_Beagle,...,FurLength_Short,Vaccinated_No,Vaccinated_Not Sure,Vaccinated_Yes,Sterilized_No,Sterilized_Not Sure,Sterilized_Yes,Health_Healthy,Health_Minor Injury,Health_Serious Injury
0,60,0,5,0,1,0,0,0,0,0,...,1,0,0,1,0,0,1,1,0,0
1,2,0,3,1,0,0,0,0,0,0,...,1,1,0,0,1,0,0,1,0,0
2,1,0,5,0,1,0,0,0,0,0,...,1,1,0,0,1,0,0,1,0,0
3,2,0,3,1,0,0,0,0,0,0,...,0,1,0,0,1,0,0,1,0,0
4,12,150,3,1,0,0,0,0,0,0,...,0,0,0,1,0,0,1,1,0,0


In [None]:
missing_features = set(df.columns) - set(df_test.columns)
for feature in missing_features:
    df_test[feature] = 0
df_test = df_test[df.columns]

# Now, df_test contains all missing columns filled with zeros
df_test.head()

  df_test[feature] = 0
  df_test[feature] = 0
  df_test[feature] = 0
  df_test[feature] = 0


Unnamed: 0,Age,Fee,PhotoAmt,target,Type_Cat,Type_Dog,Breed1_0,Breed1_Abyssinian,Breed1_Akita,Breed1_American Bulldog,...,FurLength_Short,Vaccinated_No,Vaccinated_Not Sure,Vaccinated_Yes,Sterilized_No,Sterilized_Not Sure,Sterilized_Yes,Health_Healthy,Health_Minor Injury,Health_Serious Injury
0,60,0,5,0,0,1,0,0,0,0,...,1,0,0,1,0,0,1,1,0,0
1,2,0,3,0,1,0,0,0,0,0,...,1,1,0,0,1,0,0,1,0,0
2,1,0,5,0,0,1,0,0,0,0,...,1,1,0,0,1,0,0,1,0,0
3,2,0,3,0,1,0,0,0,0,0,...,0,1,0,0,1,0,0,1,0,0
4,12,150,3,0,1,0,0,0,0,0,...,0,0,0,1,0,0,1,1,0,0


Convert the test set to numpy array(s)

In [None]:
missing_cols = set(df_test.columns) - set(df.columns)
missing_cols

set()

In [None]:
# Convert to numpy array
x_test = df_test.copy().drop('target', axis=1).to_numpy() # (N, D)
y_test = df_test.copy()['target'].to_numpy().reshape(-1, 1) # (N, 1)

Use the Learnt weights to make predictions on the test set


In [7]:
# Apply the sigmoid function to get predicted probabilities
y_pred = sigmoid(x_test @ optimized_theta[1:] + optimized_theta[0]) > 0.5

# Make binary predictions based on a threshold (e.g., 0.5)
y_pred = y_pred.astype(int)
y_pred =  y_pred.reshape(-1)
id_columns.to_numpy()

print(id_columns.shape)
print(y_pred.shape)

NameError: name 'sigmoid' is not defined

Save these predictions as a csv file called `submission.csv` in the format "given in `sample_submission.csv`

In [None]:
df = pd.DataFrame({'ID': id_columns, 'target': y_pred})
df.head()

Unnamed: 0,ID,target
0,1,1
1,2,1
2,3,1
3,4,1
4,5,1


In [None]:
df.to_csv('submission.csv', index=False)

Preliminary checks to ensure `submission.csv` is in the correct format.

## Submission Cells

We will now zip and prepare the notebook and csv for submission.

In [None]:
df_temp = pd.read_csv('submission.csv')
test_temp = pd.read_csv('test.csv')
assert len(df_temp.columns) == 2, "Number of columns in the submission file is not correct, check the submission format"
assert list(df_temp.columns) == ['ID', 'target'] , "Column names are not correct, check the submission format"
assert df_temp['target'].nunique() == 1 or df_temp['target'].nunique() == 2, "The prediction should be 0 or 1 only"
assert len(df_temp) == len(test_temp), "Number of rows in the submission file is not correct"

Making the submission zip ready<br>
Note: Ensure that your notebook has been saved uptil now with the name eval.ipynb

In [None]:
import shutil
import os

if not os.path.exists('temp'):
    os.makedirs('temp')

if os.path.exists('submission.csv'):
    shutil.copy('submission.csv','temp/submission.csv')

if os.path.exists('eval.ipynb'):
    shutil.copy('eval.ipynb',os.path.join('temp','eval.ipynb'))

shutil.make_archive('submission', 'zip', 'temp')
shutil.rmtree('temp')

Submit the `submission.zip` file to kaggle