# Logistic Regression: A Vectorized Approach Without Machine Learning Libraries

## 1. Introduction

## 2. Preprocessing

## 3. Initialize Coefficients

## 4. Log-Odds Function

## 5. Sigmoid Function

## 6. Log-Loss Function

## 7. Gradient Descent

## 8. Find Best Threshold

## 9. Performance Metrics

## 10. Submission





# 1.Introduction
This notebook will be applying logistic regression on the Titanic dataset. I did not include data exploration visuals in this notebook but did some minor data cleaning prior to applying the models. As an overview of the algorithm, I initialized coefficients and an intercept to zero and used the log-odds function to find the logarithm of the odds then applied a sigmoid function to restrict the values from 0 to 1. I then used gradient descent to find the optimal coefficients and intercept. These coefficients and intercept are then put into the sigmoid function with these sigmoid values determining the best threshold for classification. I included accuracy, recall and precision functions. Submission results are also included

# 2.Preprocessing

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
#train_data.head(10)

/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv


In [2]:
#Looking at the numeric variables
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The age variable has missing data but all of the other numeric columns are fine. An approach is to group data by another variable and find the average age for each group and impute the average into the missing values. I will take group by the title of each person since people with similar title may have similar ages.


In [3]:
train_data["Name"] = train_data["Name"].str.split(',').str[1]
train_data["Name"] = train_data["Name"].str.split('.').str[0]
train_data["Name"] = train_data["Name"].str.strip()

In [4]:
x = train_data.groupby('Name').agg(['count']).index.get_level_values('Name')
x

Index(['Capt', 'Col', 'Don', 'Dr', 'Jonkheer', 'Lady', 'Major', 'Master',
       'Miss', 'Mlle', 'Mme', 'Mr', 'Mrs', 'Ms', 'Rev', 'Sir', 'the Countess'],
      dtype='object', name='Name')

The names are transformed into title. Taking the average age of each group to fill missing data.

In [5]:
train_data["Age"] = train_data.groupby("Name").transform(lambda x: x.fillna(x.mean()))['Age']
#changing sex to be 0 or 1 for female & male
train_data['Sex'].replace({'female':0,'male':1},inplace=True)
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Mr,1,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,Mrs,0,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Miss,0,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,Mrs,0,35.0,1,0,113803,53.1,C123,S
4,5,0,3,Mr,1,35.0,0,0,373450,8.05,,S


For logistic regression, we don't need to standardize the data. The major focus of this notebook is on the algorithm, so I will be using what I think are the most potentially useful variables (6 of the features)

In [6]:
train_data_log = train_data.iloc[:,[False,False,True, False,True,True,True,True,False,True,False,False]]
normalized_data_train=(train_data_log-train_data_log.min())/(train_data_log.max()-train_data_log.min())
train_labels_log = train_data.iloc[:,1]
normalized_data_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,1.0,1.0,0.271174,0.125,0.0,0.014151
1,0.0,0.0,0.472229,0.125,0.0,0.139136
2,1.0,0.0,0.321438,0.0,0.0,0.015469
3,0.0,0.0,0.434531,0.125,0.0,0.103644
4,1.0,1.0,0.434531,0.0,0.0,0.015713


# 3.Initializing

In [7]:
def initial_coefs_intercept(data):
    """Function takes a pandas df as input and returns initialized coefficients and intercept"""
    coefficients = []
    intercept = 0
    for i in range(len(data.columns)):
        coefficients.append(0)
    return [coefficients, intercept]

initial_coefs = initial_coefs_intercept(normalized_data_train)[0]
initial_intercept = initial_coefs_intercept(normalized_data_train)[1]
print(initial_coefs)
print(initial_intercept)

[0, 0, 0, 0, 0, 0]
0


# 4.Log-Odds

In [8]:
#log_odds
def log_odds(data,coefficients,intercept):
    """Takes pandas dataframe, list of coefficients and an intercept value and returns
    an array of the log odds of each feature"""
    return np.dot(data,coefficients) + intercept

l_o= log_odds(normalized_data_train,initial_coefs, initial_intercept)


# 5.Sigmoid

In [9]:
def sigmoid(log_odds_vars):
    """Takes log odds calculated with the log odds functions and returns the sigmoid transformed values
    restricting the values from 0 to 1"""
    sigmoid_values = 1/(1+np.exp(-log_odds_vars))
    
    return sigmoid_values

sigmoid_vals = sigmoid(l_o)


# 6.Log-Loss
Log-Loss is the loss function. Gradient descent uses the derivative of the loss function for optimization

In [10]:
def log_loss(probabilities, labels):
    """Determines the log loss given a set of sigmoid values (probabilities) and a set of training data labels"""
    #start_time = time.time()
    data_length = len(labels)
    labels = np.array(labels)
    
    left_half = np.dot(labels,np.log(probabilities+.0001)) #including small epsilon so no division by 0
    right_half = np.dot(1-labels,np.log(1-probabilities+.0001))
    loss = (-1/data_length) * (left_half + right_half)

    #print("--- %s seconds ---" % (time.time() - start_time)) 
    return loss
    
#print(log_loss(sigmoid_vals,train_labels_log))

# 7.Gradient_Descent
> ### Using the functions created earlier and the gradients of the coefficients and the intercept I used a learning rate of .005 and number of iterations equal to 10,000. I tried several and this had best performance.

In [11]:
def find_coefficients(data, coefficients, intercept,labels,learning_rate, iterations):
    coefs = coefficients
    for i in range(iterations):
        l_odds = log_odds(data,coefs,intercept)
        sig_vals = sigmoid(l_odds)
        data_transpose = np.transpose(learning_rate * data)
        coefs = np.dot(data_transpose,(labels-sig_vals) * sig_vals*(1-sig_vals)) + coefs
        intercept = intercept + learning_rate * np.dot((labels-sig_vals), (sig_vals*(1-sig_vals)))
    print(coefs, intercept)
    return coefs, intercept
best_coefs= find_coefficients(normalized_data_train,initial_coefs, initial_intercept,train_labels_log,.0005, 50000)

[-2.51954182 -2.7993534  -3.57881625 -2.86774287 -0.35887018  0.71276646] 4.140646511275284


With the best coefficients and intercept, insert these into the sigmoid function to get the sigmoid values of the optimized coefficients.

In [12]:
best_coef = best_coefs[0]
best_int = best_coefs[1]

v = sigmoid(log_odds(normalized_data_train,best_coef,best_int))

#print(v)

# 8.Threshold

In [13]:
def find_threshold(sigmoid_vals):
    """Takes sigmoid vals from best coefficients and best intercept and returns the best classifier threshold"""
    predictions = []
    vals = []
    accuracies = []
    
    for num in range(1000):
        vals.append(num/1000)
        accuracy = 0
        for i in v:
            if i > num/1000:
                predictions.append(1)
            else:
                predictions.append(0)
        
        for j in range(len(predictions)):
            if predictions[j] == train_labels_log[j]:
                accuracy += 1
        accuracies.append(accuracy/len(predictions))
        accuracy = 0
        predictions = []
    indx = accuracies.index(max(accuracies))
    print("Best accuracy on training set:")
    print(max(accuracies))
    best_threshold = vals[indx]
    return best_threshold
    
best_thresh = find_threshold(v)
print(best_thresh)

Best accuracy on training set:
0.8148148148148148
0.556


# 9.Performance

In [14]:
def calculate_precision(sigmoid_vals, threshold, labels):
    "Precision is  True Positives / (True Positives + False Positives)"
    predictions = []
    true_positives = 0
    false_positives = 0
    for i in sigmoid_vals:
        if i > threshold:
            predictions.append(1)
        else:
            predictions.append(0)
    
    
    for i in range(len(labels)):
        if labels[i] == 1 and labels[i] == predictions[i]:
            true_positives += 1
        elif labels[i] == 0 and labels[i] != predictions[i]:
            false_positives += 1
    
    return true_positives/(true_positives + false_positives)
    
print("Precision:")
print(calculate_precision(v, best_thresh, train_labels_log))

    

Precision:
0.8241758241758241


In [15]:
def calculate_recall(sigmoid_vals, threshold, labels):
    "Precision is  True Positives / (True Positives + False Negatives)"
    predictions = []
    true_positives = 0
    false_negatives = 0
    for i in sigmoid_vals:
        if i > threshold:
            predictions.append(1)
        else:
            predictions.append(0)
    
    
    for i in range(len(labels)):
        if labels[i] == 1 and labels[i] == predictions[i]:
            true_positives += 1
        elif labels[i] == 1 and labels[i] != predictions[i]:
            false_negatives += 1
    
    return true_positives/(true_positives + false_negatives)
    
    
    
print("Recall")          
print(calculate_recall(v, best_thresh, train_labels_log))
    

Recall
0.6578947368421053


# 10.Submission

In [16]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")


test_data["Name"] = test_data["Name"].str.split(',').str[1]
test_data["Name"] = test_data["Name"].str.split('.').str[0]
test_data["Name"] = test_data["Name"].str.strip()
test_data['Sex'].replace({'female':0,'male':1},inplace=True)


x = test_data.groupby('Name').agg(['count']).index.get_level_values('Name')
test_data["Age"] = test_data.groupby("Name").transform(lambda x: x.fillna(x.mean()))['Age']


test_data_log = test_data.iloc[:,[False,True,False,True,True,True,True,False,True,False,False]]
normalized_data_test=(test_data_log-test_data_log.min())/(test_data_log.max()-test_data_log.min())


In [17]:
pred_test = sigmoid(log_odds(normalized_data_test,best_coef,best_int))

In [18]:
#looping through sigmoid values and comparing to threshold value. If greater than threshold, predict class 1.
#otherwise predict class 0
classifier = []
for i in range(len(pred_test)):
    if pred_test[i] > best_thresh:
        classifier.append(1)
    else:
        classifier.append(0)

In [19]:
data = {'PassengerId': test_data["PassengerId"].values, 'Survived':classifier} 
df_submission = pd.DataFrame(data)

df_submission.to_csv("submission_log_regression2.csv",index=False)

#Accuracy was 0.758 on testing set

Accuracy on the testing set was .758. This could probably be improved by trying different learning rates and number of epochs.