In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Read in the data

In [29]:
## Importing the dataset
data = pd.read_csv('data/divorce_data.csv', sep=';')

# Data Exploration

In [30]:
# Get the shape of the dataset
num_rows, num_cols = data.shape

# Check for missing values
missing_values = data.isnull().sum().sum()

# Check the balance of the target variable
divorce_counts = data['Divorce'].value_counts()

num_rows, num_cols, missing_values, divorce_counts


(170,
 55,
 0,
 Divorce
 0    86
 1    84
 Name: count, dtype: int64)

The dataset contains 170 rows (i.e., couples) and 55 columns (54 predictors and 1 target). There are no missing values in the dataset, which is good as it simplifies the preprocessing steps.

The target variable "Divorce" is fairly balanced with 86 instances of non-divorced couples (value 0) and 84 instances of divorced couples (value 1). This is beneficial because imbalanced datasets can often lead to biased models.

# Data Preprocessing

In [31]:
from sklearn.model_selection import train_test_split

# Separate features and target
features = data.drop('Divorce', axis=1)
target = data['Divorce']

# Split the data into training and test sets
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=42)

features_train.shape, features_test.shape, target_train.shape, target_test.shape


((136, 54), (34, 54), (136,), (34,))

The data has been successfully split into training and test sets. We have 136 instances in the training set and 34 instances in the test set. Each instance has 54 features.

# Feature Importance

In this step, we'll use a decision tree-based method to rank the importance of the features in predicting divorce. This will help us identify the key predictors of divorce.

We'll use the Random Forest algorithm from scikit-learn for this. A Random Forest is an ensemble of Decision Trees that is often used for feature selection because it provides a measure of the importance of each feature.

In [32]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest classifier
rf = RandomForestClassifier(random_state=42)

# Fit the model to the training data
rf.fit(features_train, target_train)

# Get feature importances
importances = rf.feature_importances_

# Create a DataFrame of features and importances
feature_importances = pd.DataFrame({
    'Feature': features.columns,
    'Importance': importances
})

# Sort the DataFrame by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

feature_importances

# Select the top 10 features
top_features = feature_importances['Feature'][:10].tolist()

# Select these top features from the training and test data
features_train_selected = features_train[top_features]
features_test_selected = features_test[top_features]

features_train_selected.head()


Unnamed: 0,Q40,Q17,Q18,Q19,Q12,Q20,Q16,Q11,Q15,Q26
69,0,4,4,4,4,4,4,4,4,4
138,0,0,0,0,0,0,0,0,0,0
2,3,3,3,3,4,2,3,3,3,2
93,0,0,0,0,0,0,0,0,0,0
136,0,1,0,0,1,0,0,0,0,0


The Random Forest has ranked the features by their importance in predicting the target variable "Divorce".

The five most important features, according to this model, are:

Q40 with an importance of approximately 0.0966

Q17 with an importance of approximately 0.0951

Q18 with an importance of approximately 0.0920

Q19 with an importance of approximately 0.0896

Q12 with an importance of approximately 0.0896

These results suggest that these questions may be particularly important in predicting divorce.

# Feature Selection

As a starting point, let's choose the top 10 features. However, we can adjust this number later if necessary. Now, let's select these top features from our training and test datasets.

In [33]:
# Select the top 10 features
top_features = feature_importances['Feature'][:10].tolist()

# Select these top features from the training and test data
features_train_selected = features_train[top_features]
features_test_selected = features_test[top_features]

features_train_selected.head()


Unnamed: 0,Q40,Q17,Q18,Q19,Q12,Q20,Q16,Q11,Q15,Q26
69,0,4,4,4,4,4,4,4,4,4
138,0,0,0,0,0,0,0,0,0,0
2,3,3,3,3,4,2,3,3,3,2
93,0,0,0,0,0,0,0,0,0,0
136,0,1,0,0,1,0,0,0,0,0


# Implementing k-Nearest Neighbors with Scikt Learn

To train and evaluate a k-NN model using scikit-learn, we can use the KNeighborsClassifier class. After training the model, we can use it to make predictions on the test set, and then compute accuracy and F1 score

In [34]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score

# Initialize the KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)

# Fit the model to the training data
knn.fit(features_train_selected, target_train)

# Predict the target for the test data
target_pred_knn = knn.predict(features_test_selected)

# Compute accuracy
accuracy_knn = accuracy_score(target_test, target_pred_knn)

# Compute F1 score
f1_knn = f1_score(target_test, target_pred_knn)

accuracy_knn, f1_knn


(0.9705882352941176, 0.9743589743589743)

The k-Nearest Neighbors (k-NN) model from scikit-learn achieved an accuracy of approximately 0.971 (or 97.1%) and an F1 score of approximately 0.974 on the test data.

# Implementing k-Nearest Neighbors from Scratch

Now let's move on to implement the k-NN algorithm from scratch. The steps are as follows:

1.) Calculate Euclidean distance between two instances.


2.) Get the k nearest neighbors of a given test instance.

3.) Predict the class of the test instance by taking the mode of the class labels of the k nearest neighbors.

Let's start by defining the function for calculating the Euclidean distance.

Converting our DataFrames to numpy arrays will make life easier!

In [35]:

X_train = features_train.to_numpy()
y_train = target_train.to_numpy()
X_test = features_test.to_numpy()
y_test = target_test.to_numpy()

print("Data before conversion to Numpy arrays:")
print("features_train:")
print(features_train.head())
print("target_train:")
print(target_train.head())
print("features_test:")
print(features_test.head())
print("target_test:")
print(target_test.head())

print("\nData after conversion to Numpy arrays:")
print("X_train:")
print(X_train)
print("y_train:")
print(y_train)
print("X_test:")
print(X_test)
print("y_test:")
print(y_test)

Data before conversion to Numpy arrays:
features_train:
     Q1  Q2  Q3  Q4  Q5  Q6  Q7  Q8  Q9  Q10  ...  Q45  Q46  Q47  Q48  Q49  \
69    4   4   4   3   4   2   4   4   4    3  ...    4    0    4    4    4   
138   0   0   1   0   0   0   0   1   1    0  ...    3    3    3    3    0   
2     2   2   2   2   1   3   2   1   1    2  ...    2    3    2    3    1   
93    0   1   0   1   0   0   0   0   0    1  ...    1    1    1    2    1   
136   0   0   2   0   0   0   0   0   0    0  ...    2    3    1    2    1   

     Q50  Q51  Q52  Q53  Q54  
69     3    4    4    4    4  
138    1    3    3    3    1  
2      1    1    2    2    2  
93     1    1    0    0    0  
136    2    1    2    2    0  

[5 rows x 54 columns]
target_train:
69     1
138    0
2      1
93     0
136    0
Name: Divorce, dtype: int64
features_test:
     Q1  Q2  Q3  Q4  Q5  Q6  Q7  Q8  Q9  Q10  ...  Q45  Q46  Q47  Q48  Q49  \
139   3   1   1   0   0   0   0   0   0    0  ...    3    3    2    2    0   
30    3 

In [36]:
def calculate_euclidean_distance(instance1, instance2):
    """
    Calculate the Euclidean distance between two instances.
    - instance1: first instance
    - instance2: second instance
    """
    return np.sqrt(np.sum((instance1 - instance2) ** 2)) # subtracts corresponding elements of the two arrays

print(f"Euclidean distance between the first and second training instance: {calculate_euclidean_distance(X_train[0], X_train[1])}")


Euclidean distance between the first and second training instance: 21.97726097583591


Next, we'll implement the function to get the k nearest neighbors of a given test instance. This function will compute the Euclidean distance from the test instance to each training instance, keep track of the k instances with the smallest distances, and return their indices.

In [40]:
def get_k_nearest_neighbors(X_train, x_test, k):
    """
    Get the k nearest neighbors of a test instance.
    - X_train: training features
    - x_test: test instance
    - k: number of neighbors to return
    """
    # Calculate the Euclidean distance from the test instance to each training instance
    distances = np.array([calculate_euclidean_distance(x_train, x_test) for x_train in X_train])
    
    # Get the indices of the k training instances with the smallest distances
    nearest_neighbors = distances.argsort()[:k]
    
    return nearest_neighbors

# Test the function
print(f"Indices of the 3 nearest neighbors of the first test instance: {get_k_nearest_neighbors(X_train, X_test[0], 3)}")


Indices of the 3 nearest neighbors of the first test instance: [122  89  33]


Now, we implement the function to predict the class of a test instance. This function will get the k nearest neighbors of the test instance, find the most common class label among these neighbors, and return this class label as the prediction.

In [51]:
from scipy import stats
def predict_with_k_nearest_neighbors(X_train, y_train, x_test, k):
    """
    Predict the class of a test instance using the k nearest neighbors.
    - X_train: training features
    - y_train: training target values
    - x_test: test instance
    - k: number of neighbors to consider
    """
    # Get the k nearest neighbors of the test instance
    nearest_neighbors = get_k_nearest_neighbors(X_train, x_test, k)
    
    # Get the class labels of the nearest neighbors
    class_labels = y_train[nearest_neighbors]
   
    # Predict the most common class label
    prediction = stats.mode(class_labels)[0]
    
    return prediction

# Test the function
prediction = predict_with_k_nearest_neighbors(X_train, y_train, X_test[0], 3)
print(f"Predicted class: {prediction}, Actual class: {target_test.values[0]}")


Predicted class: 0, Actual class: 0


The next step is to use this function to make predictions for multiple test instances. 

In [52]:
def predict_with_k_nearest_neighbors_multiple(X_train, y_train, X_test, k):
    """
    Predict the class of multiple test instances using the k nearest neighbors.
    - X_train: training features
    - y_train: training target values
    - X_test: test features
    - k: number of neighbors to consider
    """
    # Make predictions for each test instance
    predictions = [predict_with_k_nearest_neighbors(X_train, y_train, x_test, k) for x_test in X_test]
    
    return predictions

# Test the function
predictions = predict_with_k_nearest_neighbors_multiple(X_train, y_train, X_test, 3)
print(f"Predicted classes: {predictions[:10]}, Actual classes: {y_test[:10]}")


Predicted classes: [0, 1, 0, 1, 0, 0, 0, 1, 0, 1], Actual classes: [0 1 0 1 0 0 0 1 0 1]


The custom implementation of the k-Nearest Neighbors (k-NN) algorithm is working correctly. It made correct predictions for the first 10 instances in the test set!


Now that we have implemented and tested the k-NN algorithm from scratch, let's evaluate its performance on the entire test set. We'll compute the accuracy and F1 score as we did before.

In [53]:
# Make predictions for the entire test set
predictions = predict_with_k_nearest_neighbors_multiple(X_train, y_train, X_test, 3)

# Compute accuracy
accuracy_knn_custom = accuracy_score(y_test, predictions)

# Compute F1 score
f1_knn_custom = f1_score(y_test, predictions)

accuracy_knn_custom, f1_knn_custom


(0.9705882352941176, 0.9743589743589743)

The custom implementation of the k-Nearest Neighbors (k-NN) algorithm achieved an accuracy of approximately 0.971 (or 97.1%) and an F1 score of approximately 0.974 on the test data. These values are quite high and turned out to be the same as the scikit-learn k-NN. Now, let's compute the permutation feature importance for both the scikit-learn and custom k-NN models.

# Permutation Feature Importance