In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Read in the data

In [6]:
## Importing the dataset
data = pd.read_csv('data/divorce_data.csv', sep=';')

# Data Exploration

In [7]:
# Get the shape of the dataset
num_rows, num_cols = data.shape

# Check for missing values
missing_values = data.isnull().sum().sum()

# Check the balance of the target variable
divorce_counts = data['Divorce'].value_counts()

num_rows, num_cols, missing_values, divorce_counts


(170,
 55,
 0,
 Divorce
 0    86
 1    84
 Name: count, dtype: int64)

The dataset contains 170 rows (i.e., couples) and 55 columns (54 predictors and 1 target). There are no missing values in the dataset, which is good as it simplifies the preprocessing steps.

The target variable "Divorce" is fairly balanced with 86 instances of non-divorced couples (value 0) and 84 instances of divorced couples (value 1). This is beneficial because imbalanced datasets can often lead to biased models.

# Data Preprocessing

In [11]:
from sklearn.model_selection import train_test_split

# Separate features and target
features = data.drop('Divorce', axis=1)
target = data['Divorce']

# Split the data into training and test sets
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=42)

features_train.shape, features_test.shape, target_train.shape, target_test.shape


((136, 54), (34, 54), (136,), (34,))

The data has been successfully split into training and test sets. We have 136 instances in the training set and 34 instances in the test set. Each instance has 54 features.

# Feature Importance

In this step, we'll use a decision tree-based method to rank the importance of the features in predicting divorce. This will help us identify the key predictors of divorce.

We'll use the Random Forest algorithm from scikit-learn for this. A Random Forest is an ensemble of Decision Trees that is often used for feature selection because it provides a measure of the importance of each feature.

In [12]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest classifier
rf = RandomForestClassifier(random_state=42)

# Fit the model to the training data
rf.fit(features_train, target_train)

# Get feature importances
importances = rf.feature_importances_

# Create a DataFrame of features and importances
feature_importances = pd.DataFrame({
    'Feature': features.columns,
    'Importance': importances
})

# Sort the DataFrame by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

feature_importances


Unnamed: 0,Feature,Importance
39,Q40,0.096593
16,Q17,0.095148
17,Q18,0.091988
18,Q19,0.089627
11,Q12,0.089565
19,Q20,0.063145
15,Q16,0.057146
10,Q11,0.055608
14,Q15,0.047422
25,Q26,0.041697


The Random Forest has ranked the features by their importance in predicting the target variable "Divorce".

The five most important features, according to this model, are:

Q40 with an importance of approximately 0.0966

Q17 with an importance of approximately 0.0951

Q18 with an importance of approximately 0.0920

Q19 with an importance of approximately 0.0896

Q12 with an importance of approximately 0.0896

These results suggest that these questions may be particularly important in predicting divorce.

# Feature Selection

As a starting point, let's choose the top 10 features. However, we can adjust this number later if necessary. Now, let's select these top features from our training and test datasets.

In [13]:
# Select the top 10 features
top_features = feature_importances['Feature'][:10].tolist()

# Select these top features from the training and test data
features_train_selected = features_train[top_features]
features_test_selected = features_test[top_features]

features_train_selected.head()


Unnamed: 0,Q40,Q17,Q18,Q19,Q12,Q20,Q16,Q11,Q15,Q26
69,0,4,4,4,4,4,4,4,4,4
138,0,0,0,0,0,0,0,0,0,0
2,3,3,3,3,4,2,3,3,3,2
93,0,0,0,0,0,0,0,0,0,0
136,0,1,0,0,1,0,0,0,0,0


# Implementing Decision Tree with Scikt Learn

In [14]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the Decision Tree classifier
dt = DecisionTreeClassifier(random_state=42)

# Fit the model to the training data
dt.fit(features_train_selected, target_train)

# Predict the target for the test data
target_pred = dt.predict(features_test_selected)

target_pred

array([0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0])

We have successfully trained a Decision Tree model using scikit-learn and made predictions on the test data.

Now, let's evaluate the performance of this model. We'll use two common metrics for binary classification problems: accuracy and F1 score. Accuracy is the proportion of correct predictions (both true positives and true negatives) among the total number of cases examined. The F1 score is the harmonic mean of precision and recall, providing a balance between these two metrics.