# What leads to divorce?

### Table of Contents

* **[Overview](#Overview)**

* **[Data Exploration](#Data-Exploration)**  
    * [Check for Data Quality](#Check-for-Data-Quality)
        * [Missing Values Check](#Missing-Values-Check)
        * [Zero Values Check](#Zero-Values-Check)
        * [Unique Values Check](#Unique-Values-Check)
        * [Duplicate Values Check](#Duplicate-Values-Check)
* **[Data Visualization & Analysis](#Data-Visualization-&-Analysis)**
    * [Find Outliers](#Finding-Outliers)
    * [Find Correlations of Features](#Find-Correlations-of-Features)
* **[Data Preparation](#Data-Preparation)**
    * [Data Cleanup](#Data-Cleanup)
        * [Handling Missing Values](#Handling-Missing-Values)
        * [Handling Outliers](#Handling-Outliers)

# Overview


<p>This the dataset that was collected from <a href='kaggle.com'>Kaggle</a> from the the below url.</p>
<p><a href='https://www.kaggle.com/datasets/andrewmvd/divorce-prediction'>Divorce Prediction Dataset</a></p>
<p>The dataset contains actual data which has been masked for privacy on the given features and reference has the explanation for each feature. No personal information is revealed in the data.</p>
<p>This analysis is going to predict the factors that are contributing to the divorce.</p>

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.exceptions import ConvergenceWarning
import time


# Data Exploration

In [2]:
# Set all columns visible
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [3]:
# Load reference data into dataframe
datareference = pd.read_csv('data/divorce-dataset-1/reference.tsv', sep = '|')

This dataset contains data about 170 couples with their corresponding Divorce Predictors Scale variables (DPS) on the basis of Gottman couples therapy for 54 questions.
The couples are from various regions of Turkey wherein the records were acquired from face-to-face interviews from couples who were already divorced or happily married.
All responses were collected on a 5 point scale (0=Never, 1=Seldom, 2=Averagely, 3=Frequently, 4=Always).

Source: <a href='https://www.kaggle.com/datasets/andrewmvd/divorce-prediction'>https://www.kaggle.com/datasets/andrewmvd/divorce-prediction</a>

In [4]:
# Read reference data & Aligned dataframe columns and headers
left_aligned_refdata = datareference.style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
left_aligned_refdata

Unnamed: 0,atribute_id,description
0,1,"If one of us apologizes when our discussion deteriorates, the discussion ends."
1,2,"I know we can ignore our differences, even if things get hard sometimes."
2,3,"When we need it, we can take our discussions with my spouse from the beginning and correct it."
3,4,"When I discuss with my spouse, to contact him will eventually work."
4,5,The time I spent with my wife is special for us.
5,6,We don't have time at home as partners.
6,7,We are like two strangers who share the same environment at home rather than family.
7,8,I enjoy our holidays with my wife.
8,9,I enjoy traveling with my wife.
9,10,Most of our goals are common to my spouse.


In [5]:
# read feature data
data = pd.read_csv('data/divorce-dataset-1/divorce_data.csv', sep = ';')

In [6]:
data

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,Q45,Q46,Q47,Q48,Q49,Q50,Q51,Q52,Q53,Q54,Divorce
0,2,2,4,1,0,0,0,0,0,0,1,0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,2,1,2,0,1,2,1,3,3,2,1,1,2,3,2,1,3,3,3,2,3,2,1,1
1,4,4,4,4,4,0,0,4,4,4,4,3,4,0,4,4,4,4,3,2,1,1,0,2,2,1,2,0,1,1,0,4,2,3,0,2,3,4,2,4,2,2,3,4,2,2,2,3,4,4,4,4,2,2,1
2,2,2,2,2,1,3,2,1,1,2,3,4,2,3,3,3,3,3,3,2,1,0,1,2,2,2,2,2,3,2,3,3,1,1,1,1,2,1,3,3,3,3,2,3,2,3,2,3,1,1,1,2,2,2,1
3,3,2,3,2,3,3,3,3,3,3,4,3,3,4,3,3,3,3,3,4,1,1,1,1,2,1,1,1,1,3,2,3,2,2,1,1,3,3,4,4,2,2,3,2,3,2,2,3,3,3,3,2,2,2,1
4,2,2,1,1,1,1,0,0,0,0,0,1,0,1,1,1,1,1,2,1,1,0,0,0,0,2,1,2,1,1,1,1,1,1,0,0,0,0,2,1,0,2,3,0,2,2,1,2,3,2,2,2,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,4,3,4,0,0,4,0,1,0,1,0,0,0,0,1,0,4,1,1,4,2,2,2,0
166,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,3,1,3,4,1,2,2,2,2,3,2,2,0
167,1,1,0,0,0,0,0,0,0,1,0,1,1,0,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,1,1,0,1,0,0,1,1,1,2,1,3,3,0,2,3,0,2,0,1,1,3,0,0,0
168,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,4,1,2,1,1,0,4,3,3,2,2,3,2,4,3,1,0


# Check for Data Quality

Checking for all data quality issues in the dataset.
        

## Missing Values Check

Checking for missing values in the data set.

In [7]:
# Missing values check
missing_values_count = data.isna().sum()
print(missing_values_count)

Q1         0
Q2         0
Q3         0
Q4         0
Q5         0
Q6         0
Q7         0
Q8         0
Q9         0
Q10        0
Q11        0
Q12        0
Q13        0
Q14        0
Q15        0
Q16        0
Q17        0
Q18        0
Q19        0
Q20        0
Q21        0
Q22        0
Q23        0
Q24        0
Q25        0
Q26        0
Q27        0
Q28        0
Q29        0
Q30        0
Q31        0
Q32        0
Q33        0
Q34        0
Q35        0
Q36        0
Q37        0
Q38        0
Q39        0
Q40        0
Q41        0
Q42        0
Q43        0
Q44        0
Q45        0
Q46        0
Q47        0
Q48        0
Q49        0
Q50        0
Q51        0
Q52        0
Q53        0
Q54        0
Divorce    0
dtype: int64


## Zero Values Check

Checking for zero values in the data set.


In [8]:
# Zero values check
data.eq(0).sum()

Q1          69
Q2          59
Q3          51
Q4          75
Q5          82
Q6          86
Q7         114
Q8          81
Q9          84
Q10         62
Q11         71
Q12         58
Q13         47
Q14         66
Q15         69
Q16         75
Q17         73
Q18         79
Q19         77
Q20         81
Q21         78
Q22         87
Q23         90
Q24         72
Q25         63
Q26         72
Q27         77
Q28         85
Q29         81
Q30         72
Q31         44
Q32         46
Q33         71
Q34         50
Q35         85
Q36         88
Q37         49
Q38         64
Q39         50
Q40         72
Q41         55
Q42         44
Q43         16
Q44         60
Q45         28
Q46         22
Q47         33
Q48         10
Q49         28
Q50         19
Q51         12
Q52         23
Q53         31
Q54         50
Divorce     86
dtype: int64

## Unique Values Check

Checking for unique values in the data set.
        

In [9]:
# Unique values check
def unique_values_and_null_percentage(dataframe):
    """
    Check the count of unique values and the percentage of null values for each column in a DataFrame.

    Parameters:
    - dataframe: Pandas DataFrame.

    Returns:
    - A DataFrame with column names, the count of unique values, and the percentage of null values.
    """
    unique_count = dataframe.nunique()
    null_percentage = dataframe.isnull().mean() * 100

    result_df = pd.DataFrame({
        'Column': unique_count.index,
        'Unique Count': unique_count.values,
        'Null Percentage': null_percentage.values
    })

    result_df = result_df.sort_values(by='Unique Count', ascending=False).reset_index(drop=True)
    return result_df[['Column', 'Unique Count', 'Null Percentage']]

result = unique_values_and_null_percentage(data)
print(result)

     Column  Unique Count  Null Percentage
0        Q1             5              0.0
1       Q42             5              0.0
2       Q31             5              0.0
3       Q32             5              0.0
4       Q33             5              0.0
5       Q34             5              0.0
6       Q35             5              0.0
7       Q36             5              0.0
8       Q37             5              0.0
9       Q38             5              0.0
10      Q39             5              0.0
11      Q40             5              0.0
12      Q41             5              0.0
13      Q43             5              0.0
14      Q29             5              0.0
15      Q44             5              0.0
16      Q45             5              0.0
17      Q46             5              0.0
18      Q47             5              0.0
19      Q48             5              0.0
20      Q49             5              0.0
21      Q50             5              0.0
22      Q51

## Duplicate Values Check

Checking for duplicate values in the data set.

In [10]:
duplicate_count = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")


Number of duplicate rows: 20


# Data Visualization & Analysis

## Finding Outliers
   

## Find Correlations of Features

# Data Preparation

## Data Cleanup

## Handling Missing Values

## Handling Outliers