# What leads to divorce?

### Table of Contents

* **[Overview](#Overview)**

* **[Data Exploration](#Data-Exploration)**  
    * [Check for Data Quality](#Check-for-Data-Quality)
        * [Missing Values Check](#Missing-Values-Check)
        * [Zero Values Check](#Zero-Values-Check)
        * [Unique Values Check](#Unique-Values-Check)
        * [Duplicate Values Check](#Duplicate-Values-Check)
* **[Data Visualization & Analysis](#Data-Visualization-&-Analysis)**
    * [Find Outliers](#Finding-Outliers)
    * [Find Correlations of Features](#Find-Correlations-of-Features)
* **[Data Preparation](#Data-Preparation)**
    * [Data Cleanup](#Data-Cleanup)
        * [Handling Missing Values](#Handling-Missing-Values)
        * [Handling Outliers](#Handling-Outliers)

# Overview


<p>This the dataset that was collected from <a href='kaggle.com'>Kaggle</a> from the the below url.</p>
<p><a href='https://www.kaggle.com/datasets/andrewmvd/divorce-prediction'>Divorce Prediction Dataset</a></p>
<p>The dataset contains actual data which has been masked for privacy on the given features and reference has the explanation for each feature. No personal information is revealed in the data.</p>
<p>This analysis is going to predict the factors that are contributing to the divorce.</p>

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.exceptions import ConvergenceWarning
import time


# Data Exploration

In [2]:
# Set all columns visible
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [3]:
# Load reference data into dataframe
datareference = pd.read_csv('data/divorce-dataset-1/reference.tsv', sep = '|')

This dataset contains data about 170 couples with their corresponding Divorce Predictors Scale variables (DPS) on the basis of Gottman couples therapy for 54 questions.
The couples are from various regions of Turkey wherein the records were acquired from face-to-face interviews from couples who were already divorced or happily married.
All responses were collected on a 5 point scale (0=Never, 1=Seldom, 2=Averagely, 3=Frequently, 4=Always).

Source: <a href='https://www.kaggle.com/datasets/andrewmvd/divorce-prediction'>https://www.kaggle.com/datasets/andrewmvd/divorce-prediction</a>

In [4]:
# Read reference data & Aligned dataframe columns and headers
left_aligned_refdata = datareference.style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
left_aligned_refdata

Unnamed: 0,atribute_id,description
0,1,"If one of us apologizes when our discussion deteriorates, the discussion ends."
1,2,"I know we can ignore our differences, even if things get hard sometimes."
2,3,"When we need it, we can take our discussions with my spouse from the beginning and correct it."
3,4,"When I discuss with my spouse, to contact him will eventually work."
4,5,The time I spent with my wife is special for us.
5,6,We don't have time at home as partners.
6,7,We are like two strangers who share the same environment at home rather than family.
7,8,I enjoy our holidays with my wife.
8,9,I enjoy traveling with my wife.
9,10,Most of our goals are common to my spouse.


In [5]:
# read feature data
data = pd.read_csv('data/divorce-dataset-1/divorce_data.csv', sep = ';')

In [None]:
data

# Check for Data Quality

Checking for all data quality issues in the dataset.
        

## Missing Values Check

Checking for missing values in the data set.

In [None]:
# Missing values check
missing_values_count = data.isna().sum()
print(missing_values_count)

## Zero Values Check

Checking for zero values in the data set.


In [None]:
# Zero values check
data.eq(0).sum()

## Unique Values Check

Checking for unique values in the data set.
        

In [None]:
# Unique values check
def unique_values_and_null_percentage(dataframe):
    """
    Check the count of unique values and the percentage of null values for each column in a DataFrame.

    Parameters:
    - dataframe: Pandas DataFrame.

    Returns:
    - A DataFrame with column names, the count of unique values, and the percentage of null values.
    """
    unique_count = dataframe.nunique()
    null_percentage = dataframe.isnull().mean() * 100

    result_df = pd.DataFrame({
        'Column': unique_count.index,
        'Unique Count': unique_count.values,
        'Null Percentage': null_percentage.values
    })

    result_df = result_df.sort_values(by='Unique Count', ascending=False).reset_index(drop=True)
    return result_df[['Column', 'Unique Count', 'Null Percentage']]

result = unique_values_and_null_percentage(data)
print(result)

## Duplicate Values Check

Checking for duplicate values in the data set.

In [None]:
duplicate_count = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")


# Data Visualization & Analysis

## Finding Outliers
   

## Find Correlations of Features

# Data Preparation

## Data Cleanup

## Handling Missing Values

## Handling Outliers