# CDS6314 Data Mining: Association Rule Mining on Marital Satisfaction Data

**Group Members:** 
| Student ID    | Name                            |
| -------------- | ------------------------------ |
| 1211102409    | CHUA KAI ZHENG                 |
| 1211102696    | LEE JIA MENG                   |
| 1211103527    | MUHAMMAD IRFAN HAQIEF BIN RAZAK |
| 1211100917    | NATALIE TAN LI YI              |

**Group Number:** 

## 1. Introduction

This project aims to perform association rule mining on the "Marital satisfaction, sex, age, marriage duration, religion, number of children, economic status, education, and collectivistic values: Data from 33 countries" dataset (Sorokowski et al., 2017). The primary objective is to uncover interesting associations and patterns within this data, particularly focusing on factors related to marital satisfaction.

We will follow a data mining pipeline involving data exploration, preprocessing, application of association rule mining algorithms, and evaluation of the discovered rules. This notebook documents the Python implementation of these steps.

**Exploratory Questions:**
1.  
2. 
3.  
4.  

 ### Step 1: Import Necessary Libraries
 We start by importing the libraries we'll need.

In [9]:
# Step 1: Import Necessary Libraries
import pandas as pd 
import numpy as np   
import matplotlib.pyplot as plt 
import seaborn as sns 


# Set display options for pandas to make outputs easier to read during exploration
pd.set_option('display.max_columns', None)      # Ensures all columns of a DataFrame are displayed
pd.set_option('display.max_rows', 100)         # Displays up to 100 rows (useful for peeking at data)
pd.set_option('display.width', 1000)          # Adjusts the display width in the console/notebook for wider tables

print("Step 1: Libraries imported successfully!")

Step 1: Libraries imported successfully!


## 2. Data Loading and Initial Understanding

The first step in our analysis is to load the dataset. The data is stored in a CSV (Comma Separated Values) file named "Marital satisfaction_Data.xlsx - Arkusz1.csv". We will use pandas to read this file into a DataFrame, which is a tabular data structure, similar to a spreadsheet.

In [11]:
# Step 2.1: Load the Dataset

# Define the path to your dataset file. 
# If the file is in the same directory as your Jupyter Notebook, you can just use the filename.
# Otherwise, you'll need to provide the full path to the file.
file_path = 'Marital satisfaction_Data.xlsx' 

try:
    # Attempt to read the CSV file into a pandas DataFrame.
    df_raw = pd.read_excel(file_path)
    print("Dataset loaded successfully!")
    
    # Display the dimensions of the loaded DataFrame (number of rows, number of columns)
    print(f"Dataset dimensions (rows, columns): {df_raw.shape}")
    
except FileNotFoundError:
    # If the file is not found at the specified path, print an error message.
    print(f"Error: The file '{file_path}' was not found. Please check the file path and filename.")
    print("Please ensure the dataset file is in the correct location or update the 'file_path' variable.")
    # Create an empty DataFrame if the file is not found to prevent errors in subsequent cells.
    # In a real project, you might want to stop execution here or handle this more robustly.
    df_raw = pd.DataFrame() 

# Display the first 5 rows of the dataset to get an initial look at its structure and content.
# This helps verify that the data has been loaded correctly.
if not df_raw.empty:
    print("\nFirst 5 rows of the raw dataset (df_raw.head()):")
    print(df_raw.head())

Dataset loaded successfully!
Dataset dimensions (rows, columns): (7180, 31)

First 5 rows of the raw dataset (df_raw.head()):
      Country   Sex (1-M, 2-F)   Age  Marriage duration (years)  Number of children  Number of brought up children  Education (1-no formal education, 2-primary school, 3-secondary school, 4-high school or technical college, 5-bachelor or master degree)   Material status (1-much better than average in my country, 2-better than average in my country, 3-similar to average in my country, 4-worse than average in my country, 5-much worse than average in my country)  Religion (1-Protestant, 2-Catholic,  3-Jewish, 4-Muslim, 5-Buddhist, 6-None, 7-Jehovah, 8-Evangelic, 9-Spiritualism, 10-Other - very specific, 11-Orthodox, 12-Hinduism)   Religiosity (1-not religious at all, 7-extremely religious)  Pension (1-strongly agree, 4-neither agree nor disagree, 7-strongly disagree)  Marriage and Relationships Questionnaire (MRQ) (1-yes, 3-neither yes nor no, 5-no)  Unnamed: 12  U

### 2.2. Initial Data Inspection and Creating a Working Copy

Now that the data is loaded, we'll perform some basic inspections to get a better understanding of its characteristics. This includes:
* Viewing the data types of each column.
* Getting summary statistics for numerical columns.
* Checking for missing values.

It's also good practice to create a working copy of the raw dataset. This way, the original loaded data (`df_raw`) remains unchanged, and we can perform all modifications and preprocessing steps on the copy (`df`).

In [12]:
# Step 2.2: Perform Initial Data Inspection and Create Working Copy

if not df_raw.empty:
    print("\n--- Dataset Information (df_raw.info()) ---")
    df_raw.info()

    print("\n--- Basic Descriptive Statistics for Numerical Columns (df_raw.describe()) ---")
    print(df_raw.describe())

    print("\n--- Missing Values Count per Column (df_raw.isnull().sum()) ---")
    # .isnull() creates a boolean DataFrame (True where value is missing/NaN, False otherwise).
    # .sum() on this boolean DataFrame counts the number of True values per column.
    missing_values = df_raw.isnull().sum()
    print("Columns with missing values (if any):")
    print(missing_values[missing_values > 0]) # This filters to show only columns that actually have missing values.
    total_missing = missing_values.sum()
    print(f"\nTotal number of missing values in the entire dataset: {total_missing}")
    
    percentage_missing = (total_missing / (df_raw.shape[0] * df_raw.shape[1])) * 100
    print(f"Overall percentage of missing data: {percentage_missing:.2f}%")

    # Create a working copy of the DataFrame. All subsequent modifications will be done on 'df'.
    df = df_raw.copy()
    print(f"\nWorking copy 'df' created. Shape of 'df': {df.shape}")
    print("Initial inspection complete. The working DataFrame 'df' is now ready for further processing.")

else:
    print("Raw dataset (df_raw) is empty. Cannot proceed with inspection or create a working copy.")


--- Dataset Information (df_raw.info()) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7180 entries, 0 to 7179
Data columns (total 31 columns):
 #   Column                                                                                                                                                                                                            Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                            --------------  -----  
 0   Country                                                                                                                                                                                                           7179 non-null   object 
 1   Sex (1-M, 2-F)                                                                                                                

### 2.3. Column Naming and Understanding

The raw dataset has a two-line header structure in the CSV, which pandas might misinterpret. The first line of the CSV contains descriptive column names for some variables, while others are generic. The second line in the CSV (which becomes the first data row if not handled) appears to be item numbers mapping to the questionnaire. Let's examine the loaded column names and the first few data rows to ensure correct interpretation and then proceed to rename columns for clarity and ease of use.

**Reference for Column Mapping (based on CSV structure and Questionnaire):**


We will systematically rename the columns based on the questionnaire for better readability and programmatic access.

In [13]:
# Step 2.3: Data Cleaning - Standardizing and Renaming Columns

if not df_raw.empty: # Check if df_raw was loaded successfully in previous cells
    
    print("Original column names (from df.columns before any renaming in this cell):")
    print(df.columns.tolist())
    print(f"Number of columns initially: {len(df.columns)}")

    # The CSV structure has descriptive names for the first few columns,
    # but the scale items (MRQ, KMSS, CI) are less clear directly from the CSV header
    # and often result in generic or 'Unnamed' columns by pandas.
    # The second row of the CSV file (which pandas might have read as the first data row if header=0, 
    # or is part of a multi-index if header=[0,1]) contains item numbers.
    # For simplicity, we'll assume pandas used the first row as headers and create a definitive list of new names.

    # *** THIS IS THE MOST CRITICAL PART OF THIS STEP ***
    # You MUST ensure this 'final_column_names' list:
    # 1. Has the EXACT same number of elements as columns in your DataFrame 'df'.
    # 2. The order of names correctly maps to the actual order of columns in your 'df'.
    # Inspect your `df.head()` and `df.columns.tolist()` output carefully to confirm.
    
    final_column_names = [
        'Country', 'Sex', 'Age', 'Marriage_Duration', 'Num_Children', 
        'Num_BroughtUp_Children', 'Education_Level', 'Material_Status', 
        'Religion', 'Religiosity_Score', 'Pension_View',
        # MRQ Items (9 items - from Questionnaire Q11-Q19)
        # These names are placeholders; ensure they match the actual content order from your CSV.
        'MRQ_EnjoyCompany_Q11', 'MRQ_Happy_Q12', 'MRQ_SpouseAttractive_Q13', 'MRQ_EnjoyDoingThings_Q14',
        'MRQ_EnjoyCuddling_Q15', 'MRQ_RespectSpouse_Q16', 'MRQ_ProudOfSpouse_Q17', 
        'MRQ_RomanticSide_Q18', 'MRQ_LoveSpouse_Q19', # This is Q19, paper suggests exclusion later
        # KMSS Items (3 items - from Questionnaire Q20-Q22)
        'KMSS_SatWithMarriage_Q20', 'KMSS_SatWithSpouse_Q21', 'KMSS_SatWithRelationship_Q22',
        # Collectivism-Individualism Items (8 items - Scale 2 from Questionnaire)
        # CI_Nat refers to "In this society..." questions (Q2.1 to Q2.4 in questionnaire)
        'CI_Nat1_ChildrenPrideParentsAccompl', 'CI_Nat2_ParentsPrideChildrenAccompl',
        'CI_Nat3_AgingParentsLiveWithChildren', 'CI_Nat4_ChildrenLiveWithParentsUntilMarried',
        # CI_Indiv refers to "I think..." questions (Q2.5 to Q2.8 in questionnaire)
        'CI_Indiv1_ChildrenShouldPrideParentsAccompl', 'CI_Indiv2_ParentsShouldPrideChildrenAccompl',
        'CI_Indiv3_AgingParentsShouldLiveWithChildren', 'CI_Indiv4_ChildrenCanLiveWithParentsUntilMarried'
    ]
    
    # Check for column count mismatch BEFORE attempting to rename
    if len(final_column_names) == df.shape[1]:
        df.columns = final_column_names
        print("\nColumns successfully renamed to 'final_column_names'.")
    else:
        print(f"\nCRITICAL WARNING: Column count mismatch!")
        print(f"DataFrame 'df' has {df.shape[1]} columns.")
        print(f"The 'final_column_names' list has {len(final_column_names)} names.")
        print("This means the renaming WILL FAIL or be incorrect.")
        print("Please carefully inspect your CSV file and the `df.columns.tolist()` output from the cell above.")
        print("Adjust 'final_column_names' list to have the correct number and order of names.")
        print("Common issues: extra empty columns at the end of the CSV, or pandas misinterpreting initial headers.")
        print("For example, if you have extra columns, you might need to slice df first: df = df.iloc[:, :len(final_column_names)]")
        # It's safer to stop or raise an error here if counts don't match.
        # For demonstration, we'll print what we have:
        print("\nCURRENT df.columns before trying to force rename (if mismatch):")
        print(df.columns.tolist())

    # Let's verify the renaming by looking at the head and the new column list
    if 'Country' in df.columns and len(final_column_names) == df.shape[1] : # A quick check if first column name is as expected
        print("\nFirst 5 rows of 'df' after renaming columns:")
        print(df.head())
        print("\nList of new column names in 'df':")
        print(df.columns.tolist())
        print("\nData types after renaming (should be unchanged by renaming itself):")
        print(df.dtypes.value_counts()) # See a summary of data types
    elif len(final_column_names) != df.shape[1]:
        print("\nSkipping display of df.head() due to column renaming issues.")
    else:
        print("\nColumn 'Country' not found, renaming might have been incomplete or incorrect. Please verify.")

else:
    print("DataFrame 'df_raw' (and thus 'df') is empty. Cannot rename columns.")

Original column names (from df.columns before any renaming in this cell):
['Country ', 'Sex (1-M, 2-F)', 'Age', 'Marriage duration (years)', 'Number of children', 'Number of brought up children', 'Education (1-no formal education, 2-primary school, 3-secondary school, 4-high school or technical college, 5-bachelor or master degree) ', 'Material status (1-much better than average in my country, 2-better than average in my country, 3-similar to average in my country, 4-worse than average in my country, 5-much worse than average in my country)', 'Religion (1-Protestant, 2-Catholic,  3-Jewish, 4-Muslim, 5-Buddhist, 6-None, 7-Jehovah, 8-Evangelic, 9-Spiritualism, 10-Other - very specific, 11-Orthodox, 12-Hinduism) ', 'Religiosity (1-not religious at all, 7-extremely religious)', 'Pension (1-strongly agree, 4-neither agree nor disagree, 7-strongly disagree)', 'Marriage and Relationships Questionnaire (MRQ) (1-yes, 3-neither yes nor no, 5-no)', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Un

## 3. Data Exploration (EDA - Exploratory Data Analysis)

Now that our data is loaded and we have a basic understanding of its structure and initial quality, we will perform Exploratory Data Analysis (EDA). EDA is crucial for:
* Understanding the distribution of individual variables (univariate analysis).
* Identifying potential relationships or correlations between pairs of variables (bivariate analysis).
* Visualizing these distributions and relationships.
* Informing our data preprocessing steps (e.g., how to handle outliers, how to categorize variables).

The insights from EDA will be documented in the "Data Exploration" section of our technical report.