# Individual Data Mining Assignment

# Introduction

### Contextualising the problem

**Data mining** is a field that focuses on extracting insights and patterns from large datasets. That as of itself already constitutes a strong argument in favour of using such a process when tackling business problems, but there is still more to it.



As such, this project proposes to answer the following problem statement and research questions:



### Expanding on the business objectives of the project (Business Understanding)



### Data Mining Methodology 

The Cross-Industry Standard Process for Data Mining (CRISP-DM) Methodology was used to build the deliverable of the assignment (it is pictured below). It organizes the data mining project in six phases (Business Understanding, Data Understanding, Data preparation, Modeling, Evaluation and Deployment).

<div style="text-align: center;">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/639px-CRISP-DM_Process_Diagram.png" alt="CRISP-DM Process" style="width:400px;height:400px;margin-top: 20px;">
</div>

Figure 2: CRISP-DM Process Diagram 

1. *Business Understanding*: This process involves 

2. *Data Understanding*: The dataset contains 

3. *Data Preparation*: This is the 

4. *Modelling*: In this phase 

5. *Evaluation*: The models will be evaluated based on the success criteria discussed in business objective i.e. 

6. *Deployment*: This step is omitted.

# Data Understanding

### Importing modules

In [197]:
%pip install -r requirements.txt -q

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.2 -> 24.3.1
[notice] To update, run: C:\Users\mikol\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [198]:
import pandas as pd
import os
DATA_FOLDER = 'data'
RANDOM_STATE = 0

### Exploration of Arthritis variable

In [199]:
# Load the questionnaire dataset
questionnaire_df = pd.read_csv("./data/questionnaire.csv", encoding='ISO-8859-1')

# MCQ160A: Ever told you had arthritis
# MCQ195: What type of arthritis
filtered_questionnaire_df = questionnaire_df[questionnaire_df['MCQ160A'].notnull()]

merged_df = filtered_questionnaire_df

# Merge with other datasets
for csv_file in ['demographic.csv', 'labs.csv', 'questionnaire.csv']:
    if csv_file != 'questionnaire.csv':  # Skip the questionnaire file as it is already merged
        temp_df = pd.read_csv(f"./data/{csv_file}", encoding='ISO-8859-1')
        merged_df = pd.merge(merged_df, temp_df, on='SEQN', how='inner', suffixes=('', ''))

merged_df

Unnamed: 0,SEQN,ACD011A,ACD011B,ACD011C,ACD040,ACD110,ALQ101,ALQ110,ALQ120Q,ALQ120U,...,URXUTL,URDUTLLC,URXUTU,URDUTULC,URXUUR,URDUURLC,URXPREG,URXUAS,LBDB12,LBDB12SI
0,73557,1.0,,,,,1.0,,1.0,3.0,...,,,,,,,,,524.0,386.7
1,73558,1.0,,,,,1.0,,7.0,1.0,...,,,,,,,,,507.0,374.2
2,73559,1.0,,,,,1.0,,0.0,,...,,,,,,,,,732.0,540.2
3,73561,1.0,,,,,1.0,,0.0,,...,,,,,,,,,225.0,166.1
4,73562,,,,4.0,,1.0,,5.0,3.0,...,,,,,,,,,750.0,553.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5583,83723,,,,4.0,,1.0,,3.0,3.0,...,,,,,,,,,621.0,458.3
5584,83724,1.0,,,,,1.0,,0.0,,...,,,,,,,,,837.0,617.7
5585,83726,,,,1.0,,,,,,...,,,,,,,,,,
5586,83727,,,,3.0,,1.0,,1.0,2.0,...,,,,,,,,,720.0,531.4


In [200]:
# Load the variable descriptions
variable_descriptions = pd.read_excel('./variable_description.xlsx')
variable_descriptions['Variable Name'] = variable_descriptions['Variable Name'].str.upper()

# Filter descriptions for columns present in the dataframe
descriptions = variable_descriptions[variable_descriptions['Variable Name'].isin(merged_df.columns)]

# Identify columns without descriptions
no_explanation = set(merged_df.columns) - set(descriptions['Variable Name'])

# Make a dictionary of descriptions
descriptions = descriptions.set_index('Variable Name')['Variable Description'].to_dict()

print(descriptions)

{'SEQN': 'Respondent sequence number', 'RIDSTATR': 'Interview and examination status of the participant.', 'RIAGENDR': 'Gender of the participant.', 'RIDAGEYR': 'Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of age.', 'RIDAGEMN': 'Age in months of the participant at the time of screening. Individuals aged 959 months and older are topcoded at 959 months.', 'RIDRETH1': 'Recode of reported race and Hispanic origin information', 'DMDCITZN': '{Are you/Is SP} a citizen of the United States? [Information about citizenship is being collected by the U.S. Public Health Service to perform health related research. Providing this information is voluntary and is collected under the authority of the Public Health Service Act. There will be no effect on pending immigration or citizenship petitions.]', 'DMDYRSUS': 'Length of time the participant has been in the US.', 'DMDEDUC3': 'What is the highest grade or level of school {you have/SP has} 

In [201]:
print("Columns without descriptions:")
print(no_explanation)

Columns without descriptions:
{'URXUCR.y', 'LBDR62.y', 'LBXAPB', 'LBDR72.y', 'LBDR53.x', 'LBDR42.x', 'LBDR62.x', 'WTSAF2YR.x', 'LBDR69.y', 'PHAFSTMN.y', 'LBDR82.x', 'LBDR33.x', 'LBDR73.y', 'LBDR55.y', 'LBDR35.x', 'LBDR11.y', 'LBDR67.y', 'LBDR51.y', 'LBDR11.x', 'LBDR82.y', 'LBDR71.x', 'LBDR64.x', 'LBDR70.x', 'LBDR68.y', 'LBDR16.y', 'LBDR54.x', 'LBDR06.y', 'LBDR81.y', 'LBDR35.y', 'WTSAF2YR.y', 'LBDR67.x', 'LBDR18.y', 'LBDR55.x', 'LBDR84.y', 'LBDR16.x', 'LBDRHP.y', 'WTSB2YR.x', 'LBDR66.x', 'SMAQUEX.x', 'LBDR54.y', 'LBDR84.x', 'LBDR71.y', 'LBDRPI.y', 'LBDR53.y', 'LBDR72.x', 'LBDR89.y', 'WTSA2YR.x', 'LBDR59.y', 'LBDR61.y', 'LBDR68.x', 'LBDR59.x', 'LBDR45.y', 'WTSA2YR.y', 'LBDR58.x', 'LBDR69.x', 'LBDR31.x', 'LBDR45.x', 'LBDR89.x', 'LBDRPCR.x', 'LBDR58.y', 'LBDR18.x', 'LBDR52.y', 'LBDR66.y', 'LBDR70.y', 'LBDR61.x', 'LBDR39.x', 'LBDR51.x', 'LBDR73.x', 'PHAFSTMN.x', 'LBDR83.y', 'WTSB2YR.y', 'PHAFSTHR.x', 'LBDR52.x', 'LBDR81.x', 'LBDR33.y', 'LBDR64.y', 'LBDRPCR.y', 'LBDR56.x', 'LBDR26.y', 'LBDR4

In [202]:
# Drop the columns without descriptions
merged_df = merged_df.drop(columns=no_explanation)
merged_df.head(4)

Unnamed: 0,SEQN,ACD011A,ACD011B,ACD011C,ACD040,ACD110,ALQ101,ALQ110,ALQ120Q,ALQ120U,...,URXUTL,URDUTLLC,URXUTU,URDUTULC,URXUUR,URDUURLC,URXPREG,URXUAS,LBDB12,LBDB12SI
0,73557,1.0,,,,,1.0,,1.0,3.0,...,,,,,,,,,524.0,386.7
1,73558,1.0,,,,,1.0,,7.0,1.0,...,,,,,,,,,507.0,374.2
2,73559,1.0,,,,,1.0,,0.0,,...,,,,,,,,,732.0,540.2
3,73561,1.0,,,,,1.0,,0.0,,...,,,,,,,,,225.0,166.1


In [203]:
"MCQ180A" in list(merged_df.columns)

True

In [204]:
arthritis_columns = [
    #= Arthritis Diagnosis and Details =#
    'MCQ160A',  # Ever told had arthritis (direct indicator of arthritis diagnosis)
    'MCQ180A',  # Age when first told had arthritis (helps determine age of onset)
    'MCQ195',   # Type of arthritis (specifies arthritis type, e.g., osteoarthritis, rheumatoid arthritis)

    #= Physical Functioning and Limitations (Arthritis often leads to functional impairments) =#
    'PFQ049',   # Kept from working due to health problem (assesses impact on employment)
    'PFQ051',   # Limited in kind or amount of work (evaluates work limitations)
    'PFQ054',   # Difficulty walking without equipment (mobility issues common in arthritis)
    'PFQ059',   # Limited in any way due to health problem (overall limitation assessment)

    #= Specific Activity Difficulties (Arthritis affects daily living activities) =#
    'PFQ061A',  # Difficulty managing money (fine motor skills may be affected by hand arthritis)
    'PFQ061B',  # Difficulty walking 1/4 mile (measures mobility limitations)
    'PFQ061C',  # Difficulty walking up 10 steps (assesses lower body strength and pain)
    'PFQ061D',  # Difficulty stooping, crouching, kneeling (common challenges with joint pain)
    'PFQ061E',  # Difficulty lifting/carrying 10 lbs (upper body strength and joint function)
    'PFQ061F',  # Difficulty doing household chores (impact on independent living)
    'PFQ061G',  # Difficulty preparing meals (fine motor skills and standing tolerance)
    'PFQ061H',  # Difficulty walking between rooms (indicates severe mobility issues)
    'PFQ061I',  # Difficulty standing up from a chair (lower body strength and joint flexibility)
    'PFQ061J',  # Difficulty getting in/out of bed (overall mobility and stiffness)
    'PFQ061K',  # Difficulty eating (hand dexterity and grip strength)
    'PFQ061L',  # Difficulty dressing (range of motion and fine motor skills)
    'PFQ061M',  # Difficulty standing for 2 hours (endurance and joint pain)
    'PFQ061N',  # Difficulty sitting for 2 hours (joint stiffness and discomfort)
    'PFQ061O',  # Difficulty reaching over head (shoulder joint issues)
    'PFQ061P',  # Difficulty grasping small objects (hand and finger arthritis)
    'PFQ061Q',  # Difficulty going out socially (mobility and pain affecting social participation)
    'PFQ061R',  # Difficulty in social activities (impact on social life and mental health)
    'PFQ061S',  # Difficulty relaxing at home (chronic pain affecting rest)
    'PFQ061T',  # Difficulty pushing/pulling large objects (strength and joint function)

    #= Pain and Health Management Strategies =#
    'MCQ075',   # Current joint symptoms (assesses presence of symptoms)
    'MCQ370A',  # Controlling weight to lower disease risk (weight management is crucial in arthritis)
    'MCQ370B',  # Increasing physical activity to lower disease risk (exercise can improve symptoms)
    'MCQ370C',  # Reducing sodium/salt intake (dietary factors may influence inflammation)
    'MCQ370D',  # Reducing fat/calorie intake (supports weight management)

    #= Weight and Body Measures (Obesity is a risk factor for arthritis) =#
    'WHQ030',   # Self-perceived weight status (perception may influence management)
    'WHQ040',   # Desire to change weight (motivation for weight loss)
    'WHD010',   # Current height (needed for BMI calculation)
    'WHD020',   # Current weight (needed for BMI calculation)
    'WHD050',   # Weight 1 year ago (assesses recent weight changes)
    'WHD110',   # Weight 10 years ago (long-term weight trends)
    'WHD120',   # Weight at age 25 (baseline weight for comparison)
    'WHD130',   # Height at age 25 (to verify height consistency)

    #= Physical Activity (Physical activity affects arthritis symptoms and progression) =#
    'PAQ605',   # Engages in vigorous-intensity work activity (may impact joint health)
    'PAQ610',   # Days per week of vigorous work activity (frequency of high-impact activity)
    'PAQ620',   # Engages in moderate-intensity work activity (assesses activity level)
    'PAQ625',   # Days per week of moderate work activity (frequency)
    'PAQ650',   # Engages in vigorous recreational activities (exercise habits)
    'PAQ655',   # Days per week of vigorous recreational activities (frequency)
    'PAQ665',   # Engages in moderate recreational activities (beneficial for joint health)
    'PAQ670',   # Days per week of moderate recreational activities (frequency)
    'PAQ706',   # Days physically active for at least 60 minutes (overall activity level)
    'PAQ710',   # Hours per day watching TV (sedentary behavior contributing to stiffness)
    'PAQ715',   # Hours per day using computer (sedentary time)

    #= Comorbid Conditions and Related Health Factors =#
    'MCQ160N',  # Ever told had gout (related arthritic condition)
    'MCQ160K',  # Ever told had chronic bronchitis (systemic inflammation relevance)
    'MCQ160L',  # Ever told had liver condition (possible medication side effects)
    'OSQ060',   # Ever told had osteoporosis (bone health relevance)
    'OSQ072',   # Ever prescribed medicine for osteoporosis (treatment overlaps)
    'RXQ510',   # Ever told to take low-dose aspirin (anti-inflammatory usage)
    'MCQ053',   # Treatment for anemia (could be related to chronic disease)

    # #= Family Health History (Genetic predisposition) =#
    # 'MCQ300A',  # Family history of heart attack/angina (shared risk factors)
    # 'MCQ300B',  # Family history of asthma (possible autoimmune link)
    # 'MCQ300C',  # Family history of diabetes (metabolic syndrome relevance)

    # #= Sleep Patterns (Chronic pain affects sleep quality) =#
    # 'SLD010H',  # Hours of sleep on weekdays (sleep duration)
    # 'SLQ050',   # Trouble sleeping (indicative of pain interference)
    # 'SLQ060',   # Sleep disorder diagnosis (sleep health)

    #= Mental Health Status (Chronic conditions impact mental well-being) =#
    # 'DPQ010',   # Little interest or pleasure in doing things (depressive symptom)
    # 'DPQ020',   # Feeling down, depressed, or hopeless (mental health indicator)
    # 'DPQ030',   # Trouble sleeping (overlap with sleep variables)
    # 'DPQ040',   # Feeling tired or having little energy (could be due to pain or depression)
    # 'DPQ050',   # Poor appetite or overeating (possible stress response)
    # 'DPQ060',   # Feeling bad about yourself (self-esteem issues)
    # 'DPQ070',   # Trouble concentrating (affects daily functioning)
    # 'DPQ080',   # Moving or speaking slowly (psychomotor retardation)
    # 'DPQ090',   # Thoughts of self-harm (critical mental health concern)
    # 'DPQ100',   # Difficulty with daily activities (overall functioning)
]

In [205]:
# Filter the merged_df to only include the columns specified in arthritis_columns
df = merged_df[arthritis_columns]
df['MCQ195'].fillna(0, inplace=True)
df 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MCQ195'].fillna(0, inplace=True)


Unnamed: 0,MCQ160A,MCQ180A,MCQ195,PFQ049,PFQ051,PFQ054,PFQ059,PFQ061A,PFQ061B,PFQ061C,...,PAQ706,PAQ710,PAQ715,MCQ160N,MCQ160K,MCQ160L,OSQ060,OSQ072,RXQ510,MCQ053
0,1.0,62.0,9.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,...,,2.0,8.0,2.0,2.0,2.0,2.0,,1.0,2.0
1,2.0,,0.0,2.0,2.0,2.0,2.0,,,,...,,4.0,8.0,2.0,2.0,2.0,2.0,,1.0,2.0
2,2.0,,0.0,2.0,2.0,2.0,2.0,5.0,1.0,1.0,...,,4.0,0.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0
3,1.0,70.0,9.0,2.0,1.0,2.0,,1.0,1.0,2.0,...,,1.0,1.0,2.0,1.0,2.0,2.0,,1.0,2.0
4,1.0,45.0,2.0,2.0,1.0,2.0,,1.0,2.0,1.0,...,,5.0,8.0,1.0,2.0,2.0,2.0,,1.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5583,2.0,,0.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,...,,3.0,8.0,2.0,2.0,2.0,2.0,,2.0,2.0
5584,1.0,3.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,...,,2.0,8.0,2.0,2.0,2.0,2.0,,1.0,2.0
5585,2.0,,0.0,2.0,2.0,2.0,2.0,,,,...,,3.0,0.0,2.0,2.0,2.0,2.0,,2.0,2.0
5586,2.0,,0.0,2.0,2.0,2.0,2.0,,,,...,,2.0,1.0,2.0,2.0,2.0,,,,2.0


In [206]:
# Set the threshold for the percentage of missing values allowed
threshold = 0.20

# Calculate the percentage of missing values for each column
missing_percentage = df.isnull().mean()

# Filter columns that have more than the threshold percentage of missing values
columns_to_drop = missing_percentage[missing_percentage > threshold].index

# Drop the columns
test = df.drop(columns=columns_to_drop)
test.dropna(inplace=True)

test

Unnamed: 0,MCQ160A,MCQ195,PFQ049,PFQ051,PFQ054,MCQ370A,MCQ370B,MCQ370C,MCQ370D,WHQ030,...,PAQ605,PAQ620,PAQ650,PAQ665,PAQ710,PAQ715,MCQ160N,MCQ160K,MCQ160L,MCQ053
0,1.0,9.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,3.0,...,2.0,2.0,2.0,2.0,2.0,8.0,2.0,2.0,2.0,2.0
1,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,...,2.0,1.0,2.0,2.0,4.0,8.0,2.0,2.0,2.0,2.0
2,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,...,2.0,1.0,2.0,1.0,4.0,0.0,2.0,2.0,2.0,2.0
3,1.0,9.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,...,2.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,2.0,2.0
4,1.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,...,1.0,2.0,2.0,2.0,5.0,8.0,1.0,2.0,2.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5582,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,...,2.0,2.0,1.0,1.0,2.0,8.0,2.0,2.0,2.0,2.0
5583,2.0,0.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,...,2.0,2.0,2.0,1.0,3.0,8.0,2.0,2.0,2.0,2.0
5584,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,3.0,...,2.0,2.0,2.0,1.0,2.0,8.0,2.0,2.0,2.0,2.0
5585,2.0,0.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,3.0,...,2.0,2.0,1.0,2.0,3.0,0.0,2.0,2.0,2.0,2.0


## **Main Research Question:** To what extent can Machine Learning be used to predict the type of arthritis among adults in the NHANES population?

#### Sub Questions:
- What significant associations exist between demographic characteristics, ... and arthritis types among adults in the NHANES dataset?
- Which features are most influential in predicting arthritis types, and how does feature selection impact model performance?
- Among various classification models and evaluation metrics, which combination yields the best performance in predicting arthritis types?

<div style="text-align: center;">
    <img src="https://i.imgur.com/5JlOAjj.png" alt="Flow Diagram" style="width:720px;height:400px;">
</div>

<div style="text-align: center;">
    <img src="https://i.imgur.com/Nlwxj75.png" alt="Flow Diagram" style="width:720px;height:400px;">
</div>