# Data Cleaning and Preprocessing

This notebook's goal is to clean and preprocess the NHANES (August 2021 - August 2023) dataset to prepare it for use in a machine learning model aimed at predicting diabetes risk.

### Key Columns
SEQN - Respondent sequence number

## Table of Contents
1. [Importing Libraries](#import-libraries)
2. [Loading the Dataset](#loading-the-dataset)
3. [Initial Data Exploration](#data-exploration)
4. [Data Cleaning](#data-cleaning)
5. [Data Preprocessing]()
6. [Exploratory Data Analysis (EDA)]#eda)


# Import Libraries 

In [2]:
import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns
import os

# Loading the dataset

In [20]:
folder_path = '../data/raw/csv_files/'

# list csv files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

merged_df = None
# Load each csv file into a pandas dataframe and merge on 'SEQN' column
for csv_file in csv_files:
    file_path = os.path.join(folder_path, csv_file)
    df = pd.read_csv(file_path)

    if merged_df is None:
        merged_df = df
    else:
        merged_df = pd.merge(merged_df, df, on='SEQN', how='outer')


# Data Exploration

In [21]:
# Check merged dataframe structure
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11933 entries, 0 to 11932
Data columns (total 81 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SEQN      11933 non-null  float64
 1   ALQ111    5481 non-null   float64
 2   ALQ121    4922 non-null   float64
 3   ALQ130    4069 non-null   float64
 4   ALQ142    4082 non-null   float64
 5   ALQ270    2366 non-null   float64
 6   ALQ280    2362 non-null   float64
 7   ALQ151    4901 non-null   float64
 8   ALQ170    2358 non-null   float64
 9   BPQ020    8498 non-null   float64
 10  BPQ030    2968 non-null   float64
 11  BPQ150    2969 non-null   float64
 12  BPQ080    8498 non-null   float64
 13  BPQ101D   8498 non-null   float64
 14  DBQ010    1441 non-null   float64
 15  DBD030    1121 non-null   float64
 16  DBD041    1441 non-null   float64
 17  DBD050    1153 non-null   float64
 18  DBD055    1440 non-null   float64
 19  DBD061    1356 non-null   float64
 20  DBQ073A   869 non-null    fl

In [22]:
# Show the first 5 rows of the merged dataframe
merged_df.head()

Unnamed: 0,SEQN,ALQ111,ALQ121,ALQ130,ALQ142,ALQ270,ALQ280,ALQ151,ALQ170,BPQ020,...,SMQ040,SMD641,SMD650,SMD100MN,SMQ621,SMD630,SMAQUEX2,WTPH2YR,LBXTC,LBDTCSI
0,130378.0,,,,,,,,,1.0,...,3.0,,,,,,1.0,56042.12941,264.0,6.83
1,130379.0,1.0,2.0,3.0,0.0,,,2.0,,1.0,...,3.0,,,,,,1.0,37435.705647,214.0,5.53
2,130380.0,1.0,10.0,1.0,0.0,,,2.0,,2.0,...,,,,,,,1.0,85328.844519,187.0,4.84
3,130381.0,,,,,,,,,,...,,,,,,,,,,
4,130382.0,,,,,,,,,,...,,,,,,,,,,


In [23]:
# Check for missing values
merged_df.isnull().sum()

SEQN            0
ALQ111       6452
ALQ121       7011
ALQ130       7864
ALQ142       7851
            ...  
SMD630      11910
SMAQUEX2     2918
WTPH2YR      3865
LBXTC        5043
LBDTCSI      5043
Length: 81, dtype: int64

In [24]:
# Summary statistics
merged_df.describe()   

Unnamed: 0,SEQN,ALQ111,ALQ121,ALQ130,ALQ142,ALQ270,ALQ280,ALQ151,ALQ170,BPQ020,...,SMQ040,SMD641,SMD650,SMD100MN,SMQ621,SMD630,SMAQUEX2,WTPH2YR,LBXTC,LBDTCSI
count,11933.0,5481.0,4922.0,4069.0,4082.0,2366.0,2362.0,4901.0,2358.0,8498.0,...,3243.0,273.0,1185.0,1175.0,772.0,23.0,9015.0,8068.0,6890.0,6890.0
mean,136344.0,1.109104,5.030679,5.842959,4.742283,4.838123,3.545301,1.821261,4.396098,1.659449,...,2.3395,12.728938,18.998312,0.414468,1.281088,13.173913,1.095618,37744.395761,181.541074,4.694643
std,3444.904716,0.385114,4.314321,54.996448,7.326042,7.785415,7.133496,0.458352,45.252453,0.542253,...,0.900889,14.662369,86.343555,0.653359,3.599612,1.922408,0.294084,30937.952799,42.31614,1.094357
min,130378.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,9.0,1.0,0.0,62.0,1.6
25%,133361.0,1.0,2.0,1.0,0.0,0.0,0.0,2.0,0.0,1.0,...,1.0,3.0,5.0,0.0,1.0,12.0,1.0,18092.615464,151.0,3.9
50%,136344.0,1.0,5.0,2.0,4.0,4.0,0.0,2.0,1.0,2.0,...,3.0,10.0,10.0,0.0,1.0,13.0,1.0,30264.726858,178.0,4.6
75%,139327.0,1.0,8.0,3.0,9.0,9.0,7.0,2.0,2.0,2.0,...,3.0,20.0,20.0,1.0,1.0,14.0,1.0,49006.051233,207.0,5.35
max,142310.0,9.0,99.0,999.0,99.0,99.0,99.0,9.0,999.0,9.0,...,3.0,99.0,999.0,9.0,99.0,16.0,2.0,241728.857241,438.0,11.33


# Data Cleaning