## Section I. Introduction
The dataset chosen for this study is the [***"Pediatric Anemia Dataset: Hematological Indicators and Diagnostic Classification"***](https://data.mendeley.com/datasets/y7v7ff3wpj/1). This dataset contains hematological parameters used to support the diagnosis of anemia among patients.

Anemia is a medical condition characterized by a deficiency in healthy red blood cells or hemoglobin, which reduces the blood’s capacity to transport oxygen to body tissues. This condition remains a significant public health concern, particularly in tropical and subtropical regions. Early detection and appropriate treatment using hematological indicators such as hemoglobin level and red blood cell count are essential in addressing this condition.

With this, the goal of this study is to predict the clinical diagnostic outcome of anemia using demographic and hematological parameters. Therefore, the task is a binary classification task, where the model aims to classify patients into either: anemic or non-anemic. 

## Section II. Description of the Dataset
The dataset used in this study was obtained from the publicly available anemia clinical dataset published on [Mendeley Data](https://data.mendeley.com/datasets/y7v7ff3wpj/1). The data were collected from anemia patients in Aalok Healthcare Ltd., located in Dhaka, Bangladesh on October 9, 2023. 

Each row in the dataset represents a patients's record, and each corresponding column represent a specific attribute. The dataset consists of **1000 observations** and **8 features**, with an additional target variable (Decision_Class). 

The following are the description of each feature of the dataset:

- **`Gender`**: biological sex of the patient; `m` for male or `f` for female
- **`Age`**: age of the patient (years)
- **`Hb (hemoglobin)`**: measure of the blood's capacity to carry oxygen (g/dL)
- **`RBC (red blood cell count)`**: number of red blood cells per unit volume (million/μL)
- **`PCV (packed cell volume)`**: percentage of red blood cells in blood volume
- **`MCV (mean corpuscular volume)`**: average size of red blood cells (fL)
- **`MCH (mean corpuscular hemoglobin)`**: average hemoglobin content per red blood cell (pg/cell)
- **`MCHC (mean corpuscular hemoglobin concentration )`**: concentration of hemoglobin in red blood cells (g/dL)
- **`Decision_Class`**: binary indicator for the diagnostic outcome (0, 1)

## Import

In [None]:
import numpy as np
import pandas as pd
import csv

## Section III. Data Preparation

### Reading the Dataset
The first step is to load the dataset `anemia.csv`.

In [None]:
anemia_df = pd.read_csv('anemia.csv')

To quickly view the structure of the dataset, we use the function `head()`.

In [None]:
anemia_df.head(10)

In [None]:
anemia_df.shape

The loaded dataset information shows that the dataset contains **1,000 rows** and **9 columns**, confirming that the dataset has been loaded succesfully.

To improve readability and consistency, the dataset column names were renamed.

In [None]:
anemia_df = anemia_df.rename(columns={'Gender': 'gender', 
                              'Age': 'age', 
                              'Hb': 'hb', 
                              'RBC': 'rbc', 
                              'PCV': 'pcv', 
                              'MCV': 'mcv',
                              'MCH': 'mch',
                              'MCHC': 'mchc',
                              'Decision_Class': 'decision'})

### Data Cleaning

#### Checking multiple representations
To ensure consistent formatting, multiple representations of values in the **gender** column were checked and standardized. 
The values of the **gender** column were first inspected using the `unique()` function to identify all existing categorical representations in the dataset.

In [None]:
anemia_df['gender'].unique()

The output confirmed that gender was represented using two categories: 'f' for female and 'm' for male. We can then proceed with binary encoding. 

In [None]:
gender_scale = {'f': 0, 'm': 1}
anemia_df['gender'] = anemia_df['gender'].map(gender_scale)

anemia_df['gender']

This transformation converted gender into numerical values, where **female** was encoded as **0** and **male** as **1**. This step is necessary since many machine learning algorithms require numerical input.

#### Checking data types
To check whether each variable is in the appropriate format, we use the `.dtypes`.

In [None]:
anemia_df.dtypes

In [None]:
anemia_df.info()

The output shows that all variables are stored using appropriate and consistent data types. For discrete variables such as gender, age, and the target classification variable, the data are stored as integer values, while continuous hematological parameters are stored as floating-point values.

#### Handling missing values
To check for any missing values in our dataset, we use the `isnull()` function along with `sum`.

In [None]:
print(anemia_df.isnull().sum())

The output shows there are **no missing values** in the dataset.

#### Handling duplicates
To check for any duplicate records, we use the `duplicated()` along with `sum`.

In [None]:
print(anemia_df.duplicated().sum())

The output showed that there were **28 duplicate records** in the dataset. To further examine these duplicate entries, we display all duplicated rows:

In [None]:
anemia_df[anemia_df.duplicated(keep=False)]

Since duplicate records can affect model performance and analysis, the duplicate entries we'll be removed from the dataset.

In [None]:
anemia_df = anemia_df.drop_duplicates()

After removing duplicates, the dataset was rechecked to confirm that no duplicate records remained:

In [None]:
print(anemia_df.duplicated().sum())

The output confirmed that duplicate records were successfully removed.

In [None]:
anemia_df.info()

The dataset now contains **972 unique patient records**. 

With the data cleaning process completed, we proceed to the exploratory data analysis (EDA) phase to examine data distributions and relationships between variables.

## Section IV. Exploratory Data Analysis (EDA)

In [None]:
gender = anemia_df['gender']
age = anemia_df['age']
hb = anemia_df['hb']
rbc = anemia_df['rbc']
pcv = anemia_df['pcv']
mcv = anemia_df['mcv']
mch = anemia_df['mch']
mchc = anemia_df['mchc']
decision = anemia_df['decision']