# Machine Learning Neural Network (ECG prediction)

## Introduction
In this project, I explore the application of machine learning to analyze electrocardiogram (ECG) data. ECGs, crucial for diagnosing heart conditions, present an opportunity for enhanced analysis through machine learning to improve accuracy and efficiency.

I will utilize ECG recordings and expert annotations from datasets in CSV and TXT format; CSV file will contain modified limb lead II (MLII) and v5, TXT file will contain the time stamp of the rpeaks and the type of heartbeat (normal, abnormal). These resources are essential for training and evaluating a machine learning model to identify R-peaks and classify the normality of heartbeats, key indicators of heart health.

My objectives:
- <b>Investigate Machine Learning Applications:</b> Apply machine learning models, specifically using Dense and Dropout layers, to ECG data.
- <b>Detect R-Peaks:</b> Use machine learning for accurate R-peak detection in ECG signals.
- <b>Classify Heartbeat Normality:</b> Develop a model to classify heartbeat normality, identifying potential cardiac abnormalities.

This project aims to explore the potential of machine learning in ECG analysis.

libraries imported for the project:

In [1]:
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns

## Data Collection and Preprocessing

### Data Collection
The data is sourced from [MIT-BIH Arrhythmia Database](https://www.physionet.org/content/mitdb/1.0.0/)

### Data Loading

In [2]:
# listing file path

file_paths = [
    '100', '101', '102', '103', '104',
    '105', '106', '107', '108', '109',
    '111', '112', '113', '114', '115',
    '116', '117', '118', '118', '119',
    '121', '122', '123', '124', '200',
    '201', '202', '203', '205', '207',
    '208', '209', '210', '212', '213',
    '214', '215', '217', '219', '220', 
    '221', '222', '223', '228', '230',
    '231'
]

csv_list = []
txt_list = []

for file_path in file_paths:
    csv_df = pd.read_csv(f"data/{file_path}.csv")
    txt_df = pd.read_table(f"data/{file_path}annotations.txt", delimiter = '\t')
    
    csv_df.columns = ['samp_num', 'MLII', 'V5']
    txt_df.columns = ['6', 'RA'] # RA stands for Rhythm annotation
    
    csv_list.append(csv_df)
    txt_list.append(txt_df)

# check if the list are of the same length
print(f"csv: {len(csv_list)}\ntxt: {len(txt_list)}")

csv: 46
txt: 46


## Exploratory Data Analysis (EDA) 
Through visualizations and statistical summaries, EDA helps uncover underlying structures, informs the choice of appropriate models, and guides hypothesis formulation, ensuring a deeper insight into the data's nature. This process is invaluable in projects involving complex datasets, such as ECG data, where identifying key features like R-peaks or heartbeat normality can significantly impact model performance and accuracy.

In [3]:
txt_list[0].head()

Unnamed: 0,6,RA
0,0:00.050 18 + 0 0 0,(N
1,0:00.214 77 N 0 0 0,
2,0:01.028 370 N 0 0 0,
3,0:01.839 662 N 0 0 0,
4,0:02.628 946 N 0 0 0,


Raw data from the txt file:

According to the database [website](https://www.physionet.org/files/mitdb/1.0.0/mitdbdir/intro.htm) each symbol means something. Here are the examples:

| Symbol | Meaning                                      |
|--------|----------------------------------------------|
| · or N | Normal beat                                  |
| L      | Left bundle branch block beat                |
| R      | Right bundle branch block beat               |
| A      | Atrial premature beat                        |
| a      | Aberrated atrial premature beat              |
| J      | Nodal (junctional) premature beat            |
| S      | Supraventricular premature beat              |
| V      | Premature ventricular contraction            |
| F      | Fusion of ventricular and normal beat        |
| [      | Start of ventricular flutter/fibrillation    |
| !      | Ventricular flutter wave                     |
| ]      | End of ventricular flutter/fibrillation      |
| e      | Atrial escape beat                           |
| j      | Nodal (junctional) escape beat               |
| E      | Ventricular escape beat                      |
| /      | Paced beat                                   |
| f      | Fusion of paced and normal beat              |
| x      | Non-conducted P-wave (blocked APB)           |
| Q      | Unclassifiable beat                          |
| \|      | Isolated QRS-like artifact                  |

Function to merge both dataframes to easily train the machine learning neural network:

In [4]:
def add_rpeak_normality_columns(df, ref_df):
    refRpeak = []
    refNormal = []

    # Extract the 'rpeak' and 'type'
    for index, row in ref_df.iterrows():
        if ref_df['6'][index].split()[2] in symbols:
            refRpeak.append(ref_df['6'][index].split()[1]) 
            refNormal.append(ref_df['6'][index].split()[2]) 

    # Ensuring elements of refRpeak are integers
    refRpeak = [int(val) for val in refRpeak]

    # Create new columns in the DataFrame
    df['rpeak'] = 0
    df['normality'] = 0

    # Set 'rpeak' column values to 1 where samp_num is in refRpeak
    df.loc[df['samp_num'].isin(refRpeak), 'rpeak'] = 1

    # Create a mapping of symbol to integer for normality
    symbol_to_int = {symbol: idx for idx, symbol in enumerate(symbols)}

    # Assuming 'samp_num' matches with 'refRpeak' order
    for rpeak, normal in zip(refRpeak, refNormal):
        if normal in symbol_to_int:
            df.loc[df['samp_num'] == rpeak, 'normality'] = symbol_to_int[normal]
        else:
            print(f'{normal} not found in symbols')
            
    return df

Merging the dataframes:

In [5]:
df_list = []

for i, csv in enumerate(csv_list):
    for j, txt in enumerate(txt_list):
        if (i == j):
            df_list.append(add_rpeak_normality_columns(csv, txt))

print('completed')

NameError: name 'symbols' is not defined

Merged dataframe:

In [None]:
df_list[0].head()

## Exploratory Data Analysis (EDA)
Through visualizations and statistical summaries, EDA helps uncover underlying structures, informs the choice of appropriate models, and guides hypothesis formulation, ensuring a deeper insight into the data's nature. This process is invaluable in projects as it involves complex datasets, such as ECG data, where identifying key features like R-peaks or heartbeat normality can significantly impact model performance and accuracy.

### Basic Information
The first merged dataframe will be used as an example for EDA.

Number of merged dataframe:

In [None]:
print(len(df_list))

In [None]:
df_list[0].head()

In [None]:
df_list[0].info()

In [None]:
df_list[0].describe()

### Check for Missing Values

In [None]:
for df in df_list:
    print(df.isnull().sum())

### Distribution of Key Features

In [None]:
df = df_list[0]
plt.figure(figsize=(10, 6))
sns.histplot(df['MLII'], bins=30, kde=True)
plt.title(f'Distribution of R-peak amplitude in DataFrame {i+1}')
plt.xlabel('R-peak amplitude')
plt.ylabel('Frequency')
plt.show()

### Identifying Outliers

In [None]:
# Box plot for a numerical column
sns.boxplot(x=df['MLII'])
plt.title('Box Plot of R-peak Values')
plt.show()

In [None]:
# Correlation matrix heatmap
plt.figure(figsize=(10, 8))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, fmt=".2f")
plt.title('Correlation Matrix of Features')
plt.show()

## Model Building (Markdown + Code)

Model Architecture (Markdown): Discuss the architecture of your model, including why you chose certain layers and their configurations.
Model Definition (Code): Code cells defining your model using Sequential API with Dense and Dropout layers.
Model Compilation (Code): Include code for compiling your model, specifying the optimizer and loss function.
Training (Markdown + Code):
Discuss your training approach.
Training Process (Code): Show the code that trains the model, including any callbacks or special training conditions.

## Model Evaluation (Markdown + Code)

Evaluation Metrics (Markdown): Describe the metrics you will use to evaluate your model.
Evaluation (Code): Code cells to evaluate the model and print out metrics.

## Results and Discussion (Markdown + Code)

Results (Markdown + Code):
Present key findings, supported by code cells that output results, visualizations, etc.
Discussion (Markdown): Interpret the results, discuss any limitations, and how the model could be improved.

## Conclusion and Future Work (Markdown)

Summarize your findings, the implications of your work, and any potential future directions for this research.

## References (Markdown)

List all references, including articles, books, and online resources, that were cited or used for code in the project.