# Preterm Birth Prediction Using EHG

This project is using topological data analysis of electrohysterograph samples to predict preterm births. TRhe goal is to determine if TDA can enhance the classification of signals and subsequent prediction of preterm birth. 

Previous literature: use of persistent homology to classify electromyography signals (hand movements) https://www.proquest.com/openview/8312b24e2d9eb9f154c172d4705d4409/1?pq-origsite=gscholar&cbl=5444811

## Data Access  

https://physionet.org/content/tpehgdb/1.0.1/tpehgdb/

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Gašper Fele-Žorž, Gorazd Kavšek, Živa Novak-Antolič and Franc Jager. A comparison of various linear and non-linear signal processing techniques to separate uterine EMG records of term and pre-term delivery groups. Medical & Biological Engineering & Computing, 46(9):911-922 (2008).

## Data Description

"The Term-Preterm EHG Database, a collection of electrohysterogram (EHG: uterine EMG) recordings obtained at the University Medical Centre Ljubljana from 300 pregnant women, has been contributed to PhysioNet. The collection includes recordings from 262 women who had full-term pregnancies and 38 whose pregnancies ended prematurely; 162 of the recordings were made before the 26th week of gestation, and 138 later."

Each record is composed of three channels, recorded from 4 electrodes (E1, E2, E3, E4).  
The differences in the electrical potentials of the electrodes were recorded, producing 3 channels:  
 - S1 = E2–E1 (first channel)
 - S2 = E2–E3 (second channel)  
 - S3 = E4–E3 (third channel)  

The individual records are 30 minutes in duration. Each signal has been digitized at 20 samples per second per channel with 16-bit resolution over a range of ±2.5 millivolts.  

Each signal was digitally filtered using 3 different 4-pole digital Butterworth filters with a double-pass filtering scheme. The band-pass cut-off frequencies were:  

 - 0.08Hz to 4Hz
 - 0.3Hz to 3Hz
 - 0.3Hz to 4Hz

The records in the database contain both the original and filtered signals. The records are in WFDB format. Each record consists of two files, a header file (.hea) containing information regarding the record and the data file (.dat) containing signal data.

An accompanying file (tpehgdb.smr) summarizes clinical information of each record, describing whether the corresponding pregnancy ended on term (> 37 weeks) or prematurely (≤ 37 weeks), and whether the record was obtained before the 26th week of gestation or during or after the 26th week of gestation.  

Visualize the waveform: https://physionet.org/lightwave/?db=tpehgdb/1.0.1  

## Virtual Environment 
my virtual environment name: ehg_preterm   
Best practices is to activate the virtual environment from CLI (see git bash example below)  
Navigate to project directory  
 source ./ehg_preterm/Scripts/activate  
 deactivate  

## Requirements
jupyter  
pandas  
numpy  
scikit-learn  
matplotlib  
seaborn  
wfdb #for waveform data  
tensorflow  
glob  
ripser 

### ripser requirements for Windows
- MinGW-w64 for Windows https://winlibs.com/ (also make sure to update system Path)
- Microsoft C++ Build Tools "Desktop development with C++" workload https://visualstudio.microsoft.com/visual-cpp-build-tools/
- install cython before installing ripser


### wfdb readthedocs
https://wfdb.readthedocs.io/en/latest/wfdb.html

### smr filetype
This filetype is proprietary to Spike2 by Cambridge Electronic Design. I did not have luck parsing it with python, so I used https://filext.com/file-extension/SMR, pasted into a txt file and read it to csv. This would not be ideal with larger files.  

## Import packages

In [None]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib
import seaborn
import wfdb
import os
import glob
import cython
import ripser
from ripser import ripser
import matplotlib.pyplot as plt
from persim import plot_diagrams
from ripser import Rips
from io import StringIO

In [None]:
print("testing the kernel")

# Inspect and Import Data 
.dat and .hea files from database tpehgdb

## Visualize one patient record

Thank you to ProtoBioengineering for the tutorial: https://medium.com/@protobioengineering/how-to-get-heart-data-from-the-mit-bih-arrhythmia-database-e452d4bf7215 

In [None]:
patient_record = wfdb.rdrecord("tpehgdb/tpehg1007")
wfdb.plot_wfdb(patient_record) # Visualiztion of 12 channels (sensors/leads) for the patient record 
print(patient_record.__dict__) # Dictionary of metadata

In [None]:
# Per wfdb dot notation, metadata can be accessed via patient_record.XYZ
patient_number = patient_record.record_name
leads = patient_record.sig_name # Names of leads and filtered leads 
comments = patient_record.comments
sig_len = patient_record.sig_len # Number of samples

In [None]:
print(patient_number)
print(leads)
print(comments)
print(sig_len)

## Extract data to csv
Write loop that extracts relevant information; each patient to one csv. 
Thank you to Abhishek Patil: https://github.com/abhilampard/Physionet-CSV-Conversion/blob/master/script.py

In [None]:
# Check current working directory
os.getcwd()

#Change working directory if necessary
os.chdir('/home/katie_grillaert/ehg_preterm/tpehgdb')

In [None]:
# Code to convert all .dat files (ECG signals) in a folder to CSV format 
# @author: Abhishek Patil

# THIS CODE IMPORTS 12 cols, one for each lead, with the first line as the header. 
#dat_files = glob.glob('*.dat') #Get list of all .dat files in the current folder
#df=pd.DataFrame(data=dat_files)
#df.to_csv("files_list.csv",index=False,header=None) #Write the list to a CSV file
#files=pd.read_csv("files_list.csv",header=None)

#Given the master csv, write the data to a csv for each patient:

dat_files = glob.glob('*.dat')  # Get list of all .dat files in the current folder

df = pd.DataFrame(data=dat_files)
df.to_csv("/home/katie_grillaert/ehg_preterm/files_list.csv", index=False, header=None)  # Write the list to a CSV file
files = pd.read_csv("/home/katie_grillaert/ehg_preterm/files_list.csv", header=None)
output_directory = "/home/katie_grillaert/ehg_preterm" 

for i in range(1, len(files) + 1):
    recordname = str(files.iloc[[i]])
    
    # Replace multiple consecutive spaces with a single space
    recordname = ' '.join(recordname.split())

    # Remove leading/trailing spaces
    recordname = recordname.strip()

    # Replace newline characters in the file name
    recordname = recordname.replace('\n', '')

    # Initialize the recordname_new variable
    recordname_new = ""

    try:
        # Try with original extraction
        recordname_new = recordname[-13:-4]
        record = wfdb.rdsamp(recordname_new)  # rdsamp() returns the signal as a numpy array
    except FileNotFoundError:
        try:
            # Retry with alternative extraction
            recordname_new = recordname[-12:-4]
            record = wfdb.rdsamp(recordname_new)  # rdsamp() returns the signal as a numpy array
        except FileNotFoundError:
            # If both attempts fail, skip this file
            print(f"File not found: {recordname}")
            continue

    record = np.asarray(record[0])

    path = recordname_new + ".csv"
    np.savetxt(path, record, delimiter=",")  # Writing the CSV for each record
    print("Files done: %s/%s" % (i, len(files)))

print("\nAll files done!")


### Test: Look at one csv example

In [None]:
df = pd.read_csv("/home/katie_grillaert/ehg_preterm/tpehgdb/tpehg1021.csv", header=None)

In [None]:
df

# Analyse persistent homology

## Filtration information

maxdim (int, optional, default 1) – Maximum homology dimension computed. Will compute all dimensions lower than and equal to this value. For 1, H_0 and H_1 will be computed.  
thresh (float, default infinity) – Maximum distances considered when constructing filtration. If infinity, compute the entire filtration.  
coeff (int prime, default 2) – Compute homology with coefficients in the prime field Z/pZ for p=coeff.  
do_cocycles (bool) – Indicator of whether to compute cocycles, if so, we compute and store cocycles in the cocycles_ dictionary Rips member variable  
n_perm (int) – The number of points to subsample in a “greedy permutation,” or a furthest point sampling of the points. These points will be used in lieu of the full point cloud for a faster computation, at the expense of some accuracy, 
which can be bounded as a maximum bottleneck distance to all diagrams on the original point set  
verbose (boolean) – Whether to print out information about this object as it is constructed  

https://ripser.scikit-tda.org/en/latest/reference/stubs/ripser.Rips.html   


## Explore one observation

### Visualize persistence diagram 

In [None]:
#So much data... will need to research proper way to filter and create sparse matrix, and probably sample on top of that. For now, a simple sample. 
unfiltered_sensors = 

sample_fraction = 0.01
df_sample = df.sample(frac=sample_fraction, random_state=42) 
print("sample prepared")

result = ripser(df_sample, maxdim=2, coeff=2)
print("result calculated")

# Plot the persistent homology diagram
plt.figure(figsize=(10, 5))
diagrams = (result)['dgms']
plot_diagrams(diagrams, show=True)

In [None]:
df.columns = df.columns.astype(str)

new_column_names = {'0': 's1_unfilt', 
                    '1': 's1_filter_a', 
                    '2': 's1_filter_b',
                    '3': 's1_filter_c',
                    '4': 's2_unfilt', 
                    '5': 's2_filter_a', 
                    '6': 's2_filter_b',
                    '7': 's2_filter_c',
                    '8': 's3_unfilt', 
                    '9': 's3_filter_a', 
                    '10': 's3_filter_b',
                    '11': 's3_filter_c'}

df.rename(columns=new_column_names, inplace=True)


df

### Examine dictionary of results

In [None]:
# Print keys
print("Keys:", list(result.keys()))

# Print values
#print("Values:", list(result.values()))

# Print key-value pairs (items)
#print("Items:", list(result.items()))

#dgms_value = result['dgms']
#print("Contents of 'dgms':", dgms_value)


### Extract Betti numbers

In [None]:
if 'dgms' in result:
    persistence_diagrams = result['dgms']

    # Extract Betti-0 numbers
    betti_0_numbers = sum(np.isfinite(persistence_diagrams[0][:, 1]))
    print("Betti-0 Numbers:", betti_0_numbers)

    # Extract Betti-1 numbers
    betti_1_numbers = sum(np.isfinite(persistence_diagrams[1][:, 1]))
    print("Betti-1 Numbers:", betti_1_numbers)

    # Extract Betti-2 numbers
    betti_2_numbers = sum(np.isfinite(persistence_diagrams[2][:, 1]))
    print("Betti-2 Numbers:", betti_2_numbers)

else:
    print("The 'dgms' key is not present in the dictionary.")


### View patient record

In [None]:
# View result dictionary for patient record
print(result)

In [None]:
file_path = r'tpehg914'

# Read the WFDB record
record_inspect = wfdb.rdrecord(file_path)

# Print the record object
print(record_inspect)

# Access the raw signal data
raw_data = record_inspect.p_signal
print(raw_data)

## Compile filtration results for all observations - unfiltered sensors

In [None]:
# Loop to compile results list for multiple records

# Set the folder path where CSV files are located
folder_path = r'/home/katie_grillaert/ehg_preterm/tpehgdb'

# Set the sample fraction
sample_fraction = 0.01

# Initialize a counter variable
counter = 0

# Get a list of all CSV files in the folder
csv_files = glob.glob(folder_path + '/*.csv')

# Dictionary to store results with filenames as keys
all_results_unfilt = {}

# Loop through each CSV file
for csv_file in csv_files:

    # Increment the counter
    counter += 1

    # Print the file number and total number of files
    print(f"Processing file {counter} of {len(csv_files)}")
    
    # Extract the filename without the path
    filename = os.path.basename(csv_file)
   
    # Extract the desired part of the filename, e.g., without the extension
    key = filename.replace('.csv', '')

    # Read CSV file into a DataFrame
    df_filename = pd.read_csv(csv_file, header=None)
    
    # Increment the counter
    counter += 1

    # Print the file number and total number of files
    print(f"Processing file {counter} of {len(csv_files)}")    
    
    # Select only numeric columns
    numeric_columns = df_filename.select_dtypes(include=['number'])

    # Check if there is at least one numeric column
    if not numeric_columns.empty:

        # Rename cols
        numeric_columns.columns = numeric_columns.columns.astype(str)
                
        new_column_names = {'0': 's1_unfilt', 
                         '1': 's1_filter_a', 
                         '2': 's1_filter_b',
                         '3': 's1_filter_c',
                         '4': 's2_unfilt', 
                         '5': 's2_filter_a', 
                         '6': 's2_filter_b',
                         '7': 's2_filter_c',
                         '8': 's3_unfilt', 
                         '9': 's3_filter_a', 
                         '10': 's3_filter_b',
                         '11': 's3_filter_c'}

        numeric_columns.rename(columns=new_column_names, inplace=True)

        # Select only columns for sensor groups
        unfilt_group = ['s1_unfilt', 's2_unfilt', 's3_unfilt']
        #filt_a_group = ['s1_filter_a', 's2_filter_a', 's3_filter_a']
        #filt_b_group = ['s1_filter_b', 's2_filter_b', 's3_filter_b']
        #filt_c_group = ['s1_filter_c', 's2_filter_c', 's3_filter_c']

        # Create df subsets
        unfilt_df = numeric_columns[unfilt_group]
        #filt_a_df = numeric_columns[filt_a_group]
        #filt_b_df = numeric_columns[filt_b_group]
        #filt_c_df = numeric_columns[filt_c_group]
 
        # Take a sample of the DataFrame
        df_filename_sample = unfilt_df.sample(frac=sample_fraction, random_state=42)

        # Print debugging information
        #print(f"Sampled DataFrame: {df_filename_sample}")

        # Perform persistent homology analysis
        result_filename = ripser(df_filename_sample, maxdim=2, coeff=2, metric='euclidean')

        # Add the 'filename' key to the result dictionary
        result_filename['filename'] = key

        # Store the result in the dictionary with the modified filename as the key
        all_results_unfilt[key] = result_filename
        
    else:
        print("No numeric columns found in the DataFrame.")


# Now, all_results contains the results for each patient with filenames as keys
print(all_results_unfilt)


### Inspect dictionary from all files

In [None]:
for key in all_results_unfilt:
    print(key)

#print("Keys:", list(all_results.keys()))
#dgms_for_specific_file = all_results['tpehg1484']['dgms']
#dgms_for_specific_file

## Extract all Betti numbers and filenames to csv

In [None]:
# Create lists to store betti numbers and filenames
betti_numbers_list = []
filenames_list = []

for filename, result in all_results_unfilt.items():
    # Check if 'dgms' key is present
    if 'dgms' in result:
        persistence_diagrams = result['dgms']

        # Extract Betti-0 numbers
        betti_0_numbers = sum(np.isfinite(persistence_diagrams[0][:, 1]))

        # Extract Betti-1 numbers
        betti_1_numbers = sum(np.isfinite(persistence_diagrams[1][:, 1]))

        # Extract Betti-2 numbers
        betti_2_numbers = sum(np.isfinite(persistence_diagrams[2][:, 1]))

        # Extract the filename without the extension
        filename = result.get('filename', 'unknown')

        # Append betti numbers and filename to the lists
        betti_numbers_list.append([betti_0_numbers, betti_1_numbers, betti_2_numbers])
        filenames_list.append(filename)

    else:
        print("The 'dgms' key is not present in the dictionary.")
        
# Create a DataFrame from the lists
df_betti_numbers = pd.DataFrame(betti_numbers_list, columns=['Betti-0', 'Betti-1', 'Betti-2'])

# Add the 'filename' column to the DataFrame
df_betti_numbers['filename'] = filenames_list

# Save the DataFrame to a CSV file
df_betti_numbers.to_csv('/home/katie_grillaert/ehg_preterm/betti_numbers_unfilt.csv', index=False)

In [None]:
betti_numbers_df = pd.read_csv('/home/katie_grillaert/ehg_preterm/betti_numbers_unfilt.csv')
betti_numbers_df

In [None]:
# Repeat for filtered sensor groups
# Loop to compile results list for multiple records

#subset = [unfilt_df, filt_a_df, filt_b_df, filt_c_df]

for i in range(4):
    
    subset = f"Subset_{i}"

    # Set the folder path where CSV files are located
    folder_path = r'/home/katie_grillaert/ehg_preterm/tpehgdb'

    # Set the sample fraction
    sample_fraction = 0.01

    # Initialize a counter variable
    counter = 0
    
    # Dictionary to store results with filenames as keys
    all_results = {}
    
    # Get a list of all CSV files in the folder
    csv_files = glob.glob(folder_path + '/*.csv')

    # Loop through each CSV file
    for csv_file in csv_files:

        # Increment the counter
        counter += 1
    
        # Extract the filename without the path
        filename = os.path.basename(csv_file)
   
        # Extract the desired part of the filename, e.g., without the extension
        key = filename.replace('.csv', '')

        # Read CSV file into a DataFrame
        df_filename = pd.read_csv(csv_file, header=None)

        # Print the file number and total number of files
        print(f"Processing file {counter} of {len(csv_files)}", filename)    
    
        # Select only numeric columns
        numeric_columns = df_filename.select_dtypes(include=['number'])

        # Check if there is at least one numeric column
        if not numeric_columns.empty:

            # Rename cols
            numeric_columns.columns = numeric_columns.columns.astype(str)
                
            new_column_names = {'0': 's1_unfilt', 
                             '1': 's1_filter_a', 
                             '2': 's1_filter_b',
                             '3': 's1_filter_c',
                             '4': 's2_unfilt', 
                             '5': 's2_filter_a', 
                             '6': 's2_filter_b',
                             '7': 's2_filter_c',
                             '8': 's3_unfilt', 
                             '9': 's3_filter_a', 
                             '10': 's3_filter_b',
                             '11': 's3_filter_c'}

            numeric_columns.rename(columns=new_column_names, inplace=True)

            # Select only columns for sensor groups
            unfilt_group = ['s1_unfilt', 's2_unfilt', 's3_unfilt']
            filt_a_group = ['s1_filter_a', 's2_filter_a', 's3_filter_a']
            filt_b_group = ['s1_filter_b', 's2_filter_b', 's3_filter_b']
            filt_c_group = ['s1_filter_c', 's2_filter_c', 's3_filter_c']

            # Create df subsets
            unfilt_df = numeric_columns[unfilt_group]
            filt_a_df = numeric_columns[filt_a_group]
            filt_b_df = numeric_columns[filt_b_group]
            filt_c_df = numeric_columns[filt_c_group]

            # Choose the appropriate subset dataframe based on the current iteration
            if subset == "Subset_0":
                subset_df = unfilt_df
            elif subset == "Subset_1":
                 subset_df = filt_a_df
            elif subset == "Subset_2":
                subset_df = filt_b_df
            elif subset == "Subset_3":
                subset_df = filt_c_df
            else:
                raise ValueError(f"Unknown subset: {subset}")
                
            # Take a sample of the DataFrame
            df_filename_sample = subset_df.sample(frac=sample_fraction, random_state=42)

            # Print debugging information
            #print(f"Sampled DataFrame: {df_filename_sample}")

            # Perform persistent homology analysis
            result_filename = ripser(df_filename_sample, maxdim=2, coeff=2, metric='euclidean')
    
            # Add the 'filename' key to the result dictionary
            result_filename['filename'] = key

            # Store the result in the dictionary with the modified filename as the key
            all_results[key] = result_filename
        
        else:
            print("No numeric columns found in the DataFrame.")
    
    # Now, all_results contains the results for each patient with filenames as keys
    print("Subset", i, "completed")

    # Create lists to store betti numbers and filenames
    betti_numbers_list = []
    filenames_list = []

    for filename, result in all_results.items():
        # Check if 'dgms' key is present
        if 'dgms' in result:
            persistence_diagrams = result['dgms']

            # Extract Betti-0 numbers
            betti_0_numbers = sum(np.isfinite(persistence_diagrams[0][:, 1]))

            # Extract Betti-1 numbers
            betti_1_numbers = sum(np.isfinite(persistence_diagrams[1][:, 1]))

            # Extract Betti-2 numbers
            betti_2_numbers = sum(np.isfinite(persistence_diagrams[2][:, 1]))

            # Extract the filename without the extension
            filename = result.get('filename', 'unknown')

            # Append betti numbers and filename to the lists
            betti_numbers_list.append([betti_0_numbers, betti_1_numbers, betti_2_numbers])
            filenames_list.append(filename)

        else:
            print("The 'dgms' key is not present in the dictionary.")
        
    # Create a DataFrame from the lists
    df_betti_numbers = pd.DataFrame(betti_numbers_list, columns=[f'Betti-0_{subset}', f'Betti_1_{subset}', f'Betti_2_{subset}'])


    # Add the 'filename' column to the DataFrame
    df_betti_numbers['filename'] = filenames_list

    # Generate the output CSV filename dynamically based on the current subset
    output_csv_filename = f'/home/katie_grillaert/ehg_preterm/betti_numbers_{subset}.csv'

    # Save the DataFrame to the CSV file
    df_betti_numbers.to_csv(output_csv_filename, index=False)

    print(f"Results saved to: {output_csv_filename}")

## Examine Header Files

In [None]:
hea_file_path = '/home/katie_grillaert/ehg_preterm/tpehgdb/tpehg1484'

# Load the HEA file
hea_record = wfdb.rdheader(hea_file_path)

# Display general information about the record
print(f"Record Name: {hea_record.record_name}")
print(f"Number of Channels: {hea_record.n_sig}")
print(f"Sampling Frequency: {hea_record.fs} Hz")
print(f"Signal Labels: {hea_record.sig_name}")
print(f"Comments: {hea_record.comments}")

# Display detailed information about each signal/channel
#for i, signal in enumerate(hea_record.sig_name):
 #   print(f"\nSignal {i + 1} - {signal}")
  #  print(f"Label: {hea_record.sig_name[i]}")
   # print(f"Units: {hea_record.units[i]}")
   # print(f"ADC Gain: {hea_record.adc_gain[i]}")
   # print(f"ADC Resolution: {hea_record.adc_res[i]}")
   # print(f"Baseline: {hea_record.baseline[i]}")



In [None]:
# Load the HEA file
hea_file_path = '/home/katie_grillaert/ehg_preterm/tpehgdb/tpehg1484'
hea_record = wfdb.rdheader(hea_file_path)

# Extract comments from the record
comments = hea_record.comments

# Create a DataFrame with comments as columns
df = pd.DataFrame({'Comment{}'.format(i + 1): [comment] for i, comment in enumerate(comments)})

# Specify the desired CSV file path
csv_file_path = '/home/katie_grillaert/ehg_preterm/tpehg1484_comments.csv'

# Save DataFrame to CSV
df.to_csv(csv_file_path, index=False)

print(f"Comments have been saved to {csv_file_path}")



In [None]:
df_header = pd.read_csv("/home/katie_grillaert/ehg_preterm/tpehg1484_comments.csv")
df_header

In [None]:
import os
import glob
import pandas as pd
import wfdb

# Directory path containing HEA files
hea_directory = '/home/katie_grillaert/ehg_preterm/tpehgdb/'

# Use glob to get all HEA files in the directory
hea_files = glob.glob(os.path.join(hea_directory, '*.hea'))

# Initialize an empty list to store rows
rows = []

# Process each HEA file
for hea_file_path in hea_files:
    # Extract file name without extension
    file_name = os.path.splitext(os.path.basename(hea_file_path))[0]

    print(file_name)
    # Check if the corresponding .hea file exists
    if os.path.exists(hea_file_path):
        # Load the HEA file
        hea_record = wfdb.rdheader(file_name)

        # Extract comments from the record
        comments = hea_record.comments

        # Create a dictionary representing a row
        row_dict = {'filename': file_name}
        for idx, comment in enumerate(comments):
            row_dict[f'Comment{idx + 1}'] = comment

        # Append the row dictionary to the list
        rows.append(row_dict)

# Create a DataFrame from the list of rows
df_all_comments = pd.DataFrame(rows)

# Specify the desired CSV file path
csv_file_path = '/home/katie_grillaert/ehg_preterm/all_comments.csv'

# Save the main DataFrame to CSV
df_all_comments.to_csv(csv_file_path, index=False)

print(f"Comments for all HEA files have been saved to {csv_file_path}")


In [None]:
all_comments_df = pd.read_csv("/home/katie_grillaert/ehg_preterm/all_comments.csv")
all_comments_df

## Read in metadata from smr file

In [None]:
file_path = '/home/katie_grillaert/ehg_preterm/smr.txt'

Read the content of the file
with open(file_path, 'r') as file:
    content = file.read()

# Use StringIO to convert the content to a file-like object
data = StringIO(content)

# Read the data into a DataFrame
df = pd.read_csv(file_path, sep='|', skipinitialspace=True, skipfooter=1, engine='python', header = 0)

smr_df.rename(columns={'Record    ': 'filename'}, inplace=True)
smr_df.rename(columns={' Gestation ': 'Gestation'}, inplace=True)
smr_df.rename(columns={' Rec. time ': 'Rec_time'}, inplace=True)
smr_df.rename(columns={'   Group   ': 'Group'}, inplace=True)
smr_df.rename(columns={' Premature ': 'Premature'}, inplace=True)
smr_df.rename(columns={' Early ': 'Early'}, inplace=True)


smr_df.to_csv('/home/katie_grillaert/ehg_preterm/smr.csv', index=False)
smr_df

In [None]:
#smr_df.columns

# Remove leading and trailing spaces from all entries in the DataFrame
#smr_df = smr_df.map(lambda x: x.strip() if isinstance(x, str) else x)

# Display the updated DataFrame
#smr_df
#smr_df.to_csv('/home/katie_grillaert/ehg_preterm/smr.csv', index=False)


## Merge Betti numbers, smr metadata, and Comments on filename 

In [None]:
betti_0 = pd.read_csv("/home/katie_grillaert/ehg_preterm/betti_numbers_Subset_0.csv")
betti_1 = pd.read_csv("/home/katie_grillaert/ehg_preterm/betti_numbers_Subset_1.csv")
betti_2 = pd.read_csv("/home/katie_grillaert/ehg_preterm/betti_numbers_Subset_2.csv")
#betti_3 = pd.read_csv("/home/katie_grillaert/ehg_preterm/betti_numbers_Subset_3.csv")
smr = pd.read_csv("/home/katie_grillaert/ehg_preterm/smr.csv")
comments = pd.read_csv("/home/katie_grillaert/ehg_preterm/all_comments.csv")

# Merge betti DataFrames one by one
merged_df = pd.merge(betti_0, betti_1, on='filename', how='inner')
merged_df = pd.merge(merged_df, betti_2, on='filename', how='inner')
#merged_df = pd.merge(merged_df, betti_3, on='filename', how='inner')

# Merge with smr DataFrame
merged_df = pd.merge(merged_df, smr, on='filename', how='inner')

# Merge with comments DataFrame
merged_df = pd.merge(merged_df, comments, on='filename', how='inner')

ehgdf = merged_df.copy()
ehgdf

In [None]:
ehgdf.to_csv('ehgdf.csv', index=False)

## Rename columns

In [None]:
# Dictionary to map old column names to new column names
#cols_to_drop=["Comment1", "Comment2"]
#ehgdf.drop(columns=cols_to_drop, inplace=True)

#column_order = ["filename"] + [col for col in ehgdf.columns if col != target_column]
#ehgdf = ehgdf[column_order]


column_mapping = {
    'Comment3': 'Gestation',
    'Comment4': 'Rec_Time',
    'Comment5': 'Age',
    'Comment6': 'Parity',
    'Comment7': 'Abortions',
    'Comment8': 'Weight',
    'Comment9': 'Hypertension',
    'Comment10': 'Diabetes',
    'Comment11': 'Placental_pos',
    'Comment12': 'Bleeding_first_tri',
    'Comment13': 'Bleeding_second_tri',
    'Comment14': 'Funneling',
    'Comment15': 'Smoker',
}

# Rename columns using the rename method
ehgdf.rename(columns=column_mapping, inplace=True)

# Display the DataFrame with renamed columns
ehgdf

## Check column datatypes")

In [None]:
for column in ehgdf.columns:
    data_type = ehgdf[column].dtype
    print(f"Column '{column}' has datatype: {data_type}")

In [None]:
columns = ehgdf.columns

for column in columns:
    unique_entries = ehgdf[column].unique()
    print("Unique entries in", column)
    print(unique_entries)

    value_counts = ehgdf[column].value_counts()
    print("Count of each unique entry in", column)
    print(value_counts)
    print()

## Change dataypes

In [None]:
#Floats
ehgdf['Gestation'] = ehgdf['Gestation'].astype(float)
ehgdf['Rec_time'] = ehgdf['Rec_time'].astype(float)
ehgdf['Age'] = ehgdf['Age'].astype(float)
ehgdf['Parity'] = ehgdf['Parity'].astype(float)
ehgdf['Abortions'] = ehgdf['Abortions'].astype(float)
ehgdf['Weight'] = ehgdf['Weight'].astype(float)

#Booleans
#ehgdf['Smoker'] = ehgdf['Smoker'].map({'Yes': True, 'No': False}).astype(bool)


## Visualize data

In [None]:
# Generate a profile report
profile = ProfileReport(ehgdf, title='EHG Data Profiling Report', explorative=True)

# Save the report to an HTML file
profile.to_file("ehg_output_report.html")

# Next Steps

Run Loop on Filtrations

Data cleaning and visualization
- assign datatypes
- clean up cells

ML Pipeline
- train/test split
- select models to train on data
- cross-validation
- hyperparameter tuning
- test

To Research
- use filtered data?
- how to calc sparse matrix
- is it ok to sample 1%?
- what coeff?
  