# **Data Cleaning**

## Objectives

*   Evaluate missing data
*   Clean data

## Inputs

* outputs/datasets/collection/FertilityTrSeatmentData.csv.gz

## Outputs

* Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned

## Conclusions


####  Data Cleaning Pipeline

* Filter data to keep only entries with "Treatment - IVF" using:
    - filter_ivf

* Drop rows with 'Live birth occurence' value 1 and 'Embryos transferred' 0 using
    - drop_erroneous

* Drop Columns using:
    - drop_columns
  ```
  ['Total number of previous DI cycles',
  'Main reason for producing embroys storing eggs',
  'Type of treatment - IVF or DI',
  'Donated embryo',
  'Eggs thawed (0/1)',
  'Year of treatment',
  'Number of live births',
  'Embryos stored for use by patient',
  'Fresh eggs stored (0/1)',
  'Heart three birth congenital abnormalities',
  'Heart two birth congenital abnormalities',
  'Heart three delivery date',
  'Heart three sex',
  'Heart three birth weight',
  'Heart three weeks gestation',
  'Heart three birth outcome',
  'Heart one birth congenital abnormalities',
  'Heart two birth weight',
  'Heart two delivery date',
  'Heart two sex',
  'Heart two weeks gestation',
  'Heart two birth outcome',
  'Heart one birth weight',
  'Heart one weeks gestation',
  'Heart one delivery date',
  'Heart one sex',
  'Heart one birth outcome',
  'Number of foetal sacs with fetal pulsation',
  'Early outcome',
  'Partner ethnicity',
  'Partner Type']
  ```

* Standardize datatype of "Total number of previous pregnancies - IVF and DI" and "Total number of previous live births - IVF or DI" and inpute 0 for missing values using the following imputers:
    - convert_to_numeric
    - zeros
    - convert_to_int

* Input "Sperm Source" using
    - fill_sperm_source

* Process and clean data on "Date of embryo transfer" using the following imputers:
    - dot_to_int_999
    - replace_missing_values
    - append_cycle_type

* Clean column 'Embryos transferred from eggs micro-injected' using:
    - micro_injected

* Input Donor age using:
    - donor_age

* Append 'e' to '1' in the 'Embryos transferred' column when a single embryo transfer was elective using:
    - e_flagging

* Annotate the value 0 in relevant columns using:
    - type_of_cycle

* Convert columns with data type float to data type integer using:
    - float_to_int

* Drop rows with placeholder values '999' using:
    - drop_999 = DropRowsWith999()

* Drop remaining rows containing missing values using:
    - drop_missing_values

---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

To make the parent of the current directory the new current directory:
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("A new current directory has been set")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load data

In [None]:
import pandas as pd
# Read the DataFrame from the compressed CSV file
df = pd.read_csv('outputs/datasets/collection/FertilityTreatmentData.csv.gz')
df.head(3)

# Data Exploration

Explore dataset

In [None]:
print (f"Number of empty entries followed by the unique values and data type at each column:\n")

for column in df.columns:
    # Check how many empty fields there are in each column
    empty_fields_count = df[column].isnull().sum()
    # Check unique values there are in each column
    unique_values = df[column].unique()
    # Check data type of each column
    data_type = df[column].dtype
    
    print (f"- {column}: {empty_fields_count}, {unique_values}, {data_type}\n")

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

Check the distribution and shape of a variable with missing data.

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

In [None]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")


# Data Cleaning

## Assessing Missing Data Levels

* Custom function to display missing data levels in a DataFrame, it shows the absolute levels, relative levels and data type.

In [None]:
def EvaluateMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute/len(df)*100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_data_absolute,
                                   "PercentageOfDataset": missing_data_percentage,
                                   "DataType": df.dtypes}
                                    )
                          .sort_values(by=['PercentageOfDataset'], ascending=False)
                          .query("PercentageOfDataset > 0")
                          )

    return df_missing_data


Check missing data levels for the collected dataset.

In [None]:
EvaluateMissingData(df)

## Dealing with Missing Data

### Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split

TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['Live birth occurrence'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

In [None]:
df_missing_data = EvaluateMissingData(TrainSet)
print(f"* There are {df_missing_data.shape[0]} variables with missing data \n")
df_missing_data

## Data Cleaning

### Data Cleaning Summary

* Filter the Dataset to include only IVF treatments by filtering 'Main reason for producing embroys storing eggs' only for "Treatment - IVF"

* Drop likely erroneous entries. Remove rows where 'Live birth occurrence' has value 1 and 'Embryos transferred' has value 0.

* Drop columns that have missing data and don't add relevant information for the analysis:
    - 'Total number of previous DI cycles',
    - 'Main reason for producing embroys storing eggs' (after filtering the df for 'Treatment - IVF')
    - 'Type of treatment - IVF or DI' (will have only IVF values after filtering the df)
    - 'Donated embryo',
    - 'Eggs thawed (0/1)',
    - 'Year of treatment',
    - 'Number of live births',
    - 'Embryos stored for use by patient',
    - 'Fresh eggs stored (0/1)',
    - 'Heart three birth congenital abnormalities',
    - 'Heart two birth congenital abnormalities',
    - 'Heart three delivery date',
    - 'Heart three sex',
    - 'Heart three birth weight',
    - 'Heart three weeks gestation',
    - 'Heart three birth outcome',
    - 'Heart one birth congenital abnormalities',
    - 'Heart two birth weight',
    - 'Heart two delivery date',
    - 'Heart two sex',
    - 'Heart two weeks gestation',
    - 'Heart two birth outcome',
    - 'Heart one birth weight',
    - 'Heart one weeks gestation',
    - 'Heart one delivery date',
    - 'Heart one sex',
    - 'Heart one birth outcome',
    - 'Number of foetal sacs with fetal pulsation',
    - 'Early outcome',
    - 'Partner ethnicity'
    - 'Partner Type'

* "Total number of previous pregnancies - IVF and DI" and "Total number of previous live births - IVF or DI" columns need to have the data type standardized and missing values should be inmputed with "0"

* 'Sperm source' missing entries should be filled up with 'Donor' if there is a 'Sperm donor age at registration', otherwise, fill up with 'Partner'.

* Process and clean data on "Date of embryo transfer" column: Convert float values to integers and handle NaNs.
Replace the value 999 with 0, as these entries represent frozen cycles. Replace missing values with "NT" for "No transfer" if "Embryos transferred" is 0. Append strings based on "Fresh cycle" and "Frozen cycle" values.

* Clean column 'Embryos transferred from eggs micro-injected': If the specific treatment type includes 'ICSI', then fill missing values with value from column 'Embryos transferred' , otherwise fill missing values with 0.

* Impute missing values in the "Egg donor age at registration" and "Sperm donor age at registration" columns based on their respective source columns ("Egg source" and "Sperm source"). If the source is "Patient" or "Partner," fill the missing values using the "Patient age at treatment" and "Partner age" columns, respectively. After imputation, rename the columns to "Patient/Egg provider age" and "Partner/Sperm provider age" accordingly.

* Append 'e' to '1' in the 'Embryos transferred' column when a single embryo transfer was elective to enhance clarity and analysis.

* Annotate the value 0 in relevant columns to indicate whether it pertains to a frozen or fresh cycle,giving the value contextual meaning.

* Convert columns with data type float to data type integer.

* Drop rows with placeholder values '999'.

* Drop all ramaining rows with missing data.

#### Filter the Dataset to include only IVF treatments

Since the costumer is interested in predicting the chance of success using IFV treatment, the first step is to filter the data and keep only entrances with "Main reason for producing embroys storing eggs" with the value of "Treatment - IVF"

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class FilterIVFTreatments(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.query("`Main reason for producing embroys storing eggs` == 'Treatment - IVF'")

Create a new dataframe and apply drop_erroneous to the selected variables in the TrainSet.

In [None]:
filter_ivf = FilterIVFTreatments()
df_filter_ivf = filter_ivf.fit_transform(TrainSet)

Check cleaning effect

In [None]:
df_filter_ivf['Main reason for producing embroys storing eggs'].unique()

Apply the transformation to Train and TestSet

In [None]:
filter_ivf = FilterIVFTreatments()
TrainSet_cleaned, TestSet_cleaned = filter_ivf.transform(TrainSet), filter_ivf.transform(TestSet)

#### Drop rows with 'Live birth occurrence' value 1 and 'Embryos transferred' 0

Since it is not possible to have a successfull treatment with a Live occurence without having had embryos transferred, these entries cannot be considered.

In [None]:
class DropErroneousEntries(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop(X[(X['Live birth occurrence'] == 1) & (X['Embryos transferred'] == 0)].index)

Create a new dataframe and apply drop_erroneous to the selected variables in the TrainSet.

In [None]:
drop_erroneous = DropErroneousEntries()
df_drop_erroneous= drop_erroneous.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
print("\nDataFrame after applying DropErroneousEntries:")
print(df_drop_erroneous)

Apply the transformation to Train and TestSet

In [None]:
drop_erroneous = DropErroneousEntries()
TrainSet_cleaned, TestSet_cleaned = drop_erroneous.transform(TrainSet_cleaned), drop_erroneous.transform(TestSet_cleaned)

#### Drop columns that have missing data and/or don't add relevant information for the analysis

In [None]:
columns_to_drop = [
    'Total number of previous DI cycles',
    'Main reason for producing embroys storing eggs',
    'Type of treatment - IVF or DI',
    'Donated embryo',
    'Eggs thawed (0/1)',
    'Year of treatment',
    'Number of live births',
    'Embryos stored for use by patient',
    'Fresh eggs stored (0/1)',
    'Heart three birth congenital abnormalities',
    'Heart two birth congenital abnormalities',
    'Heart three delivery date',
    'Heart three sex',
    'Heart three birth weight',
    'Heart three weeks gestation',
    'Heart three birth outcome',
    'Heart one birth congenital abnormalities',
    'Heart two birth weight',
    'Heart two delivery date',
    'Heart two sex',
    'Heart two weeks gestation',
    'Heart two birth outcome',
    'Heart one birth weight',
    'Heart one weeks gestation',
    'Heart one delivery date',
    'Heart one sex',
    'Heart one birth outcome',
    'Number of foetal sacs with fetal pulsation',
    'Early outcome',
    'Partner ethnicity',
    'Partner Type'
    ]

print(f"* {len(columns_to_drop)} variables to drop \n\n"
    f"{columns_to_drop}")


Apply imputation approach to the selected variables in the TrainSet.

In [None]:
from feature_engine.selection import DropFeatures
print(TrainSet_cleaned.columns)
drop_columns = DropFeatures(features_to_drop=columns_to_drop)
df_dropped_columns = drop_columns.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
df_dropped_columns.head(3)

Apply the transformation to the Train and TestSet

In [None]:
from feature_engine.selection import DropFeatures

drop_columns = DropFeatures(features_to_drop=columns_to_drop)
drop_columns.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = drop_columns.transform(TrainSet_cleaned), drop_columns.transform(TestSet_cleaned)

#### Handling "Total number of previous pregnancies - IVF and DI" and "Total number of previous live births - IVF or DI" columns

* Turn values to numeric
* Impute missing values with "0".
* Standardize values on column by converting the data type to integers.
* Replace('>3', 4)

Turn values to numeric

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class ConvertToNumeric(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        for col in self.columns:
            # Replace '>3' with 4
            X[col] = X[col].replace('>3', 4)
            # Convert to numeric
            X[col] = pd.to_numeric(X[col])
        return X

Create a new dataframe and apply convert_to_numeric to the selected variables in the TrainSet.

In [None]:
convert_to_numeric = ConvertToNumeric(columns=['Total number of previous pregnancies - IVF and DI', 'Total number of previous live births - IVF or DI'])
df_prev_preg_births_to_numeric = convert_to_numeric.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
print(df_prev_preg_births_to_numeric['Total number of previous pregnancies - IVF and DI'].dtype)
print(df_prev_preg_births_to_numeric['Total number of previous live births - IVF or DI'].dtype)

Apply the transformation to Train and TestSet

In [None]:
convert_to_numeric = ConvertToNumeric(columns=['Total number of previous pregnancies - IVF and DI', 'Total number of previous live births - IVF or DI'])

TrainSet_cleaned, TestSet_cleaned = convert_to_numeric.transform(TrainSet_cleaned), convert_to_numeric.transform(TestSet_cleaned)

##### Fill missing values with 0

In [None]:
from feature_engine.imputation import ArbitraryNumberImputer

# Fill missing values with 0 for specified columns
zeros_imputer = ArbitraryNumberImputer(arbitrary_number=0, variables=[
    'Total number of previous pregnancies - IVF and DI',
    'Total number of previous live births - IVF or DI'
])

Create a new dataframe and apply convert_to_int to the selected variables in the TrainSet.

In [None]:

df_prev_preg_births_zero_imputed = zeros_imputer.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
# Function to compare columns before and after transformation
def compare_columns(df_original, df_cleaned, columns):
    comparison_dict = {}
    summary_list = []

    for column in columns:
        comparison_dict[f'{column}_Before_Cleaning'] = df_original[column]
        comparison_dict[f'{column}_After_Cleaning'] = df_cleaned[column]
            
        before_unique = df_original[column].unique()
        after_unique = df_cleaned[column].unique()
        before_missing = df_original[column].isna().sum()
        after_missing = df_cleaned[column].isna().sum()
        
        summary_list.append({
            'Column': column,
            'Before Unique Values': before_unique,
            'After Unique Values': after_unique,
            'Before Missing Entries': before_missing,
            'After Missing Entries': after_missing
        })
        
    comparison_df = pd.DataFrame(comparison_dict)
    summary_df = pd.DataFrame(summary_list)
    
    return comparison_df, summary_df


In [None]:
# Columns to compare
columns_to_compare = ['Total number of previous pregnancies - IVF and DI', 'Total number of previous live births - IVF or DI']

# Compare the columns before and after cleaning
comparison_prev_preg_births_zero_imputed = compare_columns(TrainSet_cleaned, df_prev_preg_births_zero_imputed, columns_to_compare)
print(comparison_prev_preg_births_zero_imputed)

Apply the transformation to Train and TestSet

In [None]:
zeros_imputer = ArbitraryNumberImputer(arbitrary_number=0, variables=[
    'Total number of previous pregnancies - IVF and DI',
    'Total number of previous live births - IVF or DI'
]).fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = zeros_imputer.transform(TrainSet_cleaned), zeros_imputer.transform(TestSet_cleaned)

##### Standardize values on column by converting the data type to integers and replace('>3', 4)

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class ConvertToIntegers(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        for col in self.columns:
            # Replace '>3' with 4 and convert to int
            X[col] = X[col].replace('>3', 4).astype(float).astype(int)
        return X

Create a new dataframe and apply convert_to_int to the selected variables in the TrainSet.

In [None]:
convert_to_int = ConvertToIntegers(['Total number of previous pregnancies - IVF and DI', 'Total number of previous live births - IVF or DI'])
df_prev_preg_births_to_int = convert_to_int.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
# Columns to compare
columns_to_compare = ['Total number of previous pregnancies - IVF and DI', 'Total number of previous live births - IVF or DI']

# Compare the columns before and after cleaning
comparison_prev_preg_births_int = compare_columns(TrainSet_cleaned, df_prev_preg_births_to_int, columns_to_compare)
print(comparison_prev_preg_births_int)

Apply the transformation to Train and TestSet

In [None]:
convert_to_int = ConvertToIntegers(['Total number of previous pregnancies - IVF and DI', 'Total number of previous live births - IVF or DI'])
convert_to_int.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = convert_to_int.transform(TrainSet_cleaned), convert_to_int.transform(TestSet_cleaned)

#### Clean 'Sperm source' missing entries

If there is a 'Sperm donor age at registration', input 'Donor'; otherwise, input 'Partner'.


In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class FillSpermSource(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X['Sperm source'] = X.apply(self._fill_sperm_source, axis=1)
        return X

    def _fill_sperm_source(self, row):
        if pd.isna(row['Sperm source']):
            if not pd.isna(row['Sperm donor age at registration']):
                return 'Donor'
            else:
                return 'Partner'
        return row['Sperm source']

Create a new dataframe and apply fill_sperm_source to the selected variables in the TrainSet.

In [None]:
# Create an instance of the transformer
fill_sperm_source = FillSpermSource()
df_filled_sperm_source = fill_sperm_source.fit_transform(TrainSet_cleaned)


Check cleaning effect

In [None]:
# Columns to compare
columns_to_compare = ['Sperm source', 'Sperm donor age at registration']

# Compare the columns before and after cleaning
comparison_sperm_source = compare_columns(TrainSet_cleaned, df_filled_sperm_source, columns_to_compare)
print(comparison_sperm_source)

Apply the transformation to Train and TestSet

In [None]:
fill_sperm_source = FillSpermSource()
fill_sperm_source.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = fill_sperm_source.transform(TrainSet_cleaned), fill_sperm_source.transform(TestSet_cleaned)

#### Handling "Date of embryo transfer" column

This column has to be handled using several custom transformers to:
* Convert float values to integers and handle NaN values.
* Replace the value 999 with 0. All these "999" entries are from frozen cycles and transfers from frozen cycles happen mostly on the day same day they are thawed.
* Replace missing values based on the "Embryos transferred" column. If the value is 0, the missing entries need to be replaced by "NT" for "No transfer", meaning that the treatment didn't work.
* Append strings based on the "Fresh cycle" and "Frozen cycle" values.

##### Create a Custom Transformer to convert float values to integers, handle NaN values and replace the value 999 with 0.


In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# Convert float values to integers and handle NaN values
class ConvertToIntAndReplace999(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        # Fill NaN with -1 and convert to int
        X['Date of embryo transfer'] = X['Date of embryo transfer'].fillna(-1).astype(int)
        # Replace 999 with 0
        X['Date of embryo transfer'] = X['Date of embryo transfer'].replace(999, 0)
        return X


Apply the dot_to_int_999 to the selected variables in the TrainSet to verify the cumulative cleaning results.

In [None]:
# Create an instance of the transformer
dot_to_int_999 = ConvertToIntAndReplace999()
df_dot_to_int_999 = dot_to_int_999.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
# Columns to compare
columns_to_compare = ['Date of embryo transfer']

# Compare the columns before and after cleaning
comparison_date_transf_to_int_999 = compare_columns(TrainSet_cleaned, df_dot_to_int_999, columns_to_compare)
print(comparison_date_transf_to_int_999)

Apply the transformation to Train and TestSet

In [None]:
dot_to_int_999 = ConvertToIntAndReplace999()
dot_to_int_999.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = dot_to_int_999.transform(TrainSet_cleaned), dot_to_int_999.transform(TestSet_cleaned)

##### Replace missing values based on the "Embryos transferred" column.

If the value is 0, the missing entries need to be replaced by "NT" for "No transfer", meaning that the treatment didn't work.

In [None]:
# Replace missing values based on the "Embryos transferred" column
class ReplaceMissingValues(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X['Date of embryo transfer'] = X.apply(self._replace_missing, axis=1)
        return X

    def _replace_missing(self, row):
        value = row['Date of embryo transfer']
        if value == -1 and row['Embryos transferred'] == 0:
            return 'NT'
        elif value == -1:
            return 'Missing'
        return value

Apply the replace_missing_values to the selected variables in the TrainSet.

In [None]:
# Create an instance of the transformer
replace_missing_values = ReplaceMissingValues()
df_replace_missing_values = replace_missing_values.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
# Columns to compare
columns_to_compare = ['Date of embryo transfer', 'Embryos transferred']

# Compare the columns before and after cleaning
comparison_date_transf_replace_missing_values = compare_columns(TrainSet_cleaned, df_replace_missing_values, columns_to_compare)
print(comparison_date_transf_replace_missing_values)

Apply the transformation to Train and TestSet

In [None]:
replace_missing_values = ReplaceMissingValues()
replace_missing_values.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = replace_missing_values.transform(TrainSet_cleaned), replace_missing_values.transform(TestSet_cleaned)

##### Append strings based on the "Fresh cycle" and "Frozen cycle" values

In [None]:
# Append strings based on the "Fresh cycle" and "Frozen cycle" values
class AppendCycleType(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X['Date of embryo transfer'] = X.apply(self._append_cycle_type, axis=1)
        return X

    def _append_cycle_type(self, row):
        value = row['Date of embryo transfer']
        if value not in ['NT', 'Missing']:
            if row['Fresh cycle'] == 1:
                value = f"{value} - fresh"
            elif row['Frozen cycle'] == 1:
                value = f"{value} - frozen"
            else:
                value = f"{value} - Mixed fresh/frozen"
        return value


Apply the embryo_transfer to the selected variables in the TrainSet.

In [None]:
append_cycle_type = AppendCycleType()
df_appended_cycle_type = append_cycle_type.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
# Columns to compare
columns_to_compare = ['Date of embryo transfer', 'Embryos transferred', 'Fresh cycle', 'Frozen cycle']

# Compare the columns before and after cleaning
comparison_date_trans_cycle_type = compare_columns(TrainSet_cleaned, df_appended_cycle_type, columns_to_compare)
print(comparison_date_trans_cycle_type)


Apply the transformation to Train and TestSet

In [None]:
append_cycle_type = AppendCycleType()
append_cycle_type.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = append_cycle_type.transform(TrainSet_cleaned), append_cycle_type.transform(TestSet_cleaned)

#### Clean column 'Embryos transferred from eggs micro-injected'

If the specific treatment type includes 'ICSI', then fill missing values with value from column 'Embryos transferred' , otherwise fill missing values with 0.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class MicroInjectedEmbryos(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        
        # Embryos transferred from eggs micro-injected imputation
        missing_micro_injected = (X['Embryos transferred from eggs micro-injected'].isna())
        ICSI = X['Specific treatment type'].str.contains('ICSI')
        # Only replace missing values
        X.loc[missing_micro_injected & ICSI, 'Embryos transferred from eggs micro-injected'] = X.loc[missing_micro_injected & ICSI, 'Embryos transferred']
        X.loc[missing_micro_injected & ~ICSI, 'Embryos transferred from eggs micro-injected'] = 0

        return X

Apply the micro_injected to the selected variables in the TrainSet.

In [None]:
# Create an instance of the transformer
micro_injected = MicroInjectedEmbryos()
df_micro_injected = micro_injected.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
# Columns to compare
columns_to_compare = ['Specific treatment type', 'Embryos transferred', 'Embryos transferred from eggs micro-injected']

# Compare the columns before and after cleaning
comparison_micro_injected_embryos = compare_columns(TrainSet_cleaned, df_micro_injected, columns_to_compare)
print(comparison_micro_injected_embryos)


Apply the transformation to Train and TestSet

In [None]:
micro_injected = MicroInjectedEmbryos()
micro_injected.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = micro_injected.transform(TrainSet_cleaned), micro_injected.transform(TestSet_cleaned)

#### Handling 'Egg donor age at registration' and 'Sperm donor age at registration'

Both of these columns have more than 90% missing data, but the missing data can be managed by checking the respective source columns ('Egg source' and 'Sperm source') to determine if the source is "Donor" or "Patient/Partner".

**Egg donor age at registration:**
- For missing fields in the "Egg donor age at registration" column, if the value in the 'Egg source' column is "Patient", then the field can be filled with the patient's age from the "Patient age at treatment" column.
- After that, the column "Egg donor age at registration" is renamed to "Patient/Egg provider age".
- The age in this dataset is represented as ranges, which are different between "Patient age at treatment" ('18-34', '35-37', '38-39', '40-42', '43-44', '45-50') and the original "Egg donor age at registration" ('<= 20', 'Between 21 and 25', 'Between 26 and 30', 'Between 31 and 35', '>35'). Therefore, the ranges need to be standardized.
- Since the majority of values will come from the "Patient age at treatment", this column's ranges are used as the reference to adjust the "Patient/Egg provider age".

**Sperm donor age at registration:**
- For missing fields in the "Sperm donor age at registration" column, if the value in the 'Sperm source' column is "Partner", then the field can be filled with the partner's age from the "Partner age" column.
- After that, the column "Sperm donor age at registration" is renamed to "Partner/Sperm provider age".
- The age in this dataset is represented as ranges, which are different between "Partner age" ('18-34', '35-37', '38-39', '40-42', '43-44', '45-50', '51-55', '56-60', '>60') and the original "Sperm donor age at registration" ('<= 20', 'Between 21 and 25', 'Between 26 and 30', 'Between 31 and 35', 'Between 36 and 40', 'Between 41 and 45', '>45'). Therefore, the ranges need to be standardized.
- Since the majority of values will come from the "Partner age", this column's ranges are used as the reference to adjust the "Partner/Sperm provider age".
- The column "Partner age" can then be dropped because the useful information will already be saved on the "Partner/Sperm provider age".

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class DonorAgeImputer(BaseEstimator, TransformerMixin):
    def __init__(self):
        # Mapping from donor age ranges to patient/partner age ranges
        self.egg_age_map = {
            'Between 21 and 25': '18-34',
            'Between 26 and 30': '18-34',
            'Between 31 and 35': '18-34',
            '>35': '38-39',
            '<= 20': '18-34'
        }
        self.sperm_age_map = {
            'Between 21 and 25': '18-34',
            'Between 26 and 30': '18-34',
            'Between 31 and 35': '18-34',
            'Between 36 and 40': '38-39',
            'Between 41 and 45': '43-44',
            '>45': '45-50',
            '<= 20': '18-34'
        }
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        
        # Egg donor age imputation
        X['Egg donor age at registration'] = X['Egg donor age at registration'].map(self.egg_age_map)
        missing_egg_age = (X['Egg donor age at registration'].isna()) & (X['Egg source'] == 'Patient')
        X.loc[missing_egg_age, 'Egg donor age at registration'] = X.loc[missing_egg_age, 'Patient age at treatment']
        X.rename(columns={'Egg donor age at registration': 'Patient/Egg provider age'}, inplace=True)
        
        # Sperm donor age imputation
        X['Sperm donor age at registration'] = X['Sperm donor age at registration'].map(self.sperm_age_map)
        missing_sperm_age = (X['Sperm donor age at registration'].isna()) & (X['Sperm source'] == 'Partner')
        X.loc[missing_sperm_age, 'Sperm donor age at registration'] = X.loc[missing_sperm_age, 'Partner age']
        X.rename(columns={'Sperm donor age at registration': 'Partner/Sperm provider age'}, inplace=True)

        # Drop the "Partner age" column
        X.drop(columns=['Partner age'], inplace=True)
        
        # Ensure no duplicate columns
        if X.columns.duplicated().any():
            raise ValueError("Duplicate column names found after transformation")
        
        return X



Apply the donor_age to the selected variables in the TrainSet.

In [None]:
donor_age = DonorAgeImputer()
df_donor_age = donor_age.fit_transform(TrainSet_cleaned)

Egg and Sperm donor age at registration Data Celaning evaluation

In [None]:

def CompareDataCleaning(df_original, df_cleaned, variable_map):
    missing_values = {}
    unique_values = {}

    for original_var, cleaned_var in variable_map.items():
        # Missing values
        original_missing_count = df_original[original_var].isna().sum()
        cleaned_missing_count = df_cleaned[cleaned_var].isna().sum()
        original_missing_percent = (original_missing_count / len(df_original)) * 100
        cleaned_missing_percent = (cleaned_missing_count / len(df_cleaned)) * 100

        missing_values[original_var] = pd.DataFrame({
            'Original Missing Count': [original_missing_count],
            'Original Missing Percent': [original_missing_percent],
            'Cleaned Missing Count': [cleaned_missing_count],
            'Cleaned Missing Percent': [cleaned_missing_percent]
        })

        # Unique value counts
        original_unique = df_original[original_var].value_counts(dropna=False)
        cleaned_unique = df_cleaned[cleaned_var].value_counts(dropna=False)
        unique_values[original_var] = pd.DataFrame({
            'Original': original_unique,
            'Cleaned': cleaned_unique
        })

    # Display results
    for original_var, cleaned_var in variable_map.items():
        print("\n=====================================================================================")
        print(f"Missing Values for {original_var} -> {cleaned_var}:\n")
        print(missing_values[original_var])
        print(f"\nUnique Values for {original_var} -> {cleaned_var}:\n")
        print(unique_values[original_var])

In [None]:
# Mapping of original to cleaned variables
variable_map_donor = {
    'Egg donor age at registration': 'Patient/Egg provider age',
    'Sperm donor age at registration': 'Partner/Sperm provider age',
}

CompareDataCleaning(df_original=TrainSet_cleaned, df_cleaned=df_donor_age, variable_map=variable_map_donor)

Apply the transformation to Train and TestSet

In [None]:
donor_age = DonorAgeImputer()
donor_age.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = donor_age.transform(TrainSet_cleaned), donor_age.transform(TestSet_cleaned)

### Convert all float data type variables to data type integer

In [None]:
class FloatToIntTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.float_vars = None
    
    def fit(self, X, y=None):
        # Identify float columns
        self.float_vars = X.select_dtypes(include='float').columns.tolist()
        return self
    
    def transform(self, X):
        X = X.copy()
        for var in self.float_vars:
            X[var] = X[var].astype(int)
        return X

Apply the FloatToIntTransformer to the selected variables in the TrainSet.

In [None]:
float_to_int = FloatToIntTransformer()
df_float_to_int = float_to_int.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
# Check data types before transformation
print("Data types before transformation:")
print(TrainSet_cleaned.dtypes)
print("\n")
print("========================================")
print("\n")
# Check data types after transformation
print("Data types after transformation:")
print(df_float_to_int.dtypes)

Apply the transformation to Train and TestSet

In [None]:
float_to_int = FloatToIntTransformer()
float_to_int.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = float_to_int.transform(TrainSet_cleaned), float_to_int.transform(TestSet_cleaned)

#### Explicitly mark the transferred embryos that were electively selected

To enhance the clarity of the 'Embryos transferred' column, an "e" will be appended to the 1 in 'Embryos transferred' when both 'Embryos transferred' and 'Elective single embryo transfer' columns have a value of 1. This will change the value to "1e".

This adjustment is intended to indicate cases where a single embryo transfer was elective, thereby distinguishing it from situations where only one embryo was available for transfer.

Explicitly marking elective single embryo transfers improves the analysis of the outcomes.

In [None]:
class EFlaggingTransformer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        
        X['Embryos transferred'] = X.apply(self.append_e, axis=1)
        return X
    
    def append_e(self, row):
        if row['Embryos transferred'] == 1 and row['Elective single embryo transfer'] == 1:
            return '1e'
        else:
            return row['Embryos transferred']
    


Apply the e_flagging to the selected variables in the TrainSet.

In [None]:
e_flagging = EFlaggingTransformer()
df_e_flagged = e_flagging.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
# Columns to compare
columns_to_compare = ['Embryos transferred', 'Elective single embryo transfer']

# Compare the columns before and after cleaning
comparison_e_flagging = compare_columns(TrainSet_cleaned, df_e_flagged, columns_to_compare)
print(comparison_e_flagging)


Apply the transformation to Train and TestSet

In [None]:
e_flagging = EFlaggingTransformer()
e_flagging.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = e_flagging.transform(TrainSet_cleaned), e_flagging.transform(TestSet_cleaned)

#### Annotate the value 0 in relevant columns to indicate whether it pertains to a frozen or fresh cycle

For columns related to fresh cycles ('Fresh eggs collected,' 'Total eggs mixed,' and 'Total embryos created'), if the value is 0 and it's a frozen cycle, the transformer marks it as "0 - frozen cycle." Similarly, for the 'Total embryos thawed' column, if the value is 0 and it's a fresh cycle, it is marked as "0 - fresh cycle." This distinction ensures that the value 0 is meaningful, reflecting its relevance to either a fresh or frozen cycle. For instance, "0 - frozen cycle" on the 'Fresh eggs collected' column has a different significance than a simple 0 in the context of a fresh cycle.

In [None]:
class TypeOfCycleAppender(BaseEstimator, TransformerMixin):
    def __init__(self, columns_to_update):
        self.columns_to_update = columns_to_update

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        
        # Ensure columns have the correct data type to avoid issues
        for column in self.columns_to_update:
            X[column] = X[column].astype(str)
        
        # Apply transformation for frozen cycle
        for column in self.columns_to_update:
            X.loc[(X['Frozen cycle'] == 1) & (X[column] == '0'), column] = '0 - frozen cycle'
        
        # Apply transformation for fresh cycle
        X['Total embryos thawed'] = X['Total embryos thawed'].astype(str)
        X.loc[(X['Fresh cycle'] == 1) & (X['Total embryos thawed'] == '0'), 'Total embryos thawed'] = '0 - fresh cycle'
        
        return X

Apply the type_of_cycle to the selected variables in the TrainSet.

In [None]:
columns_to_update = ['Fresh eggs collected', 'Total eggs mixed', 'Total embryos created']

type_of_cycle = TypeOfCycleAppender(columns_to_update=columns_to_update)
df_type_of_cycle_appended = type_of_cycle.fit_transform(TrainSet_cleaned)

Check cleaning effect

In [None]:
# Columns to compare
columns_to_compare = ['Fresh eggs collected', 'Total eggs mixed', 'Total embryos created', 'Fresh cycle', 'Frozen cycle', 'Total embryos thawed']

# Compare the columns before and after cleaning
comparison_type_of_cycle_appended = compare_columns(TrainSet_cleaned, df_type_of_cycle_appended, columns_to_compare)
print(comparison_type_of_cycle_appended)


Apply the transformation to Train and TestSet

In [None]:
columns_to_update = ['Fresh eggs collected', 'Total eggs mixed', 'Total embryos created']
type_of_cycle = TypeOfCycleAppender(columns_to_update=columns_to_update)
type_of_cycle.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = type_of_cycle.transform(TrainSet_cleaned), type_of_cycle.transform(TestSet_cleaned)

### Check cleaning results

In [None]:
print (f"Number of empty entries followed by the unique values and data type at each column:\n")

for column in TrainSet_cleaned.columns:
    # Check how many empty fields there are in each column
    empty_fields_count = TrainSet_cleaned[column].isnull().sum()
    # Check unique values there are in each column
    unique_values = TrainSet_cleaned[column].unique()
    # Check data type of each column
    data_type = TrainSet_cleaned[column].dtype
    
    print (f"- {column}: {empty_fields_count}, {unique_values}, {data_type}\n")

#### Drop all rows with placeholder values of 999

In [None]:
class DropRowsWith999(BaseEstimator, TransformerMixin):
    """
    Custom transformer to drop rows with the value "999" in any column.
    """
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        
        # Drop rows where any column has the value "999"
        X_filtered = X[(X != "999").all(axis=1)]
        
        return X_filtered

Apply the drop_999 to the selected variables in the TrainSet and check cleaning effect

In [None]:
print(f"Original row count: {TrainSet_cleaned.shape[0]}")
occurrences_before = TrainSet_cleaned.isin(['999']).sum().sum()
print(f"Total occurrences of '999' before cleaning: {occurrences_before}")

drop_999 = DropRowsWith999()
df_999_dropped = drop_999.fit_transform(TrainSet_cleaned)

print(f"Cleaned row count: {df_999_dropped.shape[0]}")
occurrences_after = df_999_dropped.isin(['999']).sum().sum()

df_999_dropped


Apply the transformation to Train and TestSet

In [None]:

drop_999 = DropRowsWith999()

drop_999.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = drop_999.transform(TrainSet_cleaned), drop_999.transform(TestSet_cleaned)

In [None]:
print (f"Number of empty entries followed by the unique values and data type at each column:\n")

for column in TrainSet_cleaned.columns:
    # Check how many empty fields there are in each column
    empty_fields_count = TrainSet_cleaned[column].isnull().sum()
    # Check unique values there are in each column
    unique_values = TrainSet_cleaned[column].unique()
    # Check data type of each column
    data_type = TrainSet_cleaned[column].dtype
    
    print (f"- {column}: {empty_fields_count}, {unique_values}, {data_type}\n")

#### Check variables with missing data.

In [None]:
EvaluateMissingData(TrainSet_cleaned)

#### Drop all ramaining rows with missing data

In [None]:
from feature_engine.imputation import DropMissingData

drop_missing_data = DropMissingData()
df_missing_data_dropped = drop_missing_data.fit_transform(TrainSet_cleaned)
df_missing_data_dropped

Apply the transformation to the Train and TestSet

In [None]:

drop_missing_data = DropMissingData()

drop_missing_data.fit(TrainSet_cleaned)

TrainSet_cleaned, TestSet_cleaned = drop_missing_data.transform(TrainSet_cleaned), drop_missing_data.transform(TrainSet_cleaned)

Check that there are no more variables missing data.

In [None]:
EvaluateMissingData(TrainSet_cleaned)

## Concatenate the cleaned TrainSet and TestSet to create df_cleaned

In [None]:
df_cleaned = pd.concat([TrainSet_cleaned, TestSet_cleaned])

# Push cleaned data to Repo

In [None]:
import os

# create outputs/datasets/collection folder
try:
  os.makedirs(name='outputs/datasets/cleaned',  exist_ok=True)
except Exception as e:
  print(e)


## Train Set

In [None]:
TrainSet_cleaned.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)

## Test Set

In [None]:
TestSet_cleaned.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)

## Cleaned df

In [None]:
df_cleaned.to_csv("outputs/datasets/cleaned/FertilityTreatmentDataCleaned.csv", index=False)

---