# **IVF Case Study Notebook**

## Objectives

*   Answer business requirement 1: 
    - The client is interested in understanding the factors that impact IVF treatment success and identifying the most relevant variables correlated with a successful outcome.
## Inputs

* outputs/datasets/cleaned/FertilityTreatmentDataCleaned.csv

## Outputs

* generate code that answers business requirement 1 and can be used to build the Streamlit App

---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

To make the parent of the current directory the new current directory:
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("A new current directory has been set")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load data

In [None]:
import pandas as pd
# Read the DataFrame from the compressed CSV file
df = pd.read_csv('outputs/datasets/cleaned/FertilityTreatmentDataCleaned.csv')
df.head(3)

Investigate data

In [None]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df=df, minimal=True)
profile.to_notebook_iframe()

## Correlation Study

In [None]:
df.info()

In [None]:
print (f"Number of empty entries followed by the unique values and data type at each column:\n")

for column in df.columns:
    # Check how many empty fields there are in each column
    empty_fields_count = df[column].isnull().sum()
    # Check unique values there are in each column
    unique_values = df[column].unique()
    # Check data type of each column
    data_type = df[column].dtype
    
    print (f"- {column}: {empty_fields_count}, {unique_values}, {data_type}\n")


In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

`.corr()` for `spearman` and `pearson` methods was used, and the top 20 correlations were investigated.

* As this command returns a pandas series and the first item is the correlation between 'Live birth occurrence' and 'Live birth occurrence', which happens to be 1, it was excluded by applying `[1:]`
  
* Values were sorted considering the absolute value, by setting `key=abs`

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['Live birth occurrence'].sort_values(key=abs, ascending=False)[1:].head(20)
corr_spearman

The same for `pearson` method

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['Live birth occurrence'].sort_values(key=abs, ascending=False)[1:].head(20)
corr_pearson

## Correlation analysis results:

For both correlation methods, we notice **very week levels of correlation** between 'Live birth occurrence' and a given variable.

 The highest negative value of -0.23 for "Embryos transferred_0", meaning no embryos were transfered and positive value of 0.16 for "Date of embryo transfer_5 - fresh", meaning that the embryo was transfered on the 5th day from the beginning of the procedure on a fresh cycle (as opposed to a frozen cycle, where the embyo is collected and frozen prior to the transfer procedure).

Since 'Date of embryo transfer_NT', 'Embryos transferred_0', 'Total embryos created_0', 'Total eggs mixed_0', 'Fresh eggs collected_0' represent treatments that have failed prior to embryo transfer, these variables are also going to be ignored in the analysis because of its obvious negative impact on the treatment failing.

**Predictors that might offer valuable insights into treatment success:**

- Date of embryo transfer_5 - fresh:
    - This suggests that embryo transfers on day 5 of fresh cycles have some association with higher success rates.

- Embryos transferred_1e:
    - This suggests thattransfering one single embryo, which has been electively selected has some association with higher success rates.

- Elective single embryo transfer:
    - Using Elective single embryo transfer has a moderate impact on success rates.

- Patient/Egg provider (different age ranges):
    - Age 18-34 positively correlates with success.
    - Age 40-42 and Age 43-44 negatively correlate, reflecting decreased success rates with increasing age.

- Total embryos created_6-10:
    - This positive correlation suggests that creating more embryos within this range might be associated with higher success rates.

- Fresh eggs collected_1-5 and Total eggs mixed_1-5:
    - These variables show a slight negative correlation, indicating that collecting or mixing fewer eggs might have a marginal impact on success.

- Partner/Sperm provider age_18-34, correlation values:
    - Just like with the Patient/Egg provider age, the Partner/Sperm provider age on the range of 18-34 seems to have a somewhat positive impact on treatment success.

The variables Patient age at treatment and Partner age have similar effects to Patient/Egg provider and Partner/Sperm provider. This is likely because the large majority of treatments on this dataset have as egg source the patient and as sperm source the partner. Therefore only Patient/Egg provider and Partner/Sperm provider ages will be considered for the analysis.



Based on the correlation analysis results and the hypotheses to be validated, the investigation will focus on whether successful IVF treatment outcomes are typically associated with:

- Embryo transfer occurring on day 5.
- Elective single embryo transfer (eSET).
- Patient/Egg provider being younger than 34 years old.
- Collection of more than 5 fresh eggs from the patient/egg donor.
- Mixing of more than 5 eggs with sperm.
- Creation of 6-10 embryos.
- Partner/Sperm provider being younger than 34 years old.
- Absence of endometriosis in the patient.

In [None]:
vars_to_study = ['Date of embryo transfer', 'Elective single embryo transfer', 'Embryos transferred', 'Fresh eggs collected', 'Total eggs mixed', 'Total embryos created', 'Patient/Egg provider age', 'Partner/Sperm provider age', 'Causes of infertility - endometriosis']
vars_to_study

## Exploratory Data Analysis (EDA) on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['Live birth occurrence'])
df_eda.head(3)

### Variables Distribution by Live birth occurrence

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
import numpy as np

sns.set_style("whitegrid")


def plot_count_distribution(df, col, target_var):
    plt.figure(figsize=(12, 5))

    # Define custom ordering for specific columns with ranges
    if col == "Date of embryo transfer":
        order = [
            "0 - fresh",
            "1 - fresh",
            "2 - fresh",
            "3 - fresh",
            "4 - fresh",
            "5 - fresh",
            "6 - fresh",
            "7 - fresh",
            "0 - frozen",
            "1 - frozen",
            "2 - frozen",
            "3 - frozen",
            "4 - frozen",
            "5 - frozen",
            "6 - frozen",
            "7 - frozen",
            "2 - Mixed fresh/frozen",
            "3 - Mixed fresh/frozen",
            "5 - Mixed fresh/frozen",
            "6 - Mixed fresh/frozen",
            "Missing",
            "NT",
        ]
    elif col in ["Fresh eggs collected", "Total eggs mixed"]:
        order = [
            "0",
            "1-5",
            "6-10",
            "11-15",
            "16-20",
            "21-25",
            "26-30",
            "31-35",
            "36-40",
            ">40",
            "0 - frozen cycle",
        ]
    elif col == "Total embryos created":
        order = [
            "0",
            "1-5",
            "6-10",
            "11-15",
            "16-20",
            "21-25",
            "26-30",
            ">30",
            "0 - frozen cycle",
        ]
    else:
        order = sorted(
            # Sort other categorical columns in ascending order as strings
            df[col].unique(), key=str
        )  

    sns.countplot(data=df, x=col, hue=target_var, order=order)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


def convert_to_string(df, columns):
    for col in columns:
        # Convert to string to avoid mixed type issues
        df[col] = df[col].astype(str)  


# Convert data before plotting to avoid mixed type issues
columns_to_convert = ["Total embryos created"]
convert_to_string(df_eda, columns_to_convert)


# Function to plot proportion distributions
def plot_proportion_distribution(df, col, target_var):
    plt.figure(figsize=(15, 6))

    # Define the order specifically for the variable of interest
    if col == "Date of embryo transfer":
        order = [
            "0 - fresh",
            "1 - fresh",
            "2 - fresh",
            "3 - fresh",
            "4 - fresh",
            "5 - fresh",
            "6 - fresh",
            "7 - fresh",
            "0 - frozen",
            "1 - frozen",
            "2 - frozen",
            "3 - frozen",
            "4 - frozen",
            "5 - frozen",
            "6 - frozen",
            "7 - frozen",
            "2 - Mixed fresh/frozen",
            "3 - Mixed fresh/frozen",
            "5 - Mixed fresh/frozen",
            "6 - Mixed fresh/frozen",
            "Missing",
            "NT",
        ]
    elif col in ["Fresh eggs collected", "Total eggs mixed"]:
        order = [
            "0",
            "1-5",
            "6-10",
            "11-15",
            "16-20",
            "21-25",
            "26-30",
            "31-35",
            "36-40",
            ">40",
            "0 - frozen cycle",
        ]
    elif col == "Total embryos created":
        order = [
            "0",
            "1-5",
            "6-10",
            "11-15",
            "16-20",
            "21-25",
            "26-30",
            ">30",
            "0 - frozen cycle",
        ]
    else:
        order = sorted(
            df[col].unique(), key=str
        )  # Sort other categorical columns in ascending order as strings

    # Filter order to match only existing categories
    unique_values = df[col].unique()
    order = [x for x in order if x in unique_values]

    # Calculate proportions
    df_prop = df.groupby([col, target_var]).size().reset_index(name="count")
    df_prop["proportion"] = df_prop.groupby(col)["count"].transform(
        lambda x: x / x.sum()
    )

    # Pivot the data to have proportions for each target variable as separate columns
    df_pivot = df_prop.pivot(index=col, columns=target_var, values="proportion").fillna(
        0
    )
    df_pivot = df_pivot.reindex(
        order, fill_value=0
    )  # Reorder according to the predefined order

    # Plot using Matplotlib to stack bars
    if not df_pivot.empty:
        plt.bar(
            df_pivot.index,
            df_pivot[0],
            label="Live birth occurrence 0",
            color="#3274a1",
        )
        if 1 in df_pivot.columns:
            plt.bar(
                df_pivot.index,
                df_pivot[1],
                bottom=df_pivot[0],
                label="Live birth occurrence 1",
                color="#e1812c",
            )

    # Format y-axis as percentages
    plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
    plt.xticks(rotation=90)
    plt.title(f"Proportion Distribution of {col} by {target_var}", fontsize=20, y=1.05)
    plt.ylabel("Proportion")
    plt.xlabel(col)
    plt.legend(title=target_var, loc="upper right")

    # Add labels to the stacked bars
    for i in range(len(df_pivot)):
        if (
            0 in df_pivot.columns
            and np.isfinite(df_pivot.iloc[i, 0])
            and df_pivot.iloc[i, 0] > 0
        ):
            plt.text(
                i,
                df_pivot.iloc[i, 0] / 2,
                f"{df_pivot.iloc[i, 0]:.1%}",
                ha="center",
                va="center",
                color="white",
                fontsize=9,
            )
        if (
            1 in df_pivot.columns
            and np.isfinite(df_pivot.iloc[i, 1])
            and df_pivot.iloc[i, 1] > 0
        ):
            plt.text(
                i,
                df_pivot.iloc[i, 0] + df_pivot.iloc[i, 1] / 2,
                f"{df_pivot.iloc[i, 1]:.1%}",
                ha="center",
                va="center",
                color="black",
                fontsize=9,
            )

    plt.show()


# Choose color palette
palette = sns.color_palette("colorblind")


# Function to plot pie charts
def plot_pie_chart(df, col, target_var):
    # Filter the data to include only successful cases
    df_successful = df[df[target_var] == 1]

    # Aggregate counts within each category for successful cases
    df_pie = df_successful.groupby([col]).size().reset_index(name="count")

    # Calculate total count for percentage calculations
    total_count = df_pie["count"].sum()

    # Calculate the percentage for each slice
    df_pie["percentage"] = df_pie["count"] / total_count

    # Define threshold for displaying labels directly on the pie chart
    threshold = 0.05  # 5%

    # Creating labels for the legend with counts and percentages
    legend_labels = [
        f"{label}: {count} ({count/total_count:.1%})"
        for label, count in zip(df_pie[col], df_pie["count"])
    ]

    # Plot pie chart for successful cases
    plt.figure(figsize=(8, 8))
    wedges, texts, autotexts = plt.pie(
        df_pie["count"],
        startangle=90,
        colors=palette,
        labels=[
            # Display category name if above threshold
            label if pct > threshold else ""
            for label, pct in zip(df_pie[col], df_pie["percentage"])
        ],  
        autopct=lambda p: (
            # Show % only if > threshold
            f"{p:.1f}%" if p / 100 > threshold else ""
        ),  
    )

    # Adjust the display of labels on the pie chart
    for text in autotexts:
        text.set_color("black")

    # Adding the legend
    plt.legend(
        wedges, legend_labels, title=col, loc="center left", bbox_to_anchor=(1, 0.5)
    )
    plt.title(f"Distribution of Successful Cases for {col}")
    plt.show()


# Plotting the graphs
target_var = "Live birth occurrence"

for col in vars_to_study:
    print(f"Plotting count distribution for: {col}")
    plot_count_distribution(df_eda, col, target_var)
    print("\n\n")

    print(f"Plotting proportion distribution for: {col}")
    plot_proportion_distribution(df_eda, col, target_var)
    print("\n\n")

    print(f"Plotting pie charts for: {col}")
    plot_pie_chart(df_eda, col, target_var)
    print("\n\n")

---

## Parallel Plot

In [None]:
import plotly.express as px

# Convert the categorical column to a numeric type
df_eda['Live birth occurrence'] = df_eda['Live birth occurrence'].astype('category').cat.codes

# Create the parallel categories plot
fig = px.parallel_categories(df_eda, color="Live birth occurrence")

# Update layout to adjust size, font size and margins
fig.update_layout(
    font=dict(size=8),
    margin=dict(l=50, r=50, t=50, b=50),
    width=1000, height=600
)

fig.show(renderer='jupyterlab')


---

## Conclusions

The correlations and plots interpretation converge.

The findings from the correlation analysis and data visualization suggest key factors that are commonly associated with successful IVF treatment outcomes:

- Embryo transfers performed on day 5 of a fresh cycle or on day 0 of a frozen cycle (the day they were thawed) were more likely to result in success.

- Success was often observed when a single embryo was electively transferred or when two embryos were transferred without elective selection.

- Collecting more than 5 fresh eggs from the patient or egg donor, or utilizing eggs from a frozen cycle, was linked to higher success rates.

- Mixing more than 5 eggs with sperm was a common factor in successful outcomes.

- Successful treatments typically involved the creation of 6 to 10 embryos.

- Higher success rates were noted when the Patient/Egg provider was younger than 34 years old.

- Outcomes were more favorable when the Partner/Sperm provider was also younger than 34 years old.

---