# **IVF Case Study Notebook**

## Objectives

*   Answer business requirement 1: 
    - The client is interested in understanding the factors that impact IVF treatment success and identifying the most relevant variables correlated with a successful outcome.
## Inputs

* outputs/datasets/cleaned/FertilityTreatmentDataCleaned.csv

## Outputs

* generate code that answers business requirement 1 and can be used to build the Streamlit App

---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

To make the parent of the current directory the new current directory:
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("A new current directory has been set")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load data

In [None]:
import pandas as pd
# Read the DataFrame from the compressed CSV file
df = pd.read_csv('outputs/datasets/cleaned/FertilityTreatmentDataCleaned.csv')
df.head(3)

Investigate data

In [None]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df=df, minimal=True)
profile.to_notebook_iframe()

## Correlation Study

In [None]:
df.info()

In [None]:
print (f"Number of empty entries followed by the unique values and data type at each column:\n")

for column in df.columns:
    # Check how many empty fields there are in each column
    empty_fields_count = df[column].isnull().sum()
    # Check unique values there are in each column
    unique_values = df[column].unique()
    # Check data type of each column
    data_type = df[column].dtype
    
    print (f"- {column}: {empty_fields_count}, {unique_values}, {data_type}\n")


In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

`.corr()` for `spearman` and `pearson` methods was used, and the top 20 correlations were investigated.

* As this command returns a pandas series and the first item is the correlation between 'Live birth occurrence' and 'Live birth occurrence', which happens to be 1, it was excluded by applying `[1:]`
  
* Values were sorted considering the absolute value, by setting `key=abs`

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['Live birth occurrence'].sort_values(key=abs, ascending=False)[1:].head(20)
corr_spearman

The same for `pearson` method

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['Live birth occurrence'].sort_values(key=abs, ascending=False)[1:].head(20)
corr_pearson

## Correlation analysis results:

For both correlation methods, we notice weak or very week levels of correlation between 'Live birth occurrence' and a given variable.

Since 'Date of embryo transfer_NT', 'Embryos transferred_0', 'Total embryos created_0', 'Total eggs mixed_0', 'Fresh eggs collected_0' represent treatments that have failed prior to embryo transfer, these variables are also going to be ignored in the analysis because of its obvious negative impact on the treatment failing.

**Predictors that might offer valuable insights into treatment success:**

- Date of embryo transfer_5 - fresh:

    This suggests that embryo transfers on day 5 of fresh cycles have some association with higher success rates.

- Embryos transferred_1e:

    This suggests thattransfering one single embryo, which has been electively selected has some association with higher success rates.

- Elective single embryo transfer:

    Using Elective single embryo transfer has a moderate impact on success rates.

- Patient/Egg provider (different age ranges):

    Age 18-34 positively correlates with success.
    Age 40-42 and Age 43-44 negatively correlate, reflecting decreased success rates with increasing age.

- Total embryos created_6-10:

    This positive correlation suggests that creating more embryos within this range might be associated with higher success rates.

- Fresh eggs collected_1-5 and Total eggs mixed_1-5:

    These variables show a slight negative correlation, indicating that collecting or mixing fewer eggs might have a marginal impact on success.

- Partner/Sperm provider age_18-34, correlation values:

    Just like with the Patient/Egg provider age, the Partner/Sperm provider age on the range of 18-34 seems to have a somewhat positive impact on treatment success.

The variables Patient age at treatment and Partner age have similar effects to Patient/Egg provider and Partner/Sperm provider. This is likely because the large majority of treatments on this dataset have as egg source the patient and as sperm source the partner. Therefore only Patient/Egg provider and Partner/Sperm provider ages will be considered for the analysis.



Based on the correlation results it will be investigated if successful IVF treatment outcomes typically:

* had the embryo being transfered on day 5.
* had only one electively selected embryo transfered.
* happened when Patient/Egg provider was younger than 34 years old.
* had more than 5 fresh eggs being collected from patient/egg donor.
* had more than 5 eggs mixed with sperm.
* had a range of 6-10 embryos created.
* happened when Partner/Sperm provider was younger than 34 years old. 

In [None]:
vars_to_study = ['Date of embryo transfer', 'Elective single embryo transfer', 'Embryos transferred', 'Fresh eggs collected', 'Total eggs mixed', 'Total embryos created', 'Patient/Egg provider age', 'Partner/Sperm provider age']
vars_to_study

## EDA on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['Live birth occurrence'])
df_eda.head(3)

### Variables Distribution by Live birth occurrence

Plot the distribution (numerical and categorical) coloured by Live birth occurrence

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'Live birth occurrence'
for col in vars_to_study:
    if df_eda[col].dtype == 'category':
        plot_categorical(df_eda, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df_eda, col, target_var)
        print("\n\n")


---

## Parallel Plot

In [None]:
import plotly.express as px

# Convert the categorical column to a numeric type
df_eda['Live birth occurrence'] = df_eda['Live birth occurrence'].astype('category').cat.codes

# Create the parallel categories plot
fig = px.parallel_categories(df_eda, color="Live birth occurrence")

# Update layout to adjust size, font size and margins
fig.update_layout(
    font=dict(size=8),
    margin=dict(l=50, r=50, t=50, b=50),
    width=1000, height=600
)

fig.show(renderer='jupyterlab')


---

## Conclusions

The correlations and plots interpretation converge. 

Successful IVF treatment outcomes typically:

* had eitehr the embryo being transfered on day 5 on a fresh cycle or were from a frozen cycle being transfered on the day they were thawed, day 0.

* had only one embryo selected electively transfered or 2 embryos without elective selection.

* had more than 5 fresh eggs being collected from patient/egg donor or were from a frozen cycle.

* had more than 5 eggs mixed with sperm.

* had a range of 6-10 embryos created.

* happened when Patient/Egg provider was younger than 34 year old.

* happened when Partner/Sperm provider was younger than 34 year old.

---