# **Data Collection Notebook**

## Objectives

* Download the anonymized data available (2017-2018) from the Human Fertilization and Embryology Authority (HFEA) website: [HFEA Data & Research](https://www.hfea.gov.uk/about-us/data-research/).
* Save it as raw data under outputs/datasets/collection.

## Inputs

* Data file .xslx

## Outputs

* Generate compressed Dataset:
  
   - outputs/datasets/collection/FertilityTreatmentData.csv.gz

## Additional Comments

* The HFEA website does not provide an API for data retrieval. Therefore, the files were fetched from the download url on the website.

---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

To make the parent of the current directory the new current directory:
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("A new current directory has been set")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Fetch data

## Import Libraries

Import `pandas` for data manipulation.

In [None]:
import pandas as pd

---

Read xlsx File

In [None]:
def read_fertility_data():
    try:
        # Attempt to read from the website
        df = pd.read_excel('https://www.hfea.gov.uk/media/3469/ar-2017-2018.xlsx', sheet_name='Anonymised register')
        print("Data successfully read from the website.")
    except Exception as e:
        # If there is an error, read from the local file
        print(f"Failed to read from the website. Error: {e}")
        df = pd.read_csv('outputs/datasets/collection/FertilityTreatmentData.csv.gz', compression='gzip')
        print("Data successfully read from the local file.")
    
    return df


df = read_fertility_data()

In [None]:
df.info()

---

# Push file to Repo

In [None]:
import os

# Create output directory if it doesn't exist
os.makedirs(name='outputs/datasets/collection', exist_ok=True)

# Save the DataFrame as a compressed CSV file using gzip compression
df.to_csv("outputs/datasets/collection/FertilityTreatmentData.csv.gz", index=False, compression='gzip')