# **Data Collection Notebook**

## Objectives

* Download the anonymized data available (2017-2018) from the Human Fertilization and Embryology Authority (HFEA) website: [HFEA Data & Research](https://www.hfea.gov.uk/about-us/data-research/).
* Save it as raw data under outputs/datasets/collection.

## Inputs

* Data file .xslx

## Outputs

* Generate compressed Dataset:
  
   - outputs/datasets/collection/FertilityTreatmentData.csv.gz

## Additional Comments

* The HFEA website does not provide an API for data retrieval. Therefore, the files were fetched from the download url on the website.

---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/patriciahalley/Documents/Code_institute/git/ivf-success-predictor/jupyter_notebooks'

To make the parent of the current directory the new current directory:
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("A new current directory has been set")

A new current directory has been set


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/patriciahalley/Documents/Code_institute/git/ivf-success-predictor'

---

# Fetch data

## Import Libraries

Import `pandas` for data manipulation.

In [4]:
import pandas as pd

---

Read xlsx File

In [5]:
df = pd.read_excel('https://www.hfea.gov.uk/media/3469/ar-2017-2018.xlsx', sheet_name='Anonymised register')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169616 entries, 0 to 169615
Data columns (total 61 columns):
 #   Column                                             Non-Null Count   Dtype  
---  ------                                             --------------   -----  
 0   Patient age at treatment                           169616 non-null  object 
 1   Total number of previous IVF cycles                169616 non-null  object 
 2   Total number of previous DI cycles                 169616 non-null  object 
 3   Total number of previous pregnancies - IVF and DI  42688 non-null   float64
 4   Total number of previous live births - IVF or DI   105087 non-null  object 
 5   Causes of infertility - tubal disease              169616 non-null  int64  
 6   Causes of infertility - ovulatory disorder         169616 non-null  int64  
 7   Causes of infertility - male factor                169616 non-null  int64  
 8   Causes of infertility - patient unexplained        169616 non-null  int64 

---

# Push file to Repo

In [7]:
import os

# Create output directory if it doesn't exist
os.makedirs(name='outputs/datasets/collection', exist_ok=True)

# Save the DataFrame as a compressed CSV file using gzip compression
df.to_csv("outputs/datasets/collection/FertilityTreatmentData.csv.gz", index=False, compression='gzip')