# Download the MAUDE Data

Goals: 
1. ***(COMPLETE)*** Retrieve and list all the [MAUDE zip files on fda.gov](https://www.fda.gov/medical-devices/mandatory-reporting-requirements-manufacturers-importers-and-device-user-facilities/about-manufacturer-and-user-facility-device-experience-maude).

The following Python code completes these steps:
1. Read the entire MAUDE webpage from fda.gov
2. The read returns one HTML table
3. Use the Pandas library to convert the HTML table into a Pandas dataframe
4. Drop the first row of the dataframe which is only used for formatting on the web page
5. Rename the columns of the dataframe to include 'Description' and remove tab characters
6. Convert total records to integer to allow math operations on record counts

In [7]:
! pip install --requirement requirements.txt > /dev/null

In [2]:
import pandas as pd
from unicodedata import normalize

# Read the entire webpaage from fda.gov
tables = pd.read_html('https://www.fda.gov/medical-devices/mandatory-reporting-requirements-manufacturers-importers-and-device-user-facilities/about-manufacturer-and-user-facility-device-experience-maude')

# The read should return one table; use that as the dataframe
df = tables[0]

# Drop the first row which is only used for formatting on the web page
df.drop(index=df.index[0],
        axis=0,
        inplace=True)

# Rename the columns of the table to include 'Description' and remove tabs
df.columns = [
    'File Name',
    'Compressed Size in Bytes',
    'Uncompressed Size in Bytes',
    'Total Records',
    'Description'
]

# Convert total records to integer
df = df.astype({'Total Records':'int'})


Goals:
1. _**(COMPLETE)**_ Print a summary of the MAUDE data

The following Python code completes these steps:
1. Print the dataframe from the previous steps

In [3]:
df

Unnamed: 0,File Name,Compressed Size in Bytes,Uncompressed Size in Bytes,Total Records,Description
1,mdrfoi.zip,6167KB,87864KB,263604,MAUDE Base records received to date for 2022
2,mdrfoithru2021.zip,460013KB,4253175KB,12830703,Master Record through 2021
3,mdrfoiadd.zip,6276KB,90017KB,269188,New MAUDE Base records for the current month.
4,mdrfoichange.zip,11457KB,137162KB,421553,MAUDE Base data updates: changes to existing B...
5,patient.zip,669KB,7249KB,269189,MAUDE Patient records received to date for 2022
...,...,...,...,...,...
66,foitext2020.zip,193121KB,1134242KB,3039449,Narrative Data for 2020
67,foitext2021.zip,211070KB,1255788KB,3625862,Narrative Data for 2021
68,foitext.zip,18407KB,124772KB,441898,Narrative Data received to date for 2022
69,foitextadd.zip,8583KB,56463KB,200966,New MAUDE Narrative data for the current month.


Goals:
1. _**(COMPLETE)**_ Download all of the MAUDE files to local storage

The following Python code completes these steps:
1. Iterate over all rows using DataFrame.iterrows()
2. For each row, get the file name
3. Use the file name to create a path on the local system
4. Check to see if the file exists.  If it does, don't download it again
5. Use the file name to create the URL for the file
6. Download the file to the local system



In [4]:
from os.path import exists

import os
import urllib.request

data_directory = './data'

# Create the data directory if needed
try:
    os.makedirs(data_directory, exist_ok = True)
except OSError as error:
    print(f"Error creating {data_directory}: {error}")

# Iterate all rows using DataFrame.iterrows()
for index, row in df.iterrows():
    file_name = row["File Name"]
    file_path = f"{data_directory}/{file_name}"
    if exists(file_path):
      print(f"Already downloaded {file_path}; Skipping!")
    else:
      print(f"Downloading {file_name}")
      url=f"https://www.accessdata.fda.gov/MAUDE/ftparea/{file_name}"
      urllib.request.urlretrieve(url, file_path)
 

Downloading mdrfoi.zip
Downloading mdrfoithru2021.zip
Downloading mdrfoiadd.zip
Downloading mdrfoichange.zip
Downloading patient.zip
Downloading patientthru2021.zip
Downloading patientadd.zip
Downloading patientchange.zip
Downloading patientproblemcode.zip
Downloading patientproblemdata.zip
Downloading foidevthru1997.zip
Downloading foidev1998.zip
Downloading foidev1999.zip
Downloading device2000.zip
Downloading device2001.zip
Downloading device2002.zip
Downloading device2003.zip
Downloading device2004.zip
Downloading device2005.zip
Downloading device2006.zip
Downloading device2007.zip
Downloading device2008.zip
Downloading device2009.zip
Downloading device2010.zip
Downloading device2011.zip
Downloading device2012.zip
Downloading device2013.zip
Downloading device2014.zip
Downloading device2015.zip
Downloading device2016.zip
Downloading device2017.zip
Downloading device2018.zip
Downloading device2019.zip
Downloading device2020.zip
Downloading device2021.zip
Downloading device.zip
Downlo

Goal:
1. List the files on the local system

Steps:
1. Use the 'ls' shell command to list the files on the local system

In [5]:
! ls ./data

device.zip             devicechange.zip       foitext2013.zip
device2000.zip         deviceproblemcodes.zip foitext2014.zip
device2001.zip         foidev1998.zip         foitext2015.zip
device2002.zip         foidev1999.zip         foitext2016.zip
device2003.zip         foidevproblem.zip      foitext2017.zip
device2004.zip         foidevthru1997.zip     foitext2018.zip
device2005.zip         foitext.zip            foitext2019.zip
device2006.zip         foitext1996.zip        foitext2020.zip
device2007.zip         foitext1997.zip        foitext2021.zip
device2008.zip         foitext1998.zip        foitextadd.zip
device2009.zip         foitext1999.zip        foitextchange.zip
device2010.zip         foitext2000.zip        foitextthru1995.zip
device2011.zip         foitext2001.zip        mdrfoi.zip
device2012.zip         foitext2002.zip        mdrfoiadd.zip
device2013.zip         foitext2003.zip        mdrfoichange.zip
device2014.zip         foitext2004.zip        mdrfoithru2021.zip
device

# Observations and Research
There are four different types of files with MAUDE data:
1. Master event data : `mdrfoithru2021.zip`
2. Patient data: `patient...`
3. Device data: `device...`
4. Text data: `foitext...`

All reports are linked via the `MDR Report Key`.

This paper provides some additional insight on massaging the raw MAUDE data into useful formats:
- [A Primer to the Structure, Content and Linkage of the FDA’s Manufacturer and User Facility Device Experience (MAUDE) Files](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5994953/)

The [MAUDE data page](https://www.fda.gov/medical-devices/mandatory-reporting-requirements-manufacturers-importers-and-device-user-facilities/about-manufacturer-and-user-facility-device-experience-maude) also provides information that can be used to develop a schema for importing the data into a modern database that supports queries.

This video had good info on running Jupyter notebooks on AWS:  
- [Serverless Jupyter on AWS: Fully Managed Notebook Environments](https://www.youtube.com/watch?v=-k53AcgVHTI&ab_channel=AWSPublicSector)

Running the notebook on a dedicated server in AWS will allow faster runtimes and persistent storage.  This is in comparison to running on shared resources in Google Colab where the data will need to be downloaded and reingested for each run which will lead to long run times.




# Next Steps
1. Set up an AWS account to host a dedicated Jupyter notebook server
2. Migrate the 'Download' notebook to the AWS server
3. Run the 'Download' notebook in the AWS environment and begin processing the data files
4. Extract and concatenate the data for 2015 to build the database schema

# Summary
1. A Jupyter notebook has been developed to download MAUDE data from fda.gov
2. The Google Colab envirionment is suitable for running the notebook and viewing the downloaded files.
3. However, the shared Colab environment resulted in slow downloads.  In addition, the data is not saved permanently.
4. A dedicated environment needs to be created which will allow faster runs of the notebook and permanent storage.  The dedicated server will be created in the AWS cloud.
5. Next steps include processing data for 2015 to develop a database schema.