<a href="https://colab.research.google.com/github/managedkaos/nicoles-research-data/blob/main/Nicoles_Research_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Retrieve and list all the [MAUDE zip files on fda.gov](https://www.fda.gov/medical-devices/mandatory-reporting-requirements-manufacturers-importers-and-device-user-facilities/about-manufacturer-and-user-facility-device-experience-maude).

In [2]:
import pandas as pd

# Define constant values
DATA_URL = 'https://www.fda.gov/medical-devices/mandatory-reporting-requirements-manufacturers-importers-and-device-user-facilities/about-manufacturer-and-user-facility-device-experience-maude'
DOWNLOAD_URL = 'https://www.accessdata.fda.gov/MAUDE/ftparea'

# Read the entire webpaage from fda.gov using Pandas
# Pandas will look for tables by default
tables = pd.read_html(DATA_URL)

# The read should return one table; use that as the dataframe
# TODO: check here to confirm one and only one table was returned
df = tables[0]

# Drop the first row which is only used for formatting on the web page
df.drop(index=df.index[0],
        axis=0,
        inplace=True)

# Rename the columns of the table to include 'Description' and remove tabs
df.columns = [
    'File Name',
    'Compressed Size in Bytes',
    'Uncompressed Size in Bytes',
    'Total Records',
    'Description'
]

# Convert the 'Total Records' values to integer
df = df.astype({'Total Records':'int'})

# If needed for debugging, print the table as markdown
# print(df.to_markdown())

In [3]:
df

Unnamed: 0,File Name,Compressed Size in Bytes,Uncompressed Size in Bytes,Total Records,Description
1,mdrfoi.zip,6167KB,87864KB,263604,MAUDE Base records received to date for 2022
2,mdrfoithru2021.zip,460013KB,4253175KB,12830703,Master Record through 2021
3,mdrfoiadd.zip,6276KB,90017KB,269188,New MAUDE Base records for the current month.
4,mdrfoichange.zip,11457KB,137162KB,421553,MAUDE Base data updates: changes to existing B...
5,patient.zip,669KB,7249KB,269189,MAUDE Patient records received to date for 2022
...,...,...,...,...,...
66,foitext2020.zip,193121KB,1134242KB,3039449,Narrative Data for 2020
67,foitext2021.zip,211070KB,1255788KB,3625862,Narrative Data for 2021
68,foitext.zip,18407KB,124772KB,441898,Narrative Data received to date for 2022
69,foitextadd.zip,8583KB,56463KB,200966,New MAUDE Narrative data for the current month.


Download all of the MAUDE files to local storage

In [4]:
import urllib.request

# Iterate all rows using .iterrows()
for index, row in df.iterrows():

    # Get the file name from this row
    file_name = row["File Name"]

    # Use the file name to define where the file will be stored locally
    file_path = f"/home/{file_name}"

    # Use the file name to define the URL where the file will be downloaded from
    file_url=f"{DOWNLOAD_URL}/{file_name}"

    # Print dots for tracking progress
    print('.', end='')

    # Download the file and store it locally
    urllib.request.urlretrieve(file_url, file_path)
 

......................................................................

List the files in local storage

In [5]:
! ls /home

device2000.zip	device2018.zip		foitext2002.zip  foitext2020.zip
device2001.zip	device2019.zip		foitext2003.zip  foitext2021.zip
device2002.zip	device2020.zip		foitext2004.zip  foitextadd.zip
device2003.zip	device2021.zip		foitext2005.zip  foitextchange.zip
device2004.zip	deviceadd.zip		foitext2006.zip  foitextthru1995.zip
device2005.zip	devicechange.zip	foitext2007.zip  foitext.zip
device2006.zip	deviceproblemcodes.zip	foitext2008.zip  mdrfoiadd.zip
device2007.zip	device.zip		foitext2009.zip  mdrfoichange.zip
device2008.zip	foidev1998.zip		foitext2010.zip  mdrfoithru2021.zip
device2009.zip	foidev1999.zip		foitext2011.zip  mdrfoi.zip
device2010.zip	foidevproblem.zip	foitext2012.zip  patientadd.zip
device2011.zip	foidevthru1997.zip	foitext2013.zip  patientchange.zip
device2012.zip	foitext1996.zip		foitext2014.zip  patientproblemcode.zip
device2013.zip	foitext1997.zip		foitext2015.zip  patientproblemdata.zip
device2014.zip	foitext1998.zip		foitext2016.zip  patientthru2021.zip
device2015.z

In [6]:
! unzip /home/foitext2012.zip

Archive:  /home/foitext2012.zip
  inflating: foitext2012.txt         


Unzip the following files:
```
foitext2012.zip
foitext2013.zip
foitext2014.zip
foitext2015.zip
foitext2016.zip
foitext2017.zip
foitext2018.zip
foitext2019.zip
foitext2020.zip
foitext2021.zip
```

In [8]:
# Make the input directory
! mkdir -p /home/foitext_files

# Extract the contents of the following files into the input directory:
#   foitext2012.zip
#   foitext2013.zip
#   foitext2014.zip
#   foitext2015.zip
#   foitext2016.zip
#   foitext2017.zip
#   foitext2018.zip
#   foitext2019.zip
#   foitext2020.zip
#   foitext2021.zip

! for i in {device,foitext}20{12..21}.zip; do echo -n "."; unzip -qq -d /home/foitext_files -o "/home/${i}"; done

....................

In [7]:
! ls /home/foitext_files

DEVICE2012.txt	DEVICE2017.txt	foitext2012.txt  foitext2017.txt
DEVICE2013.txt	DEVICE2018.txt	foitext2013.txt  foitext2018.txt
DEVICE2014.txt	DEVICE2019.txt	foitext2014.txt  foitext2019.txt
DEVICE2015.txt	DEVICE2020.txt	foitext2015.txt  foitext2020.txt
DEVICE2016.txt	DEVICE2021.txt	foitext2016.txt  foitext2021.txt


In [5]:
# Make the output directory
! mkdir -p /home/filtered-foitext_files

In [10]:
import pandas as pd

df1 = pd.read_csv("/home/foitext_files/foitext2012.txt", encoding="ISO-8859-1", sep='|')