### Download Human Drug Adverse Event dataset from OpenFDA website 
[OpenFDA](https://open.fda.gov/data/downloads/) website makes a variety of datasets available for download manually or programmatically. In this particular sample we have implemented Python code to download all 1400+ files for the Human Drug Adverse Event dataset. The files are downloaded as zipped files which are unzipped after download.

On successful execution of this Notebook you will have two folders in the Lakehouse:
1. With Zipped files downloaded from OpenFDA website
2. With unzipped files which will be used as source for creating flattend JSON tables in a subsequent step

The Notebook execution can take 2-3 hours so good idea to run using Data Factory pipeline feature available in Microsoft Fabric.

**Note**: Its important to keep in mind that the size of raw unzipped JSON files will be 400+GB

In [None]:
import requests

#retrieve metadata json file which has details for all datasets and files in those datasets
response  = requests.get("https://api.fda.gov/download.json")
download_metadata_json = response.json() if response and response.status_code == 200 else None

if download_metadata_json:
    print("metadata json available")
else:
    print("error: " + download_metada_json)

In [None]:
#setup directory paths where zipped and unzipped files will reside
download_dir_name = "fda_ds"
download_dir_path ="Files/" + download_dir_name + "/"

unzip_dir_name = download_dir_name + "_unzipped"
unzip_dir_path ="Files/" + unzip_dir_name + "/"

In [None]:
#create directories in the Lakehouse Files areas for zipped and unzipped files
print(download_dir_path)
mssparkutils.fs.mkdirs(download_dir_path)

print(unzip_dir_path)
mssparkutils.fs.mkdirs(unzip_dir_path)

In [None]:
from urllib.parse import urlparse

counter = 0

#parse the Metadata JSON file and loop through downloading each files to Lakehouse
#there are a total of 1400+ files which are referred to as partitions in the metadata json
#python code to download is pretty basic but can be optimized to use distributed processing of Spark in a subsequent iteration of the release of this sample 
for p in download_metadata_json['results']['drug']['event']['partitions']:    

    counter = counter + 1
    file_display_name = p['display_name']
    file_url = p['file']

    path = urlparse(file_url).path
    file_name = path.split("/")[-1]
    file_year_quarter = path.split("/")[-2]
    print(f"Downloading File# {counter}: {file_year_quarter}-{file_name}")
    
    r = requests.get(file_url, allow_redirects=True)

    download_path ="/lakehouse/default/Files/fda_ds/"
    with open(download_path + file_year_quarter + "-" + file_name, 'wb') as f:
        f.write(r.content)          

In [None]:
import zipfile

#function to unzip file
def unzip_file(zip_filepath, output_file):
    with zipfile.ZipFile(zip_filepath, 'r') as zip_ref:
        for f in zip_ref.infolist():
            data = zip_ref.read(f)
            with open(output_file, 'wb') as fh:
                fh.write(data)


In [None]:
#loop through list of zipped files uncompressing each of them in a different folder
data_files = mssparkutils.fs.ls(download_dir_path)

counter = 0
for data_file in data_files:
    counter = counter + 1
    print(f"File# {counter}: {data_file.name}")
    unzip_file(f"/lakehouse/default/Files/fda_ds/{data_file.name}", "/lakehouse/default/" + unzip_dir_path + data_file.name.replace(".zip",""))
