# Downloading and Loading Datasets

To lessen the work of clicking individual files to download, we use BeautifulSoup to scrape the filenames from the website.

Parent directory: https://sdge.sdsc.edu/data/sdge/

**Note:** Right now we have only downloaded files from 2020-08

In [None]:
from bs4 import BeautifulSoup
import requests

# Fetch the web page
url = "https://sdge.sdsc.edu/data/sdge/ens_gfs_001/2020-08/"
response = requests.get(url)
data = response.text

# Parse the HTML content
soup = BeautifulSoup(data, 'html.parser')

In [None]:
# The files/filenames are organized in a table
# Extract them from the table
fns = []
for tr in soup.find('table').find_all('tr'):
    # print(tr)
    row = [url.text for url in tr.find_all('a')]
    fns.append(row[1])

# first 2 elements are not filenames
fns = fns[2:]
fns

In [None]:
# Get the download urls, which are basically parent directory + filename
# Files will be downloaded to the local "data" folder
urls = []
dest = []
for i in fns:
    temp = 'https://sdge.sdsc.edu/data/sdge/ens_gfs_001/2020-08/' + i
    urls.append(temp)

    temp = 'data/' + i
    dest.append(temp)

**The following 2 code cell downloads the files from an online website, do not run unless needed.**

In [None]:
import requests
import time
from multiprocessing import cpu_count
from multiprocessing.pool import ThreadPool

# Normal loop that downloads files from urls
# args is a zip of urls and destinations
def download_url(args):
    t0 = time.time()
    url, fn = args[0], args[1]
    try:
        r = requests.get(url)
        with open(fn, 'wb') as f:
            f.write(r.content)
        return(url, time.time() - t0)
    except Exception as e:
        print('Exception in download_url():', e)

# Download multiple files in parallel 
def download_parallel(args):
    cpus = cpu_count()
    results = ThreadPool(cpus - 1).imap_unordered(download_url, args)
    # prints results of downloaded file and time taken
    for result in results:
        print('url:', result[0], 'time (s):', result[1])

In [None]:
# Run the download code
inputs = zip(urls, dest)
download_parallel(inputs)

Now we have the .nc files downloaded, we open them using the xarray library.

.nc files have metadata, which is retained when opened as datasets. However, converting them into Pandas dataframes loses that. 

In [None]:
import xarray as xr
import pandas as pd 

In [None]:
# dest has all the file paths
dest

In [None]:
# test: open one dataset
ds = xr.open_dataset(dest[0])

# Print variable names
print(ds.data_vars)

In [None]:
# converting ds to df
df = ds.to_dataframe()
df

# Data Cleaning

some messy first look at the data, using that one ds we loaded ^

In [None]:
ds.data_vars