## Running Data Pipelines In a Jupyter Notebook

A Jupyter Notebook be used for many things:
  - Documentation (using Markdown Cells)
  - Code (using Code Cells)
  - Running Code in other files (using the `%run` command)
  - Installing Python Packages (using the `%pip install` command)

By combining them together, one can create a documented series of steps of how code should be run!

Our course analysis focuses on a dataset provided by Nicholas Steinmetz and his colleagues out of his 2019 publication in Nature.  
In this notebook, there's code for:
  - Downloading the paper
  - Downloading example videos
  - Downloading the data provided by the authers
  - Converting the data into easier-to=analyze formats for this course


Each Step has two parts, each of which need to be run:
  1. **Download the Dependencies**. Certain packages are needed for each step; to download and install them into your Python environment, run the cells with `%pip install` in them.
  2. **Run the Code**.  Some of the cells have the code written directly here, while others run scripts found in other files.  Just run the cell and wait until "Success!" is printed below it to run!

### Exercise: Run the Data Pipeline   

Get a feel for running Jupyter notebooks, running each cell.

---

### Step 1: Download The Nature Paper

Nature Paper: https://www.nature.com/articles/s41586-019-1787-x

In [None]:
import os
import urllib.request

url = 'https://www.nature.com/articles/s41586-019-1787-x.pdf'
filename = 'references/steinmetz2019.pdf'

os.makedirs('references', exist_ok=True)
urllib.request.urlretrieve(url, filename);
print('Success!')

---

#### Step 2: Download the Trimmed Videos

  - *Inputs From*: iBOTS-Hosted Sciebo Folder
  - *Outputs To*: `vids/*.avi`

##### Install Dependencies

In [None]:
%pip install requests tqdm

##### Run Code

In [None]:
import os
import requests
from tqdm import tqdm

def download_from_sciebo(public_url, to_filename, is_file = True):
    """
    Downloads a file or folder from a shared URL on Sciebo
    """
    r = requests.get(public_url + "/download", stream=True)
    progress_bar = tqdm(desc=f"Downloading {to_filename}", unit='B', unit_scale=True, total=int(r.headers['Content-Length'])) if is_file else tqdm()
    with open(to_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=8192):
            f.write(chunk)
            progress_bar.update(len(chunk))


os.makedirs("vids", exist_ok=True)
download_from_sciebo("https://uni-bonn.sciebo.de/s/oMoBlis0VvAsblG", "vids/eyetracking_example_steinmetz2019.avi")
download_from_sciebo("https://uni-bonn.sciebo.de/s/fDY3V8JnZEOPnCR", "vids/mouse_wheel_example_steinmetz2019.avi")
print('Success!')

--- 

### Step 3: Download the Steinmetz et al, 2019 Dataset

From the [Neuromatch Academy Data Archive](https://osf.io/hygbm), hosted by the [Center for Open Science](https://osf.io/)

  - *Input from*: The Internet
  - *Output to*: `data/raw`

##### Install Dependencies

In [None]:
%pip install requests tqdm

##### Run the Script

In [None]:
%run scripts/1_download_data.py
print('Success!')

---

### Step 4: Process the Data

  - *Inputs from*: `raw/*`
  - *Outputs to*: `processed/*.nc`


##### Install Dependencies

In [None]:
%pip install numpy pandas tqdm xarray netCDF4 pyarrow

##### Run the Script

In [None]:
%run scripts/2_convert_to_netcdf.py
print('Success!')

---

### Step 5: Extract Tables for Today's Analysis

  - *Inputs from*: `processed/*.nc`
  - *Outputs to*: `final/*.csv`

In [None]:
%pip install pandas xarray

In [41]:
%run scripts/3_extract_to_csv.py