In [1]:
import os
import glob
import urllib
import hashlib
import requests
from ftplib import FTP

import wget
import pandas as pd
from matplotlib import pyplot as plt


When loading data from remote locations, we will use internet standard file transfer protocols as well as data platform-specific APIs.

In [2]:
path_datadir = os.path.abspath(os.path.join(os.pardir, 'resources'))

# 3 Remote data files

## 3.1 Generic Unified Resource Locators
URL

https://en.wikipedia.org/wiki/URL



<img src=https://upload.wikimedia.org/wikipedia/commons/d/d6/URI_syntax_diagram.svg>



```
https://en.wikipedia.org/wiki/URL

http://localhost:8888/notebooks/w2_bhv/Day_1/assignment_part1.ipynb

ftp://ftp.ncbi.nlm.nih.gov/genbank/README.genbank

mailto:thiagosa@oslomet.no
```

### ftplib

In [1]:
url = 'ftp://ftp.ncbi.nlm.nih.gov/genbank/README.genbank'

### BEGIN SOLUTION

urlscheme, urlhost = url.split('//')
urlpath = '/'.join(urlhost.split('/')[1:])
urlhost = urlhost.split('/')[0]

ftp = FTP(urlhost)  # connect to host, default port
ftp.login()

ftp.cwd(os.path.dirname(urlpath))

contents = ftp.nlst()

if not os.path.basename(urlpath) in contents:
       raise FileNotFoundError

for i in contents[:10]:
    print(i)

# path_readme = os.path.join(path_datadir, os.path.dirname(urlpath), os.path.basename(urlpath))
# path_readme = os.path.join(path_datadir, urlpath.split(os.sep)[0], urlpath.split(os.sep)[1])
path_readme = os.path.join(path_datadir, *urlpath.split(os.sep))

os.makedirs(os.path.dirname(path_readme), exist_ok=True)

if os.path.isfile(path_readme):
    print('Skipping ' + path_readme + ': file exists')
else:
    print('Downloading ' + os.path.basename(path_readme))
    with open(path_readme, 'wb') as f:
        ftp.retrbinary('RETR ' + os.path.basename(path_readme), f.write)

ftp.quit()

### END SOLUTION

NameError: name 'FTP' is not defined

### urllib

In [4]:
url = 'https://www.oslomet.no/var/oslomet/storage/images/_aliases/xxxlarge/1/0/3/4/144301-1-eng-GB/ail-02_SB_2400x1200.jpg'

In [5]:
### BEGIN SOLUTION
path_jpg_urllib = os.path.join(path_datadir, 'temp', 'guga_and_the_minister_urllib.jpg')
os.makedirs(os.path.dirname(path_jpg_urllib), exist_ok=True)
if os.path.isfile(path_datadir):
    print('Skiping - file exists')
else:
    urlObj = urllib.request.urlopen(url)
    with open(path_jpg_urllib, 'b+w') as f:
        f.write(urlObj.read())
### END SOLUTION

![using urrlib](../resources/temp/guga_and_the_minister_urllib.jpg)

### resp

In [6]:
resp = requests.get(url)

### BEGIN SOLUTION
path_jpg_resp = os.path.join(path_datadir, 'temp', 'guga_and_the_minister_resp.jpg')
with open(path_jpg_resp, 'b+w') as f:
    f.write(resp.content)
### END SOLUTION

![using resp](../resources/temp/guga_and_the_minister_resp.jpg)

### Checking data integrity

[**Checksum**](https://en.wikipedia.org/wiki/Checksum) is a short string computed from data. If the data is changed (even if just a bit) a very different string is produced.

In [7]:
### BEGIN SOLUTION
hashlib.md5(open(path_jpg_urllib,'rb').read()).hexdigest()
### END SOLUTION

'b0b798843ea35ec0e82eb0dd40ebfd90'

In [8]:
### BEGIN SOLUTION
hashlib.md5(open(path_jpg_resp,'rb').read()).hexdigest()
### END SOLUTION

'b0b798843ea35ec0e82eb0dd40ebfd90'

## 3.2 Data platform-specific APIs

### AWS S3
Amazon Web Services Simple Storage Service

`aws s3 ls s3://dandiarchive/ --no-sign-request`

https://docs.aws.amazon.com/cli/latest/reference/s3/index.html

https://github.com/Originate/dbg-pds-tensorflow-demo/blob/master/notebooks/01-data-cleaning-single-stock.ipynb

In [9]:
# S3 BUCKET: BRAZILIAN MUSIC
# https://registry.opendata.aws/covers-br/
# aws s3 ls s3://covers-song-br/ --no-sign-request

### DANDI
- neuroscience focused

`dandi download https://api.dandiarchive.org/api/dandisets/000017/versions/draft/assets/3722e6b8-d47f-4feb-a9ae-9c368e41166b/download/ --output-dir /path/to/data/dir`

https://www.dandiarchive.org/handbook/10_using_dandi/#using-the-python-cli

**Suggested reading**

https://www.cerberusftp.com/ftps-vs-https-which-is-the-right-tool-for-secure-file-transfer/