<a href="https://colab.research.google.com/github/olaviinha/SloppyNoto/blob/master/crawlers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Crawlers


This utility is to accompany [Sloppy Noto](https://github.com/olaviinha/SloppyNoto) to provide a bit of aid in the quest of hunting down possible data file candidates for data to audio conversion.

<font color="#9d9">**HTTP Crawler**</font> crawls ftp-like websites (_"Index of /something"_). Due to federal policies, some instances such as ESA and NASA have changed some of their data archive servers from FTP to HTTP(S). This crawler will take care of that. HTTP Crawler crawls 3 levels of subdirectories in depth.

<font color="#99d">**FTP Crawler**</font> crawls FTP directories as anonymous. FTP Crawler crawls **all** levels of subdirectories in depth, and it is **not recommended** to crawl top level, or any upper level directories (such as a directory containing _all data of an entire space mission_), for such crawling may take literally **hours** to complete.

<font color="#9dd">**ZIP Crawler**</font> crawls individual compressed archive files (.zip, .tar.gz). ZIP Crawler crawls all levels of subdirectories within the zip. To use data files from inside crawled zips, you must copy the data files to your Google Drive using the enclosed <font color="#9dd">Drive Stasher</font> cell.

**All of the crawlers** output:
- **Recursive** list of all files over 20 MB in size from whatever you're crawling.
- Categorized and sorted by file size.
- File types of your choosing highlighted with a green arrow (<font color="#9d9">`=>`</font>)

**Howto**
- Input ftp, http(s) or zip (.zip, .tar.gz) address to the corresponding crawler cell and run cell by clicking the play button on the left side of cell.
- Copy/paste paths from the results to Sloppy Noto's `data_file` field.
- To use data files found from inside crawled ZIP files, use <font color="#9dd">Drive Stasher</font> cell to copy those files from ZIPs to your Google Drive. Then use the provided Drive path in Sloppy Noto.

**Note**
- Everything this utility does is **guessing** based on file names and sizes, thus all results are fully **indicative**. It is certainly not set in stone that any files that these crawlers find are going to work in Sloppy Noto.

In [None]:
#@title Setup & Settings

#@markdown <small>Highlight these extensions in all crawl results. Case-insensitive, comma-separated list of file extensions, including period.</small>
highlight_extensions = ".csv, .tsv, .tab, .lst, .log" #@param {type:"string"}



#@markdown <small>Path to a directory in your Google Drive. Everything you choose to save, will be saved in this directory. Relative to your Drive root.</small>
save_dir = "" #@param {type:"string"}

#@markdown <small>Save resulting file lists from your crawls as well as source information of copied files as .txt files (to the directory set above). May come in handy in the future, not having to crawl the same things time after time.</small>
save_output_txt = False #@param {type:"boolean"}

save_txt = save_output_txt


from google.colab import output
import os

force_setup = False

  
# inhagcutils
if not os.path.isfile('/content/inhagcutils.ipynb') and force_setup == False:
  pip_packages = 'ftputil'
  %cd /content/
  !pip -q install import-ipynb {pip_packages}
  !curl -s -O https://raw.githubusercontent.com/olaviinha/inhagcutils/master/inhagcutils.ipynb
# Mount Drive
if not os.path.isdir('/content/drive') and force_setup == False:
  from google.colab import drive
  drive.mount('/content/drive')
# Drive symlink
if not os.path.isdir('/content/mydrive') and force_setup == False:
  os.symlink('/content/drive/My Drive', '/content/mydrive')
  drive_root_set = True

import import_ipynb
from inhagcutils import *
drive_root = '/content/mydrive/'
dir_tmp = '/content/tmp/'
if not os.path.isdir(dir_tmp):
  create_dirs([dir_tmp])

import sys
import ftplib
import ftputil
import fnmatch


#size_limits = [1000, 500, 200, 100, 50, 20]
size_limits = [1000, 500, 200, 100, 50, 20]
b = 1000000
size_limits = [limit*b for limit in size_limits]

def fix_extensions(input_extensions):
  extensions = input_extensions.split(',')
  gz_extensions = [ext+'.gz' for ext in extensions]
  extensions.extend(gz_extensions)
  extensions = [ext.lower() for ext in extensions]
  extensions.extend([ext.upper() for ext in extensions])
  return tuple(extensions)

def apnd(content):
  global log_all, txt
  log_all = open(txt, 'a+')
  log_all.write(content)
  log_all.close()

def print_filelist(list, title, ftp=True):
  global log_all, save_txt
  total = len(list)
  op(c.title, '\n'+title+':\n')
  if save_txt == True:
    apnd('\n'+title+'\n')
  list.sort(key=lambda tup: tup[0], reverse=True)
  i = 0
  for s, f in list:
    if ftp == True:
      line = str('{:.2f}'.format(round(s/b, 2)))+' MB: ftp://'+ftp_host+f
    else:
      line = str('{:.2f}'.format(round(s/b, 2)))+' MB: '+f
    s_line = line
    if f.endswith(extensions):
      if i < 500:
        op(c.ok, '=>', line)
      s_line = '=> '+line
    else:
      if i < 500:
        print(line)
    if i == 500:
      remaining = total-500
      print('...and', str(remaining), 'more files of this size scale were found.')
      op(c.warn, '\nPrinting was stopped at 500.', 'Full file list is saved in a .txt file if you have save_txt checked.')
    if save_txt == True:
      apnd(s_line+'\n')
    i += 1
      
def retrieve_print_filelist(basedir, ftp=True):
  global size_limits, b
  all_files = []
  files_by_size = [[] for _ in size_limits]

  if ftp == True:
    recursive = host.walk(basedir, topdown=True, onerror=None) # recursive search 
  else:
    recursive = os.walk(basedir, topdown=True)
  print('Recursive file list retrieval in progress...')
  files_found = 0
  for root, dirs, files in recursive:
    for name in files:
      fullpath = os.path.join(root, name)
      if ftp == True:
        size = host.path.getsize(fullpath)
      else:
        size = os.path.getsize(fullpath)
      all_files.append([size, fullpath])

      for i, limit in enumerate(size_limits):
        if i == 0:
          if size > limit:
            files_by_size[i].append([size, fullpath])
            files_found += 1
        else:
          if size > limit and size < size_limits[i-1]:
            files_by_size[i].append([size, fullpath])
            files_found += 1
  output.clear()
  op(c.ok, '\nResults\n\n')

  if files_found == 0:
    op(c.fail, 'No suitable files found.')
  else:
    for i, filelist in enumerate(files_by_size):
      if len(filelist) > 0:
        if i == 0:
          title = 'Over 1 GB'
        else:
          title = str(round(size_limits[i]/b))+'-'+str(round(size_limits[i-1]/b))+' MB'
        print_filelist(filelist, title, ftp=ftp)
    if save_txt == True:
      log_all.close()

extensions = fix_extensions(highlight_extensions)
save_path = fix_path(drive_root+save_dir)

output.clear()
op(c.ok, 'Setup finished.')


#<font color="#9d9">HTTP Crawler</font>

In [None]:
#@markdown HTTP Crawler is meant strictly for http(s) addresses with an FTP-like view (normally titled `Index of /something`).
#@markdown Crawls **3 levels** deep in the file system tree structure.
web_url = "" #@param {type:"string"}

#import pprint from pprint
import requests
import urllib
from bs4 import BeautifulSoup

if save_txt == True:
  web_host = slug(web_url.replace('http://', '').replace('https://', '').split('/')[0])
  webdir = web_url.replace(webhost, '').replace('/', '_')
  txt = save_path+'filelist_'+webhost+'_'+path_leaf(web_url)+'_'+rnd_str(4)+'.txt'
  apnd('WEB URL: '+web_url+'\n')
  apnd('WEB host: '+web_host+'\n')

def get_url_paths(url, ext='', params={}):
  response = requests.get(url, params=params)
  if response.ok:
    response_text = response.text
  else:
    return response.raise_for_status()
  soup = BeautifulSoup(response_text, 'html.parser')
  
  #link = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
  #parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
  li = []
  for link in soup.find_all('a'):
    href = link.get('href')
    if href != '.' and href != '..':
      fullhref = url + href
      meta = link.next_sibling.split()
      if len(meta) > 0:
        size = meta[2]
      else:
        size = 0
      
      #print('blaa', blaa)
      #print(date)
      #print(size)
      li.append([size, fullhref])
  return li
  #return parent

def crawl_dirs(links):
  dir_contents = []
  ndx = 0
  for i, link in enumerate(links):
    href = link[1]
    if href.endswith('/'):
      if not href.endswith('../') and not href.endswith('./'):
        #dir_contents.append([])
        #dir_contents[ndx] = get_url_paths(link)
        dir_contents.append( get_url_paths(href) )
        ndx += 1
  return dir_contents


links = get_url_paths(web_url, '')
subs = crawl_dirs(links)

subsubs = []
for i, sub in enumerate(subs):
  subsubs.append(crawl_dirs(sub))

def size_filter(links):
  global size_limits, b
  files_by_size = [[] for _ in size_limits]
  files_found = 0
  for link in links:
    size = link[0]
    if size != '-':
      size = int(size)
    href = link[1]
    if not href.endswith('/'):
      fullpath = href
      for i, limit in enumerate(size_limits):
        if i == 0:
          if size > limit:
            files_by_size[i].append([size, fullpath])
            files_found += 1
        else:
          if size > limit and size < size_limits[i-1]:
            files_by_size[i].append([size, fullpath])
            files_found += 1
      #print(size, href)
      #print(get_size(link))

  output.clear()
  op(c.ok, '\nResults\n\n')
  if files_found == 0:
    op(c.fail, 'No suitable files found.')
  else:
    for i, filelist in enumerate(files_by_size):
      if len(filelist) > 0:
        if i == 0:
          title = 'Over 1 GB'
        else:
          title = str(round(size_limits[i]/b))+'-'+str(round(size_limits[i-1]/b))+' MB'
        print_filelist(filelist, title, ftp=False)


all_fucking_links = []

#lvl 1
if len(links) > 0:
  #size_filter(links)
  all_fucking_links.extend(links)

#lvl 2
if len(subs) > 0:
  for sub in subs:
    #size_filter(sub)
    all_fucking_links.extend(sub)

#lvl 3
if len(subsubs) > 0:
  for subsub in subsubs:
    for subsubst in subsub:
      #size_filter(subsubst)
      all_fucking_links.extend(subsubst)

#print( all_fucking_links[0] )
size_filter(all_fucking_links)



#<font color="#99d">FTP Crawler</font>

In [None]:
#@markdown FTP Crawler should not be used on upper level directories, or it may take **hours** to complete the listing. 
#@markdown Crawls to infinite depths in the file system tree structure.

ftp_url = "" #@param {type:"string"}

ftp_address = ftp_url.replace('ftp://', '')
ftp_host = ftp_address.split('/')[0]
basedir = ftp_address.replace(ftp_host, '')

if save_txt == True:
  txt = save_path+'filelist_'+ftp_host+'_'+rnd_str(4)+'.txt'
  apnd('FTP URL: '+ftp_url+'\n')
  apnd('FTP host: '+ftp_host+'\n')
  apnd('FTP dir: '+basedir+'\n\n')
output.clear()
op(c.title, 'Logging in to '+ftp_host+'...')
host = ftputil.FTPHost(ftp_host, 'anonymous', 'anonymous@domain.com')

retrieve_print_filelist(basedir, ftp=True)

#<font color="#9dd">ZIP Crawler</font>

In [None]:
#@markdown ZIP Crawler is to be used for compressed files (`.zip`, `.gz`, `.tar.gz`). Crawls to infinite depths within the zip's enclosing tree structure.
zip_url = "" #@param {type:"string"}

if save_txt == True:
  zipfile = basename(zip_url)
  txt = save_path+'filelist_zip_'+zipfile+'_'+rnd_str(4)+'.txt'
  apnd('ZIP URL: '+zip_url+'\n\n')

def is_zipzip(path):
  return path_ext(path).lower() == '.zip'
  
def is_gz(path):
  return path_ext(path).lower() == '.gz'

zip_ext = path_ext(zip_url, True)
#id = rnd_str(6)
wfile = slug(basename(zip_url))
wdir = dir_tmp+wfile+'/'
wext = path_ext(zip_url)
file_path = wdir+wfile+wext
if not os.path.isdir(wdir):
  !mkdir {wdir}
  !wget {zip_url} -O {file_path}
  if is_gz(file_path):
    if '.tar.gz' in path_leaf(file_path):
      !tar xvzf {file_path}
    else:
      !gunzip {file_path}
  elif is_zipzip(file_path):
    %cd {wdir}
    !unzip {file_path}
    %cd /content/
  !rm {wdir}{wfile}{wext}

retrieve_print_filelist(wdir, ftp=False)


##<font color="#9dd">Drive Stasher</font>

In [None]:
#@markdown Use this cell to stash data files from ZIP Crawler to your Drive.<br>
#@markdown <small>Paste file path and run cell.</small>
copy_file = "" #@param {type:"string"}

!cp {copy_file} {save_path}
op(c.ok, 'Copied to Drive:', path_leaf(copy_file))
print('You may now use this file in Sloppy Noto. Set data_file:', copy_file)
if save_txt == True:
  apnd('\nCopied '+path_leaf(copy_file)+' to '+save_path)