# in an attempt to speed up the scraping, I want to try different delay parameters to see where we can maximize speed while not running into any server request 429 issues

## I'll be using this notebook as a wrapper for the script in `medium_scraper_tag_archive_TIME-TEST.py` and setting the download delay here directly

[`DOWNLOAD_DELAY` parameter info](https://docs.scrapy.org/en/latest/topics/settings.html)

In [19]:
import os
import time
import numpy as np
import pandas as pd
!pwd

/Users/fardila/Documents/GitHub/W3LL/content_recommendation_engine/scraping


In [30]:
def time_script(script_string):
    """time the run"""
    t0 = time.time()
    !{script_string}
    runtime = round(time.time() - t0,2)
    print('{} seconds'.format(runtime))
    return runtime

def check_output(output_file):
    df = pd.read_json(output_file)
    return len(df)

In [39]:
def run_string(delay):
    """run script from here diven a delay"""
    start_date = '20180101'
    end_date = '20180107'
    tag = 'health'

    scrapper_file = 'medium_scraper_tag_archive_TIME-TEST.py'
    log_file = 'logs/'+start_date+tag+end_date+'.log'

    output_dir = 'scraped_data/'
    output_file = output_dir+start_date+tag+end_date+'_{0}.json'.format(delay)
    if os.path.exists(output_file):
        os.remove(output_file)
        
    command = 'scrapy runspider -a tag={tag} -a start_date={start_date} -a end_date={end_date} --logfile {log_file} {scrapper_file} -o {output_file} -s DOWNLOAD_DELAY={delay}'.format(tag=tag,start_date=start_date,end_date=end_date,scrapper_file=scrapper_file,output_file=output_file, log_file = log_file, delay=delay)
    print(command)
    
    return command, output_file



## Between 0 and 1

In [44]:
np.linspace(0,1,11)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

In [45]:
delays = np.linspace(0,1,11)
list_runtimes = []
list_n_articles=[]
for delay in delays:
    command, output_file = run_string(delay)
    runtime = time_script(command)
    n_articles = check_output(output_file)
    
    list_runtimes.append(runtime)
    list_n_articles.append(n_articles)

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180107 --logfile logs/20180101health20180107.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180107_0.0.json -s DOWNLOAD_DELAY=0.0
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07']
***** 76 articles on 20180107, 76 total articles so far ****
***** 103 articles on 20180106, 179 total articles so far ****
***** 92 articles on 20180101, 271 total articles so far ****
***** 141 articles on 20180102, 412 total articles so far ****
***** 163 articles on 20180104, 575 total articles so far ****
***** 164 articles on 20180105, 739 total articles so far ****
***** 159 articles on 20180103, 8

***** 164 articles on 20180105, 719 total articles so far ****
***** 103 articles on 20180106, 822 total articles so far ****
***** 76 articles on 20180107, 898 total articles so far ****
813.56 seconds
scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180107 --logfile logs/20180101health20180107.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180107_0.8.json -s DOWNLOAD_DELAY=0.8
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07']
***** 92 articles on 20180101, 92 total articles so far ****
***** 141 articles on 20180102, 233 total articles so far ****
***** 159 articles on 20180103, 392 total articles so far ****
***** 163 articles

In [47]:
list_runtimes

[62.39,
 133.38,
 229.38,
 334.6,
 444.56,
 560.53,
 663.12,
 813.56,
 887.5,
 1007.02,
 1113.05]

In [46]:
list_n_articles

[446, 885, 894, 894, 894, 894, 894, 878, 894, 894, 894]

seems like by 0.2 we start getting the right number of articles...though some are missing at 0.7??

## between 0 and 0.3

In [51]:
np.linspace(0,0.3,11)

array([0.  , 0.03, 0.06, 0.09, 0.12, 0.15, 0.18, 0.21, 0.24, 0.27, 0.3 ])

In [52]:
delays = np.linspace(0,0.3,11)
list_runtimes = []
list_n_articles=[]
for delay in delays:
    command, output_file = run_string(delay)
    runtime = time_script(command)
    n_articles = check_output(output_file)
    
    list_runtimes.append(runtime)
    list_n_articles.append(n_articles)

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180107 --logfile logs/20180101health20180107.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180107_0.0.json -s DOWNLOAD_DELAY=0.0
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07']
***** 76 articles on 20180107, 76 total articles so far ****
***** 92 articles on 20180101, 168 total articles so far ****
***** 159 articles on 20180103, 327 total articles so far ****
***** 103 articles on 20180106, 430 total articles so far ****
***** 163 articles on 20180104, 593 total articles so far ****
***** 164 articles on 20180105, 757 total articles so far ****
***** 141 articles on 20180102, 8

***** 159 articles on 20180103, 734 total articles so far ****
***** 164 articles on 20180105, 898 total articles so far ****
234.38 seconds
scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180107 --logfile logs/20180101health20180107.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180107_0.24.json -s DOWNLOAD_DELAY=0.24
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07']
***** 92 articles on 20180101, 92 total articles so far ****
***** 141 articles on 20180102, 233 total articles so far ****
***** 76 articles on 20180107, 309 total articles so far ****
***** 103 articles on 20180106, 412 total articles so far ****
***** 159 articl

In [53]:
list_n_articles

[483, 533, 727, 858, 894, 894, 894, 894, 894, 894, 894]

In [54]:
list_runtimes

[69.89,
 83.55,
 126.82,
 134.54,
 136.18,
 169.59,
 202.62,
 234.38,
 267.74,
 305.52,
 332.74]

seems like by 0.12 we have the right number of articles

## between 0.1 and 0.15 

In [55]:
np.linspace(0.1,0.15,11)

array([0.1  , 0.105, 0.11 , 0.115, 0.12 , 0.125, 0.13 , 0.135, 0.14 ,
       0.145, 0.15 ])

In [56]:
delays = np.linspace(0.1,0.15,11)
list_runtimes = []
list_n_articles=[]
for delay in delays:
    command, output_file = run_string(delay)
    runtime = time_script(command)
    n_articles = check_output(output_file)
    
    list_runtimes.append(runtime)
    list_n_articles.append(n_articles)

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180107 --logfile logs/20180101health20180107.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180107_0.1.json -s DOWNLOAD_DELAY=0.1
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07']
***** 76 articles on 20180107, 76 total articles so far ****
***** 141 articles on 20180102, 217 total articles so far ****
***** 92 articles on 20180101, 309 total articles so far ****
***** 159 articles on 20180103, 468 total articles so far ****
***** 103 articles on 20180106, 571 total articles so far ****
***** 164 articles on 20180105, 735 total articles so far ****
***** 163 articles on 20180104, 8

***** 159 articles on 20180103, 571 total articles so far ****
***** 164 articles on 20180105, 735 total articles so far ****
***** 163 articles on 20180104, 898 total articles so far ****
154.05 seconds
scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180107 --logfile logs/20180101health20180107.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180107_0.14.json -s DOWNLOAD_DELAY=0.14
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07']
***** 92 articles on 20180101, 92 total articles so far ****
***** 76 articles on 20180107, 168 total articles so far ****
***** 141 articles on 20180102, 309 total articles so far ****
***** 103 articl

In [57]:
list_n_articles

[877, 883, 874, 870, 892, 894, 894, 894, 893, 894, 894]

In [58]:
list_runtimes

[145.38,
 136.48,
 158.59,
 184.12,
 141.78,
 145.01,
 150.2,
 154.05,
 168.16,
 165.49,
 169.29]

seems like by 0.125 we have the right number of articles

something to keep in mind is that the true delay is a random amount of time (between 0.5 * `DOWNLOAD_DELAY` and 1.5 * DOWNLOAD_DELAY), so in some fraction of requests, the delay will still be too short and we will get failed requests. we can use a longer scrape to get more statistics

## try a longer scrape: 1 month

In [62]:
def run_string(delay):
    start_date = '20180101'
    end_date = '20180201'
    tag = 'health'

    scrapper_file = 'medium_scraper_tag_archive_TIME-TEST.py'
    log_file = 'logs/'+start_date+tag+end_date+'_{0}.log'.format(delay)

    output_dir = 'scraped_data/'
    output_file = output_dir+start_date+tag+end_date+'_{0}.json'.format(delay)
    if os.path.exists(output_file):
        os.remove(output_file)
        
    command = 'scrapy runspider -a tag={tag} -a start_date={start_date} -a end_date={end_date} --logfile {log_file} {scrapper_file} -o {output_file} -s DOWNLOAD_DELAY={delay}'.format(tag=tag,start_date=start_date,end_date=end_date,scrapper_file=scrapper_file,output_file=output_file, log_file = log_file, delay=delay)
    print(command)
    
    return command, output_file


In [60]:
delay = 0.13

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.13.json -s DOWNLOAD_DELAY=0.13
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/archive/201

0.13 missed a few articles

In [61]:
delay = 0.2

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.2.json -s DOWNLOAD_DELAY=0.2
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/archive/2018/

0.2 caught all articles

In [63]:
delay = 0.15

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201_0.15.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.15.json -s DOWNLOAD_DELAY=0.15
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/archiv

0.15 caught all articles but there were some 429 errors raised which were resolved by the 3 attempt. In practice it would be best to avoid getting any errors at all, so we will try a higher value

In [64]:
delay = 0.17

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201_0.17.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.17.json -s DOWNLOAD_DELAY=0.17
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/archiv

In [65]:
delay = 0.16

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201_0.16.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.16.json -s DOWNLOAD_DELAY=0.16
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/archiv

In [66]:
delay = 0.165

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201_0.165.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.165.json -s DOWNLOAD_DELAY=0.165
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/arc

In [67]:
delay = 0.155

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201_0.155.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.155.json -s DOWNLOAD_DELAY=0.155
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/arc

In [68]:
delay = 0.125

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201_0.125.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.125.json -s DOWNLOAD_DELAY=0.125
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/arc

In [69]:
delay = 0.1525

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201_0.1525.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.1525.json -s DOWNLOAD_DELAY=0.1525
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/

In [70]:
delay = 0.1575

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201_0.1575.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.1575.json -s DOWNLOAD_DELAY=0.1575
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/

In [71]:
delay = 0.1625

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201_0.1625.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.1625.json -s DOWNLOAD_DELAY=0.1625
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/

In [72]:
delay = 0.1675

command, output_file = run_string(delay)
runtime = time_script(command)
n_articles = check_output(output_file)

print('{0} seconds; {1} articles'.format(runtime, n_articles))

scrapy runspider -a tag=health -a start_date=20180101 -a end_date=20180201 --logfile logs/20180101health20180201_0.1675.log medium_scraper_tag_archive_TIME-TEST.py -o scraped_data/20180101health20180201_0.1675.json -s DOWNLOAD_DELAY=0.1675
['https://medium.com/tag/health/archive/2018/01/01', 'https://medium.com/tag/health/archive/2018/01/02', 'https://medium.com/tag/health/archive/2018/01/03', 'https://medium.com/tag/health/archive/2018/01/04', 'https://medium.com/tag/health/archive/2018/01/05', 'https://medium.com/tag/health/archive/2018/01/06', 'https://medium.com/tag/health/archive/2018/01/07', 'https://medium.com/tag/health/archive/2018/01/08', 'https://medium.com/tag/health/archive/2018/01/09', 'https://medium.com/tag/health/archive/2018/01/10', 'https://medium.com/tag/health/archive/2018/01/11', 'https://medium.com/tag/health/archive/2018/01/12', 'https://medium.com/tag/health/archive/2018/01/13', 'https://medium.com/tag/health/archive/2018/01/14', 'https://medium.com/tag/health/

# I will use 0.17 as the current optimal delay time. It is the smallest value such  that no 429 errors were raised in the 1 month span.

a useful statistic for future tests is to automate the count of 429 errors raised from the log file