Highlights:

- This notebook shows a "human-in-the-loop" approach.
- Ask the human operator to click the "show more button", when asked indicated by this notebook
- You only need to click this button once, and the script will execute forever, scrolling to the end

**NOTE**:

- If your browser screen is small, most likely when you opened developer tool, NYT site will think your browser to be a mobile one.
- In the mobile browser mode, "show more" button will appear repeated. They want to save traffic. When you do the following experiments, you may want to work on a desktop with a reasonably large screen. So this "human in the loop" can actually work.

## Initial setup

In [2]:
import datetime
print(datetime.datetime.now())

2018-11-08 01:55:49.607645


In [1]:
url = 'https://www.nytimes.com/section/world'

In [3]:
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get(url)

In [9]:
browser.find_elements_by_css_selector('h2.headline')

[<selenium.webdriver.remote.webelement.WebElement (session="572679d654becff5225f83de866f76e3", element="0.7355895603345244-1")>,
 <selenium.webdriver.remote.webelement.WebElement (session="572679d654becff5225f83de866f76e3", element="0.7355895603345244-2")>,
 <selenium.webdriver.remote.webelement.WebElement (session="572679d654becff5225f83de866f76e3", element="0.7355895603345244-3")>,
 <selenium.webdriver.remote.webelement.WebElement (session="572679d654becff5225f83de866f76e3", element="0.7355895603345244-4")>,
 <selenium.webdriver.remote.webelement.WebElement (session="572679d654becff5225f83de866f76e3", element="0.7355895603345244-5")>,
 <selenium.webdriver.remote.webelement.WebElement (session="572679d654becff5225f83de866f76e3", element="0.7355895603345244-6")>,
 <selenium.webdriver.remote.webelement.WebElement (session="572679d654becff5225f83de866f76e3", element="0.7355895603345244-7")>,
 <selenium.webdriver.remote.webelement.WebElement (session="572679d654becff5225f83de866f76e3", el

In [17]:
data = []
for s in browser.find_elements_by_css_selector('div.story-body'):
    a = s.find_element_by_css_selector('h2.headline')
    headline = a.text
    try:
        url = a.find_element_by_css_selector('a').get_attribute('href')
    except:
        url = None
    try:
        dt = s.find_element_by_css_selector('time').get_attribute('datetime')
    except:
        dt = None
    data.append({
        'headline': headline,
        'url': url,
        'datetime': dt
    })

In [14]:
import pandas as pd

In [19]:
df = pd.DataFrame(data)

In [20]:
df

Unnamed: 0,datetime,headline,url
0,1541520747,Erdogan Champions Khashoggi While Trampling Jo...,https://www.nytimes.com/2018/11/06/world/europ...
1,1541587767,"Cameroon Students Have Been Released, Official...",https://www.nytimes.com/2018/11/07/world/afric...
2,1541584807,Dry Spell: Canada Runs Low on Legal Marijuana ...,https://www.nytimes.com/2018/11/07/world/canad...
3,1541529385,"As Famine Looms in Yemen, Saudi-Led Coalition ...",https://www.nytimes.com/2018/11/06/world/middl...
4,1541474477,The Nauru Experience: Zero-Tolerance Immigrati...,https://www.nytimes.com/2018/11/05/world/austr...
5,1541580343,Philippine Lawyer Who Resisted Duterte’s Drug ...,https://www.nytimes.com/2018/11/07/world/asia/...
6,1541536215,"Ex-Guard, 94, at Nazi Camp Is Tried in German ...",https://www.nytimes.com/2018/11/06/world/europ...
7,1541521423,A ‘Legacy of Terror’: ISIS Left More Than 200 ...,https://www.nytimes.com/2018/11/06/world/middl...
8,1541517537,Facebook Admits It Was Used to Incite Violence...,https://www.nytimes.com/2018/11/06/technology/...
9,1541513702,Taliban Pummel Security Forces Across Afghanistan,https://www.nytimes.com/2018/11/06/world/asia/...


In [21]:
len(df)

64

## Human in the loop

Of course, I can use selenium to click the "show more" button. However, this operation is one off. Once we click "show more" for the first time, we don't have repeatedly click it. We only need to scroll. We can use human to solve the *onf-off* problem (show more) and hand over the repeated tasks to machines (scroll/ scrape).

Note that we only have 64 entries above. I'm going to scrape far more entries than that.

## Machine takes over and scroll down

In [26]:
import time

In [27]:
for i in range(1, 30):
    browser.execute_script('window.scrollTo(0,document.body.scrollHeight);')
    time.sleep(1)

The scraping codes are the same as before

In [28]:
data = []
for s in browser.find_elements_by_css_selector('div.story-body'):
    a = s.find_element_by_css_selector('h2.headline')
    headline = a.text
    try:
        url = a.find_element_by_css_selector('a').get_attribute('href')
    except:
        url = None
    try:
        dt = s.find_element_by_css_selector('time').get_attribute('datetime')
    except:
        dt = None
    data.append({
        'headline': headline,
        'url': url,
        'datetime': dt
    })
df = pd.DataFrame(data)

In [29]:
len(df)

584

In [30]:
df.head()

Unnamed: 0,datetime,headline,url
0,1541520747,Erdogan Champions Khashoggi While Trampling Jo...,https://www.nytimes.com/2018/11/06/world/europ...
1,1541587767,"Cameroon Students Have Been Released, Official...",https://www.nytimes.com/2018/11/07/world/afric...
2,1541584807,Dry Spell: Canada Runs Low on Legal Marijuana ...,https://www.nytimes.com/2018/11/07/world/canad...
3,1541529385,"As Famine Looms in Yemen, Saudi-Led Coalition ...",https://www.nytimes.com/2018/11/06/world/middl...
4,1541474477,The Nauru Experience: Zero-Tolerance Immigrati...,https://www.nytimes.com/2018/11/05/world/austr...


In [31]:
df.tail()

Unnamed: 0,datetime,headline,url
579,,,
580,,,
581,,,
582,,,
583,,,


In [32]:
df.to_csv('nyt_world.csv')

## Epilogue

The `datetime` and `url` fields are wrong in the output CSV. `headline` seems alright. Nevertheless, we showed you that "human in the loop" is possible and how to scrape many pages by scrolling. One can try to analyse the page structure of new elements (after scroll) to find the solution.