This is roughly the same as the video walkthrough, but slightly cleaned up.

In [1]:
import requests as req

First, fetch the webpage using the requests library.

In [2]:
res = req.get('https://www.denvergov.org/content/denvergov/en/city-of-denver-home/news.html')

BeautifulSoup is a library that can parse HTML pages.  We'll use it here to get each 'div' element in the page with the news stories.

In [3]:
from bs4 import BeautifulSoup as bs

In [4]:
soup = bs(res.content, 'lxml')

In [5]:
news_items = soup.find_all('div', {'class': 'denver-news-list-item'})

In [6]:
len(news_items)

6

In [7]:
news_items[0]

<div class="denver-news-list-item">
<p class="denver-news-list-date">Dec 2, 2019</p>
<h3><a href="/content/denvergov/en/environmental-health/about-us/news-room/newsroom_2019/fifth-annual-sustainable-denver-summit.html">City to Host Fifth Annual Sustainable Denver Summit</a></h3>
<p>Mayor Michael B. Hancock and Denver’s Office of Sustainability host the fifth annual Sustainable Denver Summit on Thursday, Dec. 5, at the Colorado Convention Center. The Summit is the community’s premier gathering of industry leaders and professionals working on today’s most important sustainability solutions. Their work will help Denver achieve its ambitious 2020 Sustainability Goals and other efforts to improve sustainability and climate action.</p>
<div style="clear:both;"></div>
</div>

In [8]:
# get date and title
news_items[0].find('p', {'class': 'denver-news-list-date'}).text

'Dec 2, 2019'

This gets the text from the link in the first news item.

In [9]:
news_items[0].find('a').text

'City to Host Fifth Annual Sustainable Denver Summit'

In [10]:
link = news_items[0].find('a')

In [11]:
link

<a href="/content/denvergov/en/environmental-health/about-us/news-room/newsroom_2019/fifth-annual-sustainable-denver-summit.html">City to Host Fifth Annual Sustainable Denver Summit</a>

In [12]:
link.get('href')

'/content/denvergov/en/environmental-health/about-us/news-room/newsroom_2019/fifth-annual-sustainable-denver-summit.html'

In [13]:
link.text

'City to Host Fifth Annual Sustainable Denver Summit'

Often 'permanent' variables in Python are set in all caps like this.  We aren't going to change the base URL or the page URL, so we use all caps for the variable names.

In [14]:
BASE_URL = 'https://www.denvergov.org'
PAGE_URL = 'https://www.denvergov.org/content/denvergov/en/city-of-denver-home/news.html?page={}'

This is a clean loop we can use to get each page of news, then get each news item in the page, and write it to our MongoDB datastore.  It also uses the tqdm library to print out progress as it runs.

In [15]:
import pandas as pd  # for conversion of datetimes
from tqdm import tqdm_notebook
from pymongo import MongoClient

client = MongoClient()
db = client['news']
coll = db['denver']

for page_number in tqdm_notebook(range(1, 71)):
    res = req.get(PAGE_URL.format(page_number))
    soup = bs(res.content, 'lxml')
    news_items = soup.find_all('div', {'class': 'denver-news-list-item'})
    for item in news_items:
        date = item.find('p', {'class': 'denver-news-list-date'}).text
        link = link = item.find('a')
        title = link.text
        href = link.get('href')
        coll.insert_one({'date': pd.to_datetime(date),
                    'title': title,
                    'link': BASE_URL + href})

client.close()

HBox(children=(IntProgress(value=0, max=70), HTML(value='')))




Finally, we just check that our data was actually written to MongoDB.

In [16]:
from pprint import pprint

client = MongoClient()
db = client['news']
coll = db['denver']

pprint(coll.find_one())

client.close()

{'_id': ObjectId('5de581023ae5ef8671ee264a'),
 'date': datetime.datetime(2019, 12, 2, 0, 0),
 'link': 'https://www.denvergov.org/content/denvergov/en/environmental-health/about-us/news-room/newsroom_2019/fifth-annual-sustainable-denver-summit.html',
 'title': 'City to Host Fifth Annual Sustainable Denver Summit'}
