## Web Scraping Demo

Hezekiah Branch

Thu Jan 6 2022

**Goal: Scraping list of articles from the current CNN homepage**


First two cells from https://realpython.com/python-web-scraping-practical-introduction/

In [160]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import json
import time

In [161]:
# Begin wall-timing workflow for web scraping
start = time.time()

Make an HTTP request to the CNN server where the connection closing is handled by urllib. 

Parse the HTML using BeautifulSoup.

In [162]:
url = "https://www.cnn.com"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

In [163]:
# I outputted 'get_text()' to express the need
# for more "hacking" on the developer side
# since this output is not the content we want
# from the home page (i.e. trending articles)
print(soup.get_text())

















CNN - Breaking News, Latest News and Videos


USWorldPoliticsBusinessOpinionHealthEntertainmentStyleTravelSportsVideosEditionU.S.InternationalArabicEspañolSearch CNNOpen MenuSearchEditionU.S.InternationalArabicEspañolUSCrime + JusticeEnergy + EnvironmentExtreme WeatherSpace + ScienceWorldAfricaAmericasAsiaAustraliaChinaEuropeIndiaMiddle EastUnited KingdomPoliticsThe Biden PresidencyFacts FirstUS ElectionsBusinessMarketsTechMediaSuccessPerspectivesVideosOpinionPolitical Op-EdsSocial CommentaryHealthLife, But BetterFitnessFoodSleepMindfulnessRelationshipsEntertainmentStarsScreenBingeCultureMediaTechInnovateGadgetForeseeable FutureMission: AheadUpstartsWork TransformedInnovative CitiesStyleArtsDesignFashionArchitectureLuxuryBeautyVideoTravelDestinationsFood and DrinkStayNewsVideosSportsPro FootballCollege FootballBasketballBaseballSoccerOlympicsHockeyVideosLive TV Digital StudiosCNN FilmsHLNTV ScheduleTV Shows A-ZCNNVRAudioCouponsCNN UnderscoredExploreWellnessGadgetsLifest

We can get a structured string with BeautifulSoup's 'prettify' function, which we then turn into a list. 

The prettify output separated each HTML tag on a new line, so we can use '\n' as our delimiter to turn this string into a list/array.

In [164]:
pretty_soup = soup.prettify()

In [165]:
pretty_bowl = pretty_soup.split('\n')
pretty_bowl[0:50]

['<!DOCTYPE html>',
 '<html class="no-js">',
 ' <head>',
 '  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>',
 '  <meta charset="utf-8"/>',
 '  <meta content="text/html" http-equiv="Content-Type"/>',
 '  <meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>',
 '  <link href="/optimizelyjs/131788053.js" rel="dns-prefetch"/>',
 '  <link href="//tpc.googlesyndication.com" rel="dns-prefetch"/>',
 '  <link href="//pagead2.googlesyndication.com" rel="dns-prefetch"/>',
 '  <link href="//www.googletagservices.com" rel="dns-prefetch"/>',
 '  <link href="//partner.googleadservices.com" rel="dns-prefetch"/>',
 '  <link href="//www.google.com" rel="dns-prefetch"/>',
 '  <link href="//aax.amazon-adsystem.com" rel="dns-prefetch"/>',
 '  <link href="//c.amazon-adsystem.com" rel="dns-prefetch"/>',
 '  <link href="//cdn.krxd.net" rel="dns-prefetch"/>',
 '  <link href="//ads.rubiconproject.com" rel="dns-prefetch"/>',
 '  <link href="//optimized-by.ru

Since we are dealing with article data, we only want the lines associated with article data. 

We can use these tags to subset the amount of string we are looking at. Most importantly, without needing to know what string formatting the CNN HTML uses in advance. The last line shows a small preview of what the output looks like. It's not necessary to keep it unless you really need it.

In [166]:
tags = ['subTitle', 'name', 'title', 'headline', 'description', 'src', 'mpId']

In [167]:
subset = []

for row in pretty_bowl:
    valid = True
    for tag in tags:
        if tag not in row:
            valid = False
    if valid:
        subset.append(row)
        
subset[0][0:500]

'         window.CNN = window.CNN || {};window.CNNI = window.CNNI || {},window.FAVE = window.FAVE || {};window.WM = window.WM || {};window.document.domain = \'cnn.com\';if (typeof window.console === \'undefined\') {window.console = {debug: function() {return true;},error: function() {return true;},info: function() {return false;},warn: function() {return false;},log: function() {return false;},timeStamp: function() {return false;}}}CNN.isWebview = false;CNN.adTargets = {protocol: "ssl"};CNN.AdsConfig'

Having found our (naively) most important subset of the string, we can more precisely extract the list of articles.

In [168]:
len(subset) # start at "values:" end at "}]}"

1

In [169]:
start_text = "{\"articleList\":"
start_index = subset[0].find(start_text) + len(start_text)

end_text = "\"layout\":\"\"}]"
end_index = subset[0].find(end_text) + len(end_text) # The 'find' module gives the start, NOT the end

print(start_index)
print(end_index)

105806
131848


In [170]:
article_list = subset[0][start_index:end_index]
article_list[0:500]

'[{"uri":"/2022/01/06/politics/joe-biden-january-6-speech-anniversary/index.html","headline":"\\u003cstrong>It\'s likely to be the line repeated most today and in days to come\\u003c/strong>","thumbnail":"//cdn.cnn.com/cnnnext/dam/assets/220106093419-03-biden-jan-6-speech-small-11.jpg","duration":"","description":"","layout":""},{"uri":"/2022/01/06/politics/january-6-anniversary/index.html","headline":"Biden condemns Trump as a threat to democracy","thumbnail":"//cdn.cnn.com/cnnnext/dam/assets/22010'

We then load this string into JSON to get a preferred JSON object of articles.

In [171]:
data = json.loads(article_list)
data

[{'uri': '/2022/01/06/politics/joe-biden-january-6-speech-anniversary/index.html',
  'headline': "<strong>It's likely to be the line repeated most today and in days to come</strong>",
  'thumbnail': '//cdn.cnn.com/cnnnext/dam/assets/220106093419-03-biden-jan-6-speech-small-11.jpg',
  'duration': '',
  'description': '',
  'layout': ''},
 {'uri': '/2022/01/06/politics/january-6-anniversary/index.html',
  'headline': 'Biden condemns Trump as a threat to democracy',
  'thumbnail': '//cdn.cnn.com/cnnnext/dam/assets/220106092029-02-biden-jan-6-speech-small-11.jpg',
  'duration': '',
  'description': '<a href="https://www.cnn.com/specials/politics/joe-biden-news" target="_blank">President Joe Biden</a> on Thursday marked the first anniversary of the January 6 insurrection by forcefully calling out former President Donald Trump for attempting to undo American democracy, saying such an insurrection must never happen again.',
  'layout': ''},
 {'uri': '/videos/politics/2022/01/06/stephanie-gris

Data science folks may also want to use this data in pandas! It's easy from here to export to CSV.

In [172]:
import pandas as pd

In [173]:
cnn_df = pd.json_normalize(data)
cnn_df # use .to_csv() for CSV exporting

Unnamed: 0,uri,headline,thumbnail,duration,description,layout,iconType
0,/2022/01/06/politics/joe-biden-january-6-speec...,<strong>It's likely to be the line repeated mo...,//cdn.cnn.com/cnnnext/dam/assets/220106093419-...,,,,
1,/2022/01/06/politics/january-6-anniversary/ind...,Biden condemns Trump as a threat to democracy,//cdn.cnn.com/cnnnext/dam/assets/220106092029-...,,"<a href=""https://www.cnn.com/specials/politics...",,
2,/videos/politics/2022/01/06/stephanie-grisham-...,Ex-Trump aide describes the worry with Trump's...,//cdn.cnn.com/cnnnext/dam/assets/220106091107-...,01:50,"Stephanie Grisham, former communications direc...",,video
3,/2022/01/06/opinions/what-i-saw-on-january-6-c...,<strong>Opinion:</strong> I was there on Janua...,//cdn.cnn.com/cnnnext/dam/assets/220106052132-...,,"On January 6, 2021, I left the Maryland suburb...",,
4,/2022/01/06/politics/stephanie-grisham-trump-o...,Stephanie Grisham says group of ex-Trump offic...,//cdn.cnn.com/cnnnext/dam/assets/200407100829-...,,"Former White House press secretary <a href=""ht...",,
5,/2022/01/06/opinions/president-biden-speech-de...,<strong>Opinion</strong>: Biden assumes role o...,//cdn.cnn.com/cnnnext/dam/assets/220106093511-...,,"President Joe Biden, marking the anniversary o...",,
6,/2022/01/06/politics/trump-tweet-january-6/ind...,<strong>Trump did not want to tweet 'stay peac...,//cdn.cnn.com/cnnnext/dam/assets/220106084508-...,,,,
7,/videos/politics/2022/01/06/senator-joe-neguse...,Senator shares defining moment from January 6,//cdn.cnn.com/cnnnext/dam/assets/210210131744-...,01:54,Sen. Joe Neguse (D-CO) joins Inside Politics t...,,video
8,/2022/01/06/politics/january-6-capitol-riot/in...,<strong>Analysis: </strong>What a reporter tra...,//cdn.cnn.com/cnnnext/dam/assets/211119115714-...,,"<a href=""https://www.cnn.com/specials/politics...",,
9,/videos/politics/2022/01/06/dick-cheney-liz-ch...,Dick Cheney slams GOP over January 6th,//cdn.cnn.com/cnnnext/dam/assets/220106143222-...,01:32,Former Republican Vice President Dick Cheney a...,,video


In [174]:
# End wall-timing of web scraping workflow
end = time.time()
print(format(end - start, '.4f'), "seconds")

0.2963 seconds
