<a href="https://colab.research.google.com/github/gtoubian/cce/blob/main/3_6_Web_Scraping_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Brooklyn Nine Nine Analysis

In this lecture, we will be scrapping data off the Wikipedia page for the popular show, Brooklyn Nine Nine. Once we scrape the data, we will clean it up and format it into a proper data set that we can run analysis on. Let's get started. First, let's import the required libraries.

In [None]:
from bs4 import BeautifulSoup as bs
import requests

From the Wikipedia page for Brooklyn Nine Nine, we will try to scrape all the information in the side bar into a dictionary with key:value pairs

In [None]:
r = requests.get("https://en.wikipedia.org/wiki/Law_%26_Order:_Special_Victims_Unit")
soup = bs(r.content)
content = soup.prettify()
content

'<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Law &amp; Order: Special Victims Unit - Wikipedia\n  </title>\n  <script>\n   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YEhrxQmbK8z-C0@WJZNHrgAAAAA","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Law_\\u0026_Order:_Special_Victims_Unit","wgTitle":"Law \\u0026 Order: Special Victims Unit","wgCurRevisionId":1011315230,"wgRevisionId":1011315230,"wgArticleId":197060,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","CS1 maint: archived copy as title","CS

Here, we have to use the find method to find out which html element contains the info that we want. We can view the html source code for a given page by clicking "inspect".

In [None]:
info_box = soup.find(class_='infobox vevent')
print(info_box.prettify())

<table class="infobox vevent" style="width:22em">
 <tbody>
  <tr>
   <th class="summary" colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-style: italic; background: #CCCCFF; padding: 0.25em 1em; line-height: 1.5em; line-height: normal;">
    Law &amp; Order:
    <span class="nowrap">
     Special Victims Unit
    </span>
   </th>
  </tr>
  <tr>
   <td colspan="2" style="text-align:center">
    <a class="image" href="/wiki/File:SVUopening.jpg">
     <img alt="SVUopening.jpg" data-file-height="556" data-file-width="1024" decoding="async" height="136" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cf/SVUopening.jpg/250px-SVUopening.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cf/SVUopening.jpg/375px-SVUopening.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cf/SVUopening.jpg/500px-SVUopening.jpg 2x" width="250"/>
    </a>
   </td>
  </tr>
  <tr>
   <th scope="row">
    Also known as
   </th>
   <td>
    <div class="plainlist">
  

Once we find the element that contains the info we'd like, we can use the find method once again to gain further insight into the data.

In [None]:
info_rows = info_box.find_all('tr')
for row in info_rows:
  print(row.prettify())

<tr>
 <th class="summary" colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-style: italic; background: #CCCCFF; padding: 0.25em 1em; line-height: 1.5em; line-height: normal;">
  Law &amp; Order:
  <span class="nowrap">
   Special Victims Unit
  </span>
 </th>
</tr>

<tr>
 <td colspan="2" style="text-align:center">
  <a class="image" href="/wiki/File:SVUopening.jpg">
   <img alt="SVUopening.jpg" data-file-height="556" data-file-width="1024" decoding="async" height="136" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cf/SVUopening.jpg/250px-SVUopening.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cf/SVUopening.jpg/375px-SVUopening.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cf/SVUopening.jpg/500px-SVUopening.jpg 2x" width="250"/>
  </a>
 </td>
</tr>

<tr>
 <th scope="row">
  Also known as
 </th>
 <td>
  <div class="plainlist">
   <ul>
    <li>
     <i>
      Law &amp; Order: SVU
     </i>
    </li>
    <li>
     <i>
      SVU


In [None]:
def get_content_value(row_data):
  if row.find("li") is not None:
    return [li.get_text(" ", strip = True).replace("\xa0", ' ') for li in row.find_all('li')]
  else:
    return row.find('td').get_text(" ", strip = True).replace("\xa0", ' ')

#Web Scraping Legality

In general, you shouldn't run into problems given that you are web scraping for your own recreational use. For further information about the permissions that a site can give, you can use the robots.txt documentation.

#Scraping Info from Multiple Shows

Since we had success with scraping the information off Brooklyn Nine Nine, let's try to scrap info from other shows on NBC.

In [None]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_programs_broadcast_by_NBC")
soup = bs(r.content)
content = soup.prettify()
content

In [None]:
shows = soup.select("h3 ~ ul i a")
#shows[0:26]
print(shows[0]['href'])
print(shows[0]['title'])

/wiki/Law_%26_Order:_Special_Victims_Unit
Law & Order: Special Victims Unit


In general, if we have a set block of code that we'd like to run repeatedly on different objects, we can create functions (or classes) that can be called with one line.


In [None]:
show_info={}

for index, row in enumerate(info_rows):
  if index==0:
    show_info['title']=row.find('th').get_text()
  elif index == 1:
    continue
  else:
    x = row.find('th')
    y = row.find('td')
    if x is not None and y is not None:
      key = row.find('th').get_text()
      value = get_content_value(row.find('td'))
      show_info[key] = value

print (show_info)

{'title': 'Law & Order: Special Victims Unit', 'Also known as': ['Law & Order: SVU', 'SVU'], 'Genre': ['Police procedural', 'Legal drama', 'Thriller', 'Mystery'], 'Created by': 'Dick Wolf', 'Starring': ['Christopher Meloni', 'Mariska Hargitay', 'Richard Belzer', 'Dann Florek', 'Michelle Hurd', 'Stephanie March', 'Ice-T', 'BD Wong', 'Diane Neal', 'Tamara Tunie', 'Adam Beach', 'Michaela McManus', 'Danny Pino', 'Kelli Giddish', 'Raúl Esparza', 'Peter Scanavino', 'Philip Winchester', 'Jamie Gray Hyder', 'Demore Barnes'], 'Opening theme': 'Theme of Law & Order: Special Victims Unit', 'Composer': 'Mike Post', 'Country of origin': 'United States', 'Original language': 'English', 'No. of seasons': '22', 'No. of episodes': '486 ( list of episodes )', 'Executive producers': ['Dick Wolf', 'Ted Kotcheff (seasons 2–13)', 'Peter Jankowski (season 2–present)', 'Michael Smith (season 14)', 'Julie Martin (seasons 14–present)', 'Jonathan Starch (season 15–17)', 'Arthur W. Forney (season 17–)', 'Mariska 

In [None]:
def get_info(url):
  def get_content_value(row_data):
    if row.find("li") is not None:
      return [li.get_text(" ", strip = True).replace("\xa0", ' ') for li in row.find_all('li')]
    else:
      return row.find('td').get_text(" ", strip = True).replace("\xa0", ' ')
  r = requests.get(url)
  soup = bs(r.content)

  info_box = soup.find(class_='infobox vevent')

  info_rows = info_box.find_all('tr')
  show_info={}

  for index, row in enumerate(info_rows):
    if index==0:
      show_info['title']=row.find('th').get_text()
    elif index == 1:
      continue
    x = row.find('th')
    y = row.find('td')
    if x is not None and y is not None:
      key = row.find('th').get_text()
      value = get_content_value(row.find('td'))
      show_info[key] = value

  return (show_info)

In [None]:
base_path = "https://en.wikipedia.org"

show_info_list = []

for index, show in enumerate(shows):
  try:
    relative_path = show['href']
    full_path = base_path + relative_path
    title = show['title']
    show_info_list.append(get_info(full_path))
  
  except Exception as e:
    print(show.get_text())
    print(e)

Ellen's Greatest Night of Giveaways
'NoneType' object has no attribute 'find_all'
Miss America
'NoneType' object has no attribute 'find_all'
The Thing About Pam
'NoneType' object has no attribute 'find_all'
Choose Your Own Adventure
'NoneType' object has no attribute 'find_all'
Fried Green Tomatoes
'NoneType' object has no attribute 'find_all'
Zorro
'NoneType' object has no attribute 'find_all'


In [None]:
len(show_info_list)

66

In [None]:
import pandas as pd

df = pd.DataFrame(show_info_list)

df.head()

Unnamed: 0,title,Also known as,Genre,Created by,Starring,Opening theme,Composer,Country of origin,Original language,No. of seasons,No. of episodes,Executive producers,Camera setup,Running time,Production companies,Distributor,Original network,Picture format,Original release,Related shows,Producers,Composers,Producer,Production location,Cinematography,Followed by,Developed by,Production locations,Editors,Budget,Based on,Music by,Theme music composer,Editor,Audio format,Narrated by,Directed by,Creative director,Presented by,Judges,Preceded by,Ending theme,No. of series,Awarded for,Country,First awarded,Website,Network,Production company,Written by,Executive producer,Produced by,Edited by,Productioncompany,Distributed by,Release date,Language,Box office
0,Law & Order: Special Victims Unit,"[Law & Order: SVU, SVU]","[Police procedural, Legal drama, Thriller, Mys...",Dick Wolf,"[Christopher Meloni, Mariska Hargitay, Richard...",Theme of Law & Order: Special Victims Unit,Mike Post,United States,English,22,486 ( list of episodes ),"[Dick Wolf, Ted Kotcheff (seasons 2–13), Peter...",Single-camera,40–44 minutes,"[Wolf Films (1999–2019), Wolf Entertainment (2...",NBCUniversal Television Distribution,NBC,"[NTSC ( 480i ) (1999–2002), HDTV 1080i (2002–p...","September 20, 1999 ( 1999-09-20 ) – present",Law & Order franchise,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Chicago Fire,,"[Action, Drama [1]]","[Derek Haas, Michael Brandt]","[Jesse Spencer, Taylor Kinney, Monica Raymund,...",,Atli Örvarsson,United States,English,9,187 ( list of episodes ),"[Dick Wolf, Matt Olmstead, Danielle Gelber, Mi...",Single-camera,42 minutes,"[Wolf Films (seasons 1–7), Wolf Entertainment ...",NBCUniversal Television Distribution,NBC,HDTV 1080i,"October 10, 2012 ( 2012-10-10 ) – present ( pr...","[Chicago P.D., Chicago Med]","[John L. Roman, Todd Arnow, Tim Deluca, Hilly ...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,The Blacklist,,"[Crime drama, Action [1], Thriller, Mystery]",Jon Bokenkamp,"[James Spader, Megan Boone, Diego Klattenhoff,...",,,United States,English,8,161 ( list of episodes ),"[Jon Bokenkamp, John Davis, John Eisendrath, J...",Single-camera,40-45 minutes,"[Davis Entertainment, Universal Television, So...","[Sony Pictures Television, NBCUniversal Televi...",NBC,576i SDTV 1080i HDTV 4K UHDTV (2016–),"September 23, 2013 ( 2013-09-23 ) – present ( ...",,,"[Dave Porter, James S. Levine]",Anthony Sparks,New York City,"[Frank Prinzi, Arthur Albert, Michael Caraccio...",The Blacklist: Redemption,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,Chicago P.D.,,"[Police procedural, Crime Drama]","[Dick Wolf, Matt Olmstead]","[Jason Beghe, Jon Seda, Sophia Bush, Jesse Lee...",,Atli Örvarsson,United States,English,8,156 ( list of episodes ),"[Dick Wolf, Matt Olmstead, Danielle Gelber, Mi...",Single-camera,42 minutes,"[Wolf Films (seasons 1–6), Wolf Entertainment ...",NBCUniversal Television Distribution,NBC,HDTV 1080i,"January 8, 2014 ( 2014-01-08 ) – present ( pre...",Chicago Justice,"[Terry Miller, Jamie Pachino, Jeremy Beim, Mic...",,,,,,"[Derek Haas, Michael Brandt]",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,Chicago Med,,Medical drama,"[Dick Wolf, Matt Olmstead]","[Nick Gehlfuss, Yaya DaCosta, Torrey DeVitto, ...",,Atli Örvarsson,United States,English,6,111 ( list of episodes ),"[Dick Wolf, Diane Frolov, Andrew Schneider, Mi...",Single-camera,42 minutes,"[Wolf Films (season 1–4), Wolf Entertainment (...",NBCUniversal Television Distribution,NBC,HDTV 1080i,"November 17, 2015 ( 2015-11-17 ) – present ( p...",,"[Charles S. Carroll, Jeffrey Drayer, David Wei...",,,,,,"[Derek Haas, Michael Brandt]",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
get_info("https://en.wikipedia.org/wiki/Law_%26_Order:_Special_Victims_Unit")

{'Also known as': 'Website',
 'Camera setup': 'Website',
 'Composer': 'Website',
 'Country of origin': 'Website',
 'Created by': 'Website',
 'Distributor': 'Website',
 'Executive producers': 'Website',
 'Genre': 'Website',
 'No. of episodes': 'Website',
 'No. of seasons': 'Website',
 'Opening theme': 'Website',
 'Original language': 'Website',
 'Original network': 'Website',
 'Original release': 'Website',
 'Picture format': 'Website',
 'Production companies': 'Website',
 'Related shows': 'Website',
 'Running time': 'Website',
 'Starring': 'Website',
 'title': 'Law & Order: Special Victims Unit'}