# Lab 8: Web Scraping Wikipedia

### Author: <font color='red'> Michelle Moore </font>

In [55]:
# --- Tutorial retrieved October 2020 ---
# How To Web Scrape Wikipedia 
# Using Python, Urllib, Beautiful Soup and Pandas
# by Alan Hylands
# URL: https://alanhylands.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas/

## Follow the instructions (with provided code) from <em>How To Web Scrape Wikipedia</em> to complete PART A

### URL: https://alanhylands.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas/ 

## PART A (10 points)

### Tutorial Steps 3-4

In [56]:
# import the library we use to open URLs
import urllib.request

# import BeautifulSoup4 library so we can parse HTML and XML docs
from bs4 import BeautifulSoup

# SSL FIX
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

url = "https://en.wikipedia.org/wiki/1999%E2%80%932000_FA_Premier_League"

# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

<div class="alert alert-block alert-info">
<b>ATTENTION:</b> The next line of code will generate a LARGE amount of HTML. Take some time to look through the HTML, but afterwards, you may want to comment out this line of code to minimize the size/length of the notebook output.</div>

### Tutorial Step 5 

In [57]:
# use prettify() to view the HTML
# print(soup.prettify())

### Tutorial Steps 6-7

In [58]:
# Find the table we want ... you can used CTRL+F to open Find and 
# search for "wikitable sortable"
# we'll extract data from the 'td' tags

In [59]:
# title function will return the HTML tags for the title and the content
# between them
soup.title()

# refine this a step further by specifiying the 'string' element and only bring 
# back the content without the 'title' tags
soup.title.string

'1999–2000 FA Premier League - Wikipedia'

<div class="alert alert-block alert-info">
<b>NOTE:</b> The following Table Image was added so you have a visual image of the table to be parsed without having to go back to the URL or open it in another tab.</div>

In [60]:
# The table we want to parse is the Personnel and Kits table with column headings:
# Team, Manager, Captain, Kit manufacturer and Shirt sponsor

# NOTE: There are 5 columns of data (view the actual webpage if needed)
from IPython.display import Image
Image("WebScrapingWiki-image1.jpg")

<IPython.core.display.Image object>

### Tutorial Step 8 (part 1 of 2)

In [61]:
# CODE FOR STEP 8 (part 1)
# Use the 'find_all' function to bring back all instances of the 'table' tag in the HTML  
# with class_='wikitable sortable'
all_sortable_tables = soup.find_all('table', class_='wikitable sortable')
all_sortable_tables

[<table class="wikitable sortable">
 <tbody><tr>
 <th>Team
 </th>
 <th>Location
 </th>
 <th>Stadium
 </th>
 <th>Capacity
 </th></tr>
 <tr>
 <td><a href="/wiki/Arsenal_F.C." title="Arsenal F.C.">Arsenal</a>
 </td>
 <td><a href="/wiki/London" title="London">London</a> <span style="font-size:85%;">(<a href="/wiki/Highbury" title="Highbury">Highbury</a>)</span>
 </td>
 <td><a href="/wiki/Arsenal_Stadium" title="Arsenal Stadium">Arsenal Stadium</a>
 </td>
 <td align="center">38,419
 </td></tr>
 <tr>
 <td><a href="/wiki/Aston_Villa_F.C." title="Aston Villa F.C.">Aston Villa</a>
 </td>
 <td><a href="/wiki/Birmingham" title="Birmingham">Birmingham</a>
 </td>
 <td><a href="/wiki/Villa_Park" title="Villa Park">Villa Park</a>
 </td>
 <td align="center">42,573
 </td></tr>
 <tr>
 <td><a href="/wiki/Bradford_City_A.F.C." title="Bradford City A.F.C.">Bradford City</a>
 </td>
 <td><a href="/wiki/Bradford" title="Bradford">Bradford</a>
 </td>
 <td><a href="/wiki/Valley_Parade" title="Valley Parade">Val

### Tutorial Step 8 (part 2 of 2)

In [62]:
# CODE FOR STEP 8 (part 2)
# We want the 2nd 'wikitable sortable' table ... so use indexing to get it
right_table = all_sortable_tables[1]
right_table

<table class="wikitable sortable">
<tbody><tr>
<th>Team
</th>
<th>Manager
</th>
<th>Captain
</th>
<th>Kit manufacturer
</th>
<th>Shirt sponsor
</th></tr>
<tr>
<td>Arsenal
</td>
<td><span class="flagicon"><a href="/wiki/France" title="France"><img alt="France" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/c/c3/Flag_of_France.svg/23px-Flag_of_France.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/c/c3/Flag_of_France.svg/35px-Flag_of_France.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/c/c3/Flag_of_France.svg/45px-Flag_of_France.svg.png 2x" width="23"/></a></span> <a href="/wiki/Ars%C3%A8ne_Wenger" title="Arsène Wenger">Arsène Wenger</a>
</td>
<td><span class="flagicon"><a href="/wiki/England" title="England"><img alt="England" class="thumbborder" data-file-height="480" data-file-width="800" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/en/thumb/b/be/Fla

### Tutorial Steps 9-10

In [63]:
A = []
B = []
C = []
D = []
E = []

for row in right_table.findAll('tr'):
    cells = row.findAll('td')
    
    if len(cells) == 5:
        A.append(cells[0].find(text=True).strip())
        B.append(cells[1].find(text=True).strip())
        C.append(cells[2].find(text=True).strip())
        D.append(cells[3].find(text=True).strip())
        E.append(cells[4].find(text=True).strip())

### Tutorial Step 11

In [75]:
import pandas as pd
df = pd.DataFrame(A, columns=['Team'])
df['Manager'] = B
df['Captain'] = C
df['Kit_manufacturer'] = D
df['Shirt_sponsor'] = E
df

Unnamed: 0,Team,Manager,Captain,Kit_manufacturer,Shirt_sponsor
0,Arsenal,Arsène Wenger,Tony Adams,Nike,Dreamcast
1,Aston Villa,John Gregory,Gareth Southgate,Reebok,LDV Vans
2,Bradford City,Paul Jewell,Stuart McCall,Asics,JCT600
3,Chelsea,Gianluca Vialli,Dennis Wise,Umbro,Autoglass
4,Coventry City,Gordon Strachan,Gary McAllister,CCFC Garments,Subaru
5,Derby County,Jim Smith,Darryl Powell,Puma,EDS
6,Everton,Walter Smith,Dave Watson,Umbro,One2One
7,Leeds United,David O'Leary,Lucas Radebe,Puma,Packard Bell
8,Leicester City,Martin O'Neill,Matt Elliott,Fox Leisure,Walkers Crisps
9,Liverpool,Gérard Houllier,Jamie Redknapp,Reebok,Carlsberg Group


### Tutorial Step 12

<div class="alert alert-block alert-info">
    <b>NOTE:</b> The following Table Image was added so you can again see the added <strong>flags</strong> that are in the Manager and Captain columns.</div>

In [65]:
# Searching for the problem ... 
# look back at the HTML, specifically associated with the column headings of Manager & Captain. 
# These 2 columns are a bit different than the other 3 columns due to the addition of small flag icons.

from IPython.display import Image
Image("WebScrapingWiki-image2.jpg")

<IPython.core.display.Image object>

In [76]:
A = []
B = []
C = []
D = []
E = []

for row in right_table.findAll('tr'):
    cells = row.findAll('td')
    
    if len(cells) == 5:
        A.append(cells[0].find(text=True).strip())
        mlnk=cells[1].findAll('a')
        B.append(mlnk[1].contents[0].strip())
        clnk = cells[2].findAll('a')
        C.append(clnk[1].contents[0].strip())
        D.append(cells[3].find(text=True).strip())
        E.append(cells[4].find(text=True).strip())

### End of Tutorial - One More Step!!! (5 points)

In [77]:
# Save the DataFrame (df) to a CSV file named 'wiki_tutorial.csv' using Panda's to_csv() function

df.to_csv('wiki_tutorial.csv')

# PART B (80 points)

## Use the code from <em>How To Web Scrape Wikipedia</em> as a guide for completing PART B

### Use the wikipedia page for North Carolina's Demographics
### URL: https://en.wikipedia.org/wiki/Demographics_of_North_Carolina

<span style="color:blue">
    Make sure you take a look at the North Carolina Demographics Wikipedia page before you get started.<br>
    You are going to need to collect the data from the <strong>Historical population</strong> table.<br>
    1. Import <strong>urllib</strong> <br>
    2. Set the variable <strong>url</strong> to the correct URL (listed above). <br>
    3. Import <strong>BeautifulSoup</strong> <br>
    4. Use <strong>BeautifulSoup</strong> to parse the HTML from the URL.<br>
    5. Take a look at the underlying HTML using <strong>.prettify()</strong> <br>
    6. Get the page <strong>'title.string'</strong> element and <strong>print</strong> the value without the &lt;title&gt; tags <br>
</span>

In [68]:
import urllib.request
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Demographics_of_North_Carolina"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")
print(soup.prettify())
print(soup.title.string)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Demographics of North Carolina - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"7d763545-bef9-4dfb-a4cb-11185ec23b09","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Demographics_of_North_Carolina","wgTitle":"Demographics of North Carolina","wgCurRevisionId":1021862398,"wgRevisionId":1021862398,"wgArticleId":16768431,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","Articles with short description","Short description is differ

In [69]:
# CODE PROVIDED - THIS IS THE TABLE YOU WANT TO PARSE
# The table to parse is the Historical population table with with columns: 
#     Census, Pop and %+-
from IPython.display import Image
Image("NC-Demographics-image1.jpg")

<IPython.core.display.Image object>

<div class="alert alert-block alert-info">
    <b>HINT:</b> Browse the output from <strong>prettify</strong> step or <strong>view the source</strong> in your browser <br>
    where you can used <strong>CTRL+F</strong> to open a find window to search for <strong>table</strong> tags <br><br>
    You will want to focus on the HTML related to the table shown above. <br>
    Look/search for a <strong>&lt;table ... &gt;</strong> tag and the column headings for this table. <br><br>
    <ul>
    <li><strong>&lt;table class="toccolours" ... &gt;</strong></li>
    <li><strong>&lt;th style="text-align:center; border-bottom:1px solid black"&gt;Census&lt;/th&gt;</strong></li>
    </ul>
</div>

<span style="color:blue">
    <strong>NOTE:</strong> You can use the <em>find()</em> command instead of the <em>findall()</em> command because there is ONLY 1 table tag on this page <br><br>
    7. Use a <strong>find(...)</strong> command to get the 'table' tag with <strong>class="toccolours"</strong> <br>
    8. Display the results 
</span>

In [70]:
demo_table = soup.find('table', class_='toccolours')
print(demo_table.prettify())

<table class="toccolours" style="border-spacing: 1px; float: right; clear: right; margin: 0 0 1em 1em; text-align:right">
 <tbody>
  <tr>
   <th class="navbox-title" colspan="4" style="padding-right:3px; padding-left:3px; font-size:110%; text-align:center">
    Historical population
   </th>
  </tr>
  <tr>
   <th style="text-align:center; border-bottom:1px solid black">
    Census
   </th>
   <th style="text-align:center; border-bottom:1px solid black">
    <abbr title="Population">
     Pop.
    </abbr>
   </th>
   <th style="text-align:center; border-bottom:1px solid black">
   </th>
   <th style="text-align:center; border-bottom:1px solid black">
    <abbr title="Percent change">
     %±
    </abbr>
   </th>
  </tr>
  <tr>
   <td style="text-align:center">
    <b>
     <a href="/wiki/1790_United_States_census" title="1790 United States census">
      1790
     </a>
    </b>
   </td>
   <td style="padding-left:8px; border-right:none; padding-right:0; text-align:right;">
    393,751
 

<span style="color:blue">
     9. Use a <strong>for-loop</strong> to iterate thru the values returned by a <strong>find_all('tr')</strong><br>
    - These will be the <em>rows</em><br>
    10. Call <strong>find_all('td')</strong> on each <strong>row</strong> returned by step 9 <br>
    - These will be the <em>columns</em><br>
    11. Print the <strong>length</strong> of each <strong>cell</strong><br>
    - I want you to do this because there is a bit of a gotcha here. You need to see how many table columns exist vs. what you see.<br>
    12. Use a <strong>for-loop</strong> to iterate thru the <strong>cells</strong> to see the contents of each <strong>column</strong><br> 
    13. Print the value of each <strong>column</strong>
</span>

In [71]:
for row in demo_table.findAll('tr'):
    cells = row.findAll('td')
    print(len(cells))
    
    for cell in cells:
        print(cell)
    

0
0
4
<td style="text-align:center"><b><a href="/wiki/1790_United_States_census" title="1790 United States census">1790</a></b></td>
<td style="padding-left:8px; border-right:none; padding-right:0; text-align:right;">393,751</td>
<td style="border-left:none; padding-left:0; text-align:left;"></td>
<td style="padding-left:8px; text-align: right;">—</td>
4
<td style="text-align:center"><b><a href="/wiki/1800_United_States_census" title="1800 United States census">1800</a></b></td>
<td style="padding-left:8px; border-right:none; padding-right:0; text-align:right;">478,103</td>
<td style="border-left:none; padding-left:0; text-align:left;"></td>
<td style="padding-left:8px; text-align: right;">21.4%</td>
4
<td style="text-align:center"><b><a href="/wiki/1810_United_States_census" title="1810 United States census">1810</a></b></td>
<td style="padding-left:8px; border-right:none; padding-right:0; text-align:right;">556,526</td>
<td style="border-left:none; padding-left:0; text-align:left;"><

<span style="color:blue">
    14. Create 3 lists: <strong> A, B and C </strong> for each column: <strong>Census, Population and Percent</strong>.<br>
    15. Re-execute the <strong>for-loop</strong> to iterate thru the values returned by a <strong>find_all('tr')</strong> to get the contents of each <strong>row</strong><br>
    16. Re-execute the <strong>find_all('td')</strong> to get the contents of each <strong>cell</strong><br>
    17. <strong>if len(cells) == 4:</strong>, APPEND the correct <strong>cell[<em>index</em>]</strong> value <strong>text</strong> to the correct list (A,B or C)<br>
        - Refer back to the Tutorial code if you need help with this command <br>
</span>

In [85]:
A = []
B = []
C = []

for row in demo_table.findAll('tr'):
    cells = row.findAll('td')
    
    if len(cells) == 4:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[3].find(text=True))

<span style="color:blue">
    18. Import <strong>pandas</strong><br>
    19. Convert the lists into a <strong>DataFrame</strong> named <strong>df</strong> with the list vs. column heading correlation:<br>
        <ul>
        <li>A = 'Census'</li>
        <li>B = 'Population'</li>
        <li>C = 'Percent'</li>
        </ul>
    20. Display the DataFrame <strong>df</strong>
</span>

In [88]:
import pandas

df = pd.DataFrame(A, columns=['Census'])
df['Population'] = B
df['Percent'] = C
df

Unnamed: 0,Census,Population,Percent
0,1790,393751,—
1,1800,478103,21.4%
2,1810,556526,16.4%
3,1820,638829,14.8%
4,1830,737987,15.5%
5,1840,753419,2.1%
6,1850,869039,15.3%
7,1860,992622,14.2%
8,1870,1071361,7.9%
9,1880,1399750,30.7%


### End of PART B - One More Step!!! (5 points)

In [89]:
# Save the DataFrame (df) to a CSV file named 'NC_demographics.csv' using Panda's to_csv() function

df.to_csv('NC_demographics.csv')