# Web_Scraping

Here I demonstrate how to scrape countries and capital information from wikipedia. I had to perform these tasks for Northwestern MSiA's Text Analytics class. Link to wikipedia page: https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order.

Useful links: 
- http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Get-HTML-data" data-toc-modified-id="Get-HTML-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Get HTML data</a></span></li><li><span><a href="#Extracting-relevant-information" data-toc-modified-id="Extracting-relevant-information-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Extracting relevant information</a></span></li><li><span><a href="#Getting-text-from-the-link" data-toc-modified-id="Getting-text-from-the-link-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Getting text from the link</a></span><ul class="toc-item"><li><span><a href="#Simple-method" data-toc-modified-id="Simple-method-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Simple method</a></span></li><li><span><a href="#Method-with-text-cleaning" data-toc-modified-id="Method-with-text-cleaning-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Method with text cleaning</a></span></li><li><span><a href="#Convert-all-links-to-text" data-toc-modified-id="Convert-all-links-to-text-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Convert all links to text</a></span></li></ul></li><li><span><a href="#Putting-it-together" data-toc-modified-id="Putting-it-together-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Putting it together</a></span></li></ul></div>

In [3]:
# import packages 
# essentials 
import pandas as pd 
import numpy as np 

# web scraping 
import requests 
from bs4 import BeautifulSoup

## Get HTML data


In [16]:
# url to scrape 
url = 'https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order'
# requests get the web data
page = requests.get(url)
# beautiful soup parses them in cleaner format 
print(page)

soup = BeautifulSoup(page.content, 'html.parser')
# to see the html structure in a clean manner 
print(soup.prettify()[:1000])

<Response [200]>
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of national capitals in alphabetical order - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_national_capitals_in_alphabetical_order","wgTitle":"List of national capitals in alphabetical order","wgCurRevisionId":809057985,"wgRevisionId":809057985,"wgArticleId":33728,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing Spanish-language text","Lists of countries","Lists of capitals"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTra

## Extracting relevant information

When you take a look at the html structure, we discover that the relevant data is within the 'wikitable sortable'.


In [None]:
'''
HTML STRUCTURE 
    table 
        wikitable sortable 
            tr 
                td 
                    a 
                        title -> 'names' 
                        href -> 'links'
    p 
        'contents'
    
EXAMPLES 
- <a href="/wiki/Abu_Dhabi" title="Abu Dhabi">
- <a href="/wiki/United_Arab_Emirates" title="United Arab Emirates">
- https://en.wikipedia.org/wiki/Abu_Dhabi
- https://en.wikipedia.org/wiki/United_Arab_Emirates
'''

In [34]:
# information is stored in "wikitable sortable" table 
table = soup.find('table',"wikitable sortable")
print(len(table))
print(table.prettify()[:1000])

503
<table class="wikitable sortable" style="font-size:95%;">
 <tr>
  <th scope="col" style="width:120px;">
   City
  </th>
  <th scope="col" style="width:200px;">
   Country
  </th>
  <th class="sortable">
   Notes
  </th>
 </tr>
 <tr>
  <td>
   <a href="/wiki/Abu_Dhabi" title="Abu Dhabi">
    Abu Dhabi
   </a>
  </td>
  <td>
   <b>
    <span class="flagicon" style="display:inline-block;width:25px;">
     <img alt="" class="thumbborder" data-file-height="600" data-file-width="1200" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_United_Arab_Emirates.svg/23px-Flag_of_the_United_Arab_Emirates.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_United_Arab_Emirates.svg/35px-Flag_of_the_United_Arab_Emirates.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_United_Arab_Emirates.svg/46px-Flag_of_the_United_Arab_Emirates.svg.png 2x" width="23"/>
    </span>
    <a href="/wiki/United_Arab_Emirates" title

In [39]:
# within wikitable, information is inside 'tr' 
table_tr = table.findAll('tr')
print(len(table_tr))
print()

# first row 
print(table_tr[0])
print()

# second row 
print(table_tr[1])

251

<tr>
<th scope="col" style="width:120px;">City</th>
<th scope="col" style="width:200px;">Country</th>
<th class="sortable">Notes</th>
</tr>

<tr>
<td><a href="/wiki/Abu_Dhabi" title="Abu Dhabi">Abu Dhabi</a></td>
<td><b><span class="flagicon" style="display:inline-block;width:25px;"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1200" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_United_Arab_Emirates.svg/23px-Flag_of_the_United_Arab_Emirates.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_United_Arab_Emirates.svg/35px-Flag_of_the_United_Arab_Emirates.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_United_Arab_Emirates.svg/46px-Flag_of_the_United_Arab_Emirates.svg.png 2x" width="23"/></span> <a href="/wiki/United_Arab_Emirates" title="United Arab Emirates">United Arab Emirates</a></b></td>
<td></td>
</tr>


In [78]:
# take second row as an example: 
ex = table_tr[1]
# we see that our information is inside td brakcet 
td = ex.findAll('td')
for i in range(len(td)):
    print(i)
    print(td[i])

# td[0] has city information and td[1] was country information  

0
<td><a href="/wiki/Abu_Dhabi" title="Abu Dhabi">Abu Dhabi</a></td>
1
<td><b><span class="flagicon" style="display:inline-block;width:25px;"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1200" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_United_Arab_Emirates.svg/23px-Flag_of_the_United_Arab_Emirates.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_United_Arab_Emirates.svg/35px-Flag_of_the_United_Arab_Emirates.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Flag_of_the_United_Arab_Emirates.svg/46px-Flag_of_the_United_Arab_Emirates.svg.png 2x" width="23"/></span> <a href="/wiki/United_Arab_Emirates" title="United Arab Emirates">United Arab Emirates</a></b></td>
2
<td></td>


In [80]:
# final info 
print(td[0].find('a').get('title'))
print(td[1].find('a').get('title'))

Abu Dhabi
United Arab Emirates


In [82]:
# now apply to every country in the wikipedia page 

# data_wiki: (city_name, country_name, city_link, country_link)
data_wiki = []
for i in table_tr:
    # names and links inside td 
    td = i.findAll('td')
    # header doesn't have td 
    if len(td)>0:
        # final info inside 'a' tag
        city_name = td[0].find('a').get('title')
        country_name = td[1].find('a').get('title')
        city_link = 'https://en.wikipedia.org' + td[0].find('a').get('href')
        country_link = 'https://en.wikipedia.org'+td[1].find('a').get('href')
        data_wiki.append((city_name, country_name,city_link,country_link))

print(len(data_wiki))
print(data_wiki[0])

250
('Abu Dhabi', 'United Arab Emirates', 'https://en.wikipedia.org/wiki/Abu_Dhabi', 'https://en.wikipedia.org/wiki/United_Arab_Emirates')


## Getting text from the link

### Simple method

In [91]:
def get_text(url): 
    '''
    Given URL, extract the text content by removing html tags 
    '''
    text = ''
    # get content 
    page = requests.get(url)#.decode('utf-8', 'replace')
    soup = BeautifulSoup(page.content, 'html.parser')
    # content is inside 'p' paragraph tag 
    content = soup.findAll('p')
    # combine multiple 'p's into one string or unicode here 
    text_temp=[] 
    for i in content:  
        text_temp = i.get_text()
        text = text + text_temp + ' '
    return text

get_text_old('https://en.wikipedia.org/wiki/Abu_Dhabi')[:1000]

"Abu Dhabi (US /ˈɑːbuː ˈdɑːbi/, UK /ˈæbuː ˈdɑːbi/; Arabic: أبو ظبي\u200e Abū Ẓabī Emirati pronunciation [ɐˈbuˈðˤɑbi])[3] is the capital and the second most populous city of the United Arab Emirates (the most populous being Dubai), and also capital of the Emirate of Abu Dhabi, the largest of the UAE's seven emirates. Abu Dhabi lies on a T-shaped island jutting into the Persian Gulf from the central western coast. The city proper had a population of 1.5 million in 2014.[4] Abu Dhabi houses federal government offices, is the seat of the United Arab Emirates Government, home to the Abu Dhabi Emiri Family and the President of the UAE, who is from this family. Abu Dhabi's rapid development and urbanisation, coupled with the relatively high average income of its population, has transformed the city into a large and advanced metropolis. Today the city is the country's centre of political and industrial activities, and a major cultural and commercial centre, due to its position as the capital. 

### Method with text cleaning

In [89]:
# stopwords from nltk 
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
import string 
def get_text(url): 
    '''
    Given URL, extract the text content by removing html tags 
    '''
    text = ''
    # get content 
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    # content is inside 'p' paragraph tag 
    content = soup.findAll('p')
    # combine multiple 'p's into one string or unicode here 
    text_temp=[] 
    for i in content:  
        text_temp = i.get_text()
        text = text + text_temp + ' '
    #text = text.encode('ascii', 'ignore')
    # stop words 
    #translator = string.maketrans((string.punctuation+string.digits),' '*42)
    translator = str.maketrans('','',(string.punctuation+string.digits))
    # remove punctuation and numbers then tokenize 
    tokens = text.strip().translate(translator).split()
    # remove stop words 
    for token in tokens: 
        if token.lower() in stopwords: 
            tokens.remove(token)   
    return ' '.join(tokens)
# example
get_text(url = 'https://en.wikipedia.org/wiki/Abu_Dhabi')[:1000]

'Abu Dhabi US ˈɑːbuː ˈdɑːbi UK ˈæbuː ˈdɑːbi Arabic أبو ظبي\u200e Abū Ẓabī Emirati pronunciation ɐˈbuˈðˤɑbi capital second populous city United Arab Emirates populous Dubai also capital Emirate Abu Dhabi largest UAEs seven emirates Abu Dhabi lies Tshaped island jutting Persian Gulf central western coast city proper population million Abu Dhabi houses federal government offices seat United Arab Emirates Government home Abu Dhabi Emiri Family President UAE family Abu Dhabis rapid development urbanisation coupled relatively high average income population transformed city large advanced metropolis Today city countrys centre political industrial activities major cultural commercial centre due position capital Abu Dhabi accounts twothirds roughly billion United Arab Emirates economy Dhabi Arabic word particular species native gazelle common Arabian region Abu Dhabi means father Dhabi gazelle thought name came abundance Gazelles area folk tale involving Shakhbut bin Dhiyab al Nahyan Abu Dhabi 

### Convert all links to text

In [None]:
data_wiki_full = []
for i in range(len(data_wiki)):
    '''
    Using the links from data_wiki, get full text using get_text function 
    '''
    city_text = tokenize(get_text(data_wiki[i][2]))
    country_text = tokenize(get_text(data_wiki[i][3]))
    data_wiki_full.append((data_wiki[i][0], data_wiki[i][1], city_text, country_text))

## Putting it together

In [93]:
# import packages 
# essentials 
import pandas as pd 
import numpy as np 

# web scraping 
import requests 
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# information is stored in "wikitable sortable" table 
table = soup.find('table',"wikitable sortable")
print(len(table))

# within wikitable, information is inside 'tr' 
table_tr = table.findAll('tr')
print(len(table_tr))

# data_wiki: (city_name, country_name, city_link, country_link)
data_wiki = []
for i in table_tr:
    # names and links inside td 
    td = i.findAll('td')
    # header doesn't have td 
    if len(td)>0:
        # final info inside 'a' tag
        city_name = td[0].find('a').get('title')
        country_name = td[1].find('a').get('title')
        city_link = 'https://en.wikipedia.org' + td[0].find('a').get('href')
        country_link = 'https://en.wikipedia.org'+td[1].find('a').get('href')
        data_wiki.append((city_name, country_name,city_link,country_link))
        
def get_text(url): 
    '''
    Given URL, extract the text content by removing html tags 
    '''
    text = ''
    # get content 
    page = requests.get(url)#.decode('utf-8', 'replace')
    soup = BeautifulSoup(page.content, 'html.parser')
    # content is inside 'p' paragraph tag 
    content = soup.findAll('p')
    # combine multiple 'p's into one string or unicode here 
    text_temp=[] 
    for i in content:  
        text_temp = i.get_text()
        text = text + text_temp + ' '
    return text

data_wiki_full = []
for i in range(len(data_wiki)):
    '''
    Using the links from data_wiki, get full text using get_text function 
    '''
    city_text = (get_text(data_wiki[i][2]))
    country_text = (get_text(data_wiki[i][3]))
    data_wiki_full.append((data_wiki[i][0], data_wiki[i][1], city_text, country_text))
    
print(data_wiki_full[0])

503
251
('Abu Dhabi', 'United Arab Emirates', 'Abu Dhabi (US /ˈɑːbuː ˈdɑːbi/, UK /ˈæbuː ˈdɑːbi/; Arabic: أبو ظبي\u200e Abū Ẓabī Emirati pronunciation [ɐˈbuˈðˤɑbi])[3] is the capital and the second most populous city of the United Arab Emirates (the most populous being Dubai), and also capital of the Emirate of Abu Dhabi, the largest of the UAE\'s seven emirates. Abu Dhabi lies on a T-shaped island jutting into the Persian Gulf from the central western coast. The city proper had a population of 1.5 million in 2014.[4] Abu Dhabi houses federal government offices, is the seat of the United Arab Emirates Government, home to the Abu Dhabi Emiri Family and the President of the UAE, who is from this family. Abu Dhabi\'s rapid development and urbanisation, coupled with the relatively high average income of its population, has transformed the city into a large and advanced metropolis. Today the city is the country\'s centre of political and industrial activities, and a major cultural and commer