# Web scraping


## What Is Web Scraping?

In [1]:
import pandas as pd

<img class="progressiveMedia-image js-progressiveMedia-image" data-src="https://cdn-images-1.medium.com/max/1600/1*GOyqaID2x1N5lD_rhTDKVQ.png" src="https://cdn-images-1.medium.com/max/1600/1*GOyqaID2x1N5lD_rhTDKVQ.png">

### Why Web Scraping for Data Science?

## Network complexity

## HTTP

## HTTP in Python: The Requests Library

[Requests: HTTP for Humans](https://2.python-requests.org/en/master/)

In [2]:
import requests

In [4]:
url = 'http://example.com/'

In [5]:
response = requests.get(url)

In [6]:
response

<Response [200]>

In [7]:
type(response)

requests.models.Response

In [8]:
response.status_code

200

In [9]:
response.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

In [11]:
print(response.text)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

In [12]:
response.headers

{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '507728', 'Cache-Control': 'max-age=604800', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Mon, 18 May 2020 14:48:22 GMT', 'Etag': '"3147526947"', 'Expires': 'Mon, 25 May 2020 14:48:22 GMT', 'Last-Modified': 'Thu, 17 Oct 2019 07:18:26 GMT', 'Server': 'ECS (nyb/1D11)', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'Content-Length': '648'}

## HTML and CSS

<img class="progressiveMedia-image js-progressiveMedia-image" data-src="https://cdn-images-1.medium.com/max/1600/1*x9mxFBXnLU05iPy19dGj7g.png" src="https://cdn-images-1.medium.com/max/1600/1*x9mxFBXnLU05iPy19dGj7g.png">

### Hypertext Markup Language: HTML

Link strani: https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes

In [16]:
url_got = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
r = requests.get(url_got)

In [18]:
html_contents = r.text

- `<p>...</p>` to enclose a paragraph;
- `<br>` to set a line break;
- `<table>...</table>` to start a table block, inside; `<tr>...<tr/>` is used for the rows; and `<td>...</td>` cells;
- `<img>` for images;
- `<h1>...</h1> to <h6>...</h6>` for headers;
- `<div>...</div>` to indicate a “division” in an HTML document, basically used to group a set of elements;
- `<a>...</a>` for hyperlinks;
- `<ul>...</ul>, <ol>...</ol>` for unordered and ordered lists respectively; inside of these, `<li>...</li>` is used for each list item.

## Using Your Browser as a Development Tool

## The Beautiful Soup Library

> **[beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**: Beautiful Soup tries to organize complexity: it helps to parse, structure and organize the oftentimes very messy web by fixing bad HTML and presenting us with an easy-to-work-with Python structure.

In [19]:
from bs4 import BeautifulSoup

In [20]:
html_soup = BeautifulSoup(html_contents, 'html.parser')

In [21]:
type(html_soup)

bs4.BeautifulSoup

In [22]:
first_h1 = html_soup.find('h1')
first_h1

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [23]:
str(first_h1)

'<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>'

In [24]:
first_h1.name

'h1'

In [25]:
first_h1.contents

['List of ', <i>Game of Thrones</i>, ' episodes']

In [26]:
first_h1.text

'List of Game of Thrones episodes'

In [27]:
first_h1.get_text()

'List of Game of Thrones episodes'

In [28]:
first_h1.attrs

{'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}

In [29]:
first_h1.attrs['id']

'firstHeading'

In [30]:
first_h1['id']

'firstHeading'

In [31]:
html_soup.find('', attrs={'id' : 'p-logo'})

<div id="p-logo" role="banner">
<a class="mw-wiki-logo" href="/wiki/Main_Page" title="Visit the main page"></a>
</div>

In [32]:
html_soup.find('h2')

<h2 id="mw-toc-heading">Contents</h2>

In [33]:
html_soup.find_all('h2')

[<h2 id="mw-toc-heading">Contents</h2>,
 <h2><span class="mw-headline" id="Series_overview">Series overview</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=List_of_Game_of_Thrones_episodes&amp;action=edit&amp;section=1" title="Edit section: Series overview">edit</a><span class="mw-editsection-bracket">]</span></span></h2>,
 <h2><span class="mw-headline" id="Episodes">Episodes</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=List_of_Game_of_Thrones_episodes&amp;action=edit&amp;section=2" title="Edit section: Episodes">edit</a><span class="mw-editsection-bracket">]</span></span></h2>,
 <h2><span class="mw-headline" id="Specials">Specials</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=List_of_Game_of_Thrones_episodes&amp;action=edit&amp;section=11" title="Edit section: Specials">edit</a><span class="mw-editsecti

In [34]:
for tag in html_soup.find_all('h2'):
    print(tag.text)
    print()

Contents

Series overview[edit]

Episodes[edit]

Specials[edit]

Home media release[edit]

Ratings[edit]

References[edit]

External links[edit]

Navigation menu



In [37]:
cites = html_soup.find_all('cite', class_='citation', limit=4)

In [46]:
for citation in cites:
    print(citation.get_text(), end='\n\n')
    link = citation.find('a')
    print(link.get('href'))
    print('-----------------')

Fowler, Matt (April 8, 2011). "Game of Thrones: "Winter is Coming" Review". IGN. Archived from the original on August 17, 2012. Retrieved September 22, 2016.

http://tv.ign.com/articles/116/1160215p1.html
-----------------
Fleming, Michael (January 16, 2007). "HBO turns Fire into fantasy series". Variety. Archived from the original on May 16, 2012. Retrieved September 3, 2016.

https://www.variety.com/article/VR1117957532.html?categoryid=14&cs=1
-----------------
"Game of Thrones". Emmys.com. Retrieved September 17, 2016.

http://www.emmys.com/shows/game-thrones
-----------------
Roberts, Josh (April 1, 2012). "Where HBO's hit 'Game of Thrones' was filmed". USA Today. Archived from the original on April 1, 2012. Retrieved March 8, 2013.

https://web.archive.org/web/20120401123724/http://travel.usatoday.com/destinations/story/2012-04-01/Where-the-HBO-hit-Game-of-Thrones-was-filmed/53876876/1
-----------------


In [47]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
r = requests.get(url)
html_contents = r.text

html_soup = BeautifulSoup(html_contents, 'html.parser')

In [48]:
episodes = []

ep_tables = html_soup.find_all('table', class_='wikiepisodetable', limit=8)

In [49]:
len(ep_tables)

8

In [52]:
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    
    for row in table.find_all('tr')[1:]:
        values = []
        for col in row.find_all(['th', 'td']):
            values.append(col.text)
            
        if values:
            episode_dict = {headers[i]: values[i] for i in range(len(values))}
            episodes.append(episode_dict)

    
#print(headers)
#print(episodes)

In [53]:
for episode in episodes[:3]:
    print(episode)

{'No.overall': '1', 'No. inseason': '1', 'Title': '"Winter Is Coming"', 'Directed by': 'Tim Van Patten', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date\u200a[20]': 'April\xa017,\xa02011\xa0(2011-04-17)', 'U.S. viewers(millions)': '2.22[21]'}
{'No.overall': '2', 'No. inseason': '2', 'Title': '"The Kingsroad"', 'Directed by': 'Tim Van Patten', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date\u200a[20]': 'April\xa024,\xa02011\xa0(2011-04-24)', 'U.S. viewers(millions)': '2.20[22]'}
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date\u200a[20]': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[23]'}


In [54]:
got_df = pd.DataFrame(episodes)

In [56]:
got_df.head()

Unnamed: 0,No.overall,No. inseason,Title,Directed by,Written by,Original air date [20],U.S. viewers(millions)
0,1,1,"""Winter Is Coming""",Tim Van Patten,David Benioff & D. B. Weiss,"April 17, 2011 (2011-04-17)",2.22[21]
1,2,2,"""The Kingsroad""",Tim Van Patten,David Benioff & D. B. Weiss,"April 24, 2011 (2011-04-24)",2.20[22]
2,3,3,"""Lord Snow""",Brian Kirk,David Benioff & D. B. Weiss,"May 1, 2011 (2011-05-01)",2.44[23]
3,4,4,"""Cripples, Bastards, and Broken Things""",Brian Kirk,Bryan Cogman,"May 8, 2011 (2011-05-08)",2.45[24]
4,5,5,"""The Wolf and the Lion""",Brian Kirk,David Benioff & D. B. Weiss,"May 15, 2011 (2011-05-15)",2.58[25]


## Web APIs

### Primer uporabe APIja

https://github.com/HackerNews/API

In [74]:
url = 'https://hacker-news.firebaseio.com/v0'

In [75]:
top_stories_res = requests.get(f'{url}/topstories.json?print=pretty')

In [76]:
top_stories = top_stories_res.json()
print(top_stories[:10])

[23223219, 23223335, 23223147, 23220081, 23222019, 23222815, 23221447, 23219782, 23221255, 23223681]


In [78]:
articles = []

In [79]:
for story_id in top_stories[:10]:
    story_url = f'{url}/item/{story_id}.json?print=pretty'
    print(f'Prenos: {story_url}')
    r = requests.get(story_url)
    story_dict = r.json()
    articles.append(story_dict)

Prenos: https://hacker-news.firebaseio.com/v0/item/23223219.json?print=pretty
Prenos: https://hacker-news.firebaseio.com/v0/item/23223335.json?print=pretty
Prenos: https://hacker-news.firebaseio.com/v0/item/23223147.json?print=pretty
Prenos: https://hacker-news.firebaseio.com/v0/item/23220081.json?print=pretty
Prenos: https://hacker-news.firebaseio.com/v0/item/23222019.json?print=pretty
Prenos: https://hacker-news.firebaseio.com/v0/item/23222815.json?print=pretty
Prenos: https://hacker-news.firebaseio.com/v0/item/23221447.json?print=pretty
Prenos: https://hacker-news.firebaseio.com/v0/item/23219782.json?print=pretty
Prenos: https://hacker-news.firebaseio.com/v0/item/23221255.json?print=pretty
Prenos: https://hacker-news.firebaseio.com/v0/item/23223681.json?print=pretty


In [80]:
articles[1]['title']

'Uber Cuts 3000 More Jobs, Closes 45 Offices'

### Import data from web - pandas

##### [Odprti podatki Slovenije](https://podatki.gov.si/)


Na portalu OPSI boste našli vse od podatkov, orodij, do koristnih virov, s katerimi boste lahko razvijali spletne in mobilne aplikacije, oblikovali lastne infografike in drugo

Primer: https://support.spatialkey.com/spatialkey-sample-csv-data/

In [81]:
data = pd.read_csv('http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv')

In [83]:
data.head()

Unnamed: 0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude
0,3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879
1,51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028
2,2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839
3,2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146
4,6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.51947,-121.435768


## Web Scraping using pandas

> Spletna stran: https://www.fdic.gov/bank/individual/failed/banklist.html

`pandas.read_html: ` Read HTML tables into a list of DataFrame objects. -> [Dokumentacija](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html)



In [84]:
my_data = pd.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html')

In [85]:
failed_banks = my_data[0]

In [86]:
failed_banks.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date
0,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.","April 3, 2020"
1,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,"February 14, 2020"
2,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,"November 1, 2019"
3,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,"October 25, 2019"
4,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,"October 25, 2019"


In [87]:
failed_banks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 561 entries, 0 to 560
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Bank Name              561 non-null    object
 1   City                   561 non-null    object
 2   ST                     561 non-null    object
 3   CERT                   561 non-null    int64 
 4   Acquiring Institution  561 non-null    object
 5   Closing Date           561 non-null    object
dtypes: int64(1), object(5)
memory usage: 26.4+ KB


## Primeri

### Scraping and Visualizing IMDB Ratings

Stran: http://www.imdb.com/title/tt0944947/episodes

In [88]:
import requests
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0944947/episodes'

In [89]:
episodes = []
ratings = []

for season in range(1,9):
    r = requests.get(url, params={'season': season})
    soup = BeautifulSoup(r.text, 'html.parser')
    listing = soup.find('div', class_='eplist')
    
    


200
200
200
200
200
200
200
200


In [None]:
print(episodes[:20])

In [None]:
print(ratings[:20])

In [None]:
import matplotlib.pyplot as plt

plt.figure()

### Scraping Fast Track data

Stran: https://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/

In [165]:
# import libraries
from bs4 import BeautifulSoup
import requests
import csv

In [166]:
# specify the url
urlpage =  'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

In [188]:
# write columns to variables
rank = data[0].getText()
company = data[1].getText()
location = data[2].getText()
yearend = data[3].getText()
salesrise = data[4].getText()
sales = data[5].getText()
staff = data[6].getText()
comments = data[7].getText()

#### Celotni program skupaj