In [1]:
import pandas as pd

# Web scraping - pridobivanje podatkov s spleta


## What Is Web Scraping?

<img class="progressiveMedia-image js-progressiveMedia-image" data-src="https://cdn-images-1.medium.com/max/1600/1*GOyqaID2x1N5lD_rhTDKVQ.png" src="https://cdn-images-1.medium.com/max/1600/1*GOyqaID2x1N5lD_rhTDKVQ.png">

### Why Web Scraping for Data Science?

## Network complexity

## HTTP

## HTTP in Python: The Requests Library

[Requests: HTTP for Humans](https://2.python-requests.org/en/master/)

In [2]:
import requests

In [3]:
url = 'http://example.com/'

In [4]:
r = requests.get(url)

In [5]:
r

<Response [200]>

In [6]:
type(r)

requests.models.Response

In [7]:
r.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 50px;\n        background-color: #fff;\n        border-radius: 1em;\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        body {\n            background-color: #fff;\n        }\n        div {\n            width: auto;\n            margin: 0 auto;\n            border-radius: 0;\n            padding: 1em;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n

In [8]:
r.status_code

200

In [9]:
r.reason

'OK'

In [10]:
r.headers

{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Cache-Control': 'max-age=604800', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Wed, 22 May 2019 17:40:58 GMT', 'Etag': '"1541025663"', 'Expires': 'Wed, 29 May 2019 17:40:58 GMT', 'Last-Modified': 'Fri, 09 Aug 2013 23:54:35 GMT', 'Server': 'ECS (dcb/7F16)', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'Content-Length': '606'}

In [12]:
r.request

<PreparedRequest [GET]>

In [14]:
r.request.headers

{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

## HTML and CSS

<img class="progressiveMedia-image js-progressiveMedia-image" data-src="https://cdn-images-1.medium.com/max/1600/1*x9mxFBXnLU05iPy19dGj7g.png" src="https://cdn-images-1.medium.com/max/1600/1*x9mxFBXnLU05iPy19dGj7g.png">

### Hypertext Markup Language: HTML

Link strani: https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687

In [8]:
import requests

In [9]:
url_got = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

In [11]:

r = requests.get(url_got)

In [12]:
r.text[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of Game of Thrones episodes - Wikipedia</title>\n<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Game_of_Thrones_episodes","wgTitle":"List of Game of Thrones episodes","wgCurRevisionId":898999050,"wgRevisionId":802553687,"wgArticleId":31120069,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from August 2017","All articles containing potentially dated statements","Official website not in Wikidata","Featured lists","Game of Thrones episodes","Lists of American drama television series episodes","Lists of fantasy television series episodes"],"wgBreakFrames":!1,"wgPageContentLanguage"

- `<p>...</p>` to enclose a paragraph;
- `<br>` to set a line break;
- `<table>...</table>` to start a table block, inside; `<tr>...<tr/>` is used for the rows; and `<td>...</td>` cells;
- `<img>` for images;
- `<h1>...</h1> to <h6>...</h6>` for headers;
- `<div>...</div>` to indicate a “division” in an HTML document, basically used to group a set of elements;
- `<a>...</a>` for hyperlinks;
- `<ul>...</ul>, <ol>...</ol>` for unordered and ordered lists respectively; inside of these, `<li>...</li>` is used for each list item.

## Using Your Browser as a Development Tool

## The Beautiful Soup Library

In [13]:
html_content = r.text #html kodo imamo pod to spremenljivko

> **[beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**: Beautiful Soup tries to organize complexity: it helps to parse, structure and organize the oftentimes very messy web by fixing bad HTML and presenting us with an easy-to-work-with Python structure.

In [14]:
from bs4 import BeautifulSoup

In [15]:
#naredimo objekt
html_soup = BeautifulSoup(html_content, 'html.parser') # povemo s katerim parserjem rdeče

In Python, multiple parsers exist to do so:
- `html.parser`: a built-in Python parser that is decent (especially when using recent versions of Python 3) and requires no extra installation.
- `lxml`: which is very fast but requires an extra installation.
- `html5lib`: which aims to parse web page in exactly the same way as a web browser does, but is a bit slower.

- `find(name, attrs, recursive, string, **keywords)`
- `find_all(name, attrs, recursive, string, limit, **keywords)`

In [16]:
html_soup.find('h1') # pogledaš na source code in isto vrže ven

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [21]:
html_soup.find('',{'id':'firstHeading'}) #atributi zaviti oklepaji - pari, classi

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [25]:
all_h2 = html_soup.find_all('h2', limit=3)

In [26]:
len(all_h2)

3

In [27]:
all_h2

[<h2>Contents</h2>,
 <h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>,
 <h2><span class="mw-headline" id="Episodes">Episodes</span></h2>]

In [24]:
for found in html_soup.find_all('h2'):
    print(found)

<h2>Contents</h2>
<h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>
<h2><span class="mw-headline" id="Episodes">Episodes</span></h2>
<h2><span class="mw-headline" id="Home_media_releases">Home media releases</span></h2>
<h2><span class="mw-headline" id="Ratings">Ratings</span></h2>
<h2><span class="mw-headline" id="References">References</span></h2>
<h2><span class="mw-headline" id="External_links">External links</span></h2>
<h2>Navigation menu</h2>


In [28]:
first_h1 = html_soup.find('h1')

In [29]:
first_h1

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [30]:
first_h1.name

'h1'

In [31]:
first_h1.contents

['List of ', <i>Game of Thrones</i>, ' episodes']

In [32]:
str(first_h1)

'<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>'

In [33]:
first_h1.text

'List of Game of Thrones episodes'

In [37]:
first_h1.get_text()#('--', strip = ) #enako kot zgoraj .text, doda 

'List of Game of Thrones episodes'

In [38]:
first_h1.attrs['id']

'firstHeading'

In [39]:
#isto
first_h1['id']

'firstHeading'

In [40]:
first_h1.get('id') #enako

'firstHeading'

In [41]:
cites = html_soup.find_all('cite', class_='citation',limit=4)

In [42]:
len(cites)

4

In [43]:
cites

[<cite class="citation web">Fowler, Matt (April 8, 2011). <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>. <a href="/wiki/IGN" title="IGN">IGN</a>. <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">Archived</a> from the original on August 17, 2012<span class="reference-accessdate">. Retrieved <span class="nowrap">September 22,</span> 2016</span>.</cite>,
 <cite class="citation news">Fleming, Michael (January 16, 2007). <a class="external text" href="http://www.variety.com/article/VR1117957532.html?categoryid=14&amp;cs=1" rel="nofollow">"HBO turns <i>Fire</i> into fantasy series"</a>. <i><a href="/wiki/Variety_(magazine)" title="Variety (magazine)">Variety</a></i>. <a class="external text" href="https://web.archive.org/web/20120516224747/http://www.variety.com/article/VR1117957532?refCatId=14" rel="nofollow">A

In [44]:
cites[0].get_text()

'Fowler, Matt (April 8, 2011). "Game of Thrones: "Winter is Coming" Review". IGN. Archived from the original on August 17, 2012. Retrieved September 22, 2016.'

In [46]:
cites[0].find('a').get('href')

'http://tv.ign.com/articles/116/1160215p1.html'

In [48]:
for citation in cites:   #besedilo in link spodaj
    print('--->', citation.get_text())
    print(citation.find('a').get('href'))
    print()

---> Fowler, Matt (April 8, 2011). "Game of Thrones: "Winter is Coming" Review". IGN. Archived from the original on August 17, 2012. Retrieved September 22, 2016.
http://tv.ign.com/articles/116/1160215p1.html

---> Fleming, Michael (January 16, 2007). "HBO turns Fire into fantasy series". Variety. Archived from the original on May 16, 2012. Retrieved September 3, 2016.
http://www.variety.com/article/VR1117957532.html?categoryid=14&cs=1

---> "Game of Thrones". Emmys.com. Retrieved September 17, 2016.
http://www.emmys.com/shows/game-thrones

---> Roberts, Josh (April 1, 2012). "Where HBO's hit 'Game of Thrones' was filmed". USA Today. Archived from the original on April 1, 2012. Retrieved March 8, 2013.
https://web.archive.org/web/20120401123724/http://travel.usatoday.com/destinations/story/2012-04-01/Where-the-HBO-hit-Game-of-Thrones-was-filmed/53876876/1



#želimo dataframe - tabelo
#Parsanje tabel

In [49]:
html_soup.text[:1000]

'\n\n\n\nList of Game of Thrones episodes - Wikipedia\ndocument.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Game_of_Thrones_episodes","wgTitle":"List of Game of Thrones episodes","wgCurRevisionId":898999050,"wgRevisionId":802553687,"wgArticleId":31120069,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from August 2017","All articles containing potentially dated statements","Official website not in Wikidata","Featured lists","Game of Thrones episodes","Lists of American drama television series episodes","Lists of fantasy television series episodes"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefa

In [50]:
episodes = []

In [51]:
ep_tables = html_soup.find_all('table', class_ = 'wikiepisodetable')

In [52]:
len(ep_tables)

7

In [62]:
for table in ep_tables[0]:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)#skranimo
    for row in rows[1:]:
        values = []
        for col in row.find_all(['th', 'td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]:values[i] in range(len(values))}
            episodes.append(episode_dict)          

NameError: name 'i' is not defined

In [59]:
episode[0]

NameError: name 'episode' is not defined

In [None]:
#for ??? poglej

In [64]:
pd.DataFrame(episodes).head(5)

## Web APIs

### Primer uporabe APIja

https://github.com/HackerNews/API

In [65]:
articles = []

In [66]:
url = 'https://hacker-news.firebaseio.com/v0'

In [67]:
top_stories = requests.get(url + '/topstories.json')

In [68]:
top_stories.text[:40]

'[20019647,20019874,20019206,20019877,200'

In [69]:
top_stories = top_stories.json()

In [72]:
for story_id in top_stories[:5]:
    story_url = url + f'/item/{story_id}/.json'
    print('Prenos: ', story_url)
    r = requests.get(story_url)
    story_dict = r.json()
    articles.append(story_dict)

Prenos:  https://hacker-news.firebaseio.com/v0/item/20019647/.json
Prenos:  https://hacker-news.firebaseio.com/v0/item/20019874/.json
Prenos:  https://hacker-news.firebaseio.com/v0/item/20019206/.json
Prenos:  https://hacker-news.firebaseio.com/v0/item/20019877/.json
Prenos:  https://hacker-news.firebaseio.com/v0/item/20021568/.json


In [73]:
articles[0]

{'by': 'ChuckMcM',
 'descendants': 107,
 'id': 20019647,
 'kids': [20019656,
  20021798,
  20019813,
  20021427,
  20020785,
  20020158,
  20020261,
  20020167,
  20019727,
  20021141,
  20019699],
 'score': 226,
 'time': 1558934592,
 'title': 'Arm announces its new premium CPU and GPU designs',
 'type': 'story',
 'url': 'https://techcrunch.com/2019/05/26/arm-announces-its-new-premium-cpu-and-gpu-designs/'}

In [80]:
for articles in articles:
    print(articles['title'])

Arm announces its new premium CPU and GPU designs
On SQS
AMD Ryzen 3000 announced
Show HN: Interactively select the quality and format for youtube-dl
Chinese developers fear the tech war will cost them access to GitHub


### Import data from web - pandas

##### [Odprti podatki Slovenije](https://podatki.gov.si/)


Na portalu OPSI boste našli vse od podatkov, orodij, do koristnih virov, s katerimi boste lahko razvijali spletne in mobilne aplikacije, oblikovali lastne infografike in drugo

Primer: https://support.spatialkey.com/spatialkey-sample-csv-data/

In [81]:
data = pd.read_csv('http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv')

In [82]:
data.head()

Unnamed: 0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude
0,3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879
1,51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028
2,2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839
3,2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146
4,6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.51947,-121.435768


## Web Scraping using pandas

> Spletna stran: https://www.fdic.gov/bank/individual/failed/banklist.html

`pandas.read_html: ` Read HTML tables into a list of DataFrame objects. -> [Dokumentacija](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html)



In [83]:
tables = pd.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html')

In [84]:
len(tables)

1

In [86]:
banks = tables[0]

In [87]:
banks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555 entries, 0 to 554
Data columns (total 7 columns):
Bank Name                555 non-null object
City                     555 non-null object
ST                       555 non-null object
CERT                     555 non-null int64
Acquiring Institution    555 non-null object
Closing Date             555 non-null object
Updated Date             555 non-null object
dtypes: int64(1), object(6)
memory usage: 30.4+ KB


In [88]:
banks.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,"December 15, 2017","February 1, 2019"
1,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,"October 13, 2017","February 21, 2018"
2,Fayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb","May 26, 2017","January 29, 2019"
3,"Guaranty Bank, (d/b/a BestBank in Georgia & Mi...",Milwaukee,WI,30003,First-Citizens Bank & Trust Company,"May 5, 2017","March 22, 2018"
4,First NBC Bank,New Orleans,LA,58302,Whitney Bank,"April 28, 2017","January 29, 2019"


In [89]:
close_timestamps = pd.to_datetime(banks['Closing Date'])

In [90]:
close_timestamps.dt.year.value_counts()

2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2017      8
2015      8
2016      5
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, dtype: int64

## Primeri

### Scraping and Visualizing IMDB Ratings

Stran: http://www.imdb.com/title/tt0944947/episodes

In [91]:
import requests
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0944947/episodes'

In [92]:
episodes = []
rankings = []

In [93]:
for season in range(1,9):
    r = requests.get(url, params={'season': season})
    #if r.status_code ==200:
    soup = BeautifulSoup(r.text, 'html.parser')
    listing = soup.find('div', class_='eplist')
    for epnr, div in enumerate(listing.find_all('div', recursive=False)):
        episode = f'{season}.{epnr + 1}'
        rating_el = div.find(class_='ipl-rating-star__rating')
        print(episode, rating_el)
        print('-------------------')
        rating = float(rating_el.get_text(strip=True))
        episodes.append(episode)
        rankings.append(rating)

1.1 <span class="ipl-rating-star__rating">9.1</span>
-------------------
1.2 <span class="ipl-rating-star__rating">8.8</span>
-------------------
1.3 <span class="ipl-rating-star__rating">8.7</span>
-------------------
1.4 <span class="ipl-rating-star__rating">8.8</span>
-------------------
1.5 <span class="ipl-rating-star__rating">9.1</span>
-------------------
1.6 <span class="ipl-rating-star__rating">9.2</span>
-------------------
1.7 <span class="ipl-rating-star__rating">9.3</span>
-------------------
1.8 <span class="ipl-rating-star__rating">9.1</span>
-------------------
1.9 <span class="ipl-rating-star__rating">9.6</span>
-------------------
1.10 <span class="ipl-rating-star__rating">9.5</span>
-------------------
2.1 <span class="ipl-rating-star__rating">8.9</span>
-------------------
2.2 <span class="ipl-rating-star__rating">8.6</span>
-------------------
2.3 <span class="ipl-rating-star__rating">8.9</span>
-------------------
2.4 <span class="ipl-rating-star__rating">8.9</spa

In [94]:
rankings[:20]

[9.1,
 8.8,
 8.7,
 8.8,
 9.1,
 9.2,
 9.3,
 9.1,
 9.6,
 9.5,
 8.9,
 8.6,
 8.9,
 8.9,
 8.9,
 9.1,
 9.0,
 8.9,
 9.7,
 9.5]

In [95]:
episodes[:20]

['1.1',
 '1.2',
 '1.3',
 '1.4',
 '1.5',
 '1.6',
 '1.7',
 '1.8',
 '1.9',
 '1.10',
 '2.1',
 '2.2',
 '2.3',
 '2.4',
 '2.5',
 '2.6',
 '2.7',
 '2.8',
 '2.9',
 '2.10']

In [100]:
import matplotlib.pylab as plt

plt.figure()

positions = [a for a in range(len(rankings))]
plt.bar(position, rankings, align='center')

NameError: name 'position' is not defined

<Figure size 432x288 with 0 Axes>

### Scraping Fast Track data

Stran: https://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/

In [101]:
# import libraries
from bs4 import BeautifulSoup
import requests
import csv

In [102]:
# specify the url
urlpage =  'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

In [103]:
page = requests.get(urlpage)

In [104]:
soup = BeautifulSoup(page.text, 'html.parser')

In [105]:
table = soup.find('table', class_='tableSorter')
results = table.find_all('tr')
print('Number of rows:', len(results))

In [108]:
table

<table class="tableSorter">
<tbody>
<tr>
<th>Rank</th>
<th>Company</th>
<th class="">Location</th>
<th class="no-word-wrap">Year end</th>
<th class="" style="text-align:right;">Annual sales rise over 3 years</th>
<th class="" style="text-align:right;">Latest sales £000s</th>
<th class="" style="text-align:right;">Staff</th>
<th class="">Comment</th>
<!--				<th>FYE</th>-->
</tr>
<tr>
<td>1</td>
<td><a href="https://www.fasttrack.co.uk/company_profile/plan-com/"><span class="company-name">Plan.com</span></a>Communications provider</td>
<td>Isle of Man</td>
<td>Sep-17</td>
<td style="text-align:right;">364.38%</td>
<td style="text-align:right;">*35,418</td>
<td style="text-align:right;">90</td>
<td>About 650 partners use its telecoms platform to support more than 100,000 UK business customers</td>
<!--						<td>Sep-17</td>-->
</tr>
<tr>
<td>2</td>
<td><a href="https://www.fasttrack.co.uk/company_profile/psioxus-2/"><span class="company-name">PsiOxus</span></a>Biotechnology developer</td>

In [109]:
results = table.find_all('tr')

In [110]:
print('Number of rows:', len(results))

Number of rows: 101


In [111]:
results[0]

<tr>
<th>Rank</th>
<th>Company</th>
<th class="">Location</th>
<th class="no-word-wrap">Year end</th>
<th class="" style="text-align:right;">Annual sales rise over 3 years</th>
<th class="" style="text-align:right;">Latest sales £000s</th>
<th class="" style="text-align:right;">Staff</th>
<th class="">Comment</th>
<!--				<th>FYE</th>-->
</tr>

In [112]:
results[1]

<tr>
<td>1</td>
<td><a href="https://www.fasttrack.co.uk/company_profile/plan-com/"><span class="company-name">Plan.com</span></a>Communications provider</td>
<td>Isle of Man</td>
<td>Sep-17</td>
<td style="text-align:right;">364.38%</td>
<td style="text-align:right;">*35,418</td>
<td style="text-align:right;">90</td>
<td>About 650 partners use its telecoms platform to support more than 100,000 UK business customers</td>
<!--						<td>Sep-17</td>-->
</tr>

In [113]:
rows = []

for row in results[0].find_all('th'):
    rows.append(row.contents[0])

In [114]:
rows

['Rank',
 'Company',
 'Location',
 'Year end',
 'Annual sales rise over 3 years',
 'Latest sales £000s',
 'Staff',
 'Comment']

In [115]:
rows = []
rows.append(['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years',
            'Sales £000s', 'Staff', 'Comments'])

In [116]:
rows

[['Rank',
  'Company Name',
  'Webpage',
  'Description',
  'Location',
  'Year end',
  'Annual sales rise over 3 years',
  'Sales £000s',
  'Staff',
  'Comments']]

In [117]:
for result in results:
    data = result.find_all('td')
    if len(data) == 0:
        continue

In [118]:
data

[<td>100</td>,
 <td><a href="https://www.fasttrack.co.uk/company_profile/brompton-technology/"><span class="company-name">Brompton Technology</span></a>Video technology provider</td>,
 <td>West London</td>,
 <td>Aug-17</td>,
 <td style="text-align:right;">50.17%</td>,
 <td style="text-align:right;">*5,250</td>,
 <td style="text-align:right;">27</td>,
 <td>Its technology is used in high-profile events such as the Oscars</td>]

In [119]:
# write columns to variables
rank = data[0].getText()
company = data[1].getText()
location = data[2].getText()
yearend = data[3].getText()
salesrise = data[4].getText()
sales = data[5].getText()
staff = data[6].getText()
comments = data[7].getText()

In [120]:
rank

'100'

In [121]:
company

'Brompton TechnologyVideo technology provider'

In [122]:
sales

'*5,250'

In [123]:
data

[<td>100</td>,
 <td><a href="https://www.fasttrack.co.uk/company_profile/brompton-technology/"><span class="company-name">Brompton Technology</span></a>Video technology provider</td>,
 <td>West London</td>,
 <td>Aug-17</td>,
 <td style="text-align:right;">50.17%</td>,
 <td style="text-align:right;">*5,250</td>,
 <td style="text-align:right;">27</td>,
 <td>Its technology is used in high-profile events such as the Oscars</td>]

In [124]:
companyname = data[1]

In [129]:
companyname = data[1].find('span', class_='company-name').getText()

In [130]:
companyname

'Brompton Technology'

In [131]:
company

'Brompton TechnologyVideo technology provider'

In [132]:
description = company.replace(companyname, '')

In [133]:
description

'Video technology provider'

In [134]:
sales

'*5,250'

In [139]:
sales.strip('*').strip('+').replace(',','') #odstranimo *, en znak odstrani - za to smo zihr

'5250'

In [141]:
data[1]

<td><a href="https://www.fasttrack.co.uk/company_profile/brompton-technology/"><span class="company-name">Brompton Technology</span></a>Video technology provider</td>

In [142]:
url = data[1].find('a').get('href')

In [143]:
url

'https://www.fasttrack.co.uk/company_profile/brompton-technology/'

In [145]:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

In [146]:
try:
    tableRow = soup.find('table').find_all('tr')[-1]
    webpage = tableRow.find('a').get('href')
except:
    webpage = None

In [147]:
webpage

'http://www.bromptontech.com'

#### Celotni program skupaj

In [151]:
from bs4 import BeautifulSoup
import requests
import csv

urlpage =  'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

page = requests.get(urlpage)
soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find('table', class_='tableSorter')
results = table.find_all('tr')
print('Number of rows:', len(results))


Number of rows: 101


In [152]:
rows = []
rows.append(['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years',
            'Sales £000s', 'Staff', 'Comments'])

In [160]:
for num, result in enumerate(results):
    data = result.find_all('td')
    if len(data) == 0:
        continue
    
    
    # write columns to variables
    rank = data[0].getText()
    company = data[1].getText()
    location = data[2].getText()
    yearend = data[3].getText()
    salesrise = data[4].getText()
    sales = data[5].getText()
    staff = data[6].getText()
    comments = data[7].getText()

    companyname = data[1].find('span', class_='company-name').getText()
    description = company.replace(companyname, '')
    print(num, '- Company is', companyname)
    
    sales.strip('*').strip('+').replace(',','')
    
    url = data[1].find('a').get('href')
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    try:
        tableRow = soup.find('table').find_all('tr')[-1]
        webpage = tableRow.find('a').get('href')
    
    
    except:
        webpage = None
    webpage
    
    rows.append([rank, companyname, webpage, description, location, yearend, salesrise, sales, staff, comments])

1 - Company is Plan.com
2 - Company is PsiOxus
3 - Company is CensorNet
4 - Company is thoughtonomy
5 - Company is Perkbox
6 - Company is Ogury
7 - Company is Verve
8 - Company is goHenry
9 - Company is Darktrace
10 - Company is Bizuma
11 - Company is Depop
12 - Company is Laser Wire Solutions
13 - Company is Bought By Many
14 - Company is Optal
15 - Company is Infinox
16 - Company is Oakbrook
17 - Company is Carwow
18 - Company is Receipt Bank
19 - Company is dB Broadcast
20 - Company is The Car Buying Group
21 - Company is Festicket
22 - Company is Planixs
23 - Company is Gigaclear
24 - Company is TransferWise
25 - Company is PatSnap
26 - Company is Hyperoptic
27 - Company is GoCardless
28 - Company is Purple
29 - Company is Trustpay Global
30 - Company is iwoca
31 - Company is LADBible Group
32 - Company is Threads Styling
33 - Company is Prodigy Finance
34 - Company is Azimo
35 - Company is Chameleon
36 - Company is SuperAwesome
37 - Company is Gousto
38 - Company is Vizolution
39 

In [161]:
with open('OUT_companies.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerows(rows)

In [162]:
df = pd.read_csv('OUT_companies.csv')

In [163]:
df.head()

Unnamed: 0,Rank,Company Name,Webpage,Description,Location,Year end,Annual sales rise over 3 years,Sales £000s,Staff,Comments
0,1,Plan.com,http://www.plan.com,Communications provider,Isle of Man,Sep-17,364.38%,"*35,418",90,About 650 partners use its telecoms platform t...
1,2,PsiOxus,http://www.psioxus.com,Biotechnology developer,Oxfordshire,Dec-17,311.67%,53136,54,Received a $15m milestone payment from its dev...
2,3,CensorNet,http://www.censornet.com,Cloud security software developer,Basingstoke,Dec-17,210.17%,"*7,535",77,"Has more than 4,000 customers, including McDon..."
3,4,thoughtonomy,http://www.thoughtonomy.com,Automation software developer,East London,May-18,205.20%,"*16,916",100,It sells to 28 countries and 50% of revenue is...
4,5,Perkbox,http://www.perkbox.com,Employee engagement services,Central London,Dec-17,204.12%,"*34,700",200,Acquired software platform Loyalty Bay for an ...
