We will start scraping the webpage by reading the content of the file and then parsing it with Beuatiful Soup.

In [1]:
from bs4 import BeautifulSoup
with open('avanza.html','r') as f:
    html = f.read()
soup = BeautifulSoup(html, 'html.parser')

In [2]:
soup

<html lang="sv"><head><style type="text/css">@charset "UTF-8";[ng\:cloak],[ng-cloak],[data-ng-cloak],[x-ng-cloak],.ng-cloak,.x-ng-cloak,.ng-hide:not(.ng-hide-animate){display:none !important;}ng\:form{display:block;}.ng-animate-shim{visibility:hidden;}.ng-anchor{position:absolute;}</style>
<meta charset="utf-8"/>
<meta content="IE=EDGE" http-equiv="X-UA-Compatible"/>
<meta content="#FDFDFD" media="(prefers-color-scheme: light)" name="theme-color"/>
<meta content="#031D20" media="(prefers-color-scheme: dark)" name="theme-color"/>
<meta content="" property="og:title"/>
<meta content="" property="og:description"/>
<meta content="https://www.avanza.se/avanzabank/ikoner/open-graph-default-logo.png" property="og:image"/>
<script>
    var gtmDataLayer = window.gtmDataLayer || [];

    


	
    gtmDataLayer.push({
        'pagePath': window.location.pathname
    });
    


    


</script>
<meta content="telephone=no" name="format-detection"/>
<meta content="Hitta och köp aktier från aktielist

Comparing the visible content of the webpage, let us search for, for example, AstraZeneca, in the source. We will find that the names are contained in elements that look like this:


```
                                                <td class="orderbookName">
                                                        <a class="ellipsis" href="/aktier/om-aktien.html/5431/astrazeneca">
                                                                <span class="flag small SE"></span>AstraZeneca
                                                        </a>
                                                </td>
```

Unfortunately, the names and the numerical data have been detached into different tables, so we cannot process the table simply row-by-row, but we need to first collect the names and then the numerical data.

We will collect all `td` elements that have the class `orderbookName`. The name of the stock is as a textual content of the element. In this case, we can access it simply by the `.text` member. In a more complicated case, we might have to issue, e.g., `cell.find_all(text=True)`.

The text contains a lot of whitespace around it, so it makes sense to `strip` that whitespace away.

In [3]:
names = list()
for cell in soup.find_all('td',class_='orderbookName'):
    names.append(cell.text.strip())
names

['AAK',
 'ABB Ltd',
 'AddLife B',
 'Addnode Group B',
 'Addtech B',
 'AFRY',
 'Alfa Laval',
 'Alleima',
 'Arion Banki SDB',
 'Arjo B',
 'ASSA ABLOY B',
 'AstraZeneca',
 'Atlas Copco A',
 'Atlas Copco B',
 'Atrium Ljungberg B',
 'Autoliv SDB',
 'Avanza Bank Holding',
 'Axfood',
 'Beijer Ref B',
 'Betsson B',
 'Better Collective',
 'Bilia A',
 'Billerud',
 'BioArctic B',
 'Biotage',
 'Boliden',
 'Bravida Holding',
 'Bure Equity',
 'Camurus',
 'Castellum',
 'Catena',
 'Corem Property Group A',
 'Corem Property Group B',
 'Corem Property Group D',
 'Corem Property Group Pref',
 'Creades A',
 'Diös Fastigheter',
 'Dometic Group',
 'Electrolux A',
 'Electrolux B',
 'Electrolux Professional B',
 'Elekta B',
 'Embracer Group B',
 'Epiroc A',
 'Epiroc B',
 'EQT',
 'Ericsson A',
 'Ericsson B',
 'Essity A',
 'Essity B',
 'Evolution',
 'Fabege',
 'Fast. Balder B',
 'Fastpartner A',
 'Fastpartner D',
 'Fenix Outdoor International B',
 'Fortnox',
 'Getinge B',
 'Handelsbanken A',
 'Handelsbanken B',

In [4]:
len(names)

157

We can identify that in the snapshot that we have, AstraZeneca, for example, would have the following numbers:
|Senast |+/-%|1 år %|Börsvärde MSEK|P/E-tal|Direktavk. %|Ägare |Lista
|-------|----|------|--------------|-------|------------|------|-----
|1423,00|0   |-3,98 |2 202 059     |34,05  |2,13        |62 653|Large Cap Stockholm

Here *Senast* is the latest quote of the share value, +/-% daily change in value, *1 år %* percentual change over the year, *Börsvärde MSEK* market value of the company in millions of kronor, *P/E-tal* the P/E value of the company, *Direktavkastning* is the dividend yield, *Ägare* the numer of owners, and *List* which list the stock is listed under in Stockholm stock exchange.

We can find these in the following element (A substantial amount of whitespace has been removed):
```
                            <tr class="row rowId11" id="11">
                                    <td class="">
                                                <span class="pushBox" data-aza-push="vm.pushData.latest['5431'].lastPrice" data-aza-push-fractions="2">
1423,00 
                                                </span>
                                    </td>
                                    <td class="neutral">
                                                <span data-ng-class="{'neutral': vm.commonService.isNeutral(vm.pushData.latest['5431'].changePercent), 'negative': vm.commonService.isNegative(vm.pushData.latest['5431'].changePercent), 'positive': vm.commonService.isPositive(vm.pushData.latest['5431'].changePercent)}" data-aza-push="vm.pushData.latest['5431'].changePercent" data-aza-push-fractions="2" class="neutral" style="">
0 
                                                </span>
                                    </td>
                                    <td class="negative">
                                                <span>
-3,98 </span>
                                    </td>
                                    <td class="">
                                                <span>
2&nbsp;202&nbsp;059 </span>
                                    </td>
                                    <td class="">
                                                <span>
34,05 </span>
                                    </td>
                                    <td class="">
                                                <span>
2,13 </span>
                                    </td>
                                    <td class="">
                                                <span>
62&nbsp;653 </span>
                                    </td>
                                
                                    <td class="">
                                                <span>
Large Cap Stockholm </span>
                                    </td>
                            </tr>
```

So we will need to find `tr` elements that have class `row`. 

In [5]:
soup.find('tr',class_='row')

<tr class="row rowId0" id="0">
<td class="buySellButtons">
<ul class="u-cleanList u-floatList buySellButtons">
<li>
<a class="button buyBtn smallBtn" href="/handla/order.html/kop/26268" title="Köp">Köp</a>
</li>
</ul>
</td>
<td class="orderbookName">
<a class="ellipsis" href="/aktier/om-aktien.html/26268/aak">
<span class="flag small SE"></span>AAK
							</a>
</td>
</tr>

Unfortunately, this will also find the name rows we found above. So let us try to find the particular element with the value we know it has, and then try to find its parent `table` element (all `tr` elements belong to a `table`). Since the text contains unnecessary characters, we can match it with regex.

In [6]:
import re
element = soup.find(string=re.compile(r'.*1423,00.*'))

Now we can navigate to its parent.

In [7]:
table = element.parent.parent.parent.parent.parent
table.name

'table'

Yay, so this is the table that we are looking for. So now let's look for the children underneath.

In [8]:
table.find('tr',class_='row')

<tr class="row rowId0" id="0">
<td class="">
<span class="pushBox" data-aza-push="vm.pushData.latest['26268'].lastPrice" data-aza-push-fractions="2">
                                                    











	


	

229,60 
                                                </span>
</td>
<td class="positive">
<span class="positive" data-aza-push="vm.pushData.latest['26268'].changePercent" data-aza-push-fractions="2" data-ng-class="{'neutral': vm.commonService.isNeutral(vm.pushData.latest['26268'].changePercent), 'negative': vm.commonService.isNegative(vm.pushData.latest['26268'].changePercent), 'positive': vm.commonService.isPositive(vm.pushData.latest['26268'].changePercent)}" style="">
                                                    











	


	

1,32 
                                                </span>
</td>
<td class="positive">
<span>











	


	

24,04 </span>
</td>
<td class="">
<span>











	


	

58 817 </span>
</td>
<td class="">
<span>











	



This is the data we want. So let's start extracting `td` element by element.

In [42]:
for cell in table.find('tr',class_='row').select('td'):
    print(cell.text.strip())

229,60
1,32
24,04
58 817
22,61
1,21
9 526
Large Cap Stockholm


We will need to do this for all such rows, so we will use `find_all` and then construct the values.

In [58]:
values = list()
for row in table.find_all('tr',class_='row'):
    values.append([cell.text.strip() for cell in row.select('td')])
values

[['229,60',
  '1,32',
  '24,04',
  '58\xa0817',
  '22,61',
  '1,21',
  '9\xa0526',
  'Large Cap Stockholm'],
 ['441,30',
  '1,12',
  '26,81',
  '803\xa0634',
  '18,74',
  '2,19',
  '49\xa0168',
  'Large Cap Stockholm'],
 ['112,90',
  '4,25',
  '2,92',
  '13\xa0776',
  '44,74',
  '1,11',
  '7\xa0048',
  'Large Cap Stockholm'],
 ['84,75',
  '3,8',
  '-18,59',
  '11\xa0368',
  '39,44',
  '1,22',
  '3\xa0846',
  'Large Cap Stockholm'],
 ['216,40',
  '2,27',
  '32,44',
  '57\xa0724',
  '33,81',
  '1,18',
  '7\xa0876',
  'Large Cap Stockholm'],
 ['139,90',
  '3,25',
  '-25,47',
  '15\xa0810',
  '12,95',
  '4,06',
  '10\xa0675',
  'Large Cap Stockholm'],
 ['379,20',
  '-0,86',
  '15,4',
  '158\xa0996',
  '26,04',
  '1,57',
  '23\xa0443',
  'Large Cap Stockholm'],
 ['76,26',
  '2,03',
  '84,25',
  '19\xa0117',
  '11,83',
  '1,87',
  '37\xa0970',
  'Large Cap Stockholm'],
 ['11,80',
  '0,85',
  '5,55',
  '17\xa0085',
  '8,78',
  '5,4',
  '4\xa0430',
  'Large Cap Stockholm'],
 ['43,12',
  '1,36'

In [59]:
len(values)

157

An important thing to notice here is that some values contain weird-looking `\xa0` characters. What are those? In HTML, for example, the string `'58\xa0817'` above looks like `58&nbsp;817`. The entity `&nbsp;` is a *non-breaking space*, it is a character that looks like a space *but is not space*. It is used, for example, to separate digit groups or tie units to numbers; it differs from space in that the web browser refuses to split the string into two different lines at the nbsp. These will cause trouble, as well as the use of comma as digit separator, so let's handle those next.

In [60]:
values = [list(map(lambda s: s.replace('\xa0','').replace(',','.'),row)) for row in values]
values = [list(map(float,row[:3])) + [int(row[3])] + \
          list(map(float,row[4:6])) + [int(row[6])] + [row[7]] for row in values]
values

[[229.6, 1.32, 24.04, 58817, 22.61, 1.21, 9526, 'Large Cap Stockholm'],
 [441.3, 1.12, 26.81, 803634, 18.74, 2.19, 49168, 'Large Cap Stockholm'],
 [112.9, 4.25, 2.92, 13776, 44.74, 1.11, 7048, 'Large Cap Stockholm'],
 [84.75, 3.8, -18.59, 11368, 39.44, 1.22, 3846, 'Large Cap Stockholm'],
 [216.4, 2.27, 32.44, 57724, 33.81, 1.18, 7876, 'Large Cap Stockholm'],
 [139.9, 3.25, -25.47, 15810, 12.95, 4.06, 10675, 'Large Cap Stockholm'],
 [379.2, -0.86, 15.4, 158996, 26.04, 1.57, 23443, 'Large Cap Stockholm'],
 [76.26, 2.03, 84.25, 19117, 11.83, 1.87, 37970, 'Large Cap Stockholm'],
 [11.8, 0.85, 5.55, 17085, 8.78, 5.4, 4430, 'Large Cap Stockholm'],
 [43.12, 1.36, 6.21, 11631, 29.86, 2.0, 12408, 'Large Cap Stockholm'],
 [285.6, 1.28, 18.8, 318086, 23.39, 1.7, 27552, 'Large Cap Stockholm'],
 [1423.0, 0.0, -3.98, 2202059, 34.05, 2.13, 62653, 'Large Cap Stockholm'],
 [166.25, 1.46, 24.29, 769521, 29.22, 1.4, 15412, 'Large Cap Stockholm'],
 [142.4, 1.32, 21.71, 769521, 25.06, 1.64, 45376, 'Large C

Now we can construct a dataframe.

In [64]:
import pandas as pd
data = list()
for (name,val) in zip(names,values):
    row = { 'Name' : name,
            'Latest' : val[0],
            'Change %' : val[1],
            '1 year %' : val[2],
            'Market value MSEK' : val[3],
            'P/E' : val[4],
            'Dividend yield %' : val[5],
            'Owners' : val[6],
            'List' : val[7]
          }
    data.append(row)
data = pd.DataFrame(data)
data

Unnamed: 0,Name,Latest,Change %,1 year %,Market value MSEK,P/E,Dividend yield %,Owners,List
0,AAK,229.60,1.32,24.04,58817,22.61,1.21,9526,Large Cap Stockholm
1,ABB Ltd,441.30,1.12,26.81,803634,18.74,2.19,49168,Large Cap Stockholm
2,AddLife B,112.90,4.25,2.92,13776,44.74,1.11,7048,Large Cap Stockholm
3,Addnode Group B,84.75,3.80,-18.59,11368,39.44,1.22,3846,Large Cap Stockholm
4,Addtech B,216.40,2.27,32.44,57724,33.81,1.18,7876,Large Cap Stockholm
...,...,...,...,...,...,...,...,...,...
152,Volvo A,255.60,1.27,22.18,510042,11.52,2.77,15231,Large Cap Stockholm
153,Volvo B,250.15,1.25,25.78,510042,11.28,2.83,160402,Large Cap Stockholm
154,Volvo Car B,28.28,-3.68,-44.59,83606,6.44,0.00,106613,Large Cap Stockholm
155,Wallenstam B,52.50,3.35,6.28,33528,-19.02,1.18,11238,Large Cap Stockholm


This can then be exported into a CSV file.

In [65]:
data.to_csv('avanza.csv',index=None)