# List of computers

Wikipedia is full of lists. One of the lists you might find yourself reading late on night is the list computers.

[the list](https://en.wikipedia.org/wiki/List_of_home_computers)

(If there are others lists you would prefer, we're not sure we want to know.) And as interesting as this list is when on wikipedia, how cool would that list be inside of an actual python list?

Let's start by importing the libraries for regex and downloading webpages to our computer.

Note that we're doing this as an example of regex, not as an example of webscraping. When you find information on the internet that you want locally, you should always first look for an API. If that isn't offered try webscraping using beautifulsoup or selenium (the next topic in this course).

In [1]:
! pip install requests

import re
import requests

Next up is getting the website locally. We have the URI (https://stubru.be/stem/dezwaarstelijst/lijst) and we will be using this to download the site.

In [2]:
uri = "https://en.wikipedia.org/wiki/List_of_home_computers"

r = requests.get(uri)
print(r.url)
print(r.status_code)

https://en.wikipedia.org/wiki/List_of_home_computers
200


Now we have a variable r which is a Response-object. We can investigate this further, looking into finding the actual list, because a website is composed of many parts, most of which we are not interested in: we need to strip the header and footer. This is where your knowledge of webdesign will prove invaluable.

First thing we need is the HTML-code. This is stored in the text-property. You could print it, but you won't be able to read it all.

In [4]:
print(r.text[:300])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of home computers - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wg


When confronted with this problem, the Chrome browser is a great help. It allows you to "inspect" a webpage, showing the code you want to find next to the actual webpage:

![](images/2022-03-03-19-49-17.png)

This way we know that we are looking for "```<table class="wikitable sortable jquery-tablesorter">```". Let's start with some regex that will gives us way to much information and refine that.

In [5]:
found = re.search(r'<table class="wikitable sortable jquery-tablesorter">', r.text)
if found:
    print(found)
else:
    print("found nothing...")

found nothing...


We found nothing? Let's go less refined.

In [6]:
found = re.search(r"<table.*>", r.text)
if found:
    print(found)
else:
    print("found nothing...")

<re.Match object; span=(18967, 19001), match='<table class="wikitable sortable">'>


We found a table! Let's refine again.

In [7]:
found = re.search(r'<table class="wikitable sortable">', r.text)
if found:
    print(found)
else:
    print("found nothing...")

<re.Match object; span=(18967, 19001), match='<table class="wikitable sortable">'>


Now we're going to get the content.

In [8]:
found = re.search(r'<table class="wikitable sortable">.*</table>/s', r.text)
if found:
    print(found)
else:
    print("found nothing...")

found nothing...


Nothing? We know the beginning is there and the end is there, so what is going on?

The solution: . matches everything _except_ newlines. We need flags to include the newlines.

In [12]:
found = re.search(r'<table class="wikitable sortable">.*</table>', r.text, flags=re.DOTALL)
if found:
    print(found)
    the_table = found.group()
else:
    print("found nothing...")

print(the_table[:100])
print(the_table[-100:])

<re.Match object; span=(18967, 102516), match='<table class="wikitable sortable">\n<caption>Home>
<table class="wikitable sortable">
<caption>Home computer models and manufacturers
</caption>
<tbody
ity
</th>
<th class="unsortable">Remarks
</th>
<th class="unsortable">Ref
</th></tr></tbody></table>


Alright, so now we have a variable containing the table we want to investigate. (We stored it in a variable so we don't need to worry about the rest of the page anymore.) We want as much information as we can find. For this we need all the text between "tr"-tags. Within those tags we'll be looking for the second, third and fourth cell.

![](images/2022-03-03-20-15-45.png)

In [16]:
found = re.findall(r"<tr>.*</tr>", the_table, flags=re.DOTALL)

if not found:
    raise Exception("Nothing found. Shouldn't be happening.")

the_list = found

print(the_list[0])
print(the_list[1])

<tr>
<th>Origin
</th>
<th>Manufacturer
</th>
<th>Model
</th>
<th>Processor
</th>
<th>Year
</th>
<th>Video type
</th>
<th>Mass storage
</th>
<th>Video chip<br />(see <a href="/wiki/List_of_home_computers_by_video_hardware" title="List of home computers by video hardware">list</a>)
</th>
<th>Compatibility
</th>
<th class="unsortable">Remarks
</th>
<th class="unsortable">Ref
</th></tr>
<tr>
<td>UK
</td>
<td><a href="/wiki/Acorn_Computers" title="Acorn Computers">Acorn Computers</a> Ltd</td>
<td><a href="/wiki/Acorn_Atom" title="Acorn Atom">Acorn Atom</a></td>
<td><a href="/wiki/MOS_Technology_6502" title="MOS Technology 6502">6502</a></td>
<td>1980</td>
<td>TV</td>
<td>Cassette</td>
<td>6847</td>
<td></td>
<td></td>
<td><sup id="cite_ref-CN8409_1-0" class="reference"><a href="#cite_note-CN8409-1">&#91;1&#93;</a></sup>
</td></tr>
<tr>
<td>UK
</td>
<td>Acorn Computers Ltd</td>
<td><a href="/wiki/BBC_Micro" title="BBC Micro">BBC Micro</a></td>
<td><a href="/wiki/MOS_Technology_6502" title="M

IndexError: list index out of range

We found the first item in the list well enough, but there doesn't seem to be a second item. Which is odd. But remember: * is greedy. All items in the table are gathered in the first item. That is not what we want. We want [non-greedy](https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy).

In [1]:
found = re.findall(r"<tr>.*?</tr>", the_table, flags=re.DOTALL)

if not found:
    raise Exception("Nothing found. Shouldn't be happening.")

the_list = found

print(the_list[0])
print(the_list[1])

NameError: name 're' is not defined

With the list of computers stored in the_list (and you should really use better variable names, we're totally providing this as an example of laziness and not actual laziness) we can go over this list and regex the different fields out of it. We'll start out by getting the column headers (in the_list[0]), next we'll start all separate fields in a list of dictionaries.

Do note: the headers are slightly different, as Chrome Inspect tells us:

![](images/2022-03-03-21-02-57.png)

(Also: did you note how we went from "if found..." to "if not found..."? Because we expect stuff to be found this is a way more concise method of writing code. In future codeblocks we'll drop even that and simply deal with the errors if nothing is found.)

In [36]:
headers = the_list[0]
found = re.findall("<th.*?>(.*?)</th>", headers, flags=re.DOTALL)

for f in found:
    print("_", f, "_", sep="")

_Origin
_
_Manufacturer
_
_Model
_
_Processor
_
_Year
_
_Video type
_
_Mass storage
_
_Video chip<br />(see <a href="/wiki/List_of_home_computers_by_video_hardware" title="List of home computers by video hardware">list</a>)
_
_Compatibility
_
_Remarks
_
_Ref
_


Good, but we can do better. There's a newline after every line that we don't want and using the non-greedy operator to get rid of the text within the opening tag is also pretty lazy. Both can be fixed by writing better regexes.

There's also the line with "\<br /\>...". Can we get rid of that text as well?

In [63]:
found = re.findall("<th[^>]*>(.*?)\n?<[br|/th]", headers, flags=re.DOTALL)

for f in found:
    print("_", f, "_", sep="")

table_titles = found

_Origin_
_Manufacturer_
_Model_
_Processor_
_Year_
_Video type_
_Mass storage_
_Video chip_
_Compatibility_
_Remarks_
_Ref_


In [61]:
found = re.findall(r"<td>(.*?)\n?</td>", the_list[1], flags=re.DOTALL)

for f in found:
    print("_", f, "_", sep="")


_UK_
_<a href="/wiki/Acorn_Computers" title="Acorn Computers">Acorn Computers</a> Ltd_
_<a href="/wiki/Acorn_Atom" title="Acorn Atom">Acorn Atom</a>_
_<a href="/wiki/MOS_Technology_6502" title="MOS Technology 6502">6502</a>_
_1980_
_TV_
_Cassette_
_6847_
__
__
_<sup id="cite_ref-CN8409_1-0" class="reference"><a href="#cite_note-CN8409-1">&#91;1&#93;</a></sup>_


Done. You could figure out a way of cutting all the links, but the links are actually pretty interesting. If you want to go that way, try [regex101](https://regex101.com), set it to "Python" and paste some raw text below. That way you'll have a very easy way of tinkering with the regex.

![](images/2022-03-03-21-29-35.png)

But keeping the links we only have to go over the entire list and create our list of dictionaries!

In [70]:
output_list = []

for item in the_list[1:]:
    found = re.findall(r"<td>(.*?)\n?</td>", the_list[1], flags=re.DOTALL)

    dict = {}
    for index, text in enumerate(found):
        dict[table_titles[index]] = text
    output_list.append(dict)

print(output_list[7]["Model"])

<a href="/wiki/Acorn_Atom" title="Acorn Atom">Acorn Atom</a>


So that was a simple example. Simple because we used wikipedia: wikipedia prouds itself in creating nice and structured pages. Suppose you wanted to know all the records of "the heaviest list", the list of 666 heavy metal records that StuBru broadcasts every year around easter, then you'd have a much bigger problem. Check the following screenshot:

![](images/2022-03-02-15-47-09.png)

Notice the "loading" in the center? This means the list is generated by using JavaScript after the page has loaded. Therefore the request we made in the beginning won't contain the list. You can solve this using better screenscraping methods which, incidentally, are the topic of the next couple of notebooks.