# 08 Getting Data

Part of ["Introduction to Data Science" course](https://github.com/kupav/data-sc-intro) by Pavel Kuptsov, [kupav@mail.ru](mailto:kupav@mail.ru)

Recommended reading for this section:

1. Grus, J. (2019). Data Science From Scratch: First Principles with Python (Vol. Second edition). Sebastopol, CA: O’Reilly Media

The following Python modules will be required. Make sure that you have them installed.
- os
- sys
- csv
- beautifulsoup4 (bs4)
- requests 
- html5lib

## Lesson 1

### File manipulations without leaving Jupyter

Below we will need to use separate Python scripts, i.e., those that are run as separate executable, and not as Jupyter sections. 

However we still want to use Jupyter to analyze their content before running. 

For this purpose we will use so called magic functions. 

These functions are prefixed with a double `%%` and provide some useful functionality.

Magic command `%%writefile <filename>` writes the contents of a cell to a file. 

Consider an example. 

The following simple piece of code sums numbers from 1 to 5 and prints the result. 

Let us first see how it works:

In [None]:
print("This is test script")
sum = 0
for i in range(1, 6):
    sum += i
print(f"sum={sum}")

Now we are going to write it to a separate file.

First we specify the file name as Python variable.

In [None]:
# This is the name for our file
filename = 'sum5.py'

The cell below will be saved to a file. 

Observe that we pass a file name to `%%writefile` as a variable `filename` with a symbol `$` that indicates that we need to substitute the content of the variable in place of its name. 

In [None]:
%%writefile $filename
print("This is test script")
sum = 0
for i in range(1, 6):
    sum += i
print(f"sum={sum}")

To check that we are doing right we can open a file manager and make sure that the file is created.

Besides the file manager there is also a terminal (also called a console) where we can type textual commands for manipulations with files.

The terminal is run as a program from the Start menu. For Windows this program is called `cmd.exe` and users of Linux aware of many various terminals.

We can also open the terminal right from the Jupyter. (Unfortunately if this notebook is opened within Google Colab the terminal is not available by default. Some adjustments are required to get it).

Anyway we can simply type those textual terminal commands right from Jupyter code sections.

To tell Jupyter that this is not a Python program but terminal commands we need to put a symbol `!` in front of the command.

For example if we want to list the content of the current directory we type `ls` if our operating system is Linux and `dir` for Windows

In [None]:
#!dir
!ls

The Linux command `cat` prints the content of a file to a screen. For Windows the corresponding command is `type`

In [None]:
#!type $filename
!cat $filename

If we want to run a Python program we have to type as follows (both in Linux and in Windows):

In [None]:
!python $filename

When the file is not needed we can remove it. 

There are many ways to do it. We will use the one provided by the Python module `os`

In [None]:
import os

def remove_file(fn):
    # If file exists, delete it 
    if os.path.isfile(fn):
        os.remove(fn)
        print(f"File {fn} has been removed")
    # If not, print a message
    else: 
        print(f"File {fn} not found")

In [None]:
remove_file(filename)

Let us check now the content of the directory:

In [None]:
#!dir
!ls

### Command-line parameters

Python as any other computer program reads parameters pass to it in a command line. 

Typically this is the way of controlling the program working.

When one runs a program using a usual window interface the command line still exists but hidden from the user. But if needed it can be accessed and the parameters can be specified.

From inside a Python program access to the command-line parameters is provided by the module `sys`. 

They are stored in a list `argv`. 

- `argv[0]` is the name of the program itself
- `argv[1]` the first parameter
- `argv[2]` the second one and so on.

Here is an example: the program below prints its list of command-line parameters and then treat `argv[1]` as an operation `+` or `-`. Two others command-line parameters are considered as integer operands.

In [None]:
filename = 'plus_minus.py'

In [None]:
%%writefile $filename
import sys

print(f'sys.argv={sys.argv}')
op = sys.argv[1]
x1 = int(sys.argv[2])
x2 = int(sys.argv[3])

if op == '+':
    y = x1 + x2
elif op == '-':
    y = x1 - x2
else:
    y = None

print(f'{x1} {op} {x2} = {y}')

In [None]:
!python $filename + 1 2

In [None]:
# We do not need this file anymore
remove_file(filename)

### Standard input and output streams

Python program interacts with an external word via files: reads data from input files and print them to output files. 

One more way of interaction of course is GUI, Graphic User Interface. But usually GUI is used to control the operation mode while data flows inside and outside a program through files. 

Files inside a Python program are accessed by their names. 

Also there are standard, i.e., default, files for input and output streams of data. 

They are represented by the objects `stdin` (input) and `stdout` (output) provided by the module `sys`.

The standard input and output streams can be used for example for creation of data filters.

Consider a program that accepts one command-line parameter: the keyword.

Then it reads `stdin` line by line and if the keyword is found it puts the corresponding line to `stdout`. 

By default `stdin` is attached to the keyboard and `stdout` corresponds to the console screen. 

In [None]:
progfile = 'kw_filter.py'

In [None]:
%%writefile $progfile
import sys
keyword = sys.argv[1]
for line in sys.stdin:
    if keyword in line:
        sys.stdout.write(f"Keyword '{keyword}' is found in stdin, put the line to stdout: \n{line}")

Since Jupyter can not run programs in interactive mode, we have to test this program using the true console. 

Thus now we either need to open the Jupyter built in terminal or run it as a separate program and navigate to the working directory where this notebook is located. Let us recall that in Windows the terminal is the program `cmd.exe`.

If using Google Colab however, we can run this program right from there since it supports interactive mode.

The execution is stopped when EOF, End Of File, symbol is received. 

To pass it from the keyboard we need to type Ctrl-D in Linux or Ctrl-Z + Enter in Windows. If running in Colab the execution is stopped by interrupting execution of the cell.

If using Google Colab, uncomment and run the section below, otherwise go to true console and run it from there.

When program is run type some lines with and without the keyword 'big'

In [None]:
#! python $progfile big

In [None]:
remove_file(progfile)

### Redirecting streams and pipes between programs

The above example of using `stdin` and `stdout` is not so interesting. 

Practically important case is when the default streams are redirected using so called pipes. 

Pipes is a mechanism of a data transferring between programs:

Output stream of one programs is attached to the input stream of another one.

Using pipes we can organize data filtering as follows:

Assume we have a text file. 

We output the content of this file through the pipe to our filtering program and it drops out all lines except those with the keyword. The output of the filter can be sent either to a screen or to another file.

Let us first modify our program: remove additional textual remarks.

In [None]:
progfile = 'kw_filter.py'

In [None]:
%%writefile $progfile
import sys
keyword = sys.argv[1]
for line in sys.stdin:
    if keyword in line:
        sys.stdout.write(line)  # No comments printed anymore. Only write an appropriate line

Now we create an initial datafile: 

In [None]:
datafile = 'for_filter.txt'

In [None]:
%%writefile $datafile
one line
two line
big line
very big one

The pipe is denoted by a symbol vertical bar `|`. 

To create a pipe between commands we type them one after another using `|` as a separator.

The command line below works a follows:

First we output the content of a file using `cat` (for Windows this the command `type` should be used instead). 

But instead of the screen we redirect its output to our filter program. In the other words we create a pipe that transfer 
the output of `cat` to the input of our program. 

Finally our program outputs its result to the screen.

In [None]:
!cat $datafile | python $progfile one
# Use this if using Windows
# !type $datafile | python $progfile one

Here we use another keyword.

In [None]:
!cat $datafile | python $progfile big
# !type $datafile | python $progfile big

We can also send the resulting output to a file. 

But now we don not need a pipe: Pipe is for communications between programs. 

If we merely want to redirect `stdout` from the screen to a file we use symbol `>` after the command:

In [None]:
outfile = 'filtered.txt'

In [None]:
!cat $datafile | python $progfile big > $outfile
# !type $datafile | python $progfile big > $outfile

Let us see the result:

In [None]:
!cat $outfile
#!type $outfile

### More examples of using pipes

Assume that we do not need to see the filtered lines, but want to count them.

We need a program that reads lines from `stdin` and put their count to `stdout`.

In [None]:
countfile = 'count.py'

In [None]:
%%writefile $countfile
import sys
count = 0
for line in sys.stdin:
    count += 1
sys.stdout.write(str(count))

Notice that we iterate over lines in `stdin` and count them one by one. 

We can not use mere `len(sys.stdin)` instead because this object do not know its length in advance. 

The input steam ends when EOF symbol appears. But in general case its appearance is unpredictable: Imagine that `stdin` is attached to the keyboard. Program can not predict when a user type EOF.

Let as count lines in the initial file:

In [None]:
!cat $datafile | python $countfile
# !type $datafile | python $countfile

And here we first filter the input file and then count its lines:

In [None]:
!cat $datafile | python $progfile big | python $countfile
# !type $datafile | python $progfile big | python $countfile

But we continue let us explore a useful feature. The standard Python module `string` contains a variable `punctuation` that contains all punctuation marks:

In [None]:
import string
print(string.punctuation)

Now we will create a more complicated analyzer. We are going to count words in an input text 
and print the counts of the most common words, i.e., those that are often encountered in the text.

In [None]:
textfile = 'story.txt'

In [None]:
%%writefile $textfile
Next morning, Monday, after disposing of the embalmed head to a
barber, for a block, I settled my own and comrade’s bill; using,
however, my comrade’s money. The grinning landlord, as well as the
boarders, seemed amazingly tickled at the sudden friendship which had
sprung up between me and Queequeg—especially as Peter Coffin’s cock
and bull stories about him had previously so much alarmed me
concerning the very person whom I now companied with.

We borrowed a wheelbarrow, and embarking our things, including my own
poor carpet-bag, and Queequeg’s canvas sack and hammock, away we went
down to “the Moss,” the little Nantucket packet schooner moored at the
wharf. As we were going along the people stared; not at Queequeg so
much—for they were used to seeing cannibals like him in their
streets,—but at seeing him and me upon such confidential terms. But we
heeded them not, going along wheeling the barrow by turns, and
Queequeg now and then stopping to adjust the sheath on his harpoon
barbs. I asked him why he carried such a troublesome thing with him
ashore, and whether all whaling ships did not find their own
harpoons. To this, in substance, he replied, that though what I hinted
was true enough, yet he had a particular affection for his own
harpoon, because it was of assured stuff, well tried in many a mortal
combat, and deeply intimate with the hearts of whales. In short, like
many inland reapers and mowers, who go into the farmers’ meadows armed
with their own scythes—though in no wise obliged to furnish them—even
so, Queequeg, for his own private reasons, preferred his own harpoon.

In [None]:
wcntfile = 'word_count.py'

In [None]:
%%writefile $wcntfile
import sys
import string
from collections import Counter

# Number of words to show
num = int(sys.argv[1])

# Collect words here
txt = []
# Read stdin line by line
for line in sys.stdin:
    # Split line into list of words
    for w in line.split():
        # Remove punctuation characters
        for p in string.punctuation:
            w = w.replace(p, '')
        # Store word in txt-list
        txt.append(w)

cnt = Counter(txt)

# Send most common word counters 
for w, c in cnt.most_common(num):
    sys.stdout.write(f"{w} : {c}\n")

In [None]:
!cat $textfile | python $wcntfile 10

In [None]:
remove_file(progfile)
remove_file(datafile)
remove_file(outfile)
remove_file(countfile)
remove_file(textfile)
remove_file(wcntfile)

### Reading and writing plain files

We already discussed it previously. 

Here is the reminder.

First create a file

In [None]:
textfile = 'TomSawyer1.txt'

In [None]:
%%writefile $textfile
Tom did play hookey, and he had a very good time. He got back home
barely in season to help Jim, the small colored boy, saw next–day's
wood and split the kindlings before supper—at least he was there in
time to tell his adventures to Jim while Jim did three–fourths of the
work. Tom's younger brother (or rather half–brother) Sid was already
through with his part of the work (picking up chips), for he was a
quiet boy, and had no adventurous, troublesome ways.

The following code reads the file line by line and prints the lines. 

Pay attention to the second parameter `'r'` of the `open`. It means reading of a files. 

If the second parameter is omitted, the reading is assumed by default.

Observe that lines are received with end end line symbol `\n`. 

To avoid double line ends we have to suppress end line symbols produced by `print` by default.

In [None]:
file = open(textfile, 'r')
for line in file:
    print(line, end="")
file.close()

Observe that the file must be closed after using. 

To guaranty the closing we can wrap file operation inside `with` context. 

It will close the file automatically and transparently when the execution lives the context.

In [None]:
text = []
with open(textfile, 'r') as file:
    for line in file:
        text.append(line.strip())

print(text)        

Observe that now we apply method `strip` to each line before storing it to the list `text`. 

This method cleans line ends from spaces and new lines symbols.

When a file is opened with the parameter `'w'` it means the file will be written. 

If the file already exists it will be overwritten.

In [None]:
text = """The old lady whirled round, and snatched her skirts 
out of danger. The lad fled on the instant, scrambled 
up the high board–fence, and disappeared over it."""

with open(textfile, 'w') as file:
    file.write(text)

Let us read the file again to ensure that the content is new:

In [None]:
with open(textfile, 'r') as file:
    for line in file:
        print(line, end="")

If a file is opened with the parameter `'a'` the new content will be appended to the existing one:

In [None]:
text = """\n\nHis aunt Polly stood surprised a moment, 
and then broke into a gentle laugh."""

with open(textfile, 'a') as file:
    file.write(text)

Let us see the result: 

In [None]:
with open(textfile, 'r') as file:
    for line in file:
        print(line, end="")

In [None]:
remove_file(textfile)

### CSV files

The simplest and very common way of storing tabulated data is provided by CSV-files, Comma Separated Value.

Technically CSV-file is a plain text whose content represents a table. 

Each line of the text corresponds a table row. Table fields, i.e., the values, are separated within lines by commas `,` or colons `:` or semicolons `;` or tabulation symbols `'\t'`.

We already dealt with CSV-files previously: we opened them as ordinary files and parsed them manually using split method.

But for serious work there is no need to do it. There are special tools for it.

In fact, manual parsing of CSV-files is even not recommended at all. 

There are cases when the table fields themselves contains commas or newline symbols. Special readers are able to process these situations correctly.

We consider the Python standard module `csv`.

Let us first write an example file using Jupyter magic command `%%writefile`.

The content that will be written to the file is in the section below.

These are the stock prices of the GameStop Corp. 

Since this is a plain text we can easily inspect its structure.

The first line as usual contains column headers. 

Other lines contain data. 

In our example the separators are commas.

In [None]:
csvread = 'GameStop.csv'

In [None]:
%%writefile $csvread
Date,Open,High,Low,Close,Volume
02/05/2021,"$316.56","$322.00","$51.09","$63.77","302,036,023"
01/29/2021,"$96.73","$483.00","$61.13","$325.00","559,240,540"
01/22/2021,"$41.55","$76.76","$36.06","$65.01","362,431,371"
01/15/2021,"$19.41","$43.06","$19.01","$35.50","307,073,743"
01/08/2021,"$19.00","$19.45","$17.08","$17.69","33,651,411"

The following code reads the created CSV-file and convert text lines to table rows. Each row is a native Python list.

Observe that the values in rows are stored as strings. 

This is our responsibility to convert theses strings into a more appropriate types, if needed.

In [None]:
import csv

with open(csvread, 'r') as file:
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        print(row)

If we want to skip the header line it can be done by an explicit empty iteration using `next` statement:

In [None]:
import csv

with open(csvread) as file:
    reader = csv.reader(file, delimiter=',')
    next(reader)  # this is to skip the fist line
    for row in reader:
        print(row)

Another option for CSV-files with the header is to read them as a dictionary:

In [None]:
import csv

with open(csvread) as file:
    reader = csv.DictReader(file, delimiter=',')
    for row in reader:
        print(row)

Each row now is loaded as a dictionary so that the filed values can be accessed by names:

In [None]:
import csv

with open(csvread) as file:
    reader = csv.DictReader(file, delimiter=',')
    for row in reader:
        print(f"Date: {row['Date']}, Volume: {row['Volume']}")

To create a CSV-file we need just to write its content row by row, where each row is a list of values.

The header is written as the first row.

In [None]:
csvwrite = 'cars.csv'

data = [
    ['Car','Horsepower','Weight','Origin'],
    ['AMC Ambassador DPL',190,3850,'US'],
    ['Buick Estate Wagon (sw)',225,3086,'US'],
    ['Toyota Corolla Mark ii',95,2372,'Japan'],
    ['Datsun PL510',88,2130,'Japan'],
    ['Volkswagen 1131 Deluxe Sedan',46,1835,'Europe'],
    ['Peugeot 504',87,2672,'Europe']
]    

In [None]:
import csv

with open(csvwrite, 'w') as file:
    writer = csv.writer(file, delimiter=',')
    for d in data:  # iterate over sublists of data 
        writer.writerow(d)  # and write it as table rows

Let us read the obtained file. 

Notice that all numeric values are read as strings.

In [None]:
with open(csvwrite) as file:
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        print(row)

Data to CSV-file can also be written as dictionaries.

We need to prepare the data as follows: a separate list of filed names and the data as a list of dictionaries.

In [None]:
fieldnames = ['Car','Horsepower','Weight','Origin']
datadics = [
    {'Car': 'AMC Ambassador DPL', 'Horsepower': '190', 'Weight': '3850', 'Origin': 'US'},
    {'Car': 'Buick Estate Wagon (sw)', 'Horsepower': '225', 'Weight': '3086', 'Origin': 'US'},
    {'Car': 'Toyota Corolla Mark ii', 'Horsepower': '95', 'Weight': '2372', 'Origin': 'Japan'},
    {'Car': 'Datsun PL510', 'Horsepower': '88', 'Weight': '2130', 'Origin': 'Japan'},
    {'Car': 'Volkswagen 1131 Deluxe Sedan', 'Horsepower': '46', 'Weight': '1835', 'Origin': 'Europe'},
    {'Car': 'Peugeot 504', 'Horsepower': '87', 'Weight': '2672', 'Origin': 'Europe'}]

In [None]:
with open(csvwrite, 'w') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    for d in datadics:  # iterate over dictionaries of data 
        writer.writerow(d)  # and write it as table rows    

Let us read the file again. The result is the same is above.

In [None]:
with open(csvwrite) as file:
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        print(row)

Although using dictionary is a more complicated way, the chance of error is lower since each data value goes to the file 
with its own name and these names are controlled by the writer.

Let us intentionally try to corrupt the data. 

First try the plain writing

In [None]:
csvwrite = 'cars.csv'

data = [
    ['Car','Horsepower','Weight','Origin'],
    ['AMC Ambassador DPL',190,3850,'US'],  # normal row
    ['Buick Estate Wagon (sw)',225,3086,'US', 'UK'],  # "UK' is superfluous here
    ['Toyota Corolla Mark ii',95],  # two fileds are abesent
]    

In [None]:
import csv

with open(csvwrite, 'w') as file:
    writer = csv.writer(file, delimiter=',')
    for d in data:  # iterate over sublists of data 
        writer.writerow(d)  # and write it as table rows

All have been written flawless. 

When we try to read we will see exactly what we have written. 

Observe that rows with the omitted filed are written as it is. It means that this not a tables in a strict sense any more since the table always have rows of identical lengths.

In [None]:
with open(csvwrite) as file:
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        print(row)

Now try the dictionary writer.

In [None]:
fieldnames = ['Car','Horsepower','Weight','Origin']
datadics = [
    {'Car': 'AMC Ambassador DPL', 'Horsepower': '190', 'Weight': '3850', 'Origin': 'US'},  # normal row
    {'Car': 'Buick Estate Wagon (sw)', 'Horsepower': '225', 'Weight': '3086', 'Origin': 'US', 'Origin': 'China'},  # duplicated fileds
    {'Car': 'Toyota Corolla Mark ii', 'Horsepower': '95', 'Weight': '2372', 'Origin': 'Japan', 'Origin1': 'Europ'},  # superfluous field
    {'Car': 'Datsun PL510', 'Horsepower': '88'}]  # two fileds are missing

If we try to write the above data there will be error: writer will notice the filed name `Origin1` that is not registered 
as a header name in the list `fieldnames`.

In [None]:
with open(csvwrite, 'w') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    for d in datadics:  
        writer.writerow(d) 

We can either fix the error in our data or process error during writing. 

We wrap the writing command into the  `try-except` block and print error messages. Also we print the dictionary before sending it to the writer.

In [None]:
with open(csvwrite, 'w') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    for d in datadics:  # iterate over sublists of data 
        try:
            print(d)
            writer.writerow(d)  # and write it as table rows    
        except ValueError as e:
            print(e)

Now all works and we see that only the superfluous filed is considered by the writer as an voidable error. 

Other our intentional corruptions are processed. 

First of all notice that the filed `'Origin': 'China'` has overwritten the first one `'Origin': 'US'` even before writing.

This is the default behavior of dictionaries.

Let us read the file:

In [None]:
with open(csvwrite) as file:
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        print(row)

We see that the missing fields have been written to the file as empty strings unlike previous plain writer that has produced the rows of different lengths.

In [None]:
remove_file(csvread)
remove_file(csvwrite)

### Exercises

1\. Create the program that writes to a file the multiplication table for pairs of integers from 2 to 9. The content of the file should be like<br>
`2 x 2 = 4`<br>
`2 x 3 = 6`<br>
`...`<br>
`9 x 8 = 72`<br>
`9 x 9 = 81`<br>

2\. Create the program that accepts a word as command-line parameter prints the number of consonants and vowels in this word.
 
3\. In this exercise you will create two programs. The first one writes to the standard output stream all integers from a range specified via its command line-parameters. The second program reads a numbers from the standard input stream and resend it to the standard output stream if its decimal representation contains at least one symbol `3`. Create a pipe for these two programs: output of the first one is piped to the input of the second. And the output of the second one is redirected to a file.

4\. Create a program that writes to a CSV-file a table with two columns. The first one is the month name and the second one is the number of its days.

## Lesson 2

### HTML in a Nutshell

Webpages that we see browsing the Internet are created using HTML. 

This abbreviation stands for "Hyper Text Mark-Up Language". 

Technically this is a plain text.

But unlike the ordinary text the content of an HTML document is structured with tags. 

The tags are keywords enclosed in angular brackets `<` and `>`.

There are tags that marks the beginning and the end of document part. Accordingly there are opening and closing tags.

For example all text paragraphs are marked with tags `<p>` and `</p>`.

```html
<p>All ready, Miss Welse, though I'm sorry we can't spare one of the steamer's boats.</p>
```

Headers are marked with `<h1></h1>`, `<h2></h2>`, `<h3></h3>` and so on. The number indicates the level of the headers 
hierarchy.

```html
<h1>Oliver Twist</h1>
<h2>Chapter I</h2>
```

The HTML page begins with the document type declaration `<!doctype html>`. 

Then the HTML content is located `<html></html>`. The closing `</html>` is the end tag of the whole document.

The HTML content includes the head part `<head></head>` and the body `<body></body>`.

The head part describes document properties: its title, character set, required additional resources and so on.

The body part contains the information shown to a user.

```html
<!doctype html>
<html lang="en-US">
<head>
    <title>Page title</title>
    <meta charset="utf-8">
</head>
<body>
  <h1>Main header</h1>
  <p>Text paragraph</p>
  <p>Text paragraph</p>
</body>
</html>
```    

Some more tags:

Boldface
```html
<b>This is very important</b>
```

Italic
```html
<i>Also pay at attention to this point</i>
```

Hyperlink

```html
<a href="https://www.wikipedia.org/">Click here to visit WikipediA</a>
```
    
Image (notice that this tag do not have the closing one)
```html
<img src="https://images.com/animals/dog.jpg">
```

Block of tags, for example paragraphs with some common property of formatting

```html
<div>
    <p>Author: John Doe</p>
    <p>Year: 2021</p>
</div>
```

Each tag can have attributes. For example the tag `<img>` has attribute `src` and the tag `<a>` has attribute `href`.

Also each tag can have attributes `id` and `class`. 

`id` is useful for identification of particular tags to create interactive webpages.

`class` is used for applying formating.

```html
<p id='author'>Jane Austen</p>
```

### Getting data from Webpages

The main problem with the webpages is that typically they have very complicated structure.

Moreover the web designers often do not follow the strict logical rules of construction of HTML structure so that 
automatic analysis of webpages can be very complicated.

Thus we start with a specially prepared simple webpage.

Let us first open it to see how it looks like.

[Charles Dickens](https://kupav.github.io/data-sc-intro/dickens.html)

To extract an information from a webpage we first need to download it. 

This is done with the module `requests`. We used it previously to get data files from the course repository. 

In [None]:
# This module downloads webpages
import requests

# This is an URL of a webpage 
url = "https://kupav.github.io/data-sc-intro/dickens.html"

# Here we downlaod the file
raw_data = requests.get(url)

# Check if downlaing was successfull
assert raw_data.status_code == 200

Now `raw_data.text` contains the whole content of the webpage. 

We need to parse it. 

Parsing means revealing the structure and mapping it to the hierarchy of Python objects.

The parsing is done using the analyzer `BeautifulSoup` whose corresponding Python module name is `bs4`. 

The analyzer requires an additional executive engine that will perform the actual reveling of tags in the downloaded content.

There are many such engines. We will use `html5lib`.

In [None]:
# This module converts raw data downloaded by request into a readable structure
from bs4 import BeautifulSoup

# Now we parse convert it to readable structure. HTML parsing is done via library 'html5lib'
soup = BeautifulSoup(raw_data.text, 'html5lib')

# Printing of th whole object just shows the full document
print(soup)

Lets check what the `soup` can do.

If we need a particular tag we can have it as follows:

In [None]:
print(soup.find('p'))

Or like this:

In [None]:
print(soup.p)

Both of the ways returns a Python object representing the corresponding tag.

Properties of the found tag can be accessed via object attributes.

For example the content of the tag can be accessed via the attribute `text`:

In [None]:
print(soup.find('p').text)
print(soup.p.text)

But this way gives us only the first tag in the document. 

To acquire all of them we have to find them.

Let us find second level headers:

In [None]:
all_h2 = soup.find_all('h2')
print(type(all_h2))
print(all_h2)

We observe that the result is the list-like container that stores all found tags.

Each element of the container is a tag object.

Object `soup` also admits implicit calling `find_all` via functional form:

In [None]:
print(soup('h2'))

Let us extract all text parts from the found tags:

In [None]:
print([x.text for x in soup('h2')])

Tags can be nested. 

And each tag object has its own method `.find_all` to find the nested tags. 

Let us see what words are highlighted with the tags `<i>` or `<b>` inside the main text and print their paragraphs.

We will call `.find_all` implicitly via functional form.

In [None]:
all_p = soup('p')
for tag in all_p:
    if len(tag('i')) != 0:
        print(tag)
    if len(tag('b')) != 0:
        print(tag)

If we mere want all tags `<i>` or `<b>` there is no need to iterate 
over tags `<p>` embracing them.

We can find all tags we need directly. Also notice that we again call `find_all` implicitly.

In [None]:
print(soup('i'))
print(soup('b'))

Tags attributes are extracted by treating a tag object like a dictionary.

Here we iterate over all paragraphs and print those that have the attribute `id`.

In [None]:
for tag in soup('p'):
    try:
        attrib = tag['id']
        print(attrib, tag)
    except KeyError:
        pass

In the above example we read the tag attribute and if it is absent an error occurs. We just ignore it.

Instead of catching errors with absent attributes we can check its presence using `has_attr`.

In [None]:
for tag in soup('p'):
    if tag.has_attr('id'):
        print(tag['id'], tag)

We can also find tags with a specific attributes using a dictionary as a second argument:

In [None]:
print(soup('p', {'id': 'author'}))

Some more examples.

List all image reference:

In [None]:
print(soup('img'))

List all hyperlinks:

In [None]:
print(soup('a'))

Finally we can extract all textual content of a webpage without all tags.

For we have to use `text` attribute of the `soup` object:

In [None]:
print(soup.text)

### Example: parsing Wikipedia

We are going to do the following:

- open the main page of Wikipedia
- find the link to the English pages
- in the main English page find links to Portals
- open the first one 
- print list of image descriptions

Load the main page

In [None]:
from bs4 import BeautifulSoup
import requests
url = "https://www.wikipedia.org/"
raw_data = requests.get(url)
assert raw_data.status_code == 200
soup = BeautifulSoup(raw_data.text, 'html5lib')

Visual inspection of the links. We need to understand how find the link to the English page.

Notice that links have attribute `title`. 

In [None]:
print(soup('a'))

List all links and their titles

In [None]:
for tag in soup('a'):
    if tag.has_attr('title'):
        s = tag['title']
        print(s, '\t', tag['href'])

List all links again and find `English` in `title`. Save corresponding url.

In [None]:
for tag in soup('a'):
    if tag.has_attr('title'):
        s = tag['title']
        if s.find('English') >= 0:
            url1 = tag['href']
            break
print(url1)

Load the page we have found

In [None]:
url_en = 'https:' + url1
print(url_en)
raw_data = requests.get(url_en)
assert raw_data.status_code == 200
soup = BeautifulSoup(raw_data.text, 'html5lib')

In [None]:
soup('a')

Look for links with `Portal` in `title` and collect in a list.

In [None]:
portals = []
for tag in soup('a'):
    if tag.has_attr('title'):
        s = tag['title']
        if s.find('Portal') == 0:
            portals.append((s, tag['href']))

portals

Open the first portal.

In [None]:
url_portal = url_en + portals[0][1]
print(url_portal)
raw_data = requests.get(url_portal)
assert raw_data.status_code == 200
soup = BeautifulSoup(raw_data.text, 'html5lib')

Iterate over tags `img` and print their description. This attribute is called `alt`.

In [None]:
for tag in soup('img'):
    if tag.has_attr('alt') and len(tag['alt']) > 0:
        print(tag['alt'])

### Exercises

5\. Write the program the loads webpage https://kupav.github.io/data-sc-intro/dickens.html and find there a link to a wikipedia article about Charles Dickens.

6\. Find some news portal you like. Write the program that loads its main page and saves its full textual content to a file.

7\. Select a web site of any university. Write the program the finds its contacts and save them to a file.