# Part 1: TCP/IP in Python

## Connecting to a socket

As always, there is a Python package for what we need:

In [1]:
import socket

First, we create an endpoint (socket) inside our computer that's ready to connect to another socket (e.g., of a network server)

In [None]:
my_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

Next, we make the actual connection (using the `connect` method from the socket object). This is like dialling the phone, but we are not yet making conversation.

In [None]:
my_socket.connect( ('docs.python.org', 80) )

The input on the left is the "Host", the input on the right is the "Port".

Note: when we are done, we should always close our socket.

In [None]:
my_socket.close()

## Making a GET request

If we were to deal with HTTP, we could simply run the below code (try it out! Note that 512 ensures that at most 512 bytes are received at once).

In [None]:
my_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
my_socket.connect( ('docs.python.org', 80) )
cmd = 'GET /3/installing/index.html HTTP/1.1'.encode()
my_socket.send(cmd)
print(my_socket.recv(512).decode())
my_socket.close()

However, since we want to be using HTTPS, we need to do a bit more. But, as always, we don't want to reinvent the wheel. And in Python, a lot of wheels have been invented. So, instead of dealing with all the deatils of an HTTPS request, we make use of the very practical `requests` package (which works just as fine for a HTTP request). The beauty about `requests` is, that we don't even have to worry about the sockets in the first place!

In [None]:
import requests

url = "https://docs.python.org/3/installing/index.html"
resp = requests.get(url)

print(resp.status_code)

The status code 200 indicates that everything is fine. We can, of course, also look at the header of the response, just as in the inspection module:

In [None]:
print(resp.headers)

Let's take a look at the actual response in raw form:

In [None]:
print(resp.content)

Luckily, `requests` knows how to deal with this:

In [None]:
print(resp.text)

# Part 2: Retrieving and parsing web pages

## Regular expressions

Start with a very simple example. Say we have a bunch of strings (e.g., from analyzing an email). We want to find the string corresponding to the the message sender. We can simply match with the keyword "From" (note, you can do the same with the `startswith()` method of string).

In [None]:
import re

email = ['From: Philippe','To: Simone','Subject: MSc in BA','Content: Great students this year!']
for line in email:
    if re.search("From",line):
        print(line)

Sometimes this is not good enough, however, and we need flexibility in our search. This is where regular expressions kick in: For example, when spellings differe.

In [None]:
email = ['FRAM: Philippe','To: Simone','Subject: MSc in BA','Content: Great students in 2017!']
for line in email:
    if re.search('F.+:',line):
        print(line)

Another example would be when we are searching for a number, but we don't know the atual content:

In [None]:
email = ['From: Philippe','To: Simone','Subject: MSc in BA','Content: Great students in 2021!']
for line in email:
    if re.search('2021',line):
        print(line)

But what if we don't know the exact year, we only know that we want to read out the line if a year is mentioned?

In [None]:
email = ['From: Philippe','To: Simone','Subject: MSc in BA','Content: Great students in 2017!']
for line in email:
    if re.search('[0-9]+',line):
        print(line)

`re.search()` tells us whether a string matches with the given expression. `re.findall()` gives us back all occurences  matching to our search pattern:

In [None]:
x = "My 2 favorite numbers are 19 and 42"
y = re.findall('[0-9]+',x)
print(y)

Note: this is a list of strings still!

In [None]:
y = re.findall('[AEIOU]+',x)
print(y)

This returned nothing, as there are no upper-case Vowels. But there is one upper-case letter:

In [None]:
y = re.findall('[A-Z]+',x)
print(y)

Let's now take a look at what part of the text we are returning:

In [None]:
y = re.findall('[0-9]+ and',x)
print(y)

We might not want to actually return the "and", just use it as a marker.

In [None]:
y = re.findall('([0-9]+) and',x)
print(y)

Note that "+" asks for at least one matching character. In contrast, "\*" asks for at least 0 characters.

In [None]:
y = re.findall('[0-9]*',x)
print(y)

An important thing to note about "+" and "\*" is that they match in a "greedy" manner: they push outward in both directions to match the longest possible string. What do you expect in the next example?

In [None]:
x = 'From: Using the : character'
y = re.findall('^F.+:',x)
print(y)

In the above, *^* indicates that the first character should be an F. The last character needs to be *:*. There is two ways to do that - due to greedy matching, we will find the longest possible string! We can "fix" this behavior using *?*

In [None]:
x = 'From: Using the : character'
y = re.findall('^F.+?:',x)
print(y)

## Regular expressions to parse a webpage

A first attempt to find the links within all a-tags:

In [None]:
import requests
import re

url = "https://docs.python.org/3/installing/index.html"
resp = requests.get(url)

x = resp.text
y = re.findall('<a .+ href="(.+)".*>',x)
for s in y:
    print(s)

Here we fell for the trap of the greedy matching process. Let's try to avoid this. Also, we don't want to collect links that refer to sections within the same page only (but we are fine with those that refer to sections in other sites)

In [None]:
y = re.findall('<a .+ href="(.+?)".*>',x)
for s in y:
    if not re.search("^#",s):
        print(s)

## BeautifulSoup

You know the drill!

In [None]:
from bs4 import BeautifulSoup

BeautifulSoup automatically parses the input. We need to specify that we want an HTML parsing, as other options are possible, too.

In [None]:
soup = BeautifulSoup(x, 'html.parser')
print(soup)

Now, let's repeat the exercise from before, just that we are using BeautifulSoup:

In [None]:
tags = soup('a')
for tag in tags:
    print(tag.get('href'))

Nevertheless, regular expressions still have their value, as we can see when eliminating links to setions on the same page:

In [None]:
tags = soup('a')
for tag in tags:
    ref = tag.get('href')
    if not re.search("^#",ref):
        print(ref)

# Part 3: Data representation on the web

## XML

There is, no surprises, an xml package. We only need some part of it, though:

In [None]:
import xml.etree.ElementTree as ET

Let's take the example from the slides. We will use the `find()` method from an `ElementTree` object to get different tags.

As a side note: ''' allows multi-line strings in Python which also include the linebreak.

In [None]:
data = '''<person>
            <name>Philippe</name>
            <phone type="intl">
              +44 736 1924
            </phone>
            <email hide="yes"/>
          </person>'''

tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))

As we move down the tree, the expressions tend to get longer:

In [None]:
data = '''<teachers>
            <teacher class="DV">
              <id>002</id>
              <name>Simone</name>
            </teacher>
            <teacher class="DTVC">
              <id>005</id>
              <name>Philippe</name>
            </teacher>
          </teachers>'''

tree = ET.fromstring(data)
teacher_list = tree.findall('teacher')
print('Teacher count:', len(teacher_list))

In [None]:
for teacher in teacher_list:
    print('Name:', teacher.find('name').text)
    print('ID:', teacher.find('id').text)
    print('Attr:', teacher.get('hide'))

Note here that `get('hide')` gave back `None`, as there is no such attribute.

## JSON

By now, you should really expect there to be a `json` package:

In [None]:
import json

In [None]:
data = '''
{
  "name" : "Philippe",
  "phone" : {
    "type" : "intl",
    "number" : "+44 736 1924"
   },
   "email" : {
     "hide" : "yes"
   }
}'''

info = json.loads(data)

Note that it looks like Python dictionaries - including the curly brackets!
In fact, what we get back really is a Python dictionary (of strings, other dicitionaries, and lists)

In [None]:
print('Name:', info["name"])
print('Hide:', info["email"]["hide"])

In the case of a bigger JSON file, we also see the list aspect:

In [None]:
teachers = '''
[
  { "class" : "DV",
    "person" : {
        "id" : "002",
        "name" : "Simone"
    }
  } ,
  { "class" : "DTVC",
    "person" : {
        "id" : "005",
        "name" : "Philippe"
    }
  }
]'''

teacher_list = json.loads(teachers)
print('Teacher count:', len(teacher_list))

In [None]:
for teacher in teacher_list:
    print('Name', teacher['person']['name'])
    print('Id', teacher['person']['id'])

Note that we get the list of teachers directly, we don't have to generate it like in the case of XML.