# Networking with Python
## Transport Control Protocol (TCP)
### Sockets
In computer networking, an Internet socket or network socket is a endpoint of a bidirectional inter-process communication flow across an Internet Protocol-based computer network, such as the Internet
### Port Numbers
A port is an application-specific or process-specific sofware communicatins endpoint, it allows multiple networked applications to coexist on the same server.

There is a list of well-known TCP port numbers

![ports](images/tcp_ports.png)

### Sokets in Python
Python has built-in support for TCP Sockets

In [1]:
# library
import socket
# Socket object
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# conect (host, port)
mysock.connect( ('data.pr4e.org', 80) )

## Application Protocols

### HTTP - Hypertext Transfer Protocol
* Dominant application layer protocol on the internet
* Invented for the web -to retirce HTML, images, documents, etc
* Exteded to be data in addition to documents -RSS, Web services, etx. Basic concept - Make a connection - quest a document - retrieve the document - close the conection

HTTP is the set of rules to allow browsers to retrieve web documents from servers over the Internet

### Protocol
A set of rules that all parties follow so we can predict each other's behavior, and not bump into each other, example drive on the right-hand side of the road

http://www.dr-chuck.com/page1.htm

* protocol: http://
* host: www.dr-chuck.com
* document /page1.htm

### Getting data from the server

Each time that a user clicks to swith to a new page, the browser makes a connection to the web server and issues a "GET" request to get the content of the page at the specified URL.

The server returns the HTML document to the browser wich formats and displays the document to the user

### Making a HTTP request

## Write a Web Browser


In [9]:
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
# GET request
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode())
mysock.close()

HTTP/1.1 400 Bad Request
Date: Sun, 12 Jun 2022 05:27:49 GMT
Server: Apache/2.4.18 (Ubuntu)
Content-Length: 308
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
<hr>
<address>Apache/2.4.18 (Ubuntu) Server at do1.dr-chuck.com Port 80</address>
</body></html>



In [8]:
# This is a simple web browser

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode(),end='')
mysock.close()

HTTP/1.1 200 OK
Date: Sun, 12 Jun 2022 05:27:35 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


## Text Processing
### About characters and strings
The most common in the west is ASCII, it have 8 bits

* The ord() funtion tells us the numeric value of a simple ASCII character

In [10]:
print(ord('H'))
print(ord('e'))
print(ord('\n'))

72
101
10


But ascii do not cover all the characters. That is the reason of other standard as UNICODE, the most extenden is:
* UTF-8 that have 1-4 bytes, it's recommended practice for encoding data to be exchanged between systems

In python 3 all strings are unicode

* When e talk to an external resorce like a network socket we sends bytes, so we need to encode Python3 strings into a given character encoding
* When we read data from an external resource, we must decode it based on the character set so it is properly represented in Python3 as a string

![socket](images/socket.png)

## Making HTTP Easier with urllib
### Using urllib in Python
Is a library that does all the socket work for us and makes web pages look like a file

In [11]:
import urllib.request, urllib.parse, urllib.error

In [12]:
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
    print(line.decode().strip())

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [14]:
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
counts = dict()
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
print(counts)

{'But': 1, 'soft': 1, 'what': 1, 'light': 1, 'through': 1, 'yonder': 1, 'window': 1, 'breaks': 1, 'It': 1, 'is': 3, 'the': 3, 'east': 1, 'and': 3, 'Juliet': 1, 'sun': 2, 'Arise': 1, 'fair': 1, 'kill': 1, 'envious': 1, 'moon': 1, 'Who': 1, 'already': 1, 'sick': 1, 'pale': 1, 'with': 1, 'grief': 1}


### Reading Web Pages

In [15]:
fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
    print(line.decode().strip())

<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>


## Parsing HTML - Web Scraping
* When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information, and then looks at more web pages
* Search engines scrape web pages - we call this "spidering the web" or "web crawling"

### Why Scrape?
* Pull data- particularly social data - who links to who?
* Get your own data back out of some system that has no "export capability"
* Monitor a site for new information
* Spider the web to make a database for search engine

### Scraping web pages
* There is som controversy about web pages scraping and some sites are a bit snippy about it
* Republishing copyrighted information is not allowed
* Violating terms of service is not allowed

### The Easy Way - Beautiful Soup
* You could do string searches the hard way or use the library Beautiful Soup

In [16]:
from bs4 import BeautifulSoup

In [17]:
# url = http://www.dr-chuck.com/page1.htm
url = input('Enter -')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

http://www.dr-chuck.com/page2.htm


## Summary
* The TCP/IP gives us pipes / sockets between applications
* We desinged application protocols to make use of these pipes
* HTTP is a simple yet powerful protocol
* Python has good support for sockets, HTTP and HTML parsing