# **Transport Control Protocol (TCP)**

- built on top of IP (internet protocol)
- assumes IP might lose data, so stores and retransmits data if it seems to be lost
- handle "flow control" using transmit window
- provides a nice reliable "pipe"

# **TCP Connections/Sockets**

- An internet/network socket is an endpoint of bidirectional inter-process communication flow across an internet protocol-based computer network such as the internet.

PROCESS <-> INTERNET <-> PROCESS

Processes are like applications.
The socket are the lines connecting process/application together.

# **TCP Port Numbers**

- Port is an application-specific or process-specific software communications endpoint.
- Allows multiple networked applications to coexist on the same server.
- There is a list of well known TCP port numbers.
- They are like extensions on a phone.

# **Common TCP Ports**

- Telnet (23)           // Login
- SSH (22)              // Secure Login
- HTTP (80)             // web server
- HTTPS (442)           // Secure web server
- SMTP (25)             // Mail
- IMAP (143/220/993)    // Mail Retrieval
- POP (109/110)         // Mail Retrieval
- DNS (53)              // Domain Name
- FTP (21)              // File Transfer

# **Sockets in Python**

- Python has built-in support for TCP sockets.

In [None]:
import socket

# creating a socket
# socket.AF_INET = through internet socket
# socket.SOCK_STREAM = gettings stream of characters one at a time (not blocks of text)
mySocket = socket.socket(socket.AF_INET,socket.SOCK_STREAM)

# calling a method socket.
# try to connect to host and the port.
mySocket.connect(('data.pr4e.org',80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mySocket.send(cmd)

while True:
    data = mySocket.recv(512)
    if len(data) <1:
        break
    print(data.decode())
mySocket.close()

# **Application Protocol**

- Application Protocol allows us to send and expect to receive something from other end of socket's application.

# **HTTP - Hypertext Transfer Protocol**

- An application protocol.
- The dominant Application Layer Protocol on the Internet.
- Invented for the Web - to Retrieve HMTL, Images, Documents, etc.
- Extended to be a data in addition to documents.
    - RSS, Web Services, etc.
    - Basic Concept:
        - Make a connection
        - Request a document
        - Retrieve the document
        - Close connection

**HTTP**

The HyperText Transfer Protocol is the set of rules to allow browsers to retrieve web docs from servers over the internet.

# **What is a Protocol?**

- A set of rules that all parties follow so we can predict each other's behavior.
- And not bump into each other.

Example:

http://www.dr-chuck.com/page1.htm

Protocol    = http://
Host        = www.dr-chuck.com
document    = page1.htm

**Getting Data From The Server**

- Whenever user clicks anchor tag with a "href=" value to switch to new page, the browser makes a connection to web server (port 80?) and issues a "GET" request to GET the content of page at the url specified.
- Server returns HTML document to Browser to format and display.

# **Internet Standards**

- The standards for all internet protocols (inner workings) are developed by an organization.
- Internet Engineering Task Force (IETF)
- www.ietf.org
- Standards are called "RFCs" - Request for Comments

# **ASCII**

American Standard Code for Information Interchange
- Holds 128 values for simple characters (Upper/Lower case ABc, numbers, symbols)
- Does not include foreign characters.

# **Representing Simple Strings**

- Each character is represented by a number between 0-256 stored in 8 bits of memory.
- 8 bits = 1 byte
- ord() function returns the numeric value of a simple ASCII character

In [None]:
print(ord('H'))

# **Multi-Byte Characters**

- Need a better type of character coding for more characters so created Unicode.
- To represent wide range of characters, computers must handle characters with more than one byte.
    - UTF-16    // Fixed length - Two Bytes
    - UTF-32    // Fixed Length - Four Bytes
    - UTF-8     // 1-4 Bytes, Dynamic
        - Upwards compatible with ASCII because it can do 1 byte.
        - UTF-8 is recommended practice for encoding data to be exchanged between systems.

# **Python3 and Unicode**

- All strings internally are UNICODE
- When sending data out from python, need to decode from UNICODE to other codes and getting back stuff have to encode data back to UNICODE.

# **urllib**

**Using urllib**

Since HTTP is common, a library that does all the socket work for us.

In [None]:
import urllib.request,urllib.parse,urllib.error

# Opens url file, read, and print.
# Does not return headers.
fileHandler = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fileHandler:
    print(line.decode().strip())

In [None]:
import urllib.request,urllib.parse,urllib.error

# Opens/reads/returns each line from url and count word occurences.

fileHandler = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

wordCount = dict()

for line in fileHandler:
    words = line.decode().split()
    for word in words:
        wordCount[word] = wordCount.get(word,0)+1
print(wordCount)

# **What is web scraping?**

A program that looks at web pages, extracts info and repeat.
- Search engines do this as web crawling/spidering.

**Why Scrape?**
- pull data
- make backups of data
- monitor site for changes
- making database for search engine
- etc