## Chapter 1

### Why should you learn to write programs?

Programming is a creative and rewarding skill. People learn it for many reasons—earning a living, solving complex problems, helping others, or simply for fun. This book assumes that everyone should know how to program, and once you do, you’ll find your own meaningful use for it.

Today, we’re surrounded by computers—from laptops to smartphones—that act as personal assistants, ready to help. The hardware is designed to constantly ask, “What would you like me to do next?”

By adding an operating system and software, programmers turn this hardware into a powerful Personal Digital Assistant capable of handling a wide range of tasks. These machines are fast and have vast memory, and with the right programming language, we can instruct them to do repetitive tasks on our behalf. Interestingly, computers are best at the kinds of tasks that humans usually find boring or monotonous.

Let’s say you want to find the most frequently used word in the first three paragraphs of this chapter and how often it appears. While reading and understanding text is easy for humans, counting words manually is tedious and not something our brains are naturally wired for. For computers, it’s the opposite—understanding language is difficult, but counting words is simple.

For example, if we run the program:
- python words.py
- Enter file: words.txt

The program might respond:
- to 16

This shows that our “personal information analysis assistant”—the computer—quickly determined that the word “to” appeared 16 times in those paragraphs.

This contrast highlights why learning to “speak computer” is so valuable. Computers excel at repetitive, structured tasks that humans find dull. Once you learn the language of programming, you can delegate such tasks to your computer partner, freeing yourself to focus on creative and intuitive work—the things humans do best.

#### Values and Types

A value is a basic thing that a program uses, like a number or some text.
Examples of values are: 1, 2, and "Hello, World!".
Each value has a type. For example, 2 is an integer (a whole number), and "Hello, World!" is a string (which means a group of letters or characters).
You can tell something is a string because it’s written inside quotation marks.
The print command can show both numbers and strings.
To start using Python, we type the python command to open the interpreter.

In [1]:
print(4)

4


In [29]:
type(4)

int

In [30]:
type("Hello World")

str

In [31]:
type(1.3)

float

In [32]:
type("17")

str

In [33]:
type("3.2")

str

In [34]:
# Well, that’s not what we expected at all! Python interprets 1,000,000 as a commaseparated
# sequence of integers, which it prints with spaces between.
# This is the first example we have seen of a semantic error: the code runs without
# producing an error message, but it doesn’t do the “right” thing.

# semantic error
print(1,000,000)

1 0 0


#### Variables

In [35]:
# A variable is a name that refers to a value
# An assignment statement creates new variables and gives them values: for example

message = 'And now for something completely different'
n = 18
pi = 3.1415

# This example makes three assignments. The first assigns a string to a new variable
# named message; the second assigns the integer 17 to n; the third assigns the
# (approximate) value of   to pi.

In [36]:
# To display the value of a variable, you can use a print statement:

print(message)
type(message)

And now for something completely different


str

In [37]:
print(n)

# The type of a variable is the type of the value it refers to.
type(n)

18


int

In [38]:
print(pi)
type(pi)

3.1415


float

#### Variable name and keywords

In [39]:
# If you give a variable an illegal name, you get a syntax error:
76trombones = 'big parade'


# 76trombones is illegal because it begins with a number

SyntaxError: invalid decimal literal (2199210916.py, line 2)

In [None]:
more@ = 10000

# more@ is illegal because it contains an illegal character, @.

In [None]:
class = "advanced Theory"

# It turns out that class is one of Python’s keywords. The interpreter uses keywords to recognize the structure of the program, and they cannot be used as variable
# names.

#### Python reserves 35 keywords:

In [None]:
# False await else import pass
# None break except in raise
# True class finally is return
# and continue for lambda try
# as def from nonlocal while
# assert del global not with
# async elif if or yield

#### Statements

In [None]:
# A statement is a unit of code which a python interprter can execute
# The twow kinds are: print and assignment

# a script contains a sequence of statements so if there are more than one statement the result appear 
# one at a time. for ex:

print(1)
a = 3
print(9)

# the assignment statement produces no output

#### Operators and operands

In [None]:
# Operators are special symbols that represent computations like addition and multiplication.
# The values the operator is applied to are called operands.

hour = 3
minute = 24

a = 20 + 32 
b = hour - 1
c = hour * 60 + minute
d = minute / 60        #there has been a chnage in the division operator between python2 & 3, In python3 the results for d is floating point (i.e. 0.4). so to obtain the same answer in Python3 use floored (// integer) division. 
e = 5 ** 2
f = (5 + 9) * (15 - 7)

print(a, b, c, d, e, f)

#### Expressions

In [None]:
# An expression is a combination of values, variables, and operators.
# If you type an expression in interactive mode, the interpreter evaluates it and displays the result:
# But in a script, an expression all by itself doesn’t do anything! This is a common source of confusion for beginners.


17
x = 6
x + 17

#### Order of Operations

In [None]:
# When more than one operator appears in an expression, the order of evaluation depends on the rules of precedence.

# The acronym PEMDAS is a useful way to remember the rules
# Parentheses have the highest precedence
# Exponentiation has the next highest precedence
# Multiplication and Division have the same precedence, which is higher than Addition and Subtraction, which also have the same precedence.
# Operators with the same precedence are evaluated from left to right. So the expression 5-3-1 is 1, not 3, because the 5-3 happens first and then 1 is subtracted from 2.


# **** When in doubt, always put parentheses in your expressions to make sure the computations are performed in the order you intend.

#### Modulus Operator

In [None]:
# The modulus operator works on integers and yields the reminder when the first operand is divided by the second
# In python the modulus sign is (%)

quotient = 7 // 3
print(quotient)

remainder = 7 % 3
print(remainder)

In [None]:
# So 7 divided by 3 is 2 with 1 left over.
# The modulus operator turns out to be surprisingly useful. For example, you can
# check whether one number is divisible by another: if x % y is zero, then x is
# divisible by y.
# You can also extract the right-most digit or digits from a number. For example,
# x % 10 yields the right-most digit of x (in base 10). Similarly, x % 100 yields the
# last two digits.

#### String Operations

In [None]:
# the '+' operator works with strings, but it is not addition in the mathematical sense.
# Instead it perform concatenation, which means joining the two strings by linking them end to end

first = 10
second = 20
print(first + second)

In [None]:
first = 100
second = 200
print(first + second)

In [None]:
# The '*' operator also works with strings by multipying the content of a string by an integer.

first = 'Test'
second = 4
print(first * second)
print((first + ' ') * second)

#### Asking the user for input

In [40]:
# Python provides a built-in function called input that gets input from the keyboard

user_input = input()
print(user_input)

hello
hello


In [41]:
# Before getting input from the user, it is a good idea to print a prompt telling the user what to input.

name = input("What is your name?\n")
print(name)

# The sequence '\n' at the end of the prompt represents a newline, which is a special character that causes a line break.

What is your name?
pyladies
pyladies


In [42]:
# If you expect the user to type an integer, you can try to convert the return value to int using the int() function:

prompt = "what... is the speed of light?\n"
speed = int(input(prompt)) 
speed_x = speed + 5
speed_x

what... is the speed of light?
6000


6005

In [43]:
# But if the user types something other than a string of digits, you get an error:

prompt = "what... is the speed of light?\n"
speed = int(input(prompt)) 
speed_x = speed + 5
speed_x

what... is the speed of light?
6000


6005

#### We will see how to handle this kind of error later.!!

#### Comments

In [44]:
# As programs get bigger and more complicated, they get more difficult to read.
# it is a good idea to add notes to your programs to explain in natural language what the program is doing.

#### Choosing mnemonic variable names

In [45]:
# For example, the following three programs are identical in terms of what they accomplish, but very different when you read them and try to understand them.


a = 35.0
b = 12.50
c = a * b
print(c)

hours = 35.0
rate = 12.50
pay = hours * rate
print(pay)

x1q3z9ahd = 35.0
x1q3z9afd = 12.50
x1q3p9afd = x1q3z9ahd * x1q3z9afd
print(x1q3p9afd)


# The Python interpreter sees all three of these programs as exactly the same but humans see and understand these programs quite differently.

# We call these wisely chosen variable names “mnemonic variable names”. The word mnemonic2 means “memory aid”. We choose mnemonic variable names to help us
# remember why we created the variable in the first place.

437.5
437.5
437.5


#### Exercises

In [46]:
# Exercise 2: Write a program that uses input to prompt a user for their name and then welcomes them.

name = input('What is your name?')
print('Hello ' + name)

What is your name?pyladies
Hello pyladies


In [None]:
# Exercise 3: Write a program to prompt the user for hours and rate per hour to compute gross pay.

hours = float(input('How many hours?'))
rate = float(input('What is the rate per hour?'))

pay = round((hours * rate), 2)
print(pay)

In [None]:
# Exercise 4: Assume that we execute the following assignment statements:

width = 17
height = 12.0

# For each of the following expressions, write the value of the expression and the type (of the value of the expression).

a = width//2
print(a)

b = width/2.0
print(b)

c = height/3
print(c)

d = 1 + 2 * 5
print(d)

In [None]:
# Exercise 5: Write a program which prompts the user for a Celsius temperature,convert the temperature to Fahrenheit, and print out the converted temperature.

# (0°C × 9/5) + 32 = 32°F

temp = input('What is the temperature in °C')
temp_F = int((float(temp) * (9 / 5)) + float(32))
print("The Temperature is " + str(temp_F) + "°F")

## Chapter 3

### Conditional execution

#### Boolean expression

In [None]:
# A boolean expression is an expression that is either true or false.

5 == 5

In [None]:
5 == 6

In [None]:
type(True)

In [None]:
# The == operator is one of the comparison operators; the others are:
# x != y # x is not equal to y
# x > y # x is greater than y
# x < y # x is less than y
# x >= y # x is greater than or equal to y
# x <= y # x is less than or equal to y
# x is y # x is the same as y
# x is not y # x is not the same as y

# Remember that = is an assignment operator and == is a comparison operator. There is no such thing as =< or =>.

#### Logical Operators

In [None]:
# There are three logical operators: and, or, and not.

x = 2
x > 0 and x < 10   #it is only true if x is greater than zero

In [None]:
n = 3
n%2 == 0 or n%3 == 0   #is true if either of the conditions is true, that is, if the number is divisible by 2 or 3.

In [None]:
# Finally, the not operator negates a boolean expression, so not (x > y) is true if x > y is false.

x = 1
y = 2
x > y

In [None]:
not(x > y)

In [None]:
# Strictly speaking, the operands of the logical operators should be boolean expressions,
# but Python is not very strict. Any nonzero number is interpreted as “true.”

17 and True

#### Conditional execution

In [None]:
# In order to write useful programs, we almost always need the ability to check conditions and change the behavior of the program accordingly. Conditional statements
# give us this ability. The simplest form is the if statement:

x = 2
if x > 0:
    print('x is positive')


#### Alternative execution

In [None]:
# A second form of the if statement is alternative execution, in which there are two possibilities and the condition determines which one gets executed.

x = 2
if x % 2 == 0:
    print('x is even')
else:
    print('x is odd')

#### Chained conditionals

In [None]:
# Sometimes there are more than two possibilities and we need more than two
# branches. One way to express a computation like that is a chained conditional:

x = 5
y = 5
if x < y:
    print('x is less than y')
elif x > y:
    print('x is greater than y')
else:
    print('x and y are equal')

#### Nested conditionals

In [None]:
# One conditional can also be nested within another. We could have written the three-branch example like this:


x = 10 
y = 20
if x == y:
    print('x and y are equal')
else:
    if x < y:
        print('x is less than y')
    else:
        print('x is greater than y')


#### Catching exceptions using try and except

In [None]:
# There is a conditional execution structure built into Python to handle these types
# of expected and unexpected errors called “try / except”. The purpose of try and
# except is that you know that some sequence of instruction(s) may have a problem
# and you want to add some statements to be executed if an error occurs. These
# extra statements (the except block) are ignored if there is no error.

In [None]:
# Here is a sample program to convert a Fahrenheit temperature to a Celsius temperature:

inp = input('Enter Fahrenheit Temperature: ')
fahr = float(inp)
cel = (fahr - 32.0) * 5.0 / 9.0
print(cel)

If we execute this code this gives us error. You can think of try and except method feature in Python

In [None]:
# we can rewrite our code as:

inp = input('Enter Fahrenheit Temperature: ')
try:
    fahr = float(inp)
    cel = (fahr - 32.0) * 5/9
    print(cel)
except:
    print('Please enter a number')
    

Python starts by executing the sequence of statements in the try block. If all goes well, it skips the except block and proceeds. If an exception occurs in the try block, Python jumps out of the try block and executes the sequence of statements in the except block. Handling an exception by try statement is called caatching an exception.

In general, catching an exception gives you a chance to fix the problem, or try again, or at least end the program gracefully.

#### Short circuit evaluation of logical expression

When Python detects that there is nothing to be gained by evaluating the rest of logical expression, it stops its evaluation and does not do the computations in the rest of the logical expression. 

When the evaluation of a logical expression stops because the overall value is already known, it is called short-circuiting the evaluation. 

While this may seem like a fine point, the short-circuit behavior leads to a clever technique called the guardian pattern. For example:

In [None]:
x = 12
y = 2
x >= 2 and (x/y) > 2

In [None]:
x = 1
y = 0
x >= 2 and (x/y) > 2

In [None]:
x = 90
y = 0
x >= 4 and (x/y) > 4

The 3rd calculation failed because python was evaluating (x/y) and y was zero which causes a runtime error. But the first and 2nd example did not fail because in the 1st calculation y was non zero and in the 2nd on the first part of these expressions x >= 2 evaluated to False so the x/y was not ever executed due to the short ciruit rule and there was no error. 

We can construct the logical expression to strategically place a guard evaluation just before the evaluation that might cause an error as follows:

In [None]:
x = 1
y = 0
x >= 2 and y != 0 and (x/y) > 2

In [None]:
x = 6
y = 0
x >= 2 and y != 0 and (x/y) > 2

In [None]:
x >= 2 and (x/y) > 2 and y != 0

In the first logical expression, x >= 2 is False so the evaluation stops at the and In the second logical expression, x >= 2 is True but y != 0 is False so we never reach (x/y). In the third logical expression, the y != 0 is after the (x/y) calculation so the expression fails with an error. In the second expression, we say that y != 0 acts as a guard to insure that we only execute (x/y) if y is non-zero.

#### Excercises

Exercise 1: 

Rewrite your pay computation to give the employee 1.5 times the hourly rate for hours worked above 40 hours.

Enter Hours: 45

Enter Rate: 10

Pay: 475.0

In [None]:
hour = int(input("Enter the total hours\n"))
rate = input("What is the rate per hour?\n")

if hour > 40:
    hr = float(hour % 40)
    pay1 = (float(hour - hr) * float(rate))
    pay2 = hr * (float(rate) * 1.5)
    pay = pay1 + pay2
    print(f"Your total pay is ${pay}")
else:
    pay = float(hour) * float(rate)
    print('Your total pay is ' + '$' + str(pay))

In [None]:
# Alternative Method

# Function to compute pay
def compute_pay(hour, rate):
    if hour > 40:
        hr = hour % 40
        pay1 = (hour - hr) * rate
        pay2 = hr * (rate * 1.5)
        pay = pay1 + pay2
#         print('Your total pay is ' + '$' + str(pay))

    else:
        pay = hour * rate
#         print('Your total pay is ' + '$' + str(pay))

    return pay

# Input hours and rate
hour = float(input("Enter hours\n"))
rate = float(input("Enter rate\n"))

# Calculate pay
pay = compute_pay(hour, rate)

# Print the pay
print("The Pay is : " + '$' + str(pay))

Exercise 2: 

Rewrite your pay program using try and except so that your program that handles non-numeric input gracefully by printing a message and exiting the program. The following shows two executions of the program:

Enter Hours: 20

Enter Rate: nine

Error, please enter numeric input

Enter Hours: forty

Error, please enter numeric input


In [None]:
try:
    hour = int(input("Enter the total hours\n"))
    rate = input("What is the rate per hour?\n")
    
    if hour > 40:
        hr = float(hour % 40)
        pay1 = (float(hour - hr) * float(rate))
        pay2 = hr * (float(rate) * 1.5)
        pay = pay1 + pay2
        print(f"Your total pay is ${pay}")
    else:
        pay = float(hour) * float(rate)
        print('Your total pay is ' + '$' + str(pay))

except:
    print("Error, please enter numeric input")

Exercise 3: Write a program to prompt for a score between 0.0 and 1.0. If the score is out of range, print an error message. If the score is between 0.0 and 1.0, print a grade using the following table:

Score     Grade
if >= 0.9   A
if >= 0.8   B
if >= 0.7   C
if >= 0.6   D
if < 0.6    F

Enter score: 0.95, then: 
A

Enter score: perfect, then: 
Bad score

Enter score: 10.0, then: 
Bad score

Enter score: 0.75, then: 
C

Enter score: 0.5, then: 
F

Run the program repeatedly as shown above to test the various different values for input.

In [None]:
def score_board(s):
    if s >= 0.9:
        grade = 'A'
    elif s >= 0.8:
        grade = 'B'
    elif s >= 0.7:
        grade = 'C'
    elif s >= 0.6:
        grade = 'D'
    elif s < 0.6:
        grade = 'F'
    else:
        grade = 'Bad Score'
    return grade

# Input score
def main():
    s_i = input("Enter score: ")
    
    try:
        score = float(s_i)
        if score < 0.0 or score > 1.0:
            print("Bad score")
        else:
            result = score_board(score)
            print(result)
    except ValueError:
        print("Bad score")
        
        
# Run the main function
if __name__ == "__main__":
    main()

## Chapter 4

### Functions

#### 4.1 Function Calls

In the context of programming, a function is a named sequence of statements that performs a computation. When you define a function, you specify the name and the sequence of statements. Later, you can “call” the function by name. We have already seen one example of a function call:

In [None]:
type(32)

#### 4.2 Built-in Functions

Python provides a number of important built-in functions that we can use without needing to provide the function definition.

The max function tells us the “largest character” in the string (which turns out to be the letter “r”)

In [None]:
max('Hello World')

The min function shows us the smallest character (which turns out to be a space).

In [None]:
min('Hello World')

Another very common built-in function is the len function which tells us how many items are in its argument. If the argument to len is a string, it returns the number of characters in the string.

These functions are not limited to looking at strings. They operate on any set of values.

You should treat the names of built-in functions as reserved words (i.e., avoid using “max” as a variable name).

In [None]:
len('Hello World')

#### Type conversion functions

Python also provides built-in functions that convert values from one type to another.


In [None]:
int('32')

In [None]:
int('Hello')

int() can convert floating-point values to integers, but it doesn’t round off; it chops off the fraction part:


In [None]:
int(3.999)

In [None]:
int(-2)

float() converts integers and strings to floating-point numbers

In [None]:
float(32)

In [None]:
float('3.146')

str() converts its argument to a string:

In [None]:
str(32)

In [None]:
str(3.14652)

#### 4.4 Math functions

Python has a math module that provides most of the familiar mathematical functions.
Before we can use the module, we have to import it:

In [None]:
import math 

# This statement creates a module object named math.

print(math)

The module object contains the functions and variables defined in the module. To access one of the functions, you have to specify the name of the module and the name of the function, separated by a dot (also known as a period). This format is called dot notation.


The first example computes the logarithm base 10 of the signal-to-noise ratio. The math module also provides a function called log that computes logarithms base e.


In [None]:
signal_power = 50
noise_power = 100
ratio = signal_power / noise_power
decibels = 10 * math.log10(ratio)
print(decibels)

The second example finds the sine of radians. The name of the variable is a hint that sin and the other trigonometric functions (cos, tan, etc.) take arguments in radians.


In [None]:
radians = 0.7
height = math.sin(radians)
print(height)

To convert from degrees to radians, divide by 360 and multiply by 2:

The expression math.pi gets the variable pi from the math module. The value of this variable is an approximation of pi, accurate to about 15 digits.

In [None]:
degrees = 45
radians = degrees / 360 * 2 * math.pi
math.sin(radians)

#### 4.5 Random numbers

Given the same inputs, most computer programs generate the same outputs everytime, so they are said to be deterministic. Determinism is usually a good thing, since we expect the same calculation to yield the same result. For some applications, though, we want the computer to be unpredictable. Games are an obvious example, but there are more.

Making a program truly nondeterministic turns out to be not so easy, but there are ways to make it at least seem nondeterministic. One of them is to use al-gorithms that generate pseudorandom numbers.
Pseudorandom numbers are not truly random because they are generated by a deterministic computation, but just by looking at the numbers it is all but impossible to distinguish them from random.

The random module provides functions that generate pseudorandom numbers. The function random returns a random float between 0.0 and 1.0 (including 0.0 but not 1.0).

for example, This program produces the following list of 10 random numbers between 0.0 and up to but not including 1.0.


In [None]:
import random

for i in range(10):
    x = random.random()
    print(x)

The random function is only one of many functions that handle random numbers. The function randint takes the parameters low and high, and returns an integer between low and high (including both).

In [None]:
random.randint(68, 910)

To choose an element from a sequence at random, you can use choice:


The random module also provides functions to generate random values from continuous distributions including Gaussian, exponential, gamma, and a few more.

In [None]:
t = [1,22,333,4444,55555]
random.choice(t)

#### 4.6 Adding new functions

A function definition specifies the name of a new function and the sequence of statements that execute when the function is called.
Once we define a function, we can reuse the function over and over throughout our program.

Here is an example:

In [None]:
def print_lyrics():
print("I'm a lumberjack, and I'm okay.")
print('I sleep all night and I work all day.')

def is a keyword that indicates that this is a function definition. 

The name of the function is print_lyrics. 

The rules for function names are the same as for variable names: letters, numbers and some punctuation marks are legal, but the first character can’t be a number. You can’t use a keyword as the name of a function, and you should avoid having a variable and a function with the same name.

The empty parentheses after the name indicate that this function doesn’t take any arguments. Later we will build functions that take arguments as their inputs.

The first line of the function definition is called the header; the rest is called the body. The header has to end with a colon and the body has to be indented.

By convention, the indentation is always four spaces. The body can contain any number of statements.

If you type a function definition in interactive mode, the interpreter prints ellipses (. . . ) to let you know that the definition isn’t complete:


In [None]:
def print_lyrics():
    print("I'm a lumberjack, and I'm okay.")
    print('I sleep all night and I work all day.')

In [None]:
print(print_lyrics)

In [None]:
print(type(print_lyrics))

# The value of print_lyrics is a function object, which has type “function”.

The syntax for calling the new function is the same as for built-in functions:

In [None]:
print_lyrics()

Once you have defined a function, you can use it inside another function. For example,

In [None]:
def repeat_lyrics():
    print_lyrics()
    print_lyrics()
    
    
repeat_lyrics()

#### 4.7 Definitions and uses

This program has two function definitions: print_lyrics and repeat_lyrics.

Function definitions get executed just like other statements, but the effect is to create function objects. The statements inside the function do not get executed until the function is called, and the function definition generates no output. 

As you might expect, you have to create a function before you can execute it. In other words, the function definition has to be executed before the first time it is called.

Exercise 2: 

Move the last line of this program to the top, so the function call appears before the definitions. Run the program and see what error message you get.

In [None]:
repeat_lyrics()

def repeat_lyrics():
    print_lyrics()
    print_lyrics()


Exercise 3: 

Move the function call back to the bottom and move the definition of print_lyrics after the definition of repeat_lyrics. What happens when you run this program?

In [None]:
def repeat_lyrics():
    print_lyrics()
    print_lyrics()
    
def print_lyrics():
    print("I'm a lumberjack, and I'm okay.")
    print('I sleep all night and I work all day.')
    
repeat_lyrics()

#### 4.8 Flow of execution

In order to ensure that a function is defined before its first use, you have to know the order in which statements are executed, which is called the flow of execution.

Execution always begins at the first statement of the program. Statements are executed one at a time, in order from top to bottom.

Function definitions do not alter the flow of execution of the program, but remember that statements inside the function are not executed until the function is called.

A function call is like a detour in the flow of execution. Instead of going to the next statement, the flow jumps to the body of the function, executes all the statements there, and then comes back to pick up where it left off.

That sounds simple enough, until you remember that one function can call another. While in the middle of one function, the program might have to execute the statements in another function. But while executing that new function, the program might have to execute yet another function!

Fortunately, Python is good at keeping track of where it is, so each time a function completes, the program picks up where it left off in the function that called it. When it gets to the end of the program, it terminates.

What’s the moral of this sordid tale? When you read a program, you don’t always want to read from top to bottom. Sometimes it makes more sense if you follow the flow of execution.

#### 4.9 Parameters and arguments

Some of the built-in and all user-defined functions we have seen require arguments.For example, when you call math.sin you pass a number as an argument. Some functions take more than one argument: math.pow takes two, the base and the exponent.

Inside the function, the arguments are assigned to variables called parameters.

Here is an example of a user-defined function that takes an argument:
This function assigns the argument to a parameter named bruce. When the function is called, it prints the value of the parametertwice. The argument is evaluated before the function is called,

In [None]:
def print_twice(bruce):
    print(bruce)
    print(bruce)
    
print_twice('Hey there!')

The same rules of composition that apply to built-in functions also apply to user-defined functions, so we can use any kind of expression as an argument for print_twice:

In [None]:
print_twice('Hello ' * 2)

The argument is evaluated before the function is called, so in the examples the expressions 'Hello '* 2 and math.cos(math.pi) are only evaluated once.
You can also use a variable as an argument:


The name of the variable we pass as an argument (hey) has nothing to do with the name of the parameter (bruce). It doesn’t matter what the value was called back home (in the caller); here in print_twice, we call everybody bruce.

In [None]:
hey = 'How are you doing today?'
print_twice(hey)

#### 4.10 Fruitful dunctions and void functions

Some of the functions we are using, such as the math functions, yield results; for lack of a better name, we can call them fruitful functions. Other functions, like print_twice, perform an action but don’t return a value. They are called void functions.

When you call a fruitful function, you almost always want to do something with the result; for example, you might assign it to a variable or use it as part of an expression:


In [None]:
import math

radians = 12
x = math.cos(radians)
golden = (math.sqrt(5) + 1) / 2

# When you call a function in interactive mode, Python displays the result:
math.sqrt(5)

But in a script, if you call a fruitful function and do not store the result of the function in a variable, the return value vanishes into the mist!

Void functions might display something on the screen or have some other effect, but they don’t have a return value. If you try to assign the result to a variable, you get a special value called None.

In [None]:
def print_twice(bruce):
    print(bruce)
    print(bruce)
    
result = print_twice('Bing')

In [None]:
print(result)

The value None is not the same as the string “None”. It is a special value that has its own type:

In [None]:
print(type(None))

To return a result from a function, we use the return statement in our function.
For example, we could make a very simple function called addtwo that adds two numbers together and returns a result.

In [None]:
def addtwo(a, b):
    added = a + b
    return added

x = addtwo(3, 5)      # here x is called as the local function variable
print(x)

When this script executes, the print statement will print out “8” because the addtwo function was called with 3 and 5 as arguments. Within the function, the parameters a and b were 3 and 5 respectively. The function computed the sum of the two numbers and placed it in the local function variable named added. Then it used the return statement to send the computed value back to the calling code as the function result, which was assigned to the variable x and printed out.

#### 4.11 Why functions?

It may not be clear why it is worth the trouble to divide a program into functions. There are several reasons:
• Creating a new function gives you an opportunity to name a group of statements, which makes your program easier to read, understand, and debug.

• Functions can make a program smaller by eliminating repetitive code. Later, if you make a change, you only have to make it in one place.

• Dividing a long program into functions allows you to debug the parts one at a time and then assemble them into a working whole.

• Well-designed functions are often useful for many programs. Once you write and debug one, you can reuse it. 

Throughout the rest of the book, often we will use a function definition to explain a concept. Part of the skill of creating and using functions is to have a function properly capture an idea such as “find the smallest value in a list of values”. Later we will show you code that finds the smallest in a list of values and we will present
it to you as a function named min which takes a list of values as its argument and returns the smallest value in the list.

#### 4.12 Excercise

Excercise 4:

What is the purpose of the “def” keyword in Python?

a) It is slang that means “the following code is really cool”

b) It indicates the start of a function

c) It indicates that the following indented section of code is to be stored for later

d) b and c are both true

e) None of the above

Exercise 5: 

What will the following Python program print out?

def fred():

    print("Zap")
    
def jane():

    print("ABC")
    
jane()

fred()

jane()


a) Zap ABC jane fred jane

b) Zap ABC Zap

c) ABC Zap jane

d) ABC Zap ABC

e) Zap Zap Zap

In [None]:
# Exercise 5: What will the following Python program print out?

def fred():
    print("Zap")
def jane():
    print("ABC")
jane()
fred()
jane()

Exercise 6: 

Rewrite your pay computation with time-and-a-half for overtime and
create a function called computepay which takes two parameters (hours and rate).

Enter Hours: 45

Enter Rate: 10

Pay: 475.0

In [None]:
def compute_pay(hours, rate):
    hour = int(input('Enter your hours\n'))
    rate = input('What is the rate?\n')
    
    if hour > 40:
        hr = float(hour % 40)
        pay1 = float(hour - hr) * float(rate)
        pay2 = hr * float(rate) * 1.5
        pay = pay1 + pay2
    else:
        pay = float(hour) * float(rate)
    return pay

total_pay = compute_pay(45, 10)
print(f'Your total pay is ${total_pay}')

## Chapter 5

### Iteration

#### 5.1 Updating variables

A common pattern in assignment statements is an assignment statement that updates a variable, where the new value of the variable depends on the old.

In [None]:
x = x + 1

This means “get the current value of x, add 1, and then update x with the new value.”
If you try to update a variable that doesn’t exist, you get an error, because Python evaluates the right side before it assigns a value to x:

Before you can update a variable, you have to initialize it, usually with a simple assignment as shown below:

And updating a variable by adding 1 is called an increment; subtracting 1 is called a decrement.

In [None]:
x = 0
x = x + 1

#### 5.2 The While statement

Computers are often used to automate repetitive tasks. Repeating identical or similar tasks without making errors is something that computers do well and people do poorly. Because iteration is so common, Python provides several language features to make it easier.

One form of iteration in Python is the while statement. Here is a simple program that counts down from five and then says “Blastoff!”.

In [None]:
n = 5
while n > 0:
    print(n)
    n = n - 1
print('Blastoff!')

It means, “While n is greater than 0, display the value of n and then reduce the value of n by 1.
When you get to 0, exit the while statement and display the word Blastoff!”
More formally, here is the flow of execution for a while statement:

1. Evaluate the condition, yielding True or False.

2. If the condition is false, exit the while statement and continue execution at the next statement.

3. If the condition is true, execute the body and then go back to step

This type of flow is called a *loop* because the third step loops back around to the top. We call each time we execute the body of the loop an *iteration*. For the above loop, we would say, “It had five iterations”, which means that the body of the loop was executed five times.

The body of the loop should change the value of one or more variables so that eventually the condition becomes false and the loop terminates. We call the variable that changes each time the loop executes and controls when the loop finishes the *iteration variable*. If there is no iteration variable, the loop will repeat forever, resulting in an *infinite loop*.

#### 5.3 Iteration loops

An endless source of amusement for programmers is the observation that the directions on shampoo, “Lather, rinse, repeat,” are an infinite loop because there is no iteration variable telling you how many times to execute the loop.

In the case of countdown, we can prove that the loop terminates because we know that the value of n is finite, and we can see that the value of n gets smaller each time through the loop, so eventually we have to get to 0. Other times a loop is obviously infinite because it has no iteration variable at all.

Sometimes you don’t know it’s time to end a loop until you get half way through the body. In that case you can write an infinite loop on purpose and then use the break statement to jump out of the loop.
This loop is obviously an infinite loop because the logical expression on the while statement is simply the logical constant True:

In [None]:
n = 10
while True:
    print(n, end=" ")
    n = n - 1
print('Done!')

If you make the mistake and run this code, you will learn quickly how to stop a runaway Python process on your system or find where the power-off button is on your computer. This program will run forever or until your battery runs out because the logical expression at the top of the loop is always true by virtue of the fact that the expression is the constant value True.

While this is a dysfunctional infinite loop, we can still use this pattern to build useful loops as long as we carefully add code to the body of the loop to explicitly exit the loop using break when we have reached the exit condition.

For example, suppose you want to take input from the user until they type done. You could write:

In [None]:
while True:
    line = input('> ')
    if line == 'done':
        break
    print(line)
print('Done!')

The loop condition is True, which is always true, so the loop runs repeatedly until it hits the break statement.
Each time through, it prompts the user with an angle bracket. If the user types done, the break statement exits the loop. Otherwise the program echoes whatever the user types and goes back to the top of the loop.

This way of writing while loops is common because you can check the condition anywhere in the loop (not just at the top) and you can express the stop condition affirmatively (“stop when this happens”) rather than negatively (“keep going until that happens.”).

#### 5.4 Finishing iterations with continue

Sometimes you are in an iteration of a loop and want to finish the current iteration and immediately jump to the next iteration. In that case you can use the continue statement to skip to the next iteration without finishing the body of the loop for the current iteration.

Here is an example of a loop that copies its input until the user types “done”, but treats lines that start with the hash character as lines not to be printed (kind of like Python comments).

In [None]:
while True:
    line = input('>')
    if line[0] == '#':
        continue
    if line == 'done':
        break
    print(line)
print('Done!')

All the lines are printed except the one that starts with the hash sign because when the continue is executed, it ends the current iteration and jumps back to the while statement to start the next iteration, thus skipping the print statement.

#### 5.5 Define loops using for

Sometimes we want to loop through a set of things such as a list of words, the lines in a file, or a list of numbers. *When we have a list of things to loop through, we can construct a definite loop using a for statement.* We call the while statement an indefinite loop because it simply loops until some condition becomes False, whereas the for loop is looping through a known set of items so it runs through as many iterations as there are items in the set.

The syntax of a for loop is similar to the while loop in that there is a for statement and a loop body:

In [None]:
friends = ['Garry', 'Jerry', 'Ferry']
for friend in friends:
    print('Hey! Nice to meet you', friend)
print('Done!')

In Python terms, the variable friends is a list1 of three strings and the for loop goes through the list and executes the body once for each of the three strings in the list resulting in this output.

Translating this for loop to English is not as direct as the while, but if you think of friends as a set, it goes like this: “Run the statements in the body of the for loop once for each friend in the set named friends.”

In particular, friend is the iteration variable for the for loop. The variable friend changes for each iteration of the loop and controls when the for loop completes. The iteration variable steps successively through the three strings stored in the friends variable.

#### 5.6 Loop patterns

Often we use a for or while loop to go through a list of items or the contents of a file and we are looking for something such as the largest or smallest value of the data we scan through.

These loops are generally constructed by:

• Initializing one or more variables before the loop starts

• Performing some computation on each item in the loop body, possibly changing the variables in the body of the loop

• Looking at the resulting variables when the loop completes

##### 5.6.1 Counting and summing loops

For example, to count the number of items in a list, we would write the following for loop:

In [None]:
count = 0
for itervar in [3, 41, 12, 9, 74, 15]:
    count = count + 1
print('Count: ', count)

We set the variable count to zero before the loop starts, then we write a for loop to run through the list of numbers. Our iteration variable is named itervar and while we do not use itervar in the loop, it does control the loop and cause the loop body to be executed once for each of the values in the list.

In the body of the loop, we add 1 to the current value of count for each of the values in the list. While the loop is executing, the value of count is the number of values we have seen “so far”.

Once the loop completes, the value of count is the total number of items. The total number “falls in our lap” at the end of the loop. We construct the loop so that we have what we want when the loop finishes.
Another similar loop that computes the total of a set of numbers is as follows:

In [None]:
total_sum = 0
for itervar in [3, 41, 12, 9, 74, 15]:
    total_sum = total_sum + itervar
print('The sum is', total_sum)

In this loop we do use the iteration variable. Instead of simply adding one to the count as in the previous loop, we add the actual number (3, 41, 12, etc.) to the running total during each loop iteration. If you think about the variable total, it contains the “running total of the values so far”. So before the loop starts total is zero because we have not yet seen any values, during the loop total is the running total, and at the end of the loop total is the overall total of all the values in the list.

As the loop executes, total accumulates the sum of the elements; a variable used this way is sometimes called an accumulator.

Neither the counting loop nor the summing loop are particularly useful in practice because there are built-in functions len() and sum() that compute the number of items in a list and the total of the items in the list respectively.

##### 5.6.2 Maximum and minimum loops

To find the largest value in a list or sequence, we construct the following loop:

In [None]:
# Maximum Number:
    largest = 0
for item in [3, 41, 12, 9, 74, 15]:
    if largest is None or item > largest:
        largest = item
    print('Loop:', item, largest)
print('Largest:', largest)

The variable largest is best thought of as the “largest value we have seen so far”.Before the loop, we set largest to the constant None. None is a special constantvalue which we can store in a variable to mark the variable as “empty”.

Before the loop starts, the largest value we have seen so far is None since we have not yet seen any values. While the loop is executing, if largest is None then we take the first value we see as the largest so far. You can see in the first iteration when the value of itervar is 3, since largest is None, we immediately set largest to be 3.

After the first iteration, largest is no longer None, so the second part of the compound logical expression that checks itervar > largest triggers only when we see a value that is larger than the “largest so far”. When we see a new “even larger” value we take that new value for largest. You can see in the program output that largest progresses from 3 to 41 to 74.

At the end of the loop, we have scanned all of the values and the variable largest now does contain the largest value in the list.

To compute the smallest number, the code is very similar with one small change:

In [None]:
# Maximum Number:
largest = None
for itervar in [3, 41, 12, 9, 74, 15]:
    if largest is None or itervar > largest:
        largest = itervar
    print('loop:', itervar, largest)
print('largest is:', largest)

In [None]:
# Minimum Number
smallest = None
print('Before:', smallest)
for itervar in [3, 41, 12, 9, 74, 15]:
    if smallest is None or itervar < smallest:
        smallest = itervar
    print('Loop', itervar, smallest)
print('small:', smallest)

Again, smallest is the “smallest so far” before, during, and after the loop executes.

When the loop has completed, smallest contains the minimum value in the list.

Again as in counting and summing, the built-in functions max() and min() make writing these exact loops unnecessary.

The following is a simple version of the Python built-in min() function:

In [None]:
def min(values):
    small = None
    for i in values:
        if small is None or i < small:
            small = i
    return small

a = [3, 41, 12, 9, 74, 15]
smallest = min(a)
print('Smallest is', smallest)

In [None]:
def max(values):
    big = None
    for i in values:
        if big is None or i > big:
            big = i
    return big

b = [3, 41, 12, 9, 74, 15]
bigger = max(b)
print('Largest is: ', bigger)

#### 5.9 Excercises

Exercise 1: 

Write a program which repeatedly reads integers until the user enters
“done”. Once “done” is entered, print out the total, count, and average of the integers. If the user enters anything other than a integers, detect their mistake using try and except and print an error message and skip to the next integers.

In [None]:
num = []

while True:
    user_num = input('>')
    if user_num == 'done':
        print('Done!')
        break
    
    try:
        digit = int(user_num)
        num.append(digit)
    except:
        print('Invalid input, enter a valid number')
        
        

In [None]:
digits = []

while True:
    user_num = input('Enter a number:')
    
    if user_num == 'done':
        print('Done!')
        break
        
    try:
        num = int(user_num)
        digits.append(num)
    except:
        print("Invalid input, enter a number")

Exercise 2: 

Write another program that prompts for a list of numbers as above
and at the end prints out both the maximum and minimum of the numbers instead of the average.

In [None]:
n = int(input('Enter the number of elements in your list:'))

numbers = []

for i in range(n):
    user_input = int(input('Enter the number:'))
    numbers.append(user_input)

    print('Your list is:', numbers)
    
def find_max_num(values):
    largest = None
    for itervar in values:
        if largest is None or itervar > largest:
            largest = itervar
    return largest

def find_min_num(values):
    smallest = None
    for itervari in values:
        if smallest is None or itervari < smallest:
            smallest = itervari
    return smallest

min_num = find_min_num(numbers)
max_num = find_max_num(numbers)
print('The smallest number is:', min_num)
print('The largest number is:', max_num)
        

## Chapter 6

### Strings

#### 6.1 A string is a sequence

A string is a sequence of characters. You can access the characters one at a time with the bracket operator:

In [None]:
fruit = 'banana'
letter = fruit[1]
letter

The second statement extracts the character at index position 1 from the fruit variable and assigns it to the letter variable.

The expression in brackets is called an index. The index indicates which character in the sequence you want (hence the name).

For most people, the first letter of “banana” is “b”, not “a”. But in Python, the index is an offset from the beginning of the string, and the offset of the first letter is zero.

In [None]:
letter = fruit[0]
letter

So “b” is the 0th letter (“zero-th”) of “banana”, “a” is the 1th letter (“one-th”), and “n” is the 2th (“two-th”) letter.
You can use any expression, including variables and operators, as an index, but the value of the index has to be an integer. 

Otherwise you get:

In [None]:
letter = fruit[1.5]

here is the stirng indexes:

b a n a n a

0 1 2 3 4 5

#### 6.2 Getting the length of a string using len

len is a built-in function that returns the number of characters in a string:

In [None]:
fruit = 'banana'
len(fruit)

To get the last letter of a string, you might be tempted to try something like this:

In [None]:
length = len(fruit)
last = fruit[length]

The reason for the IndexError is that there is no letter in “banana” with the index 6. Since we started counting at zero, the six letters are numbered 0 to 5. To get the last character, you have to subtract 1 from length:

In [None]:
last = fruit[length - 1]
last

Alternatively, you can use negative indices, which count backward from the end of the string. The expression fruit[-1] yields the last letter, fruit[-2] yields the second to last, and so on. As shown below:

In [None]:
last = fruit[-1]
last

#### 6.3 Traversal through a string with a loop

A lot of computations involve processing a string one character at a time. Often they start at the beginning, select each character in turn, do something to it, and continue until the end. This pattern of processing is called a traversal. One way to write a traversal is with a while loop:

In [None]:
index = 0
while index < len(fruit):
    letter = fruit[index]
    print(letter)
    index = index + 1


This loop traverses the string and displays each letter on a line by itself. The loop condition is index < len(fruit), so when index is equal to the length of the string, the condition is false, and the body of the loop is not executed. The last character accessed is the one with the index len(fruit)-1, which is the last character in the string.

Exercise 1: 

Write a while loop that starts at the last character in the string and
works its way backwards to the first character in the string, printing each letter on a separate line, except backwards.

In [None]:
index = len(fruit) - 1

while index < len(fruit) and index > -1:
    letter = fruit[index]
    print(letter)
    index = index - 1

Another way to write a traversal is with a for loop:
Each time through the loop, the next character in the string is assigned to the variable char. The loop continues until no characters are left.

In [None]:
for char in fruit:
    print(char)

#### 6.4 String slices

A segment of a string is called a slice. Selecting a slice is similar to selecting a
character:

In [None]:
s = 'Awesome Python'
print(s[0:5])

In [None]:
print(s[6:12])

The operator [n:m] returns the part of the string from the “n-th”character to the “m-th” character, including the first but excluding the last.

If you omit the first index (before the colon), the slice starts at the beginning of the string. If you omit the second index, the slice goes to the end of the string:

In [None]:
s[:3]

In [None]:
s[3:]

If the first index is greater than or equal to the second the result is an empty string, represented by two quotation marks:

An empty string contains no characters and has length 0, but other than that, it is the same as any other string.

In [None]:
s[5:5]

Exercise 2: 

Given that fruit is a string, what does fruit[:] mean?

In [None]:
s[:]

#### 6.5 Strings are immutable

It is tempting to use the operator on the left side of an assignment, with the intention of changing a character in a string. For example:

In [None]:
greeting = 'Hello there!'
greeting[1] = 'a'

The “object” in this case is the string and the “item” is the character you tried to assign. For now, an object is the same thing as a value, but we will refine that definition later. An item is one of the values in a sequence.

The reason for the error is that strings are immutable, which means you can’t change an existing string. The best you can do is create a new string that is a variation on the original below and this example concatenates a new first letter onto a slice of greeting. It has no
effect on the original string.

In [None]:
greeting = 'Hello there!'
fine_greeting = 'A' + greeting[1:]
print(fine_greeting)

#### 6.6 Looping and counting

The following program counts the number of times the letter “a” appears in a string:

In [None]:
word = 'banana'
count = 0
for letter in word:
    if letter == 'a':
        count = count + 1
print(count)

This program demonstrates another pattern of computation called a counter. The variable count is initialized to 0 and then incremented each time an “a” is found. When the loop exits, count contains the result: the total number of a’s.

Exercise 3: 

Encapsulate this code in a function named count, and generalize it
so that it accepts the string and the letter as arguments.

In [None]:
def count(value):
    count = 0
    for letter in value:
        if letter == 'a':
            count = count + 1
    return count

user = str(input('Enter a word:\n'))
total_count = count(user)
print('The result is:', total_count)

#### 6.7 The in operator

The word in is a boolean operator that takes two strings and returns True if the first appears as a substring in the second:

In [None]:
'a' in 'banana'

In [None]:
'you are' in 'you are amazing, just the way you are!'

#### 6.8 String comaprision

The comaprision operator works on strings. to see if two strings are equal:

In [None]:
if word == 'banana':
    print('Perfect, bananas.')

Other comparison operations are useful for putting words in alphabetical order:

In [None]:
if word < 'banana':
    print('your word' + word + ', comes before baanana')
elif word > 'banana':
    print('your word' + word + ', comes after banana')
else:
    print('Perfect')

Python does not handle uppercase and lowercase letters the same way that people do. All the uppercase letters come before all the lowercase letters, so:

Your word, Pineapple, comes before banana.


A common way to address this problem is to convert strings to a standard format, such as all lowercase, before performing the comparison. Keep that in mind in case you have to defend yourself against a man armed with a Pineapple.

#### 6.9 String methods

Strings are an example of Python objects. An object contains both data (the actual string itself) and methods, which are effectively functions that are built into the object and are available to any instance of the object.

Python has a function called dir which lists the methods available for an object. The type function shows the type of an object and the dir function shows the available methods.

In [None]:
stuff = 'Hello there!'
type(stuff)

In [None]:
 dir(stuff)

In [None]:
help(str.capitalize)

While the dir function lists the methods, and you can use help to get some simple documentation on a method, a better source of documentation for string methods would be

https://docs.python.org/library/stdtypes.html#string-methods.

Calling a method is similar to calling a function (it takes arguments and returns a value) but the syntax is different. We call a method by appending the method name to the variable name using the period as a delimiter.

For example, the method upper takes a string and returns a new string with all uppercase letters:

Instead of the function syntax upper(word), it uses the method syntax
word.upper().

In [None]:
word = 'banana'
new_word = word.upper()
new_word

This form of dot notation specifies the name of the method, upper, and the name of the string to apply the method to, word. The empty parentheses indicate that this method takes no argument.


A method call is called an invocation; in this case, we would say that we are invoking upper on the word.

For example, there is a string method named find that searches for the position of one string within another:

In [None]:
word = 'banana'
index = word.find('a')
index

In this example, we invoke find on word and pass the letter we are looking for as a parameter.
The find method can find substrings as well as characters:

In [None]:
word.find('na')

It can take as a second argument the index where it should start:

In [None]:
word.find('na', 3)

One common task is to remove white space (spaces, tabs, or newlines) from the beginning and end of a string using the strip method:

In [None]:
line = '     There you go   '
line.strip()

Some methods such as startswith return boolean values.

In [None]:
line = 'Have a nice day, dear!'
line.startswith('Have')

In [None]:
line.startswith('have')

You will note that startswith requires case to match, so sometimes we take a line and map it all to lowercase before we do any checking using the lower method.

In [None]:
line = 'Have a nice day, dear!'
line.startswith('h')

In [None]:
line.lower()

In [None]:
line.lower().startswith('h')

In the last example, the method lower is called and then we use startswith to see if the resulting lowercase string starts with the letter “h”. As long as we are careful with the order, we can make multiple method calls in a single expression.

#### 6.10 Parsing strings

Often, we want to look into a string and find a substring. For example if we were presented a series of lines formatted as follows:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 

and we wanted to pull out only the second half of the address (i.e., uct.ac.za)
from each line, we can do this by using the find method and string slicing.

First, we will find the position of the at-sign in the string. Then we will find the position of the first space after the at-sign. And then we will use string slicing to extract the portion of the string which we are looking for.

In [None]:
data = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
atpos = data.find('@')
atpos

In [None]:
sppos = data.find(' ', atpos)
sppos

In [None]:
host = data[atpos+1:sppos]
host

We use a version of the find method which allows us to specify a position in the string where we want find to start looking. When we slice, we extract the characters from “one beyond the at-sign through up to but not including the space character”.

#### 6.11 Formatted String Literals

A formatted string literal (often referred to simply as an f-string) allows Python expressions to be used within string literals. This is accomplished by prepending an f to the string literal and enclosing expressions in curly braces {}.

For example, wrapping a variable name in curly braces inside an f-string will cause it to be replaced by its value:

In [None]:
camels = 42
f'{camels}'

The result is the string ‘42’, which is not to be confused with the integer value 42.

An expression can appear anywhere in the string, so you can embed a value in a sentence:

In [None]:
camels = 42
f'I have spotted {camels} camels.'

Several expressions can be included within a single string literal in order to create more complex strings.

In [None]:
years = 3
count = 1
species = 'camels'
f'In {years} years I have spotted {count} {species}.'

#### 6.14 Excercises

Exercise 5: 

Slicing strings

Take the following Python code that stores a string:

str = 'X-DSPAM-Confidence: 0.8475'

Use find and string slicing to extract the portion of the string after the colon character and then use the float function to convert the extracted string into a floating point number.

In [None]:
data = 'X-DSPAM-Confidence: 0.8475'
d_find = data.find(':')
d_find

In [None]:
sliced_data = float(data[d_find+1:])
print(sliced_data)
type(sliced_data)

## Chapter 7

### Files

#### 7.1 Persistence

So far, we have learned how to write programs and communicate our intentions to the Central Processing Unit using conditional execution, functions, and iterations. We have learned how to create and use data structures in the Main Memory. The CPU and memory are where our software works and runs. It is where all of the
“thinking” happens.

But if you recall from our hardware architecture discussions, once the power is turned off, anything stored in either the CPU or main memory is erased. So up to now, our programs have just been transient fun exercises to learn Python.

In this chapter, we start to work with Secondary Memory (or files). Secondary memory is not erased when the power is turned off. Or in the case of a USB flash drive, the data we write from our programs can be removed from the system and transported to another system.

We will primarily focus on reading and writing text files such as those we create in a text editor. Later we will see how to work with database files which are binary files, specifically designed to be read and written through database software.

#### 7.2 Opening files

When we want to read or write a file (say on your hard drive), we first must open the file. Opening the file communicates with your operating system, which knows where the data for each file is stored. When you open a file, you are asking the operating system to find the file by name and make sure the file exists.

In this example, we open the file mbox.txt, which should be stored in the same folder that you are in when you start Python. You can download this file from www.py4e.com/code3/mbox.txt

In [None]:
fhand = open('mbox.txt')
fhand

Image 1

Image 2

If the open is successful, the operating system returns us a file handle. The file handle is not the actual data contained in the file, but instead it is a “handle” that we can use to read the data. You are given a handle if the requested file exists and you have the proper permissions to read the file.

If the file does not exist, open will fail with a traceback and you will not get a handle to access the contents of the file:

In [None]:
fhand = open('stuff.txt')

#### 7.3 Text files and lines

A text file can be thought of as a sequence of lines, much like a Python string can be thought of as a sequence of characters. For example, this is a sample of a text file which records mail activity from various individuals in an open source project development team:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Return-Path: <postmaster@collab.sakaiproject.org>

Date: Sat, 5 Jan 2008 09:12:18 -0500

To: source@collab.sakaiproject.org

From: stephen.marquard@uct.ac.za

Subject: [sakai] svn commit: r39772 - content/branches/

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

...

The entire file of mail interactions is available from

www.py4e.com/code3/mbox.txt

and a shortened version of the file is available from

www.py4e.com/code3/mbox-short.txt

These files are in a standard format for a file containing multiple mail messages. The lines which start with “From” separate the messages and the lines which start with “From:” are part of the messages. For more information about the mbox format, see https://en.wikipedia.org/wiki/Mbox.

To break the file into lines, there is a special character that represents the “end of the line” called the newline character.

In Python, we represent the newline character as a backslash-n in string constants. Even though this looks like two characters, it is actually a single character. When we look at the variable by entering “stuff” in the interpreter, it shows us the \n in the string, but when we use print to show the string, we see the string broken into two lines by the newline character.

In [None]:
stuff = 'Hello\nWorld!'
stuff

In [None]:
print(stuff)

In [None]:
stuff = 'X\nY'
print(stuff)

In [None]:
len(stuff)

You can also see that the length of the string X\nY is three characters because the newline character is a single character.

So when we look at the lines in a file, we need to imagine that there is a special invisible character called the newline at the end of each line that marks the end of the line.

So the newline character separates the characters in the file into lines.

#### 7.4 Reading files

While the file handle does not contain the data for the file, it is quite easy to construct a for loop to read through and count each of the lines in a file:

In [None]:
fhand = open('mbox-short.txt')
count = 0
for line in fhand:
    count = count + 1
print('Line count:', count)

We can use the file handle as the sequence in our for loop. Our for loop simply counts the number of lines in the file and prints them out. The rough translation of the for loop into English is, “for each line in the file represented by the file handle, add one to the count variable.”

The reason that the open function does not read the entire file is that the file might be quite large with many gigabytes of data. The open statement takes the same amount of time regardless of the size of the file. The for loop actually causes the data to be read from the file.

When the file is read using a for loop in this manner, Python takes care of splitting the data in the file into separate lines using the newline character. Python reads each line through the newline and includes the newline as the last character in the line variable for each iteration of the for loop.

Because the for loop reads the data one line at a time, it can efficiently read and count the lines in very large files without running out of main memory to store the data. The above program can count the lines in any size file using very little memory since each line is read, counted, and then discarded.

If you know the file is relatively small compared to the size of your main memory, you can read the whole file into one string using the read method on the file handle.

In [None]:
fhand = open('mbox-short.txt')
inp = fhand.read()
print(len(inp))

In [None]:
print(inp[:20])

In this example, the entire contents (all 94,626 characters) of the file mbox-short.txt are read directly into the variable inp. We use string slicing to print out the first 20 characters of the string data stored in inp.

When the file is read in this manner, all the characters including all of the lines and newline characters are one big string in the variable inp. It is a good idea to store the output of read as a variable because each call to read exhausts the resource:

In [None]:
fhand = open('mbox-short.txt')
print(len(fhand.read()))

In [None]:
print(len(fhand.read()))

Remember that this form of the open function should only be used if the file data will fit comfortably in the main memory of your computer. If the file is too large to fit in main memory, you should write your program to read the file in chunks using a for or while loop.

#### 7.5 Searching through a file

When you are searching through data in a file, it is a very common pattern to read through a file, ignoring most of the lines and only processing lines which meet a particular condition. We can combine the pattern for reading a file with string methods to build simple search mechanisms.

For example, if we wanted to read a file and only print out lines which started with the prefix “From:”, we could use the string method startswith to select only those lines with the desired prefix:

In [None]:
fhand = open('mbox-short.txt')
for line in fhand:
    if line.startswith('From:'):
        print(line)

The output looks great since the only lines we are seeing are those which start with “From:”, but why are we seeing the extra blank lines? This is due to that invisible newline character. Each of the lines ends with a newline, so the print statement prints the string in the variable line which includes a newline and then print adds another newline, resulting in the double spacing effect we see.

We could use line slicing to print all but the last character, but a simpler approach is to use the rstrip method which strips whitespaces from the right side of a string as follows:

In [None]:
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if line.startswith('From:'):
        print(line)

As your file processing programs get more complicated, you may want to structure your search loops using continue. The basic idea of the search loop is that you are looking for “interesting” lines and effectively skipping “uninteresting” lines. And then when we find an interesting line, we do something with that line.

We can structure the loop to follow the pattern of skipping uninteresting lines as follows:

In [None]:
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
#     skip 'uninteresting line'
    if not line.startswith('From:'):
        continue
#     process our 'interesting line'
    print(line)

The output of the program is the same. In English, the uninteresting lines are those which do not start with “From:”, which we skip using continue. For the “interesting” lines (i.e., those that start with “From:”) we perform the processing.

We can use the find string method to simulate a text editor search that finds lines where the search string is anywhere in the line. Since find looks for an occurrence of a string within another string and either returns the position of the string or -1 if the string was not found, we can write the following loop to show lines which contain the string “@uct.ac.za” (i.e., they come from the University of Cape Town in South Africa):

In [None]:
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if line.find('@uct.ac.za') == -1:
        continue
    print(line)

#### 7.6 Letting the user chosse the file name

We really do not want to have to edit our Python code every time we want to process a different file. It would be more usable to ask the user to enter the file name string each time the program runs so they can use our program on different files without changing the Python code.

This is quite simple to do by reading the file name from the user using input as follows:

In [None]:
fname = input('Enter the file name:')
fhand = open(fname)
count = 0
for line in fhand:
    if line .startswith('Subject:'):
        count = count + 1
print(f'There were {count} subject lines in {fname}')

#### 7.7 Using try, except, and open

What if our user types something that is not a file name?

Users will eventually do every possible thing they can do to break
your programs, either mistakenly or with malicious intent. As a matter of fact, an important part of any software development team is a person or group called Quality Assurance (or QA for short) whose very job it is to do the craziest things possible in an attempt to break the software that the programmer has created.

The QA team is responsible for finding the flaws in programs before we have delivered the program to the end users who may be purchasing the software or paying our salary to write the software. So the QA team is the programmer’s best friend.

So now that we see the flaw in the program, we can elegantly fix it using the try/except structure. We need to assume that the open call might fail and add recovery code when the open fails as follows:

In [None]:
fname = input('Enter the file name:')
try:
    fhand = open(fname)
except:
    print('File cannot be opened:', fname)
    exit()
count = 0
for line in fhand:
    if line.startswith('Subject:'):
        count = count + 1
print(f'There were {count} subject lines in {fname}')

The exit function terminates the program. It is a function that we call that never returns. Now when our user (or QA team) types in silliness or bad file names, we “catch” them and recover gracefully.

Protecting the open call is a good example of the proper use of try and except in a Python program. We use the term “Pythonic” when we are doing something the “Python way”. We might say that the above example is the Pythonic way to open a file.

Once you become more skilled in Python, you can engage in repartee with other Python programmers to decide which of two equivalent solutions to a problem is “more Pythonic”. The goal to be “more Pythonic” captures the notion that programming is part engineering and part art. We are not always interested in just making something work, we also want our solution to be elegant and to be appreciated as elegant by our peers.

#### 7.8 Writing files

To write a file, you have to open it with mode “w” as a second parameter:

In [None]:
fout = open('output.txt', 'w')
print(fout)

If the file already exists, opening it in write mode clears out the old data and starts fresh, so be careful! If the file doesn’t exist, a new one is created.

The write method of the file handle object puts data into the file, returning the number of characters written. The default write mode is text for writing (and reading) strings.

In [None]:
line1 = 'This is a beautiful day, my dear!\n'
fout.write(line1)

Again, the file object keeps track of where it is, so if you call write again, it adds the new data to the end.

We must make sure to manage the ends of lines as we write to the file by explicitly inserting the newline character when we want to end a line. The print statement automatically appends a newline, but the write method does not add the newline automatically.

In [None]:
line2 = 'Do you agree, my dear!\n'
fout.write(line2)

When you are done writing, you have to close the file to make sure that the last bit of data is physically written to the disk so it will not be lost if the power goes off.

In [None]:
fout.close()

We could close the files which we open for read as well, but we can be a little sloppy if we are only opening a few files since Python makes sure that all open files are closed when the program ends. When we are writing files, we want to explicitly close the files so as to leave nothing to chance.


When you are reading and writing files, you might run into problems with whitespace. These errors can be hard to debug because spaces, tabs, and newlines are normally invisible:

In [None]:
s = '1 2\t 3\n 4'
print(s)

The built-in function repr can help. It takes any object as an argument and returns a string representation of the object. For strings, it represents whitespace characters with backslash sequences:

In [None]:
print(repr(s))

This can be helpful for debugging.

One other problem you might run into is that different systems use different characters to indicate the end of a line. Some systems use a newline, represented \n. Others use a return character, represented \r. Some use both. If you move files between different systems, these inconsistencies might cause problems.

#### 7.10 Excercises

Exercise 1: 

Write a program to read through a file and print the contents of the
file (line by line) all in upper case. Executing the program will look as follows:

python shout.py

Enter a file name: mbox-short.txt

FROM STEPHEN.MARQUARD@UCT.AC.ZA SAT JAN 5 09:14:16 2008

RETURN-PATH: <POSTMASTER@COLLAB.SAKAIPROJECT.ORG>

RECEIVED: FROM MURDER (MAIL.UMICH.EDU [141.211.14.90])

    BY FRANKENSTEIN.MAIL.UMICH.EDU (CYRUS V2.3.8) WITH LMTPA;
    SAT, 05 JAN 2008 09:14:16 -0500

You can download the file from www.py4e.com/code3/mbox-short.txt

In [None]:
fname = input('Enter a file name:')

try:
    fhand = open(fname)
except:
    print('File cannot be opend:', fname)
    exit()
for line in fhand:
    print(line.rstrip().upper())

Exercise 2: 

Write a program to prompt for a file name, and then read through
the file and look for lines of the form:

X-DSPAM-Confidence: 0.8475

When you encounter a line that starts with “X-DSPAM-Confidence:” pull apart the line to extract the floating-point number on the line. Count these lines and then compute the total of the spam confidence values from these lines. When you reach the end of the file, print out the average spam confidence.

Enter the file name: mbox.txt

Average spam confidence: 0.894128046745


Enter the file name: mbox-short.txt

Average spam confidence: 0.750718518519


Test your file on the mbox.txt and mbox-short.txt files.

In [None]:
fname = input('Enter a file name:')

try:
    line_count = 0
    total_confidence = 0.0
    
    fhand = open(fname)
except:
    print('File cannot be opend:', fname)
    exit()
    
for line in fhand:
    if line.startswith('X-DSPAM-Confidence:'):
        confidence = float(line.split(':')[1].strip())
        total_confidence = total_confidence + confidence
        line_count = line_count + 1

if line_count > 0:
    avergae_confidence = total_confidence / line_count
    print(f'Average Spam Confidence: {avergae_confidence}')

Exercise 3:

Sometimes when programmers get bored or want to have a bit of fun, they add a harmless Easter Egg to their program. Modify the program that prompts the user for the file name so that it prints a funny message when the user types in the exact file name “na na boo boo”. The program should behave normally for all other files which exist and don’t exist. Here is a sample execution of the program:

python egg.py

Enter the file name: mbox.txt

There were 1797 subject lines in mbox.txt

python egg.py

Enter the file name: missing.tyxt

File cannot be opened: missing.tyxt

python egg.py

Enter the file name: na na boo boo

NA NA BOO BOO TO YOU - You have been punk'd!

In [None]:
fname = input('Enter a file name:')

try:
    fhand = open(fname)
    
except:
    if fname == 'na na boo boo':
        print("NA NA BOO BOO TO YOU - You have been punk'd!")
    else:
        print('File cannot be opend:', fname)
        exit()
    
count = 0   
for line in fhand:
    if line.startswith('Subject:'):
        count = count + 1
        
print(f'There were {count} subject lines in {fname}')
    

## Chapter 8

### Lists

#### A list is a sequence

Like a string, a list is a sequence of values. In a string, the values are characters; in a list, they can be any type. The values in lists are called elements or sometimes items.

There are several ways to create a new list; the simplest is to enclose the elements in square brackets (“[" and "]”):

In [None]:
[10, 20, 30, 40]      # it is a list of four integers
['AAA', 'BBB', 'CCC', 'DDD']    #is a list of four strings

The first example is a list of four integers. The second is a list of three strings. The elements of a list don’t have to be the same type. The following list contains a string, a float, an integer, and (lo!) another list:

In [None]:
['AAA', 3.0, 7, [10, 20]]

A list within another list is nested.

A list that contains no elements is called an empty list; you can create one with empty brackets, [].

As you might expect, you can assign list values to variables:

In [None]:
cheese = ['cheddar', 'mozirella']
num = [17, 123]
blank = []

print(cheese, num, blank)

#### 8.2 Lists are mutable

The syntax for accessing the elements of a list is the same as for accessing the characters of a string: the bracket operator. The expression inside the brackets specifies the index. Remember that the indices start at 0:

In [None]:
print(cheese[0])

Unlike strings, lists are mutable because you can change the order of items in a list or reassign an item in a list. When the bracket operator appears on the left side of an assignment, it identifies the element of the list that will be assigned.

In [None]:
num = [17, 123]
num[1] = 5
print(num)

The one-th element of numbers, which used to be 123, is now 5.

You can think of a list as a relationship between indices and elements. This relationship is called a mapping; each index “maps to” one of the elements.

List indices work the same way as string indices:

• Any integer expression can be used as an index.

• If you try to read or write an element that does not exist, you get an IndexError.

• If an index has a negative value, it counts backward from the end of the list.

The in operator also works on lists.

In [None]:
cheese = ['cheddar', 'mozirella']
'cheddar' in cheese

#### 8.3 Traversing a list

The most common way to traverse the elements of a list is with a for loop. The syntax is the same as for strings:

In [None]:
for cheez in cheese:
    print(cheez)

This works well if you only need to read the elements of the list. But if you want to write or update the elements, you need the indices. A common way to do that is to combine the functions range and len:

In [None]:
for i in range(len(num)):
    num[i] = num[i] * 2

This loop traverses the list and updates each element. len returns the number of elements in the list. range returns a list of indices from 0 to n − 1, where n is the length of the list. Each time through the loop, i gets the index of the next element. The assignment statement in the body uses i to read the old value of the element and to assign the new value.

A for loop over an empty list never executes the body:

In [None]:
for x in blank:
    print('this never happens')

Although a list can contain another list, the nested list still counts as a single element. The length of this list is four:

In [None]:
['spam', 1, ['Brie', 'Roquefort', 'Pol le Veq'], [1, 2, 3]]

#### 8.4 List Operations

The + operator concatenates lists:

In [None]:
a = [1, 2, 3]
b = [4, 5, 6]
c = a + b
print(c)

Similarly, the * operator repeats a list a given number of times:

In [None]:
# the * operator repeats a list a given number of times:

[0] * 4      # The first example repeats four times.

[1,2,3] * 3     # The second example repeats the list three times.

The first example repeats four times. The second example repeats the list three times.

#### 8.5 List Slices

The slice operator also works on lists:

In [None]:
x = ['a', 'b', 'c', 'd', 'e', 'f']
x[1:3]

If you omit the first index, the slice starts at the beginning. If you omit the second, the slice goes to the end. So if you omit both, the slice is a copy of the whole list.

In [None]:
x[:4]

In [None]:
x[3:]

Since lists are mutable, it is often useful to make a copy before performing operations that fold, spindle, or mutilate lists.

A slice operator on the left side of an assignment can update multiple elements:

In [None]:
x[:]

In [None]:
x = ['a', 'b', 'c', 'd', 'e', 'f']
x[1:3] = ['p', 'q']
print(x)

#### 8.6 List Methods

Python provides methods that operate on lists. For example, append adds a new element to the end of a list:

In [None]:
x = ['a', 'b', 'c']
x.append('d')
print(x)

extend takes a list as an argument and appends all of the elements:

In [None]:
x1 = ['a', 'b', 'c']
x2 = ['d', 'e']
x1.extend(x2)
print(x1)

This example leaves t2 unmodified.

sort arranges the elements of the list from low to high:

In [None]:
x = ['d', 'c', 'e', 'b', 'a']
x.sort()
print(x)

Most list methods are void; they modify the list and return None. If you accidentally write t = t.sort(), you will be disappointed with the result.

#### 8.7 Deleting elements

There are several ways to delete elements from a list. If you know the index of the element you want, you can use pop:

In [None]:
x = ['a', 'b', 'c']
y = x.pop(1)
print(x)
print(y)

pop modifies the list and returns the element that was removed. If you don’t provide an index, it deletes and returns the last element.

If you don’t need the removed value, you can use the del statement:

In [None]:
x = ['a', 'b', 'c']
del x[1]
print(x)

If you know the element you want to remove (but not the index), you can use remove:

In [None]:
x = ['a', 'b', 'c']
x.remove('b')
print(x)    # The return value from remove is None.

To remove more than one element, you can use del with a slice index:

In [None]:
x = ['a', 'b', 'c', 'd', 'e', 'f']
del x[1:5]
print(x)

As usual, the slice selects all the elements up to, but not including, the second index.

#### 8.8 Lists and Fucntions

There are a number of built-in functions that can be used on lists that allow you to quickly look through a list without writing your own loops:

In [None]:
x = [3, 41, 12, 9, 74, 15]
print(len(x))

In [None]:
print(max(x))

In [None]:
print(min(x))

In [None]:
print(sum(x))

The sum() function only works when the list elements are numbers. The other functions (max(), len(), etc.) work with lists of strings and other types that can be comparable.

We could rewrite an earlier program that computed the average of a list of numbers entered by the user using a list.

First, the program to compute an average without a list:

In [None]:
total = 0
count = 0

while True:
    inp = input('Enter the number:')
    try:
        
        if inp == 'done':
            break
        value = float(inp)
        total = total + value
        count = count + 1
    except:
        print('Enter a valid number.')
        
print(f'The sum is: \n{total}')
print(f'Number of elements: \n{count}')
average = round((total / count), 2)
print(f'Average is: {average}' )

In this program, we have count and total variables to keep the number and running total of the user’s numbers as we repeatedly prompt the user for a number.

We could simply remember each number as the user entered it and use built-in functions to compute the sum and count at the end.

In [None]:
numlist = list()

while True:
    inp = input('Enter the number:')
    try:
        if inp == 'done':
            break
        value = float(inp)
        numlist.append(value)
    except:
        print('Enter a valid number.')

average = sum(numlist) / len(numlist)
print(f'Average is: {average}' ) 

We make an empty list before the loop starts, and then each time we have a number, we append it to the list. At the end of the program, we simply compute the sum of the numbers in the list and divide it by the count of the numbers in the list to come up with the average.

#### Lists and Strings

A string is a sequence of characters and a list is a sequence of values, but a list of characters is not the same as a string. To convert from a string to a list of characters, you can use list:

In [None]:
s = 'hello'
x = list(s)
print(x)

Because list is the name of a built-in function, you should avoid using it as a variable name. 

The list function breaks a string into individual letters. If you want to break a string into words, you can use the split method:

In [None]:
s = 'This is a beautiful day!'
x = s.split()
print(x)
print(x[1])

Once you have used split to break the string into a list of words, you can use the index operator (square bracket) to look at a particular word in the list.

You can call split with an optional argument called a delimiter that specifies which characters to use as word boundaries. The following example uses a hyphen as a delimiter:

In [None]:
s = 'hey!hey!hey!'
delimiter = '!'
x = s.split(delimiter)
print(x)
print(x[1])

join is the inverse of split. It takes a list of strings and concatenates the elements. join is a string method, so you have to invoke it on the delimiter and pass the list as a parameter:

In [None]:
a = 'Hey! nice to meet you'
x = a.split()
delimiter = ' '
print(x)
y = delimiter.join(x)
print(y)

In this case the delimiter is a space character, so join puts a space between words.

To concatenate strings without spaces, you can use the empty string, "", as a delimiter.

#### 8.10 Parsing line

Usually when we are reading a file we want to do something to the lines other than just printing the whole line. Often we want to find the “interesting lines” and then parse the line to find some interesting part of the line. What if we wanted to print out the day of the week from those lines that start with “From”?

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

The split method is very effective when faced with this kind of problem. We can write a small program that looks for lines where the line starts with “From”, split those lines, and then print out the third word in the line:

In [None]:
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if not line.startswith('From '):
        continue
    words = line.split()
    print(words[2])

# Picture to be added 

#### 8.11 Objects and values

if we execute these assignment statements:

In [None]:
a = 'banana'
b = 'banana'
print(a)
print(b)

we know that a and b both refer to a string, but we don’t know whether they refer to the same string. There are two possible states:

In one case, a and b refer to two different objects that have the same value. In the second case, they refer to the same object.

To check whether two variables refer to the same object, you can use the is operator.

In [None]:
a is b

In this example, Python only created one string object, and both a and b refer to it.

But when you create two lists, you get two objects:

In this case we would say that the two lists are equivalent, because they have the same elements, but not identical, because they are not the same object. If two objects are identical, they are also equivalent, but if they are equivalent, they are not necessarily identical.

Until now, we have been using “object” and “value” interchangeably, but it is more precise to say that an object has a value. If you execute a = [1,2,3], a refers to a list object whose value is a particular sequence of elements. If another list has the same elements, we would say it has the same value.

#### 8.12 Aliasing

If a refers to an object and you assign b = a, then both variables refer to the same object:

In [None]:
a= [1, 2 , 3]
b = a
b is a

The association of a variable with an object is called a reference. In this example, there are two references to the same object.

An object with more than one reference has more than one name, so we say that the object is aliased.

If the aliased object is mutable, changes made with one alias affect the other:

In [None]:
b[0] = 17
print(a)

Although this behavior can be useful, it is error-prone. In general, it is safer to avoid aliasing when you are working with mutable objects.

For immutable objects like strings, aliasing is not as much of a problem. In this example:

In [None]:
a = 'banana'
b = 'banana'

# it almost never makes a difference whether a and b refer to the same string or not.

#### 8.13 List Argument

When you pass a list to a function, the function gets a reference to the list. If the function modifies a list parameter, the caller sees the change. For example, delete_head removes the first element from a list:

In [None]:
def delete_head(t):
    del t[0]
    
letters = ['a', 'b', 'c']
delete_head(letters)
print(letters)

The parameter t and the variable letters are aliases for the same object.

It is important to distinguish between operations that modify lists and operations that create new lists. For example, the append method modifies a list, but the + operator creates a new list:

In [None]:
t1 = [1, 2]
t2 = t1.append(3)
print(t1)
print(t2)

In [None]:
t3 = t1 + [3]
print(t3)
t1 is t3

This difference is important when you write functions that are supposed to modify lists. For example, this function does not delete the head of a list:

In [None]:
def bad_delete_head(t):
    t = t[1:]

The slice operator creates a new list and the assignment makes t refer to it, but none of that has any effect on the list that was passed as an argument.

An alternative is to write a function that creates and returns a new list. For example, tail returns all but the first element of a list:

In [None]:
def tail(t):
    return t[1:]

This function leaves the original list unmodified. Here’s how it is used:

In [None]:
letters = ['a', 'b', 'c']
rest = tail(letters)
print(rest)

Exercise 1: 

Write a function called chop that takes a list and modifies it, removing the first and last elements, and returns None. Then write a function called middle that takes a list and returns a new list that contains all but the first and last elements.

In [None]:
def chop(x):
    if len(x) > 1:
        del x[0]
        del x[-1]
    elif len(x) == 1:
        x.clear()
    return None



In [None]:
def middle(y):
    y = y[1:-1]
    return y 

In [None]:
z = ['a', 'b', 'c', 'd']
b = chop(z)
print(b)

In [None]:
z = ['a', 'b', 'c', 'd']
c = middle(z)
print(c)

#### 8.14 Debugging

Careless use of lists (and other mutable objects) can lead to long hours of debugging.

Here are some common pitfalls and ways to avoid them:

1. Don’t forget that most list methods modify the argument and return None. This is the opposite of the string methods, which return a new string and leave the original alone.

If you are used to writing string code like this:

In [None]:
word = 'hello'
word = word.strip()

It is tempting to write list code like this:

In [None]:
t = [1, 2, 3, 4]
t = t.sort()        # This is wrong!

Because sort returns None, the next operation you perform with t is likely to fail.

2. Pick an idiom and stick with it.

Part of the problem with lists is that there are too many ways to do things. For example, to remove an element from a list, you can use pop, remove, del, or even a slice assignment.

To add an element, you can use the append method or the + operator. But don’t forget that these are right:

In [None]:
x = ['a', 'b']
t = t.append[x]
t = t + [x]

What Happened:

	1.	The append() Method:
    
	•	In Python, the .append() method modifies a list in place (i.e., it updates the original list directly) and does not return any value. Its return value is None.
    
	•	By writing t = t.append[x], you are trying to assign the result of t.append[x] to t. But since t.append() returns None, t becomes None.
    
    
	2.	What Happens Next:
    
	•	After t becomes None, when you try to execute the next line t = t + [x], Python raises an AttributeError because None is not a list and does not have an append method.

Corrected Code:

Here’s the fixed version of your code:

In [None]:
x = ['a', 'b']
t = []  # Initialize an empty list

# Append x to t
t.append(x)  # Use append correctly

# Concatenate x to t as a new list
t = t + [x]  # Add x as a separate element

t.append(x)

t = t + [x]

And these are wrong:

t.append([x])   # WRONG!

t = t.append(x) # WRONG!

t + [x]         # WRONG!

t = t + x       # WRONG!

Try out each of these examples in interactive mode to make sure you understand what they do. Notice that only the last one causes a runtime error; the other three are legal, but they do the wrong thing.

3. Make copies to avoid aliasing.

If you want to use a method like sort that modifies the argument, but you need to keep the original list as well, you can make a copy.

In [None]:
orig = t[:]
t.sort()

In this example you could also use the built-in function sorted, which returns a new, sorted list and leaves the original alone. But in that case you should avoid using sorted as a variable name!

4. Lists, split, and files

When we read and parse files, there are many opportunities to encounter input that can crash our program so it is a good idea to revisit the guardian pattern when it comes to writing programs that read through a file and look for a “needle in the haystack”.

Let’s revisit our program that is looking for the day of the week on the from lines of our file:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Since we are breaking this line into words, we could dispense with the use of startswith and simply look at the first word of the line to determine if we are interested in the line at all. We can use continue to skip lines that don’t have “From” as the first word as follows:

In [None]:
fhand = open('mbox-short.txt')
for line in fhand:
    words = line.split()
    if words[0] != 'From':
        continue
    print(words[2])

It kind of works and we see the day from the first line (Sat), but then the program fails with a traceback error. What went wrong? What messed-up data caused our elegant, clever, and very Pythonic program to fail?

You could stare at it for a long time and puzzle through it or ask someone for help, but the quicker and smarter approach is to add a print statement. The best place to add the print statement is right before the line where the program failed and print out the data that seems to be causing the failure.

Now this approach may generate a lot of lines of output, but at least you will immediately have some clue as to the problem at hand. So we add a print of the variable words right before line five. We even add a prefix “Debug:” to the line so we can keep our regular output separate from our debug output.

In [None]:
for line in fhand:
    words = line.split()
    print('Debug:', words)
    if words[0] != 'From':
        continue
    print(words[2])

When we run the program, a lot of output scrolls off the screen but at the end, we see our debug output and the traceback so we know what happened just before the traceback.

Each debug line is printing the list of words which we get when we split the line into words. When the program fails, the list of words is empty [].

The error occurs when our program encounters a blank line! Of course there are “zero words” on a blank line. Why didn’t we think of that when we were writing the code? When the code looks for the first word (word[0]) to check to see if it matches “From”, we get an “index out of range” error.

This of course is the perfect place to add some guardian code to avoid checking the first word if the first word is not there. There are many ways to protect this code; we will choose to check the number of words we have before we look at the first word:

In [None]:
fhand = open('mbox-short.txt')
count = 0
for line in fhand:
    words = line.split()
#     print('Debug:', words)
    if len(words) == 0:
        continue
    elif words[0] != 'From':
        continue
    print(words[2])

First we commented out the debug print statement instead of removing it, in case our modification fails and we need to debug again. Then we added a guardian statement that checks to see if we have zero words, and if so, we use continue to skip to the next line in the file.

We can think of the two continue statements as helping us refine the set of lines which are “interesting” to us and which we want to process some more. A line which has no words is “uninteresting” to us so we skip to the next line. A line which does not have “From” as its first word is uninteresting to us so we skip it.

The program as modified runs successfully, so perhaps it is correct. Our guardian statement does make sure that the words[0] will never fail, but perhaps it is not enough. When we are programming, we must always be thinking, “What might go wrong?”

#### 8.16 Excercises

Exercise 4: Find all unique words in a file

Shakespeare used over 20,000 words in his works. But how would you determine that? How would you produce the list of all the words that Shakespeare used? Would you download all his work, read it and track all unique words by hand?

Let’s use Python to achieve that instead. List all unique words, sorted in alphabetical order, that are stored in a file romeo.txt containing a subset of Shakespeare’s work.

To get started, download a copy of the file www.py4e.com/code3/romeo.txt. Create a list of unique words, which will contain the final result. 

1. Write a program to open the file romeo.txt and read it line by line. 

2. For each line, split the line into a list of words using the split function. 

3. For each word, check to see if the word is already in the list of unique words. If the word is not in the list of unique words, add it to the list. 

4. When the program completes, sort and print the list of unique words in alphabetical order.

Enter file: romeo.txt

['Arise', 'But', 'It', 'Juliet', 'Who', 'already',
'and', 'breaks', 'east', 'envious', 'fair', 'grief',
'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft',
'sun', 'the', 'through', 'what', 'window',
'with', 'yonder']

In [None]:
fhand = open('romeo.txt')
unique_word = []

for line in fhand:
    line = line.strip()
    words = line.split()
    for word in words:
        unique_word.append(word)

sorted_unique_word = sorted(unique_word)
    
print(f'The list of unique words is:\n {sorted_unique_word}')

Exercise 5: Minimalist Email Client.

MBOX (mail box) is a popular file format to store and share a collection of emails. This was used by early email servers and desktop apps. Without getting into too many details, MBOX is a text file, which stores emails consecutively. Emails are separated by a special line which starts with From (notice the space). Importantly, lines starting with From: (notice the colon) describes the email itself and does not act as a separator. Imagine you wrote a minimalist email app, that lists the email of the senders in the user’s Inbox and counts the number of emails.

Write a program to read through the mail box data and when you find line that starts with “From”, you will split the line into words using the split function. We are interested in who sent the message, which is the second word on the From line.

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

You will parse the From line and print out the second word for each From line, then you will also count the number of From (not From:) lines and print out a count at the end. This is a good sample output with a few lines removed:

python fromcount.py

Enter a file name: mbox-short.txt

stephen.marquard@uct.ac.za

louis@media.berkeley.edu

zqian@umich.edu

[...some output removed...]

ray@media.berkeley.edu

cwen@iupui.edu

cwen@iupui.edu

cwen@iupui.edu

There were 27 lines in the file with From as the first word

In [None]:
file_name = input('Enter the file name: ')

try:
    fhand = open(file_name)  # Try to open the file
except FileNotFoundError:
    print(f'File cannot be opened: {file_name}')
    exit()  # Exit if the file is not found

count = 0  # Initialize the count variable

for line in fhand:
    line = line.rstrip()  # Remove trailing whitespace
    if line.startswith('From '):  # Match lines that start with 'From ' only
        words = line.split()  # Split the line into words
        count = count + 1  # Increment the count
        emails = words[1]  # Extract the email (2nd word)
        print(emails)  # Print the email address

print(f'There were {count} lines in the file with From as the first word')

Exercise 6:

Rewrite the program that prompts the user for a list of numbers and prints out the maximum and minimum of the numbers at the end when the user enters “done”. Write the program to store the numbers the user enters in a list and use the max() and min() functions to compute the maximum and minimum numbers after the loop completes.

Enter a number: 6

Enter a number: 2

Enter a number: 9

Enter a number: 3

Enter a number: 5

Enter a number: done

Maximum: 9.0

Minimum: 2.0

In [None]:
user_list = []  # Initialize an empty list
count = 0       # Initialize a counter

while True:
    inp = input('Enter a number (or type "done" to finish): ')
    
#     Break the loop if user tyes 'done'
    if inp == 'done':
        break
    
    try:
        # Convert input to a number and append to the list
        num = float(inp)   # Use float to handle decimal numbers
        user_list.append(num)
        count = count + 1
    except:
        print('invalid input! Please enter a valid number.')
        
# Print the results
print(f'Your list of numbers is {user_list} and you have entered a total of {count} elements.')
max_num = max(user_list)
min_num = min(user_list)
print(f'The maximum number in your list is {max_num}')
print(f'The minimum number in your list is {min_num}')

## Chapter 9

### Dictionaries

A *dictionary* is like a list, but more general. In a list, the index positions have to be integers; in a dictionary, the indices can be (almost) any type.

You can think of a dictionary as a mapping between a set of indices (which are called keys) and a set of values. Each key maps to a value. The association of a key and a value is called a key-value pair or sometimes an item.

As an example, we’ll build a dictionary that maps from English to Spanish words, so the keys and the values are all strings.

The function dict creates a new dictionary with no items. Because dict is the name of a built-in function, you should avoid using it as a variable name.

In [None]:
eng2sp = dict()
print(eng2sp)

The curly brackets, {}, represent an empty dictionary. To add items to the dictionary, you can use square brackets:

In [None]:
eng2sp['one'] = 'uno'

This line creates an item that maps from the key 'one' to the value “uno”. If we print the dictionary again, we see a key-value pair with a colon between the key and value:

In [None]:
print(eng2sp)

This output format is also an input format. For example, you can create a new dictionary with three items.

In [None]:
eng2sp = {'one':'uno', 'two':'dos', 'three':'tres'}
print(eng2sp)

Since Python 3.7x the order of key-value pairs is the same as their input order, i.e. dictionaries are now ordered structures.

But that doesn’t really matter because the elements of a dictionary are never indexed with integer indices. Instead, you use the keys to look up the corresponding values:

In [None]:
print(eng2sp['two'])

The key 'two' always maps to the value “dos” so the order of the items doesn’t matter.

If the key isn’t in the dictionary, you get an exception:

In [None]:
print(eng2sp['four'])

The len function works on dictionaries; it returns the number of key-value pairs:

In [None]:
len(eng2sp)

The in operator works on dictionaries; it tells you whether something appears as a key in the dictionary (appearing as a value is not good enough).

In [None]:
'one' in eng2sp

In [None]:
'uno' in eng2sp

To see whether something appears as a value in a dictionary, you can use the method values, which returns the values as a type that can be converted to a list, and then use the in operator:

In [None]:
vals = list(eng2sp.values())
'uno' in vals

The in operator uses different algorithms for lists and dictionaries. For lists, it uses a linear search algorithm. As the list gets longer, the search time gets longer in direct proportion to the length of the list. For dictionaries, Python uses an
algorithm called a hash table that has a remarkable property: the in operator takes about the same amount of time no matter how many items there are in a dictionary.

Exercise 1: Download a copy of the file

www.py4e.com/code3/words.txt

Write a program that reads the words in words.txt and stores them as keys in a dictionary. It doesn’t matter what the values are. Then you can use the in operator as a fast way to check whether a string is in the dictionary.

In [None]:
fhand = open('words.txt')

words_dict = {}

for line in fhand:
    words = line.split()
    for word in words:
        words_dict[word] = None

        
user_word = input('Enter a word to check:\n')
if user_word in words_dict:
    print(f'The word "{user_word}" is in the dictionary.')
else:
    print(f'The word "{user_word}" is NOT in the dictionary.')

#### 9.1 Dictionary as a set of counters

Suppose you are given a string and you want to count how many times each letter appears. There are several ways you could do it:

1. You could create 26 variables, one for each letter of the alphabet. Then you could traverse the string and, for each character, increment the corresponding counter, probably using a chained conditional.

2. You could create a list with 26 elements. Then you could convert each character to a number (using the built-in function ord), use the number as an index into the list, and increment the appropriate counter.

3. You could create a dictionary with characters as keys and counters as the corresponding values. The first time you see a character, you would add an item to the dictionary. After that you would increment the value of an existing item.

Each of these options performs the same computation, but each of them implements that computation in a different way.

An implementation is a way of performing a computation; some implementations are better than others. For example, an advantage of the dictionary implementation is that we don’t have to know ahead of time which letters appear in the string and we only have to make room for the letters that do appear.

Here is what the code might look like:

In [None]:
word = 'brontosaurus'
d = dict()
for c in word:
    if c not in d:
        d[c] = 1
    else:
        d[c] = d[c] + 1
    print(d)
print(d)

We are effectively computing a histogram, which is a statistical term for a set of counters (or frequencies).

The for loop traverses the string. Each time through the loop, if the character c is not in the dictionary, we create a new item with key c and the initial value 1 (since we have seen this letter once). If c is already in the dictionary we increment d['c'].

Here’s the output of the program:

{'b': 1, 'r': 2, 'o': 2, 'n': 1, 't': 1, 's': 2, 'a': 1, 'u': 2}

The histogram indicates that the letters “a” and “b” appear once; “o” appears twice, and so on.

Dictionaries have a method called get that takes a key and a default value. If the key appears in the dictionary, get returns the corresponding value; otherwise it returns the default value. For example:

In [None]:
counts = {'chuck':1, 'annie':42, 'jan':100}
print(counts.get('jan', 0))

In [None]:
print(counts.get('tim', 0))

We can use get to write our histogram loop more concisely. Because the get method automatically handles the case where a key is not in a dictionary, we can reduce four lines down to one and eliminate the if statement.

In [None]:
word = 'brontosaurus'
d = dict()
for c in word:
    d[c] = d.get(c, 0) + 1
#     print(d)
print(d)

The use of the get method to simplify this counting loop ends up being a very commonly used “idiom” in Python and we will use it many times in the rest of the book. So you should take a moment and compare the loop using the if statement and in operator with the loop using the get method. They do exactly the same thing, but one is more succinct.

#### 9.2 Dictionaries and files

One of the common uses of a dictionary is to count the occurrence of words in a file with some written text. Let’s start with a very simple file of words taken from the text of Romeo and Juliet.   
For the first set of examples, we will use a shortened and simplified version of the text with no punctuation. Later we will work with the text of the scene with punctuation included.

But soft what light through yonder window breaks  
It is the east and Juliet is the sun  
Arise fair sun and kill the envious moon  
Who is already sick and pale with grief  

We will write a Python program to read through the lines of the file, break each line into a list of words, and then loop through each of the words in the line and count each word using a dictionary.  
You will see that we have two for loops. The outer loop is reading the lines of the file and the inner loop is iterating through each of the words on that particular line. This is an example of a pattern called nested loops because one of the loops is the outer loop and the other loop is the inner loop.  
Because the inner loop executes all of its iterations each time the outer loop makes a single iteration, we think of the inner loop as iterating “more quickly” and the outer loop as iterating more slowly.  
The combination of the two nested loops ensures that we will count every word on every line of the input file.

In [None]:
fname = input('Enter the file name: ')

try:
    fhand = open(fname)
except:
    print('File cannot be opened:', fname)
    exit()

count = dict()
for line in fhand:
    words = line.split()
    for word in words:
        if word not in count:
            count[word] = 1
        else:
            count[word] = count[word] + 1
print(count)

In our else statement, we use the more compact alternative for incrementing a variable. counts[word] += 1 is equivalent to counts[word] = counts[word] + 1. Either method can be used to change the value of a variable by any desired amount. Similar alternatives exist for -=, *=, and /=.  
When we run the program, we see a raw dump of all of the counts in unsorted hash order. (the romeo.txt file is available at www.py4e.com/code3/romeo.txt)

#### 9.3 Looping and dictionaries

If you use a dictionary as the sequence in a for statement, it traverses the keys of the dictionary. This loop prints each key and the corresponding value:

In [None]:
counts = {'chuck':1, 'annie':42, 'jan':100}
for key in counts:
    print(key, counts[key])

Again, the keys are ordered.  
We can use this pattern to implement the various loop idioms that we have described earlier. For example if we wanted to find all the entries in a dictionary with a value above ten, we could write the following code:

In [None]:
counts = {'chuck':1, 'annie':42, 'jan':100}
for key in counts:
    if counts[key] > 10:
        print(key, counts[key])

The for loop iterates through the keys of the dictionary, so we must use the index operator to retrieve the corresponding value for each key.  
If you want to print the keys in alphabetical order, you first make a list of the keys in the dictionary using the keys method available in dictionary objects, and then sort that list and loop through the sorted list, looking up each key and printing out key-value pairs in sorted order as follows:

In [None]:
counts = {'chuck':1, 'annie':42, 'jan':100}
lst = list(counts.keys())
print(lst)
lst.sort()
print(lst)
for key in lst:
    print(key, counts[key])

First you see the list of keys in non-alphabetical order that we get from the keys method. Then we see the key-value pairs in alphabetical order from the for loop.

#### 9.4 Advanced text parsing

In the above example using the file romeo.txt, we made the file as simple as possible by removing all punctuation by hand. The actual text has lots of punctuation, as shown below.

But, soft! what light through yonder window breaks?  
It is the east, and Juliet is the sun.  
Arise, fair sun, and kill the envious moon,  
Who is already sick and pale with grief, 

Since the Python split function looks for spaces and treats words as tokens separated by spaces, we would treat the words “soft!” and “soft” as different words and create a separate dictionary entry for each word.

Also since the file has capitalization, we would treat “who” and “Who” as different words with different counts.  

We can solve both these problems by using the string methods lower, punctuation, and translate. The translate is the most subtle of the methods. Here is the documentation for translate:  

line.translate(str.maketrans(fromstr, tostr, deletestr))

*Replace the characters in fromstr with the character in the same position in tostr and delete all characters that are in deletestr. The fromstr and tostr can be empty strings and the deletestr parameter can be omitted.*. 

We will not specify the tostr but we will use the deletestr parameter to delete all of the punctuation. We will even let Python tell us the list of characters that it considers “punctuation”:

In [None]:
import string
string.punctuation

We make the following modifications to our program:

In [None]:
fname = input('Enter the file name: ')

try:
    fhand = open(fname)
except:
    print('File cannot be opened:', fname)
    exit()

count = dict()

for line in fhand:
    line = line.rstrip()
#     First two parameters are empty strings
#     Creates a translation table to map all punctuation characters to None and removes all punctuation characters from the line
    line = line.translate(line.maketrans("", "", string.punctuation))
#     converts the line to lowercase for case-insensitive word counting
    line = line.lower()
    words = line.split()
    for word in words:
        if word not in count:
            count[word] = 1
        else:
            count[word] += 1
print(count)

#### 9.5 Excercises

Exercise 2:

Write a program that categorizes each mail message by which day of
the week the commit was done. To do this look for lines that start with “From”, then look for the third word and keep a running count of each of the days of the week. At the end of the program print out the contents of your dictionary (order does not matter).

Sample Line:  
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Sample Execution:  
python dow.py  
Enter a file name: mbox-short.txt  
{'Fri': 20, 'Thu': 6, 'Sat': 1}  

In [None]:
fname = input('Enter the file name: ')

try:
    fhand = open(fname)
except:
    print('File cannot be opened:', fname)
    exit()

count = dict()
for line in fhand:
    line.rstrip()
    if line.startswith('From '):
        words = line.split()
        day = words[2]
        if day not in count:
            count[day] = 1
        else:
            count[day] += 1

print(count)  

Exercise 4: 

Add code to the above program to figure out who has the most
messages in the file. After all the data has been read and the dictionary has been created, look through the dictionary using a maximum loop (see Chapter 5: Maximum and minimum loops) to find who has the most messages and print how many messages the person has.

Enter a file name: mbox-short.txt  
cwen@iupui.edu 5

Enter a file name: mbox.txt  
zqian@umich.edu 195 

In [None]:
fname = input('Enter the file name: ')

try:
    fhand = open(fname)
except:
    print('File cannot be opened:', fname)
    exit()

email_count = dict()
for line in fhand:
    line.rstrip()
    if line.startswith('From '):
        words = line.split()
        email = words[1]
        if email not in email_count:
            email_count[email] = 1
        else:
            email_count[email] += 1


def max_count(email_counts):
    max_email = None
    max_count = 0
    for email, count in email_count.items():
        if count > max_count:
            max_email = email
            max_count = count
    return max_email, max_count

def min_count(email_counts):
    min_email = None
    min_count = len(email_count)
    for email, count in email_count.items():
        if count < min_count:
            min_email = email
            min_count = count
    return min_email, min_count

results_email, results_count = max_count(email_count)
result_email, result_count = min_count(email_count)
print(results_email, results_count) 
print(result_email, result_count)

Exercise 5: 

This program records the domain name (instead of the address) where
the message was sent from instead of who the mail came from (i.e., the whole email address). At the end of the program, print out the contents of your dictionary.

python schoolcount.py  
Enter a file name: mbox-short.txt  
{'media.berkeley.edu': 4, 'uct.ac.za': 6, 'umich.edu': 7,  
'gmail.com': 1, 'caret.cam.ac.uk': 1, 'iupui.edu': 8}   

In [None]:
fname = input('Enter the file name: ')

try:
    fhand = open(fname)
except:
    print('File cannot be opneed:', fname)
    exit()

domain_counts = dict()

for line in fhand:
    line = line.rstrip()
    if line.startswith('From '):
        words = line.split()
        email = words[1].split("@")[1]
        if email not in domain_counts:
            domain_counts[email] = 1
        else:
            domain_counts[email] += 1
        
        ##Alternative Way##
#         domain_name = email.split("@")[1]
#         domain_counts[domain_name] = domain_counts.get(domain_name, 0) + 1
print(domain_counts)

## Chapter 10

### Tuples

#### 10.1 Tuples are immutable

A tuple1 is a sequence of values much like a list. The values stored in a tuple can be any type, and they are indexed by integers. The important difference is that tuples are immutable. Tuples are also comparable and hashable so we can sort lists of them and use tuples as key values in Python dictionaries.

Syntactically, a tuple is a comma-separated list of values:

In [None]:
t = 'a', 'b', 'c', 'd', 'e'

Although it is not necessary, it is common to enclose tuples in parentheses to help us quickly identify tuples when we look at Python code:

In [None]:
t = ('a', 'b', 'c', 'd', 'e')

To create a tuple with a single element, you have to include the final comma:

In [None]:
t1 = ('a',)
type(t1)

Without the comma Python treats ('a') as an expression with a string in parentheses that evaluates to a string:

In [None]:
t2 = ('a')
type(t2)

Another way to construct a tuple is the built-in function tuple. With no argument, it creates an empty tuple:

In [None]:
t = tuple()
print(t)

If the argument is a sequence (string, list, or tuple), the result of the call to tuple is a tuple with the elements of the sequence:

In [None]:
t = tuple('lupins')
print(t)

Because tuple is the name of a constructor, you should avoid using it as a variable name.

Most list operators also work on tuples. The bracket operator indexes an element:

In [None]:
t = ('a', 'b', 'c', 'd', 'e')
print(t[0])

And the slice operator selects a range of elements.

In [None]:
print(t[1:3])

But if you try to modify one of the elements of the tuple, you get an error:

In [None]:
t[0] = 'A'

You can’t modify the elements of a tuple, but you can replace one tuple with another:

In [None]:
t = ('A',) + t[1:]
print(t)

#### 10.2 Comparing tuples

The comparison operators work with tuples and other sequences. Python starts by comparing the first element from each sequence. If they are equal, it goes on to the next element, and so on, until it finds elements that differ. Subsequent elements are not considered (even if they are really big).

In [None]:
(0, 1, 2) < (0, 3, 4)

In [None]:
(0, 1, 200000) < (0, 3, 4) 

The sort function works the same way. It sorts primarily by first element, but in the case of a tie, it sorts by second element, and so on.

This feature lends itself to a pattern called DSU for

*Decorate* a sequence by building a list of tuples with one or more sort keys preceding the elements from the sequence,  
*Sort* the list of tuples using the Python built-in sort, and  
*Undecorate* by extracting the sorted elements of the sequence.

For example, suppose you have a list of words and you want to sort them from longest to shortest:

In [None]:
txt = 'but soft what light in younder window breaks'
words = txt.split()
t = list()
for word in words:
    t.append((len(word), word))
    
t.sort(reverse=True)

res = list()
for length, word in t:
    res.append(word)
    
print(res)

The first loop builds a list of tuples, where each tuple is a word preceded by its length.

sort compares the first element, length, first, and only considers the second element to break ties. The keyword argument reverse=True tells sort to go in decreasing order.

The second loop traverses the list of tuples and builds a list of words in descending order of length. The four-character words are sorted in reverse alphabetical order, so “what” appears before “soft” in the following list.

The output of the program is as follows:  
['yonder', 'window', 'breaks', 'light', 'what',
'soft', 'but', 'in']

#### 10.3 Tuple assignment

One of the unique syntactic features of the Python language is the ability to have a tuple on the left side and a sequence on the right side of an assignment statement. This allows you to assign more than one variable at a time to the given sequence.

In this example we have a two-element tuple and assign the first and second elements of the tuple to the variables x and y in a single statement.

In [None]:
m = ('have', 'fun')
x, y = m
x

In [None]:
y

This is more general than tuple-to-tuple assignment. Both tuples and lists are sequences, so this syntax works with a two element list as well.

In [None]:
m = ['have', 'fun']
x, y = m
x

In [None]:
y

It is not magic, Python roughly translates the tuple assignment syntax to be the following:

In [None]:
m = ('have', 'fun')
x = m[0]
y = m[1]
x

In [None]:
y

Stylistically when we use a tuple on the left side of the assignment statement, we omit the parentheses, but the following is an equally valid syntax:

In [None]:
m = ('have', 'fun')
(x, y) = m
x

In [None]:
y

A particularly clever application of tuple assignment allows us to swap the values of two variables in a single statement:

In [None]:
a, b = b, a

Both sides of this statement are tuples, but the left side is a tuple of variables; the right side is a tuple of expressions. Each value on the right side is assigned to its respective variable on the left side. All the expressions on the right side are evaluated before any of the assignments.

The number of variables on the left and the number of values on the right must be the same:

In [None]:
a, b = 1, 2, 3

More generally, the right side can be any kind of sequence (string, list, or tuple). For example, to split an email address into a user name and a domain, you could write:

In [None]:
addr = 'monty@python.org'
uname, domain = addr.split('@')
uname, domain

The return value from split is a list with two elements; the first element is assigned to uname, the second to domain.

#### 10.4 Dictionaries and tuples

Dictionaries have a method called items that returns a list of tuples, where each tuple is a key-value pair:

In [None]:
d = {'b':1, 'a':10, 'c':22}
t = list(d.items())
print(t)

As you should expect from a dictionary, the items are in non-alphabetical order.

However, since the list of tuples is a list, and tuples are comparable, we can now sort the list of tuples. Converting a dictionary to a list of tuples is a way for us to output the contents of a dictionary sorted by key:

In [None]:
d = {'b':1, 'a':10, 'c':22}
t = list(d.items())
t

In [None]:
t.sort()
t

The new list is sorted in ascending alphabetical order by the key value.

#### 10.5 Multiple assignment with dictionaries

Combining items, tuple assignment, and for, you can see a nice code pattern for traversing the keys and values of a dictionary in a single loop:

In [None]:
d = {'a':10, 'b':1, 'c':22}
for key, val in d.items():
    print(val, key)

This loop has two iteration variables because items returns a list of tuples and key, val is a tuple assignment that successively iterates through each of the key-value pairs in the dictionary.

For each iteration through the loop, both key and val are advanced to the nextkey-value pair in the dictionary (still in hash order).

The output of this loop is:

10 a  
1 b  
22 c  

Again, it is in hash key order (i.e., no particular order).

If we combine these two techniques, we can print out the contents of a dictionary sorted by the value stored in each key-value pair.

To do this, we first make a list of tuples where each tuple is (value, key). The items method would give us a list of (key, value) tuples, but this time we want to sort by value, not key. Once we have constructed the list with the value-key tuples, it is a simple matter to sort the list in reverse order and print out the new,
sorted list.

In [None]:
d = {'a':10, 'b':1, 'c':22}
l = list()
for key, val in d.items():
    l.append((val, key))

print(l)
l.sort(reverse=True)
print(f'sorted list: {l}')

By carefully constructing the list of tuples to have the value as the first element of each tuple, we can sort the list of tuples and get our dictionary contents sorted by value.

#### 10.6 The most common words

Coming back to our running example of the text from Romeo and Juliet Act 2, Scene 2, we can augment our program to use this technique to print the ten most common words in the text as follows:

In [None]:
import string

fhand = open('romeo.txt')

count = dict()
for line in fhand:
    line = line.translate(str.maketrans('', '', string.punctuation))
    line = line.lower()
    words = line.split()
    for word in words:
        if word not in count:
            count[word] = 1
        else:
            count[word] += 1
            
# sorting the dictionary by value
lst = []
for key, val in list(count.items()):
    lst.append((val, key))
    
lst.sort(reverse=True)

for key, val in lst[:10]:
    print(key, val)

The first part of the program which reads the file and computes the dictionary that maps each word to the count of words in the document is unchanged. But instead of simply printing out counts and ending the program, we construct a list of (val, key) tuples and then sort the list in reverse order.

Since the value is first, it will be used for the comparisons. If there is more than one tuple with the same value, it will look at the second element (the key), so tuples where the value is the same will be further sorted by the alphabetical order of the key.

At the end we write a nice for loop which does a multiple assignment iteration and prints out the ten most common words by iterating through a slice of the list (lst[:10]).

#### 10.7 Using tuples as keys in dictionaries

Because tuples are hashable and lists are not, if we want to create a composite key to use in a dictionary we must use a tuple as the key.

We would encounter a composite key if we wanted to create a telephone directory that maps from last-name, first-name pairs to telephone numbers. Assuming that we have defined the variables last, first, and number, we could write a dictionary assignment statement as follows:

directory[last,first] = number

The expression in brackets is a tuple. We could use tuple assignment in a for loop to traverse this dictionary.

In [None]:
directory[last, first] = number
for last, first in directory:
    print(first, last, directory[last, first])

This loop traverses the keys in directory, which are tuples. It assigns the elements of each tuple to last and first, then prints the name and corresponding telephone number.

#### 10.8 Sequences: String, lists and tuples - Oh My!

I have focused on lists of tuples, but almost all of the examples in this chapter also work with lists of lists, tuples of tuples, and tuples of lists. To avoid enumerating the possible combinations, it is sometimes easier to talk about sequences of
sequences.

In many contexts, the different kinds of sequences (strings, lists, and tuples) can be used interchangeably. So how and why do you choose one over the others?

To start with the obvious, strings are more limited than other sequences because the elements have to be characters. They are also immutable. If you need the ability to change the characters in a string (as opposed to creating a new string), you might want to use a list of characters instead.

Lists are more common than tuples, mostly because they are mutable. But there are a few cases where you might prefer tuples:  
1. In some contexts, like a return statement, it is syntactically simpler to create a tuple than a list. In other contexts, you might prefer a list.  
2. If you want to use a sequence as a dictionary key, you have to use an immutable type like a tuple or string.
3. If you are passing a sequence as an argument to a function, using tuples reduces the potential for unexpected behavior due to aliasing.

Because tuples are immutable, they don’t provide methods like sort and reverse, which modify existing lists. However Python provides the built-in functions sorted and reversed, which take any sequence as a parameter and return a new sequence with the same elements in a different order.

#### 10.9 List comprehension

Sometimes you want to create a sequence by using data from another sequence. You can achieve this by writing a for loop and appending one item at a time. For example, if you wanted to convert a list of strings – each string storing digits – into numbers that you can sum up, you would write:

In [None]:
list_of_ints_in_strings = ['42', '65', '12']
list_of_ints = []

for x in list_of_ints_in_strings:
    list_of_ints.append(int(x))

print(sum(list_of_ints))

With list comprehension, the above code can be written in a more compact manner:

In [None]:
list_of_ints_in_strings = ['42', '65', '12']
list_of_ints = [ int(x) for x in list_of_ints_in_strings ]
print(sum(list_of_ints))

	•	A list comprehension is used here to iterate through each element (x) in list_of_ints_in_strings.
	•	int(x) converts each string x into an integer.
	•	The result, list_of_ints, is a new list: [42, 65, 12].

#### 10.10 Debugging

Lists, dictionaries and tuples are known generically as data structures; in this chapter we are starting to see compound data structures, like lists of tuples, and dictionaries that contain tuples as keys and lists as values. Compound data structures are useful, but they are prone to what I call shape errors; that is, errors caused when a data structure has the wrong type, size, or composition, or perhaps you write some code and forget the shape of your data and introduce an error. For example, if you are expecting a list with one integer and I give you a plain old integer (not in a list), it won’t work.

#### 10.11 Excercise

Exercise 1: 

Revise a previous program as follows: Read and parse the “From”
lines and pull out the addresses from the line. Count the number of messages from each person using a dictionary.

After all the data has been read, print the person with the most commits by creating a list of (count, email) tuples from the dictionary. Then sort the list in reverse order and print out the person who has the most commits.

Sample Line:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 

Enter a file name: mbox-short.txt  
cwen@iupui.edu 5  

Enter a file name: mbox.txt  
zqian@umich.edu 195

In [None]:
# Prompt the user for a file name
file_name = input("Enter a file name: ")

try:
    fhand = open(file_name)
except FileNotFoundError:
    print("File not found:", file_name)
    exit()

# Dictionary to count the number of messages from each person
email_counts = {}

# Parse the file line by line
for line in fhand:
    line = line.strip()
    if line.startswith("From "):  # Look for lines starting with 'From '
        words = line.split()
        email = words[1]  # Extract the email address
        email_counts[email] = email_counts.get(email, 0) + 1

# Convert the dictionary to a list of (count, email) tuples
email_list = [(count, email) for email, count in email_counts.items()]

# Sort the list in reverse order by count
email_list.sort(reverse=True)
# print(email_list)

# Print the person with the most commits
if email_list:
    most_commits = email_list[0]  # The first item in the sorted list
    print(f"{most_commits[1]} {most_commits[0]}")
else:
    print("No 'From' lines found in the file.")

Exercise 2: 

This program counts the distribution of the hour of the day for each of the messages. You can pull the hour from the “From” line by finding the time string and then splitting that string into parts using the colon character. Once you have accumulated the counts for each hour, print out the counts, one per line, sorted by hour as shown below.

python timeofday.py  
Enter a file name: mbox-short.txt  
04 3  
06 1  
07 1  
09 2  
10 3  
11 6  
14 1  
15 2  
16 4  
17 2  
18 1   
19 1

In [None]:
fname = input('Enter the file name: ')

try:
    fhand = open(fname)
except:
    print('File cannot be opened:', fname)
    exit()
    
hour_counts =dict()

for line in fhand:
    line = line.strip()
    if line.startswith('From '):
        words = line.split()
        time = words[5]
        hour = time.split(":")[0]
        hour_counts[hour] = hour_counts.get(hour, 0) + 1
        
hour_list = [(count, hour) for hour, count in hour_counts.items()]
sorted_hour = sorted(hour_list, reverse=True)

for count, hour in sorted_hour:
    print(hour, count)

Exercise 3: 

Write a program that reads a file and prints the letters in decreasing order of frequency. Your program should convert all the input to lower case and only count the letters a-z. Your program should not count spaces, digits, punctuation, or anything other
than the letters a-z. Find text samples from several different languages and see how letter frequency varies between languages. Compare your results with the tables at https://wikipedia.org/wiki/Letter_frequencies.

In [None]:
fname = input('Enter the file name: ')

try:
    fhand = open(fname)
except:
    print('File cannot be opened:', fname)
    exit()
    
letter_count = dict()

for line in fhand:
    line = line.lower()
    line = line.translate(str.maketrans('', '', string.punctuation))
    line = line.replace(" ", "")
    line = ''.join(filter(str.isalpha, line))
    for char in line:
        letter_count[char] = letter_count.get(char, 0) + 1
        
fhand.close()

sorted_letters = sorted(letter_count.items(), key=lambda x: x[1], reverse=True)

print("letter frequencies:")
for letter, count in sorted_letters:
    print(f"{letter}: {count}")

## Chapter 11

### Regular expressions

So far we have been reading through files, looking for patterns and extracting various bits of lines that we find interesting. We have been using string methods like split and find and using lists and string slicing to extract portions of the lines.

This task of searching and extracting is so common that Python has a very powerful module called regular expressions that handles many of these tasks quite elegantly. The reason we have not introduced regular expressions earlier in the book is because while they are very powerful, they are a little complicated and their syntax takes
some getting used to.

Regular expressions are almost their own little programming language for searching and parsing strings. As a matter of fact, entire books have been written on the topic of regular expressions. In this chapter, we will only cover the basics of regular
expressions. For more detail on regular expressions, see:  
https://en.wikipedia.org/wiki/Regular_expression  
https://docs.python.org/library/re.html  

The regular expression module re must be imported into your program before you can use it. The simplest use of the regular expression module is the search() function. The following program demonstrates a trivial use of the search function.

In [None]:
# search for line that contain 'From'
import re
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if re.search('From:', line):
        print(line)

We open the file, loop through each line, and use the regular expression search() to only print out lines that contain the string “From:”. This program does not use the real power of regular expressions, since we could have just as easily used
line.find() to accomplish the same result.

The power of the regular expressions comes when we add special characters to the search string that allow us to more precisely control which lines match the string. Adding these special characters to our regular expression allow us to do sophisticated matching and extraction while writing very little code.

For example, the caret character is used in regular expressions to match “the beginning” of a line. We could change our program to only match lines where “From:” was at the beginning of the line as follows:

In [None]:
# search for line that contain 'From'
import re
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if re.search('^From:', line):
        print(line)

Now we will only match lines that start with the string “From:”. This is still a very simple example that we could have done equivalently with the startswith() method from the string module. But it serves to introduce the notion that regular expressions contain special action characters that give us more control as to what will match the regular expression.

#### 11.1 Character matching in regular expressions

There are a number of other special characters that let us build even more powerful regular expressions. The most commonly used special character is the period or full stop, which matches any character.

In the following example, the regular expression F..m: would match any of the strings “From:”, “Fxxm:”, “F12m:”, or “F!@m:” since the period characters in the regular expression match any character.

In [None]:
# Search for lines that start with 'F', followed by
# 2 characters, followed by 'm:'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^F..m', line):
        print(line)

This is particularly powerful when combined with the ability to indicate that a character can be repeated any number of times using the * or + characters in your regular expression. These special characters mean that instead of matching a single character in the search string, they match zero-or-more characters (in the case of
the asterisk) or one-or-more of the characters (in the case of the plus sign). We can further narrow down the lines that we match using a repeated wild card character in the following example:

In [None]:
# Search for lines that start with From and have an at sign
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:.+@', line):
        print(line)

The search string ˆFrom:.+@ will successfully match lines that start with “From:”, followed by one or more characters (.+), followed by an at-sign. So this will match the following line:

From: stephen.marquard@uct.ac.za

You can think of the .+ wildcard as expanding to match all the characters between the colon character and the at-sign.

From:.+@

It is good to think of the plus and asterisk characters as “pushy”. For example, the following string would match the last at-sign in the string as the .+ pushes outwards, as shown below:

From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen@iupui.edu

It is possible to tell an asterisk or plus sign not to be so “greedy” by adding another character. See the detailed documentation for information on turning off the greedy behavior.

#### 11.2 Extracting data using regular expression

If we want to extract data from a string in Python we can use the findall() method to extract all of the substrings which match a regular expression. Let’s use the example of wanting to extract anything that looks like an email address from any line regardless of format. For example, we want to pull the email addresses from each of the following lines:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008  
Return-Path: <postmaster@collab.sakaiproject.org>  
for <source@collab.sakaiproject.org>;  
Received: (from apache@localhost)  
Author: stephen.marquard@uct.ac.za  

We don’t want to write code for each of the types of lines, splitting and slicing differently for each line. This following program uses findall() to find the lines with email addresses in them and extract one or more addresses from each of those lines.

In [None]:
import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\S+@\S+', s)
print(lst)

The findall() method searches the string in the second argument and returns a list of all of the strings that look like email addresses. We are using a two-character sequence that matches a non-whitespace character (\S).

The output of the program would be:  
['csev@umich.edu', 'cwen@iupui.edu']


Translating the regular expression, we are looking for substrings that have at least one non-whitespace character, followed by an at-sign, followed by at least one more non-whitespace character. The \S+ matches as many non-whitespace characters as possible.

The regular expression would match twice (csev@umich.edu and cwen@iupui.edu), but it would not match the string “@2PM” because there are no non-blank characters before the at-sign. We can use this regular expression in a program to read all the lines in a file and print out anything that looks like an email address as follows:

In [None]:
# search for lines that have an at sign between characters

import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.strip()
    x = re.findall('\S+@\S', line)
    if len(x) > 0:
        print(x)

We read each line and then extract all the substrings that match our regular expression. Since findall() returns a list, we simply check if the number of elements in our returned list is more than zero to print only lines where we found at least one substring that looks like an email address.

Some of our email addresses have incorrect characters like “<” or “;” at the beginning or end. Let’s declare that we are only interested in the portion of the string that starts and ends with a letter or a number.

To do this, we use another feature of regular expressions. Square brackets are used to indicate a set of multiple acceptable characters we are willing to consider matching. In a sense, the \S is asking to match the set of “non-whitespace characters”. Now we will be a little more explicit in terms of the characters we will match.

Here is our new regular expression:  
[a-zA-Z0-9]\S*@\S*[a-zA-Z]

This is getting a little complicated and you can begin to see why regular expressions are their own little language unto themselves. Translating this regular expression, we are looking for substrings that start with a single lowercase letter, uppercase letter, or number “[a-zA-Z0-9]”, followed by zero or more non-blank characters
(\S*), followed by an at-sign, followed by zero or more non-blank characters (\S*), followed by an uppercase or lowercase letter. Note that we switched from + to * to indicate zero or more non-blank characters since [a-zA-Z0-9] is already one non-blank character. Remember that the * or + applies to the single character
immediately to the left of the plus or asterisk.

If we use this expression in our program, our data is much cleaner:

In [None]:
# search for lines that have an at sign between characters
# the characters must be a letter or number

import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.strip()
    x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line)
    if len(x) > 0:
        print(x)

Notice that on the source@collab.sakaiproject.org lines, our regular expression eliminated two letters at the end of the string (“>;”). This is because when we append [a-zA-Z] to the end of our regular expression, we are demanding that whatever string the regular expression parser finds must end with a letter. So when it sees the “>” at the end of “sakaiproject.org>;” it simply stops at the last “matching” letter it found (i.e., the “g” was the last good match).

Also note that the output of the program is a Python list that has a string as the single element in the list.

#### 11.3 Combining searching and extracting

If we want to find numbers on lines that start with the string “X-” such as:

X-DSPAM-Confidence: 0.8475  
X-DSPAM-Probability: 0.0000  

we don’t just want any floating-point numbers from any lines. We only want to extract numbers from lines that have the above syntax.

We can construct the following regular expression to select the lines:
^X-.*: [0-9.]+  

Translating this, we are saying, we want lines that start with X-, followed by zero or more characters (.*), followed by a colon (:) and then a space. After the space we are looking for one or more characters that are either a digit (0-9) or a period [0-9.]+. Note that inside the square brackets, the period matches an actual period (i.e., it is not a wildcard between the square brackets).

This is a very tight expression that will pretty much match only the lines we are interested in as follows:

In [None]:
# search for lines that starts with 'X' followed by any non 
# whitespace characters and ':' followed by a space and any number. 
# The number can include a decimal.

import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^X\S*: [0-9.]+', line):
        print(line)

When we run the program, we see the data nicely filtered to show only the lines we are looking for.

X-DSPAM-Confidence: 0.8475  
X-DSPAM-Probability: 0.0000  
X-DSPAM-Confidence: 0.6178  
X-DSPAM-Probability: 0.0000  
...  

But now we have to solve the problem of extracting the numbers. While it would be simple enough to use split, we can use another feature of regular expressions to both search and parse the line at the same time.

Parentheses are another special character in regular expressions. When you add parentheses to a regular expression, they are ignored when matching the string. But when you are using findall(), parentheses indicate that while you want the whole expression to match, you only are interested in extracting a portion of the
substring that matches the regular expression.

So we make the following change to our program:

In [None]:
# Search for lines that start with 'X' followed by any
# non whitespace characters and ':' followed by a space
# and any number. The number can include a decimal.
# Then print the number if it is greater than zero.

import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^X\S*: ([0-9.]+)', line)
    if len(x) > 0:
        print(x)

Instead of calling search(), we add parentheses around the part of the regular expression that represents the floating-point number to indicate we only want findall() to give us back the floating-point number portion of the matching string.

The numbers are still in a list and need to be converted from strings to floating point, but we have used the power of regular expressions to both search and extract the information we found interesting.

As another example of this technique, if you look at the file there are a number of lines of the form:
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

If we wanted to extract all of the revision numbers (the integer number at the end of these lines) using the same technique as above, we could write the following program:

In [None]:
# Search for lines that start with 'Details: rev='
# followed by numbers
# Then print the number if one is found

import re 
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^Details:.*rev=([0-9]+)', line)
    if len(x) > 0:
        print(x)

Translating our regular expression, we are looking for lines that start with Details:, followed by any number of characters (.*), followed by rev=, and then by one or more digits. We want to find lines that match the entire expression but we only want to extract the integer number at the end of the line, so we surround [0-9]+ with parentheses.

Remember that the [0-9]+ is “greedy” and it tries to make as large a string of digits as possible before extracting those digits. This “greedy” behavior is why we get all five digits for each number. The regular expression module expands in both directions until it encounters a non-digit, or the beginning or the end of a line.

Now we can use regular expressions to redo an exercise from earlier in the book where we were interested in the time of day of each mail message. We looked for lines of the form:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

and wanted to extract the hour of the day for each line. Previously we did this with two calls to split. First the line was split into words and then we pulled out the fifth word and split it again on the colon character to pull out the two characters we were interested in.

While this worked, it actually results in pretty brittle code that is assuming the lines are nicely formatted. If you were to add enough error checking (or a big try/except block) to insure that your program never failed when presented with incorrectly formatted lines, the code would balloon to 10-15 lines of code that was pretty hard to read.

We can do this in a far simpler way with the following regular expression:

^From .* [0-9][0-9]:

The translation of this regular expression is that we are looking for lines that start with From (note the space), followed by any number of characters (.*), followed by a space, followed by two digits [0-9][0-9], followed by a colon character. This is the definition of the kinds of lines we are looking for.

In order to pull out only the hour using findall(), we add parentheses around the two digits as follows:

^From .* ([0-9][0-9]):

This results in the following program:

In [None]:
# Search for lines that start with From and a character
# followed by a two digit number between 00 and 99 followed by ':'
# Then print the number if one is found

import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^From .* ([0-9][0-9]):', line)
    if len(x) > 0:
        print(x)

#### 11.4 Escape character

Since we use special characters in regular expressions to match the beginning or end of a line or specify wild cards, we need a way to indicate that these characters are “normal” and we want to match the actual character such as a dollar sign or caret.

We can indicate that we want to simply match a character by prefixing that character with a backslash. For example, we can find money amounts with the following regular expression.

In [None]:
import re
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+', x)
if len(y) > 0:
    print(y)

Since we prefix the dollar sign with a backslash, it actually matches the dollar sign in the input string instead of matching the “end of line”, and the rest of the regular expression matches one or more digits or the period character. Note: Inside square brackets, characters are not “special”. So when we say [0-9.], it
really means digits or a period. Outside of square brackets, a period is the “wildcard” character and matches any character. Inside square brackets, the period is a period.

#### 11.5 Summary

While this only scratched the surface of regular expressions, we have learned a bit about the language of regular expressions. They are search strings with special characters in them that communicate your wishes to the regular expression system as to what defines “matching” and what is extracted from the matched strings.
Here are some of those special characters and character sequences:

1. ˆ Matches the beginning of the line.  
2. $ Matches the end of the line.  
3. . Matches any character (a wildcard).  
4. \s Matches a whitespace character.  
5. \S Matches a non-whitespace character (opposite of \s).    
6. * Applies to the immediately preceding character(s) and indicates to match zero or more times.  
7. *? Applies to the immediately preceding character(s) and indicates to match zero or more times in “non-greedy mode”.  
8. + Applies to the immediately preceding character(s) and indicates to match one or more times.  
9. +? Applies to the immediately preceding character(s) and indicates to match one or more times in “non-greedy mode”.


10. ? Applies to the immediately preceding character(s) and indicates to match zero or one time.  
11. ?? Applies to the immediately preceding character(s) and indicates to match zero or one time in “non-greedy mode”.  
12. [aeiou] Matches a single character as long as that character is in the specified set. In this example, it would match “a”, “e”, “i”, “o”, or “u”, but no other characters.   
13. [a-z0-9] You can specify ranges of characters using the minus sign. This example is a single character that must be a lowercase letter or a digit.  
14. [ˆA-Za-z] When the first character in the set notation is a caret, it inverts the logic. This example matches a single character that is anything other than an uppercase or lowercase letter.  
15. ( ) When parentheses are added to a regular expression, they are ignored for the purpose of matching, but allow you to extract a particular subset of the matched string rather than the whole string when using findall().  
16. \b Matches the empty string, but only at the start or end of a word.  
17. \B Matches the empty string, but not at the start or end of a word.  
18. \d Matches any decimal digit; equivalent to the set [0-9].  
19. \D Matches any non-digit character; equivalent to the set [ˆ0-9].

#### 11.6 Excercise

11.9 Exercises

Exercise 1: Write a simple program to simulate the operation of the grep command on Unix. Ask the user to enter a regular expression and count the number of lines that matched the regular expression:

$ python grep.py  
Enter a regular expression: ^Author  
mbox.txt had 1798 lines that matched ^Author  

$ python grep.py  
Enter a regular expression: ^Xmbox.  
txt had 14368 lines that matched ^X-  

$ python grep.py    

Enter a regular expression: java$ 

mbox.txt had 4175 lines that matched java$  

In [None]:
import re

fhand = open('mbox.txt')

regex = input('Enter a regular expression: ')

pattern = re.compile(regex)

match_count = 0
for line in fhand:
    if pattern.search(line):
        match_count += 1
fhand.close()

print(f"{fhand} had {match_count} lines that matched {regex}")

Exercise 2: 

Write a program to look for lines of the form:

New Revision: 39772

Extract the number from each of the lines using a regular expression and the findall() method. Compute the average of the numbers and print out the average as an integer.

Enter file:mbox.txt  
38549  
Enter file:mbox-short.txt  
39756  

In [None]:
import re

fname = input('Enter the file name: ')

fhand = open(fname)

total = 0
count = 0

for line in fhand:
    match = re.findall(r"New Revision: (\d+)", line)
    if match:
        total = int(total) + int(match[0])
        count = count + 1
#         print(match[0])
fhand.close()

if count > 0:
    average = total // count
    print(f"Avg is {average}")
else:
    print('no atching lines found.')

## Chapter 12

### Networked programs

While many of the examples in this book have focused on reading files and looking for data in those files, there are many different sources of information when one considers the Internet.

In this chapter we will pretend to be a web browser and retrieve web pages using the Hypertext Transfer Protocol (HTTP). Then we will read through the web page data and parse it.

#### 12.1 Hypertext Transfer Protocol - HTTP

The network protocol that powers the web is actually quite simple and there is built-in support in Python called socket which makes it very easy to make network connections and retrieve data over those sockets in a Python program.

A socket is much like a file, except that a single socket provides a two-way connection between two programs. You can both read from and write to the same socket. If you write something to a socket, it is sent to the application at the other end of the socket. If you read from the socket, you are given the data which the other
application has sent.

But if you try to read a socket1 when the program on the other end of the socket has not sent any data, you just sit and wait. If the programs on both ends of the socket simply wait for some data without sending anything, they will wait for a very long time, so an important part of programs that communicate over the Internet is to have some sort of protocol.

A protocol is a set of precise rules that determine who is to go first, what they are to do, and then what the responses are to that message, and who sends next, and so on. In a sense the two applications at either end of the socket are doing a dance and making sure not to step on each other’s toes. There are many documents that describe these network protocols. The Hypertext
Transfer Protocol is described in the following document:

https://www.w3.org/Protocols/rfc2616/rfc2616.txt

This is a long and complex 176-page document with a lot of detail. If you find it interesting, feel free to read it all. But if you take a look around page 36 of RFC2616 you will find the syntax for the GET request. To request a document from a web server, we make a connection, e.g. to the www.pr4e.org server on port 80, and then send a line of the form

GET http://data.pr4e.org/romeo.txt HTTP/1.0

where the second parameter is the web page we are requesting, and then we also send a blank line. The web server will respond with some header information about the document and a blank line followed by the document content.

#### 12.2 The world's simplest web browser

Perhaps the easiest way to show how the HTTP protocol works is to write a very simple Python program that makes a connection to a web server and follows the rules of the HTTP protocol to request a document and display what the server sends back.

In [None]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode(), end='')
    
mysock.close()

First the program makes a connection to port 80 on the server www.pr4e.com. Since our program is playing the role of the “web browser”, the HTTP protocol says we must send the GET command followed by a blank line. \r\n signifies an EOL (end of line), so \r\n\r\n signifies nothing between two EOL sequences. That is the equivalent of a blank line.

Once we send that blank line, we write a loop that receives data in 512-character chunks from the socket and prints the data out until there is no more data to read (i.e., the recv() returns an empty string).

The program produces the following output:

HTTP/1.1 200 OK  
Date: Wed, 11 Apr 2018 18:52:55 GMT  
Server: Apache/2.4.7 (Ubuntu)  
Last-Modified: Sat, 13 May 2017 11:22:22 GMT  
ETag: "a7-54f6609245537"  
Accept-Ranges: bytes  
Content-Length: 167  
Cache-Control: max-age=0, no-cache, no-store, must-revalidate  
Pragma: no-cache  
Expires: Wed, 11 Jan 1984 05:00:00 GMT  
Connection: close  
Content-Type: text/plain   
But soft what light through yonder window breaks  
It is the east and Juliet is the sun  
Arise fair sun and kill the envious moon  
Who is already sick and pale with grief  

The output starts with headers which the web server sends to describe the document. For example, the Content-Type header indicates that the document is a plain text document (text/plain).

After the server sends us the headers, it adds a blank line to indicate the end of the headers, and then sends the actual data of the file romeo.txt.

This example shows how to make a low-level network connection with sockets. Sockets can be used to communicate with a web server or with a mail server or many other kinds of servers. All that is needed is to find the document which describes the protocol and write the code to send and receive the data according to the protocol.

However, since the protocol that we use most commonly is the HTTP web protocol, Python has a special library specifically designed to support the HTTP protocol for the retrieval of documents and data over the web.

One of the requirements for using the HTTP protocol is the need to send and receive data as bytes objects, instead of strings. In the preceding example, the encode() and decode() methods convert strings into bytes objects and back again.

The next example uses b'' notation to specify that a variable should be stored as a bytes object. encode() and b'' are equivalent.

In [None]:
b'Hello world'

In [None]:
'Hello world'.encode()

#### 12.3 Retrieving an image over HTTP

In the above example, we retrieved a plain text file which had newlines in the file and we simply copied the data to the screen as the program ran. We can use a similar program to retrieve an image across using HTTP. Instead of copying the data to the screen as the program runs, we accumulate the data in a string, trim off the headers, and then save the image data to a file as follows:

In [None]:
import socket
import time

HOST = 'data.pr4e.org'
PORT =80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""

while True:
    data = mysock.recv(5120)
    if len(data) < 1:
        break
    count = count + len(data)
    print(len(data), count)
    picture = picture + data
    
mysock.close()

# Look for the end of the header 
pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())

# Skip past the header and save the picture data
picture = picture[pos+4:]
fhand = open("stuff.jpg", "wb")
fhand.write(picture)
fhand.close()

You can see that for this url, the Content-Type header indicates that body of the document is an image (image/jpeg). Once the program completes, you can view the image data by opening the file stuff.jpg in an image viewer.

As the program runs, you can see that we don’t get 5120 characters each time we call the recv() method. We get as many characters as have been transferred across the network to us by the web server at the moment we call recv(). In this example, we either get as few as 3200 characters each time we request up to 5120 characters of data.

Your results may be different depending on your network speed. Also note that on the last call to recv() we get 3167 bytes, which is the end of the stream, and in the next call to recv() we get a zero-length string that tells us that the server has called close() on its end of the socket and there is no more data forthcoming.

We can slow down our successive recv() calls by uncommenting the call to time.sleep(). This way, we wait a quarter of a second after each call so that the server can “get ahead” of us and send more data to us before we call recv() again. With the delay, in place the program executes as follows:

$ python urljpeg.py  
5120 5120  
5120 10240  
5120 15360  
...  
5120 225280
5120 230400
207 230607  
Header length 393  
HTTP/1.1 200 OK  
Date: Wed, 11 Apr 2018 21:42:08 GMT  
Server: Apache/2.4.7 (Ubuntu)  
Last-Modified: Mon, 15 May 2017 12:27:40 GMT  
ETag: "38342-54f8f2e5b6277"  
Accept-Ranges: bytes  
Content-Length: 230210  
Vary: Accept-Encoding   
Cache-Control: max-age=0, no-cache, no-store, must-revalidate  
Pragma: no-cache  
Expires: Wed, 11 Jan 1984 05:00:00 GMT  
Connection: close  
Content-Type: image/jpeg  

Now other than the first and last calls to recv(), we now get 5120 characters each time we ask for new data.

There is a buffer between the server making send() requests and our application making recv() requests. When we run the program with the delay in place, at some point the server might fill up the buffer in the socket and be forced to pause until our program starts to empty the buffer. The pausing of either the sending application or the receiving application is called “flow control.”

#### 12.4 Retrieving web pages with urllib

While we can manually send and receive data over HTTP using the socket library, there is a much simpler way to perform this common task in Python by using the urllib library.

Using urllib, you can treat a web page much like a file. You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol and header details.

The equivalent code to read the romeo.txt file from the web using urllib is as follows:

In [None]:
import urllib.request

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
    print(line.decode().strip())

Once the web page has been opened with urllib.request.urlopen, we can treat it like a file and read through it using a for loop.

When the program runs, we only see the output of the contents of the file. The headers are still sent, but the urllib code consumes the headers and only returns the data to us.

As an example, we can write a program to retrieve the data for romeo.txt and compute the frequency of each word in the file as follows:

In [None]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
    line = line.strip()
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
print(counts)

#### 12.5 Reading binary files using urllib

Sometimes you want to retrieve a non-text (or binary) file such as an image or video file. The data in these files is generally not useful to print out, but you can easily make a copy of a URL to a local file on your hard disk using urllib.

The pattern is to open the URL and use read to download the entire contents of the document into a string variable (img) then write that information to a local file as follows:

In [None]:
import urllib.request, urllib.parse, urllib.error
img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()
fhand = open('cover3.jpg', 'wb')
fhand.write(img)
fhand.close()

This program reads all of the data in at once across the network and stores it in the variable img in the main memory of your computer, then opens the file cover.jpg and writes the data out to your disk. The wb argument for open() opens a binary file for writing only. This program will work if the size of the file is less than the size of the memory of your computer.

However if this is a large audio or video file, this program may crash or at least run extremely slowly when your computer runs out of memory. In order to avoid running out of memory, we retrieve the data in blocks (or buffers) and then write each block to your disk before retrieving the next block. This way the program can read any size file without using up all of the memory you have in your computer.

In [None]:
import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg')
fhand = open('cover3.jpg', 'wb')
size = 0
while True:
    info = img.read(100000)
    if len(info) < 1:
        break
    size = size + len(info)
    fhand.write(info)
    
print(size, 'characters copied.')
fhand.close()

In this example, we read only 100,000 characters at a time and then write those characters to the cover3.jpg file before retrieving the next 100,000 characters of data from the web.

#### 12.6 Parsing HTML and scraping the web

One of the common uses of the urllib capability in Python is to scrape the web. Web scraping is when we write a program that pretends to be a web browser and retrieves pages, then examines the data in those pages looking for patterns.

As an example, a search engine such as Google will look at the source of one web page and extract the links to other pages and retrieve those pages, extracting links, and so on. Using this technique, Google spiders its way through nearly all of the pages on the web.

Google also uses the frequency of links from pages it finds to a particular page as one measure of how “important” a page is and how high the page should appear in its search results.

#### 12.7 Parsing HTML using regular expressions

One simple way to parse HTML is to use regular expressions to repeatedly search for and extract substrings that match a particular pattern.

Here is a simple web page:

In [None]:
<h1>The First Page</h1>
<p>
If you like, you can switch to the 
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>

We can construct a well-formed regular expression to match and extract the link values from the above text as follows:

href="http[s]?://.+?"

Our regular expression looks for strings that start with “href="http://” or “href="https://”, followed by one or more characters (.+?), followed by another double quote. The question mark behind the [s]? indicates to search for the string “http” followed by zero or one “s”.

The question mark added to the .+? indicates that the match is to be done in a “non-greedy” fashion instead of a “greedy” fashion. A non-greedy match tries to find the smallest possible matching string and a greedy match tries to find the largest possible matching string.

We add parentheses to our regular expression to indicate which part of our matched string we would like to extract, and produce the following program:

In [None]:
# Search for link values within URL input
import urllib.request, urllib.parse, urllib.error
import re
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen( url, context=ctx).read()
links = re.findall(b'href="(http[s]?://.*)"',html)
for link in links:
    print(link.decode())

The ssl library allows this program to access web sites that strictly enforce HTTPS. The read method returns HTML source code as a bytes object instead of returning an HTTPResponse object. The findall regular expression method will give us a list of all of the strings that match our regular expression, returning only the link
text between the double quotes.

When we run the program and input a URL, we get the following output as shown above.

Regular expressions work very nicely when your HTML is well formatted and predictable. But since there are a lot of “broken” HTML pages out there, a solution only using regular expressions might either miss some valid links or end up with bad data.

#### 12.8 Parsing HTML using BeautifulSoup

Even though HTML looks like XML2 and some pages are carefully constructed to be XML, most HTML is generally broken in ways that cause an XML parser to reject the entire page of HTML as improperly formed.

There are a number of Python libraries which can help you parse HTML and extract data from the pages. Each of the libraries has its strengths and weaknesses and you can pick one based on your needs.

As an example, we will simply parse some HTML input and extract links using the BeautifulSoup library. BeautifulSoup tolerates highly flawed HTML and still lets you easily extract the data you need. You can download and install the BeautifulSoup code from:  
https://pypi.python.org/pypi/beautifulsoup4

Information on installing BeautifulSoup with the Python Package Index tool pip is available at:  
https://packaging.python.org/tutorials/installing-packages/

We will use urllib to read the page and then use BeautifulSoup to extract the href attributes from the anchor (a) tags.

In [None]:
# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the href attribute for each tag.

When the program runs, it produces the output as above.

This list is much longer because some HTML anchor tags are relative paths (e.g., tutorial/index.html) or in-page references (e.g., ‘#’) that do not include “http://” or “https://”, which was a requirement in our regular expression.

You can use also BeautifulSoup to pull out various parts of each tag:

In [None]:
# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context = ctx).read()
soup = BeautifulSoup(html, "html.parser")

# retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
#     look at the parts of a tag
    print('TAG:', tag)
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents[0])
    print('Attrs:', tag.attrs)

html.parser is the HTML parser included in the standard Python 3 library. Information on other HTML parsers is available at:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

These examples only begin to show the power of BeautifulSoup when it comes to parsing HTML.

#### 12.9 Bonus section for Unix/Linux Users

If you have a Linux, Unix, or Macintosh computer, you probably have commands built in to your operating system that retrieves both plain text and binary files using the HTTP or File Transfer (FTP) protocols. One of these commands is curl:

$ curl -O http://www.py4e.com/cover.jpg

The command curl is short for “copy URL” and so the two examples listed earlier to retrieve binary files with urllib are cleverly named curl1.py and curl2.py on www.py4e.com/code3 as they implement similar functionality to the curl command. There is also a curl3.py sample program that does this task a little more effectively, in case you actually want to use this pattern in a program you are
writing.

A second command that functions very similarly is wget:

$ wget http://www.py4e.com/cover.jpg

Both of these commands make retrieving webpages and remote files a simple task.

#### 12.10 Excercises

Exercise 1: 

Change the socket program socket1.py to prompt the user for the
URL so it can read any web page. 

You can use split('/') to break the URL into its component parts so you can extract the host name for the socket connect call. Add error checking using try and except to handle the condition where the user enters an improperly formatted or non-existent URL.

In [None]:
import socket

while True:
    try:
#       prompt the user for the URL
        url = input('Enter - ').strip()
    
#       check if the url is properly formatted
        if not url.startswith("http://") and not url.startswith("https://"):
            raise ValueError("URL must start with 'http://' or 'https://'")

#       split the url to extract the hostname    
        parts = url.split('/')
        host = parts[2]     # the hostname is the third part (after "http://")
        
#       create a socket connection
        mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        mysock.connect((host, 80))
        
#       from the GET request command
        cmd = f'GET {url} HTTP/1.0\r\n\r\n'.encode()
        mysock.send(cmd)
        
#       recieve and print data from server
        while True:
            data = mysock.recv(512)
            if len(data) < 1:
                break
            print(data.decode(), end='')
        
#       close the socket.
        mysock.close()
        break   # Exit the loop after successful execution 
        
    except IndexErrror:
        print('Error: The URL seems improperly formatted. Please try again.')
    except socket.gaierror:
        print('Error: Unable to resolve host. Please check the URL and try again.')
    except Exception as e:
        print(f"An unexpected error occured: {e}")

Exercise 2: 

Change your socket program so that it counts the number of characters it has received and stops displaying any text after it has shown 3000 characters. The program should retrieve the entire document and count the total number of characters and display the count of the number of characters at the end of the
document.

In [None]:
import socket

while True:
    try:
#       prompt the user for the URL
        url = input('Enter - ').strip()
    
#       check if the url is properly formatted
        if not url.startswith("http://") and not url.startswith("https://"):
            raise ValueError("URL must start with 'http://' or 'https://'")

#       split the url to extract the hostname    
        parts = url.split('/')
        host = parts[2]     # the hostname is the third part (after "http://")
        
#       create a socket connection
        mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        mysock.connect((host, 80))
        
#       from the GET request command
        cmd = f'GET {url} HTTP/1.0\r\n\r\n'.encode()
        mysock.send(cmd)
        
#       initialize the counters and variables
        total_chars = 0
        displayed_chars = 0
        max_display = 3000
        
#       recieve and print data from server
        while True:
            data = mysock.recv(512)
            if len(data) < 1:
                break
            
#           count total characters recieved
            total_chars += len(data)
            
#           Display up to 3000
            if displayed_chars < max_display:
                to_show = min(len(data), max_display - displayed_chars)
                print(data[:to_show].decode(), end='')
                displayed_chars += to_show
        
#       close the socket.
        mysock.close()
    
        print("\n\nTotal Characters received:", total_chars)
        break   # Exit the loop after successful execution 
        
    except IndexErrror:
        print('Error: The URL seems improperly formatted. Please try again.')
    except socket.gaierror:
        print('Error: Unable to resolve host. Please check the URL and try again.')
    except Exception as e:
        print(f"An unexpected error occured: {e}")

Exercise 3: 

Use urllib to replicate the previous exercise of (1) retrieving the
document from a URL, (2) displaying up to 3000 characters, and (3) counting the overall number of characters in the document. Don’t worry about the headers for this exercise, simply show the first 3000 characters of the document contents.

In [None]:
import urllib.request

url = input('Enter a URL: ')

try:
    response = urllib.request.urlopen(url)
    total_char = 0
    displayed_chars = 0
    max_display = 3000
    
    while True:
        data = response.read(512)
        if len(data) < 1:
            break
        
        total_chars += len(data)
        
        if displayed_chars < max_display:
            to_show = min(len(data), max_display - displayed_chars)
            print(data[:to_show].decode(), end="")
            displayed_chars += to_show
            
    print(("\nTotal characters in the document:", total_chars))
        
except Exception as e:
    print("Error:", e)

Exercise 4: 

Change the urllinks.py program to extract and count paragraph (p)
tags from the retrieved HTML document and display the count of the paragraphs as the output of your program. Do not display the paragraph text, only count them. Test your program on several small web pages as well as some larger web pages.

In [None]:
import urllib.request
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')

try:
#     retrieve the html docuemnt
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    
#     find all the paragraph tags
    tags = soup('p')
    print('Number of paragraph tags:', len(tags))
    
except Exception as e:
    print('Error:', e)

Exercise 5: 

(Advanced) Change the socket program so that it only shows data
after the headers and a blank line have been received. Remember that recv receives characters (newlines and all), not lines.

In [None]:
import socket

url = input('Enter URL: ')

try:
    parts = url.split('/')
    host = parts[2]
    path = '/' + '/'.join(parts[3:])
    
    mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    mysock.connect((host, 80))
    
    cmd = f'GET {path} HTTP/1.0\r\nHOST: {host}\r\n\r\n'.encode()
    mysock.send(cmd)
    
    response = b""
    while True:
        data = mysock.recv(512)
        if len(data) < 1:
            break
        response += data
        
    mysock.close()
    
    response = response.decode()
    header_end = response.find("\r\n\r\n")
    
    if header_end != -1:
        body = response[header_end + 4:]
        print(body)
    else:
        print('No headers found.')
except Exception as e:
    print('Error:', e)

## Chapter 13

### Using Web Services

Once it became easy to retrieve documents and parse documents over HTTP using programs, it did not take long to develop an approach where we started producing documents that were specifically designed to be consumed by other programs (i.e., not HTML to be displayed in a browser).

There are two common formats that we use when exchanging data across the web. eXtensible Markup Language (XML) has been in use for a very long time and is best suited for exchanging document-style data. When programs just want to exchange dictionaries, lists, or other internal information with each other, they use JavaScript Object Notation (JSON) (see www.json.org). We will look at both formats.

#### 13.1 eXtensible Markup Language - XML

XML looks very similar to HTML, but XML is more structured than HTML. Here is a sample of an XML document:

Each pair of opening (e.g., <person>) and closing tags (e.g., </person>) represents a element or node with the same name as the tag (e.g., person). Each element can have some text, some attributes (e.g., hide), and other nested elements. If an XML element is empty (i.e., has no content), then it may be depicted by a self-closing tag (e.g., <email />).

Often it is helpful to think of an XML document as a tree structure where there is a top element (here: person), and other tags (e.g., phone) are drawn as children of their parent elements.

In [None]:
''' 
    <person>  
        <name>Chuck</name>   
        <phone type="intl">  
            +1 734 303 4456  
        </phone>  
        <email hide="yes" />  
    </person>
'''

#### 13.2 Parsing the XML

Here is a simple application that parses some XML and extracts some data elements from the XML:

In [None]:
import xml.etree.ElementTree as ET

data = '''
<person>
    <name>Chuck</name>
    <phone type="intl">
        +1 734 303 4456
    </phone>
    <email hide="yes" />
</person>'''

tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))

The triple single quote ('''), as well as the triple double quote ("""), allow for the creation of strings that span multiple lines.

Calling fromstring converts the string representation of the XML into a “tree” of XML elements. When the XML is in a tree, we have a series of methods we can call to extract portions of data from the XML string. The find function searches through the XML tree and retrieves the element that matches the specified tag.

Name: Chuck  
Attr: yes  

Using an XML parser such as ElementTree has the advantage that while the XML in this example is quite simple, it turns out there are many rules regarding valid XML, and using ElementTree allows us to extract data from XML without worrying about the rules of XML syntax.

#### 13.3 Looping through nodes

Often the XML has multiple nodes and we need to write a loop to process all of the nodes. In the following program, we loop through all of the user nodes:

In [None]:
import xml.etree.ElementTree as ET

input = '''
<stuff>
    <users>
        <user x="2">
            <id>001</id>
            <name>Chuck</name>
        </user>
        <user x="7">
            <id>009</id>
            <name>Brent</name>
        </user>
    </users>
</stuff>'''

stuff = ET.fromstring(input)
lst = stuff.findall('users/user')
print('User count:', len(lst))

for item in lst:
    print('Name:', item.find('name').text)
    print('Id:', item.find('id').text)
    print('Attribute', item.get('x'))

The findall method retrieves a Python list of subtrees that represent the user structures in the XML tree. Then we can write a for loop that looks at each of the user nodes, and prints the name and id text elements as well as the x attribute from the user node.

It is important to include all parent level elements in the findall statement except for the top level element (e.g., users/user). Otherwise, Python will not find any desired nodes.

In [None]:
import xml.etree.ElementTree as ET

input = '''
<stuff>
    <users>
        <user x="2">
            <id>001</id>
            <name>Chuck</name>
        </user>
        <user x="7">
            <id>009</id>
            <name>Brent</name>
        </user>
    </users>
</stuff>'''

stuff = ET.fromstring(input)

lst = stuff.findall('users/user')
print('User count:', len(lst))

lst2 = stuff.findall('user')
print('User count:', len(lst2))

lst stores all user elements that are nested within their users parent. lst2 looks for user elements that are not nested within the top level stuff element where there are none.

User count: 2  
User count: 0

#### 13.4 JavaScript Object Notation - JSON

The JSON format was inspired by the object and array format used in the JavaScript language. But since Python was invented before JavaScript, Python’s syntax for dictionaries and lists influenced the syntax of JSON. So the format of JSON is nearly identical to a combination of Python lists and dictionaries.

Here is a JSON encoding that is roughly equivalent to the simple XML from above:

In [None]:
{
    "name" : "Chuck",
    "phone" : {
        "type" : "intl",
        "number" : "+1 734 303 4456"
                },
            "email" : {
            "hide" : "yes"
    }
}

You will notice some differences. First, in XML, we can add attributes like “intl” to the “phone” tag. In JSON, we simply have key-value pairs. Also the XML “person” tag is gone, replaced by a set of outer curly braces.

In general, JSON structures are simpler than XML because JSON has fewer capabilities than XML. But JSON has the advantage that it maps directly to some combination of dictionaries and lists. And since nearly all programming languages have something equivalent to Python’s dictionaries and lists, JSON is a very natural format to have two cooperating programs exchange data.

JSON is quickly becoming the format of choice for nearly all data exchange between applications because of its relative simplicity compared to XML.

#### 13.5 Parsing JSON

We construct our JSON by nesting dictionaries and lists as needed. In this example, we represent a list of users where each user is a set of key-value pairs (i.e., a dictionary). So we have a list of dictionaries.

In the following program, we use the built-in json library to parse the JSON and read through the data. Compare this closely to the equivalent XML data and code above. The JSON has less detail, so we must know in advance that we are getting a list and that the list is of users and each user is a set of key-value pairs. The JSON
is more succinct (an advantage) but also is less self-describing (a disadvantage).

In [None]:
import json

data = '''
[
{ "id" : "001",
    "x" : "2",
    "name" : "Chuck"
} ,
{ "id" : "009",
    "x" : "7",
    "name" : "Brent"
}
]'''

info = json.loads(data)
print('User count:', len(info))

for item in info:
    print('Name', item['name'])
    print('Id', item['id'])
    print('Attribute', item['x'])

If you compare the code to extract data from the parsed JSON and XML you will see that what we get from json.loads() is a Python list which we traverse with a for loop, and each item within that list is a Python dictionary. Once the JSON has been parsed, we can use the Python index operator to extract the various bits of data for each user. We don’t have to use the JSON library to dig through the parsed JSON, since the returned data is simply native Python structures.

The output of this program is exactly the same as the XML version above.

User count: 2  
Name Chuck  
Id 001  
Attribute 2  
Name Brent  
Id 009  
Attribute 7  

In general, there is an industry trend away from XML and towards JSON for web services. Because the JSON is simpler and more directly maps to native data structures we already have in programming languages, the parsing and data extraction code is usually simpler and more direct when using JSON. But XML is more selfdescriptive than JSON and so there are some applications where XML retains an advantage. For example, most word processors store documents internally using XML rather than JSON.

#### 13.6 Application Programming Interfaces

We now have the ability to exchange data between applications using Hypertext Transport Protocol (HTTP) and a way to represent complex data that we are sending back and forth between these applications using eXtensible Markup Language (XML) or JavaScript Object Notation (JSON).

The next step is to begin to define and document “contracts” between applications using these techniques. The general name for these application-to-application contracts is Application Program Interfaces (APIs). When we use an API, generally one program makes a set of services available for use by other applications and publishes the APIs (i.e., the “rules”) that must be followed to access the services provided by the program.

When we begin to build our programs where the functionality of our program includes access to services provided by other programs, we call the approach a Service-oriented architecture (SOA). An SOA approach is one where our overall application makes use of the services of other applications. A non-SOA approach is where the application is a single standalone application which contains all of the code necessary to implement the application.

#### 13.7 Security and API usage

It is quite common that you need an API key to make use of a vendor’s API. The general idea is that they want to know who is using their services and how much each user is using. Perhaps they have free and pay tiers of their services or have a policy that limits the number of requests that a single individual can make during a particular time period.

Sometimes once you get your API key, you simply include the key as part of POST data or perhaps as a parameter on the URL when calling the API.

Other times, the vendor wants increased assurance of the source of the requests and so they expect you to send cryptographically signed messages using shared keys and secrets. A very common technology that is used to sign requests over the Internet is called OAuth. You can read more about the OAuth protocol at
www.oauth.net.

Thankfully there are a number of convenient and free OAuth libraries so you can avoid writing an OAuth implementation from scratch by reading the specification. These libraries are of varying complexity and have varying degrees of richness. The OAuth web site has information about various OAuth libraries.

#### Google geocoding web service

Google has an excellent web service that allows us to make use of their large database of geographic information. We can submit a geographical search string like “Ann Arbor, MI” to their geocoding API and have Google return its best guess as to where on a map we might find our search string and tell us about the landmarks nearby.

The geocoding service is free but rate limited so you cannot make unlimited use of the API in a commercial application. But if you have some survey data where an end user has entered a location in a free-format input box, you can use this API to clean up your data quite nicely.

*When you are using a free API, you need to be respectful in your use of these resources. If too many people abuse the service, Google might drop or significantly curtail its free service.*

The following is a simple application to prompt the user for a search string, call the Google geocoding API, and extract information from the returned JSON.

In [None]:
import urllib.request, urllib.parse, urllib.error
import json
import ssl

servicecurl = 'https://py4e-data.dr-chuck.net/opengeo?'

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

while True:
    address = input('Enter location: ')
    if len(address) < 1:
        break
    
    address = address.strip()
    parms = dict()
    parms['q'] = address
    
    url = servicecurl + urllib.parse.urlencode(parms)
    
    print('retrieveing', url)
    uh = urllib.request.urlopen(url, context=ctx)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters', data[:20].replace('\n', ' '))
    
    try:
        js = json.loads(data)
    except:
        js = None
        
    if not js or 'features' not in js:
        print('=== Download Error ===')
        print(data)
        break
        
    if len(js['features']) == 0:
        print('=== Object not found ===')
        print(data)
        break
        
    print(json.dumps(js, indent=4))
    
    lat = js['features'][0]['properties']['lat']
    lon = js['features'][0]['properties']['lon']
    print('lat', lat, 'lon', lon)
    location = js['features'][0]['properties']['formatted']
    print(location)

## Chapter 14

### Object-oriented programming

#### 14.1 Managing larger programs

At the beginning of this book, we came up with four basic programming patterns which we use to construct programs:

• Sequential code  
• Conditional code (if statements)  
• Repetitive code (loops)  
• Store and reuse (functions)  

In later chapters, we explored simple variables as well as collection data structures like lists, tuples, and dictionaries.

As we build programs, we design data structures and write code to manipulate those data structures. There are many ways to write programs and by now, you probably have written some programs that are “not so elegant” and other programs that are “more elegant”. Even though your programs may be small, you are starting to see how there is a bit of art and aesthetic to writing code.

As programs get to be millions of lines long, it becomes increasingly important to write code that is easy to understand. If you are working on a million-line program, you can never keep the entire program in your mind at the same time. We need ways to break large programs into multiple smaller pieces so that we have less to
look at when solving a problem, fix a bug, or add a new feature.

In a way, object oriented programming is a way to arrange your code so that you can zoom into 50 lines of the code and understand it while ignoring the other 999,950 lines of code for the moment.

#### 14.2 Getting started

Like many aspects of programming, it is necessary to learn the concepts of object oriented programming before you can use them effectively. You should approach this chapter as a way to learn some terms and concepts and work through a few simple examples to lay a foundation for future learning.

The key outcome of this chapter is to have a basic understanding of how objects are constructed and how they function and most importantly how we make use of the capabilities of objects that are provided to us by Python and Python libraries.

#### 14.3 Using Objects

As it turns out, we have been using objects all along in this book. Python provides us with many built-in objects. Here is some simple code where the first few lines should feel very simple and natural to you.

stuff = list()  
stuff.append('python')  
stuff.append('chuck')  
stuff.sort()   
print (stuff[0])  
print (stuff.__getitem__(0))  
print (list.__getitem__(stuff,0))  

Instead of focusing on what these lines accomplish, let’s look at what is really happening from the point of view of object-oriented programming. Don’t worry if the following paragraphs don’t make any sense the first time you read them because we have not yet defined all of these terms.

The first line constructs an object of type list, the second and third lines call the append() method, the fourth line calls the sort() method, and the fifth line retrieves the item at position 0.

The sixth line calls the __getitem__() method in the stuff list with a parameter of zero.

print (stuff.__getitem__(0))

The seventh line is an even more verbose way of retrieving the 0th item in the list.

print (list.__getitem__(stuff,0))

In this code, we call the __getitem__ method in the list class and pass the list and the item we want retrieved from the list as parameters. The last three lines of the program are equivalent, but it is more convenient to simply use the square bracket syntax to look up an item at a particular position in a list.

We can take a look at the capabilities of an object by looking at the output of the dir() function:

>>> stuff = list()  
>>> dir(stuff)  
['__add__', '__class__', '__contains__', '__delattr__',  
'__delitem__', '__dir__', '__doc__', '__eq__',  
'__format__', '__ge__', '__getattribute__', '__getitem__',  
'__gt__', '__hash__', '__iadd__', '__imul__', '__init__',  
'__iter__', '__le__', '__len__', '__lt__', '__mul__',  
'__ne__', '__new__', '__reduce__', '__reduce_ex__',  
'__repr__', '__reversed__', '__rmul__', '__setattr__',  
'__setitem__', '__sizeof__', '__str__', '__subclasshook__',  
'append', 'clear', 'copy', 'count', 'extend', 'index',  
'insert', 'pop', 'remove', 'reverse', 'sort']  
>>> 

The rest of this chapter will define all of the above terms so make sure to come back after you finish the chapter and re-read the above paragraphs to check your understanding.

#### 14.4 Starting with programs

A program in its most basic form takes some input, does some processing, and produces some output. Our elevator conversion program demonstrates a very short but complete program showing all three of these steps.

In [None]:
usf = input('Enter the US floor Number: ')
wf = int(usf) - 1
print('Non-US Floor Number is: ', wf)

If we think a bit more about this program, there is the “outside world” and the program. The input and output aspects are where the program interacts with the outside world. Within the program we have code and data to accomplish the task the program is designed to solve.

One way to think about object-oriented programming is that it separates our program into multiple “zones.” Each zone contains some code and data (like a program) and has well defined interactions with the outside world and the other zones within the program.

If we look back at the link extraction application where we used the BeautifulSoup library, we can see a program that is constructed by connecting different objects together to accomplish a task:

In [None]:
# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
soup = BeautifulSoup(html, 'html.parser')

tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

We read the URL into a string and then pass that into urllib to retrieve the data from the web. The urllib library uses the socket library to make the actual network connection to retrieve the data. We take the string that urllib returns and hand it to BeautifulSoup for parsing. BeautifulSoup makes use of the object html.parser1 and returns an object. We call the tags() method on the returned object that returns a dictionary of tag objects. We loop through the tags and call the get() method for each tag to print out the href attribute.

We can draw a picture of this program and how the objects work together.

The key here is not to understand perfectly how this program works but to see how we build a network of interacting objects and orchestrate the movement of information between the objects to create a program. It is also important to note that when you looked at that program several chapters back, you could fully understand what was going on in the program without even realizing that the program was “orchestrating the movement of data between objects.” It was just lines of code that got the job done.

#### 14.5 Subdividing a problem

One of the advantages of the object-oriented approach is that it can hide complexity. For example, while we need to know how to use the urllib and BeautifulSoup code, we do not need to know how those libraries work internally. This allows us to focus on the part of the problem we need to solve and ignore the other parts of the program.

This ability to focus exclusively on the part of a program that we care about and ignore the rest is also helpful to the developers of the objects that we use. For example, the programmers developing BeautifulSoup do not need to know or care about how we retrieve our HTML page, what parts we want to read, or what we plan to do with the data we extract from the web page.

#### 14.6 Our first Python object

At a basic level, an object is simply some code plus data structures that are smaller than a whole program. Defining a function allows us to store a bit of code and give it a name and then later invoke that code using the name of the function.

An object can contain a number of functions (which we call methods) as well as data that is used by those functions. We call data items that are part of the object attributes.

We use the class keyword to define the data and code that will make up each of the objects. The class keyword includes the name of the class and begins an indented block of code where we include the attributes (data) and methods (code).

In [None]:
class PartyAnimal:
    def __init__(self):
        self.x = 0
    
    def party(self):
        self.x = self.x + 1
        print('So far', self.x)
        
an = PartyAnimal()
an.party()
an.party()
an.party()

Each method looks like a function, starting with the def keyword and consisting of an indented block of code.

The first method is a specially-named method called __init()__. This method is called to do any initial setup of the data we want to store in the object. In this class we allocate the x attribute using dot notation and initialize it to zero.

self.x = 0

The other method named party. The methods all have a special first parameter that we name by convention self. The first parameter gives us access to the object instance so we can set attributes and call methods using dot notation.

Just as the def keyword does not cause function code to be executed, the class keyword does not create an object. Instead, the class keyword defines a template indicating what data and code will be contained in each object of type PartyAnimal. The class is like a cookie cutter and the objects created using the class are the
cookies2. You don’t put frosting on the cookie cutter; you put frosting on the cookies, and you can put different frosting on each cookie.

If we continue through this sample program, we see the first executable line of code:

an = PartyAnimal()

This is where we instruct Python to construct (i.e., create) an object or instance of the class PartyAnimal. It looks like a function call to the class itself. Python constructs the object with the right data and methods and returns the object which is then assigned to the variable an. In a way this is quite similar to the following line which we have been using all along:

counts = dict()

Here we instruct Python to construct an object using the dict template (already present in Python), return the instance of dictionary, and assign it to the variable counts.

When the PartyAnimal class is used to construct an object, the variable an is used to point to that object. We use an to access the code and data for that particular instance of the PartyAnimal class.

Each Partyanimal object/instance contains within it a variable x and a method/function named party. We call the party method in this line:

an.party()

When the party method is called, the first parameter (which we call by convention self) points to the particular instance of the PartyAnimal object that party is called from. Within the party method, we see the line:

self.x = self.x + 1

This syntax using the dot operator is saying ‘the x within self.’ Each time party() is called, the internal x value is incremented by 1 and the value is printed out.

The following line is another way to call the party method within the an object:

PartyAnimal.party(an)

In this variation, we access the code from within the class and explicitly pass the object pointer an as the first parameter (i.e., self within the method). You can think of an.party() as shorthand for the above line.

When the program executes, it produces the following output:

So far 1  
So far 2  
So far 3  
So far 4  

The object is constructed, and the party method is called four times, both incrementing and printing the value for x within the an object.

#### 14.7 Classes as types

As we have seen, in Python all variables have a type. We can use the built-in dir function to examine the capabilities of a variable. We can also use type and dir with the classes that we create.

In [None]:
class PartyAnimal:
    
    def __init__(self):
        self.x = 0
            
    def party(self):
        self.x = self.x + 1
        print("So far", self.x)
        
an = PartyAnimal()
print("Type", type(an))
print("Dir", dir(an))
print("Type", type(an.x))
print("Type", type(an.party))

You can see that using the class keyword, we have created a new type. From the dir output, you can see both the x integer attribute and the party method are available in the object.

#### 14.8 Object lifecycle

In the previous examples, we define a class (template), use that class to create an instance of that class (object), and then use the instance. When the program finishes, all of the variables are discarded. Usually, we don’t think much about the creation and destruction of variables, but often as our objects become more
complex, we need to take some action within the object to set things up as the object is constructed and possibly clean things up as the object is discarded.

If we want our object to be aware of these moments of construction and destruction, we add specially named methods to our object:

In [None]:
class PartyAnimal:
    
    def __init__(self):
        self.x = 0
        print('I am constructed')
        
    def party(self):
        self.x = self.x + 1
        print('So far', self.x)
        
    def __del__(self):
        print('I am destructed', self.x)
        
an = PartyAnimal()
an.party()
an.party()
an = 42
print('an contains', an)

When this program executes, it produces the above output.

As Python constructs our object, it calls our __init__ method to give us a chance to set up some default or initial values for the object. When Python encounters the line:

an = 42

It actually “throws our object away” so it can reuse the an variable to store the value 42. Just at the moment when our an object is being “destroyed” our destructor code (__del__) is called. We cannot stop our variable from being destroyed, but we can do any necessary cleanup right before our object no longer exists.

When developing objects, it is quite common to add a constructor to an object to set up initial values for the object. It is relatively rare to need a destructor for an object.

#### 14.9 Multiple instances

So far, we have defined a class, constructed a single object, used that object, and then thrown the object away. However, the real power in object-oriented programming happens when we construct multiple instances of our class.

When we construct multiple objects from our class, we might want to set up different initial values for each of the objects. We can pass data to the constructors to give each object a different initial value:

In [None]:
class PartyAnimal():
    
    def __init__(self, nam):
        self.x = 0
        self.name = nam
        print(self.name, 'constructed')
        
    def party(self):
        self.x = self.x + 1
        print(self.name, 'party count', self.x)
        
s = PartyAnimal('Sally')
s.party()
j = PartyAnimal('Jim')

j.party()
s.party()

The constructor has both a self parameter that points to the object instance and additional parameters that are passed into the constructor as the object is constructed:

s = PartyAnimal('Sally')

Within the constructor, the second line copies the parameter (nam) that is passed into the name attribute within the object instance.

self.name = nam

The output of the program shows that each of the objects (s and j) contain their own independent copies of x and nam:

Sally constructed  
Jim constructed  
Sally party count 1  
Jim party count 1  
Sally party count 2  

#### 14.10 Inheritance

Another powerful feature of object-oriented programming is the ability to create a new class by extending an existing class. When extending a class, we call the original class the parent class and the new class the child class.

For this example, we move our PartyAnimal class into its own file. Then, we can ‘import’ the PartyAnimal class in a new file and extend it, as follows:

In [None]:
from party import PartyAnimal

class CricketFan(PartyAnimal):
    
    def __init__(self, nam):
        super().__init__(nam)
        self.points = 0
        
    def six(self):
        self.points = self.points + 6
        self.party()
        print(self.name, 'points', self.points)
        

s = PartyAnimal('Sally')
s.party()
j = CricketFan('Jim')
j.party()
j.six()
print(dir(j))

When we define the CricketFan class, we indicate that we are extending the PartyAnimal class. This means that all of the variables (x) and methods (party) from the PartyAnimal class are inherited by the CricketFan class. For example, within the six method in the CricketFan class, we call the party method from
the PartyAnimal class.

We use a special syntax in the __init__() method in the CricketFan class to insure that we call the __init()__ method in the PartyAnimal so that whatever setup that PartyAnimal needs is done in addition to the setup needed for the CriocketFan extensions.

def __init__(self, nam) :  
    super().__init__(nam)  
    self.points = 0  

The super() syntax is telling Python to call the __init__ method in the class that we are extending. PartyAnimal is the super (or parent) class and CricketFan is the sub (or child) class.

As the program executes, we create s and j as independent instances of PartyAnimal and CricketFan. The j object has additional capabilities beyond the s object.

Sally constructed  
Sally party count 1  
Jim constructed  
Jim party count 1  
Jim party count 2  
Jim points 6  
['__class__', '__delattr__', ... '__weakref__',
'name', 'party', 'points', 'six', 'x']  

In the dir output for the j object (instance of the CricketFan class), we see that it has the attributes and methods of the parent class, as well as the attributes and methods that were added when the class was extended to create the CricketFan class.

#### 14.11 Summary

This is a very quick introduction to object-oriented programming that focuses mainly on terminology and the syntax of defining and using objects. Let’s quickly review the code that we looked at in the beginning of the chapter. At this point you should fully understand what is going on.

In [None]:
stuff = list()
stuff.append('python')
stuff.append('chuck')
stuff.sort()
print(stuff[0])
print(stuff.__getitem__(0))
print(list.__getitem__(stuff, 0))

The first line constructs a list object. When Python creates the list object, it calls the constructor method (named __init__) to set up the internal data attributes that will be used to store the list data. We have not passed any parameters to the constructor. When the constructor returns, we use the variable stuff to point to the returned instance of the list class.

The second and third lines call the append method with one parameter to add a new item at the end of the list by updating the attributes within stuff. Then in the fourth line, we call the sort method with no parameters to sort the data within the stuff object.

We then print out the first item in the list using the square brackets which are a shortcut to calling the __getitem__ method within the stuff. This is equivalent to calling the __getitem__ method in the list class and passing the stuff object as the first parameter and the position we are looking for as the second parameter.

At the end of the program, the stuff object is discarded but not before calling the destructor (named __del__) so that the object can clean up any loose ends as necessary.

Those are the basics of object-oriented programming. There are many additional details as to how to best use object-oriented approaches when developing large applications and libraries that are beyond the scope of this chapter.3

## Chapter 15

### Using Databases and SQL

#### 15.1 What is a database?

A database is a file structured to store and manage data efficiently. Like a dictionary, it maps keys to values, but unlike a dictionary, it is stored on permanent storage, such as a disk, allowing it to persist even after the program ends. This also enables databases to hold much larger volumes of data compared to dictionaries, which are limited by the computer’s memory.

Similar to dictionaries, databases are optimized for fast data insertion and retrieval, even with large datasets. They achieve this efficiency by creating indexes as data is added, allowing quick access to specific entries.

There are various database systems tailored for different purposes, such as Oracle, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite. This text focuses on SQLite because it is widely used, integrated into Python, and designed to be embedded in applications. For instance, the Firefox browser and many other products use SQLite internally.

SQLite is particularly suitable for addressing data manipulation challenges encountered in Informatics. For more details, visit SQLite’s website.

#### 15.2 Database concepts

When you first look at a database it looks like a spreadsheet with multiple sheets. The primary data structures in a database are: tables, rows, and columns.

In technical descriptions of relational databases the concepts of table, row, and column are more formally referred to as relation, tuple, and attribute, respectively. We will use the less formal terms in this chapter.

#### 15.3 Database Browser for SQLite

While this chapter will focus on using Python to work with data in SQLite database files, many operations can be done more conveniently using software called the Database Browser for SQLite which is freely available from:

http://sqlitebrowser.org/

Using the browser you can easily create tables, insert data, edit data, or run simple SQL queries on the data in the database.

In a sense, the database browser is similar to a text editor when working with text files. When you want to do one or very few operations on a text file, you can just open it in a text editor and make the changes you want. When you have many changes that you need to do to a text file, often you will write a simple Python
program. You will find the same pattern when working with databases. You will do simple operations in the database manager and more complex operations will be most conveniently done in Python.

#### 15.4 Creating a database table

Databases require more defined structure than Python lists or dictionaries1.

When we create a database table we must tell the database in advance the names of each of the columns in the table and the type of data which we are planning to store in each column. When the database software knows the type of data in each column, it can choose the most efficient way to store and look up the data based
on the type of data.

You can look at the various data types supported by SQLite at the following url:  
http://www.sqlite.org/datatypes.html

Defining structure for your data up front may seem inconvenient at the beginning, but the payoff is fast access to your data even when the database contains a large amount of data.

The code to create a database file and a table named Track with two columns in the database is as follows:

In [None]:
import sqlite3

conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS Track')
cur.execute('CREATE TABLE Track (title TEXT, plays INTEGER)')

conn.close()

The connect operation makes a “connection” to the database stored in the file music.sqlite in the current directory. If the file does not exist, it will be created. The reason this is called a “connection” is that sometimes the database is stored on a separate “database server” from the server on which we are running our
application. In our simple examples the database will just be a local file in the same directory as the Python code we are running.

A cursor is like a file handle that we can use to perform operations on the data stored in the database. Calling cursor() is very similar conceptually to calling open() when dealing with text files.

Once we have the cursor, we can begin to execute commands on the contents of the database using the execute() method.

Database commands are expressed in a special language that has been standardized across many different database vendors to allow us to learn a single database language. The database language is called Structured Query Language or SQL for short.

http://en.wikipedia.org/wiki/SQL

In our example, we are executing two SQL commands in our database. As a convention, we will show the SQL keywords in uppercase and the parts of the command that we are adding (such as the table and column names) will be shown in lowercase.

The first SQL command removes the Track table from the database if it exists. This pattern is simply to allow us to run the same program to create the Track table over and over again without causing an error. Note that the DROP TABLE command deletes the table and all of its contents from the database (i.e., there is
no “undo”).

cur.execute('DROP TABLE IF EXISTS Track ')

The second command creates a table named Track with a text column named title and an integer column named plays.

cur.execute('CREATE TABLE Track (title TEXT, plays INTEGER)')

Now that we have created a table named Track, we can put some data into that table using the SQL INSERT operation. Again, we begin by making a connection to the database and obtaining the cursor. We can then execute SQL commands using the cursor.

The SQL INSERT command indicates which table we are using and then defines a new row by listing the fields we want to include (title, plays) followed by the VALUES we want placed in the new row. We specify the values as question marks (?, ?) to indicate that the actual values are passed in as a tuple ( 'My Way', 15 ) as the second parameter to the execute() call.

In [None]:
import sqlite3

conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()

cur.execute('INSERT INTO Track (title, plays) VALUES (?, ?)', 
            ('Thunderstruck', 20))
cur.execute('INSERT INTO Track (title, plays) VALUES (?, ?)',
           ('My Way', 15))
conn.commit()

print('Track:')
cur.execute('SELECT title, plays FROM Track')
for row in cur:
    print(row)
    
cur.execute('DELETE FROM Track WHERE plays < 100')
conn.commit()

cur.close()

First we INSERT two rows into our table and use commit() to force the data to be written to the database file.

Then we use the SELECT command to retrieve the rows we just inserted from the table. On the SELECT command, we indicate which columns we would like (title, plays) and indicate which table we want to retrieve the data from. After we execute the SELECT statement, the cursor is something we can loop through in a for statement. For efficiency, the cursor does not read all of the data from the database when we execute the SELECT statement. Instead, the data is read on demand as we loop through the rows in the for statement.

The output of the program is as follows:  
Track:   
('Thunderstruck', 20)  
('My Way', 15)

Our for loop finds two rows, and each row is a Python tuple with the first value as the title and the second value as the number of plays.

At the very end of the program, we execute an SQL command to DELETE the rows we have just created so we can run the program over and over. The DELETE command shows the use of a WHERE clause that allows us to express a selection criterion so that we can ask the database to apply the command to only the rows that match the criterion. In this example the criterion happens to apply to all the rows so we empty the table out so we can run the program repeatedly. After the DELETE is performed, we also call commit() to force the data to be removed from the database.

#### 15.5 Structured Query Language Summary

So far, we have been using the Structured Query Language in our Python examples and have covered many of the basics of the SQL commands. In this section, we look at the SQL language in particular and give an overview of SQL syntax.

Since there are so many different database vendors, the Structured Query Language (SQL) was standardized so we could communicate in a portable manner to database systems from multiple vendors.

A relational database is made up of tables, rows, and columns. The columns generally have a type such as text, numeric, or date data. When we create a table, we indicate the names and types of the columns:

CREATE TABLE Track (title TEXT, plays INTEGER)

To insert a row into a table, we use the SQL INSERT command:

INSERT INTO Track (title, plays) VALUES ('My Way', 15)

The INSERT statement specifies the table name, then a list of the fields/columns that you would like to set in the new row, and then the keyword VALUES and a list of corresponding values for each of the fields.

The SQL SELECT command is used to retrieve rows and columns from a database. The SELECT statement lets you specify which columns you would like to retrieve as well as a WHERE clause to select which rows you would like to see. It also allows an optional ORDER BY clause to control the sorting of the returned rows.

SELECT * FROM Track WHERE title = 'My Way'

Using * indicates that you want the database to return all of the columns for each row that matches the WHERE clause.

Note, unlike in Python, in a SQL WHERE clause we use a single equal sign to indicate a test for equality rather than a double equal sign. Other logical operations allowed in a WHERE clause include <, >, <=, >=, !=, as well as AND and OR and parentheses to build your logical expressions.

You can request that the returned rows be sorted by one of the fields as follows:

SELECT title,plays FROM Track ORDER BY title

It is possible to UPDATE a column or columns within one or more rows in a table using the SQL UPDATE statement as follows:

UPDATE Track SET plays = 16 WHERE title = 'My Way'

The UPDATE statement specifies a table and then a list of fields and values to change after the SET keyword and then an optional WHERE clause to select the rows that are to be updated. A single UPDATE statement will change all of the rows that match the WHERE clause. If a WHERE clause is not specified, it performs the UPDATE
on all of the rows in the table.

To remove a row, you need a WHERE clause on an SQL DELETE statement. The WHERE clause determines which rows are to be deleted:

DELETE FROM Track WHERE title = 'My Way'

These four basic SQL commands (INSERT, SELECT, UPDATE, and DELETE) allow the four basic operations needed to create and maintain data. We use “CRUD” (Create, Read, Update, and Delete) to capture all these concepts in a single term.2

#### 15.6 Multiple tables and basic data modeling

The real power of a relational database is when we create multiple tables and make links between those tables. The act of deciding how to break up your application data into multiple tables and establishing the relationships between the tables is called data modeling. The design document that shows the tables and their
relationships is called a data model.

Data modeling is a relatively sophisticated skill and we will only introduce the most basic concepts of relational data modeling in this section. For more detail on data modeling you can start with:

http://en.wikipedia.org/wiki/Relational_model

Lets say for our tracks database we wanted to track the name of the artist for each track in addition to the title and number of plays for each track. A simple approach might be to simply add another column to the database called artist and put the name of the artist in the column as follows:

DROP TABLE IF EXISTS Track;  
CREATE TABLE Track (title TEXT, plays INTEGER, artist TEXT);  

Then we could insert a few tracks into our table.

INSERT INTO Track (title, plays, artist)  
VALUES ('My Way', 15, 'Frank Sinatra');  
INSERT INTO Track (title, plays, artist)  
VALUES ('New York', 25, 'Frank Sinatra');  

If we were to look at our data with a SELECT * FROM Track statement, it looks like we have done a fine job.

sqlite> SELECT * FROM Track;  
My Way|15|Frank Sinatra  
New York|25|Frank Sinatra  
sqlite>  

We have made a very bad error in our data modeling. We have violated the rules of database normalization.  
https://en.wikipedia.org/wiki/Database_normalization

While database normalization seems very complex on the surface and contains a lot of mathematical justifications, for now we can reduce it all into one simple rule that we will follow.

We should never put the same string data in a column more than once. If we need the data more than once, we create a numeric key for the data and reference the actual data using this key. Especially if the multiple entries refer to the same object.

To demonstrate the slippery slope we are going down by assigning string columns to out database model, think about how we would change the data model if we wanted to keep track of the eye color of our artists? Would we do this?

DROP TABLE IF EXISTS Track;  
CREATE TABLE Track (title TEXT, plays INTEGER,  
artist TEXT, eyes TEXT);  
INSERT INTO Track (title, plays, artist, eyes)  
VALUES ('My Way', 15, 'Frank Sinatra', 'Blue');  
INSERT INTO Track (title, plays, artist, eyes)  
VALUES ('New York', 25, 'Frank Sinatra', 'Blue');  

Since Frank Sinatra recorded over 1200 songs, are we really going to put the string ‘Blue’ in 1200 rows in our Track table. And what would happen if we decided his eye color was ‘Light Blue’? Something just does not feel right.

The correct solution is to create a table for the each Artist and store all the data about the artist in that table. And then somehow we need to make a connection between a row in the Track table to a row in the Artist table. Perhaps we could call this “link” between two “tables” a “relationship” between two tables. And that is exactly what database experts decided to all these links.

Lets make an Artist table as follows:

DROP TABLE IF EXISTS Artist;  
CREATE TABLE Artist (name TEXT, eyes TEXT);  
INSERT INTO Artist (name, eyes)  
VALUES ('Frank Sinatra', 'blue');  

Now we have two tables but we need a way to link rows in the two tables. To do this, we need what we call ‘keys’. These keys will just be integer numbers that we can use to lookup a row in different table. If we are going to make links to rows inside of a table, we need to add a primary key to the rows in the table. By
convention we usually name the primary key column ‘id’. So our Artist table looks as follows:

DROP TABLE IF EXISTS Artist;  
CREATE TABLE Artist (id INTEGER, name TEXT, eyes TEXT);  
INSERT INTO Artist (id, name, eyes)  
VALUES (42, 'Frank Sinatra', 'blue');  

Now we have a row in the table for ‘Frank Sinatra’ (and his eye color) and a primary key of ‘42’ to use to link our tracks to him. So we alter our Track table as follows:

DROP TABLE IF EXISTS Track;  
CREATE TABLE Track (title TEXT, plays INTEGER,  
artist_id INTEGER);  
INSERT INTO Track (title, plays, artist_id)  
VALUES ('My Way', 15, 42);  
INSERT INTO Track (title, plays, artist_id)  
VALUES ('New York', 25, 42);  

The artist_id column is an integer, and by naming convention is a foreign key pointing at a primary key in the Artist table. We call it a foreign key because it is pointing to a row in a different table.

Now we are following the rules of database normalization, but when we want to get data out of our database, we don’t want to see the 42, we want to see the name and eye color of the artist. To do this we use the JOIN keyword in our SELECT statement.

SELECT title, plays, name, eyes  
FROM Track JOIN Artist  
ON Track.artist_id = Artist.id;  

The JOIN clause includes an ON condition that defines how the rows are to to be connected. For each row in Track add the data from Artist from the row where artist_id Track table matches the id from the Artist table.

The output would be:

My Way|15|Frank Sinatra|blue  
New York|25|Frank Sinatra|blue  

While it might seem a little clunky and your instincts might tell you that it would be faster just to keep the data in one table, it turns out the the limit on database performance is how much data needs to be scanned when retrieving a query. While the details are very complex, integers are a lot smaller than strings (especially
Unicode) and far quicker to to move and compare.

#### 15.7 Data model diagrams

While our Track and Artist database design is simple with just two tables and a single one-to-many relationship, these data models can get complicated quickly and are easier to understand if we can make a graphical representation of our data model.

While there are many graphical representations of data models, we will use one of the “classic” approaches, called “Crow’s Foot Diagrams” as shown in Figure 15.4. Each table is shown as a box with the name of the table and its columns. Then where there is a relationship between two tables a line is drawn connecting the tables with a notation added to the end of each line indicating the nature of the relationship.

## Add the diagram

https://en.wikipedia.org/wiki/Entity-relationship_model  
In this case, “many” tracks can be associated with each artist. So the track end is shown with the crow’s foot spread out indicating it is the" “many” end. The artist end is shown with a vertical like that indicates “one”. There will be “many” artists in general, but the important aspect is that for each artist there will be many tracks. And each of those artists may be associated with multiple tracks.

You will note that the column that holds the foreign_key like artist_id is on the “many” end and the primary key is at the “one” end.

Since the pattern of foreign and primary key placement is so consistent and follows the “many” and “one” ends of the lines, we never include either the primary or foreign key columns in our diagram of the data model as shown in the second diagram as shown in Figure 15.5. The columns are thought of as “implementation detail” to capture the nature of the relationship details and not an essential part of the data being modeled.

#### 15.8 Automatically creating primary keys

In the above example, we arbitrarily assigned Frank the primary key of 42. However when we are inserting millions or rows, it is nice to have the database automatically generate the values for the id column. We do this by declaring the id column as a PRIMARY KEY and leave out the id value when inserting the row:

DROP TABLE IF EXISTS Artist;  
CREATE TABLE Artist (id INTEGER PRIMARY KEY,  
name TEXT, eyes TEXT);  
INSERT INTO Artist (name, eyes)  
VALUES ('Frank Sinatra', 'blue');  

Now we have instructed the database to auto-assign us a unique value to the Frank Sinatra row. But we then need a way to have the database tell us the id value for the recently inserted row. One way is to use a SELECT statement to retrieve data from an SQLite built-in-fuction called last_insert_rowid().

sqlite> DROP TABLE IF EXISTS Artist;  
sqlite> CREATE TABLE Artist (id INTEGER PRIMARY KEY,   
...> name TEXT, eyes TEXT);  
sqlite> INSERT INTO Artist (name, eyes)  
...> VALUES ('Frank Sinatra', 'blue');  
sqlite> select last_insert_rowid();  
1  
sqlite> SELECT * FROM Artist;  
1|Frank Sinatra|blue  
sqlite>  

Once we know the id of our ‘Frank Sinatra’ row, we can use it when we INSERT the tracks into the Track table. As a general strategy, we add these id columns to any table we create:
  
sqlite> DROP TABLE IF EXISTS Track;  
sqlite> CREATE TABLE Track (id INTEGER PRIMARY KEY,  
...> title TEXT, plays INTEGER, artist_id INTEGER);  

Note that the artist_id value is the new auto-assigned row in the Artist table and that while we added an INTEGER PRIMARY KEY to the the Track table, we did not include id in the list of fields on the INSERT statements into the Track table. Again this tells the database to choose a unique value for us for the id column.

sqlite> INSERT INTO Track (title, plays, artist_id)  
...> VALUES ('My Way', 15, 1);  
sqlite> select last_insert_rowid();  
1  
sqlite> INSERT INTO Track (title, plays, artist_id)  
...> VALUES ('New York', 25, 1);  
sqlite> select last_insert_rowid();  
2  
sqlite>  

You can call SELECT last_insert_rowid(); after each of the inserts to retrieve the value that the database assigned to the id of each newly created row. Later when we are coding in Python, we can ask for the id value in our code and store it in a variable for later use.

#### 15.9 Logical keys for fast lookup

If we had a table full of artists and a table full of tracks, each with a foreign key link to a row in a table full of artists and we wanted to list all the tracks that were sung by ‘Frank Sinatra’ as follows:

SELECT title, plays, name, eyes  
FROM Track JOIN Artist  
ON Track.artist_id = Artist.id   
WHERE Artist.name = 'Frank Sinatra';  

Since we have two tables and a foreign key between the two tables, our data is well-modeled, but if we are going to have millions of records in the Artist table and going to do a lot of lookups by artist name, we would benefit if we gave the database a hint about our intended use of the name column.

We do this by adding an “index” to a text column that we intend to use in WHERE clauses:

CREATE INDEX artist_name ON Artist(name);

When the database has been told that an index is needed on a column in a table, it stores extra information to make it possible to look up a row more quickly using the indexed field (name in this example). Once you request that an index be created, there is nothing special that is needed in the SQL to access the table. The database keeps the index up to date as data is inserted, deleted, and updated, and uses it automatically if it will increase the performance of a database query. 

These text columns that are used to find rows based on some information in the “real world” like the name of an artist are called Logical keys.

#### 15.10 Adding constraints to the data database

We can also use an index to enforce a constraint (i.e. rules) on our database operations. The most common constraint is a uniqueness constraint which insists that all of the values in a column are unique. We can add the optional UNIQUE keyword, to the CREATE INDEX statement to tell the database that we would like it to enforce the constraint on our SQL. We can drop and re-create the artist_name
index with a UNIQUE constraint as follows.

DROP INDEX artist_name;  
CREATE UNIQUE INDEX artist_name ON Artist(name);  

If we try to insert ‘Frank Sinatra’ a second time, it will fail with an error.

sqlite> SELECT * FROM Artist;  
1|Frank Sinatra|blue  
sqlite> INSERT INTO Artist (name, eyes)  
...> VALUES ('Frank Sinatra', 'blue');  
Runtime error: UNIQUE constraint failed: Artist.name (19)  
sqlite>  

We can tell the database to ignore any duplicate key errors by adding the IGNORE keyword to the INSERT statement as follows:

sqlite> INSERT OR IGNORE INTO Artist (name, eyes)  
...> VALUES ('Frank Sinatra', 'blue');  
sqlite> SELECT id FROM Artist WHERE name='Frank Sinatra';  
1  
sqlite> 

By combining an INSERT OR IGNORE and a SELECT we can insert a new record if the name is not already there and whether or not the record is already there, retrieve the primary key of the record.

sqlite> INSERT OR IGNORE INTO Artist (name, eyes)  
...> VALUES ('Elvis', 'blue');  
sqlite> SELECT id FROM Artist WHERE name='Elvis';  
2  
sqlite> SELECT * FROM Artist;  
1|Frank Sinatra|blue  
2|Elvis|blue  
sqlite>  

Since we have not added a uniqueness constraint to the eye color column, there is no problem having multiple ‘Blue’ values in the eye column.

#### 15.11 Sample multi-table application

A sample application called tracks_csv.py shows how these ideas can be combined to parse textual data and load it into several tables using a proper data model with relational connections between the tables.

This application reads and parses a comma-separated file tracks.csv based on an export from Dr. Chuck’s iTunes library.

Another One Bites The Dust,Queen,Greatest Hits,55,100,217103  
Asche Zu Asche,Rammstein,Herzeleid,79,100,231810  
Beauty School Dropout,Various,Grease,48,100,239960  
Black Dog,Led Zeppelin,IV,109,100,296620  
...  

The columns in this file are: title, artist, album, number of plays, rating (0-100) and length in milliseconds.

Our data model is shown in Figure 15.6 and described in SQL as follows:

DROP TABLE IF EXISTS Artist;  
DROP TABLE IF EXISTS Album;  
DROP TABLE IF EXISTS Track;  

CREATE TABLE Artist (  
    id INTEGER PRIMARY KEY,  
    name TEXT UNIQUE  
);  

CREATE TABLE Album (  
    id INTEGER PRIMARY KEY,  
    artist_id INTEGER,  
    title TEXT UNIQUE  
);

CREATE TABLE Track (    
    id INTEGER PRIMARY KEY,  
    title TEXT UNIQUE,  
    album_id INTEGER,  
    len INTEGER, rating INTEGER, count INTEGER  
);  

We are adding the UNIQUE keyword to TEXT columns that we would like to have a uniqueness constraint that we will use in INSERT IGNORE statements. This is more succinct that separate CREATE INDEX statements but has the same effect.

With these tables in place, we write the following code tracks_csv.py to parse the data and insert it into the tables:

In [None]:
import sqlite3

conn = sqlite3.connect('trackdb.sqlite')
cur = conn.cursor()

handle = open('tracks.csv')

for line in handle:
    line = line.strip()
    pieces = line.split(',')
    if len(pieces) != 6:
        continue
        
    name = pieces[0]
    artist = pieces[1]
    album = pieces[2]
    count = pieces[3]
    rating = pieces[4]
    length = pieces[5]
    
    print(name, artist, album, count, rating, length)
    
    cur.execute('''INSERT OR IGNORE INTO Artist (name) VALUES (?)''',
               (artist, ))
    cur.execute('SELECT id FROM Artist WHERE name = ? ', (artist, ))
    artist_id = cur.fetchone()[0]
    cur.execute('''INSERT OR IGNORE INTO Album (title, artist_id) VALUES (?, ?)''',
               (album, artist_id))
    cur.execute('SELECT id FROM Album WHERE title = ?', (album, ))
    album_id = cur.fetchone()[0]
    
    cur.execute('''INSERT OR REPLACE INTO Track (title, album_id, len, rating, count) VALUES (?, ?, ?, ?, ?)''',
               (name, album_id, length, rating, count))
    conn.commit()

You can see that we are repeating the pattern of INSERT OR IGNORE followed by a SELECT to get the appropriate artist_id and album_id for use in later INSERT statements. We start from Artist because we need artist_id to insert the Album and need the album_id to insert the Track.

If we look at the Album table, we can see that the entries were added and assigned a primary key as necessary as the data was parsed. We can also see the foreign key pointing to a row in the Artist table for each Album row.

sqlite> .mode column  
sqlite> SELECT * FROM Album LIMIT 5; 

id artist_id title  
1  1           Greatest Hits  
2  2           Herzeleid  
3  3           Grease  
4  4           IV  
5  5           The Wall [Disc 2]  

We can reconstruct all of the Track data, following all the relations using JOIN / ON clauses. You can see both ends of each of the (2) relational connections in each row in the output below:
  
sqlite> .mode line  
sqlite> SELECT * FROM Track  
    ...> JOIN Album ON Track.album_id = Album.id  
    ...> JOIN Artist ON Album.artist_id = Artist.id  
    ...> LIMIT 2;   
        id = 1  
        title = Another One Bites The Dust  
    album_id = 1  
        len = 217103  
    rating = 100  
        count = 55  
        id = 1  
    artist_id = 1  
        title = Greatest Hits  
            id = 1  
        name = Queen  
        id = 2  
        title = Asche Zu Asche  
    album_id = 2  
        len = 231810  
    rating = 100  
        count = 79  
        id = 2  
    artist_id = 2  
        title = Herzeleid  
            id = 2  
        name = Rammstein  
        
This example shows three tables and two one-to-many relationships between the tables. It also shows how to use indexes and uniqueness constraints to programmatically construct the tables and their relationships.

https://en.wikipedia.org/wiki/One-to-many_(data_model)

Up next we will look at the many-to-many relationships in data models.

#### 15.12 Many to many relationship in databases

Some data relationships cannot be modeled by a simple one-to-many relationship. For example, lets say we are going to build a data model for a course management system. There will be courses, users, and rosters. A user can be on the roster for many courses and a course will have many users on its roster.

It is pretty simple to draw a many-to-many relationship as shown in Figure 15.7. We simply draw two tables and connect them with a line that has the “many” indicator on both ends of the lines. The problem is how to implement the relationship using primary keys and foreign keys.

Before we explore how we implement many-to-many relationships, let’s see if we could hack something up by extending a one-to many relationship.

If SQL supported the notion of arrays, we might try to define this:

CREATE TABLE Course(  
    id INTEGER PRIMARY KEY,  
    title TEXT UNIQUE,  
    student_ids ARRAY OF INTEGER;  
);

Sadly, while this is a tempting idea, SQL does not support arrays.3
Or we could just make long string and concatenate all the User primary keys into a long string separated by commas.

CREATE TABLE Course (  
    id INTEGER PRIMARY KEY,  
    title TEXT UNIQUE  
    student_ids ARRAY OF INTEGER;  
);

INSERT INTO Course (title, student_ids)  
VALUES( 'si311', '1,3,4,5,6,9,14');  

This would be very inefficient because as the course roster grows in size and the number of courses increases it becomes quite expensive to figure out which courses have student 14 on their roster.

Instead of either of these approaches, we model a many-to-many relationship using an additional table that we call a “junction table”, “through table”, “connector table”, or “join table” as shown in Figure 15.8. The purpose of this table is to capture the connection between a course and a student.

In a sense the table sits between the Course and User table and has a one-to-many relationship to both tables. By using an intermediate table we break a many-tomany relationship into two one-to-many relationships. Databases are very good at modeling and processing one-to-many relationships.

An example Member table would be as follows:

CREATE TABLE User (  
    id INTEGER PRIMARY KEY,  
    name TEXT UNIQUE  
);

CREATE TABLE Course (   
    id INTEGER PRIMARY KEY,  
    title TEXT UNIQUE  
);

CREATE TABLE Member (  
    user_id INTEGER,  
    course_id INTEGER,  
    PRIMARY KEY (user_id, course_id)  
);

Following our naming convention, Member.user_id and Member.course_id are foreign keys pointing at the corresponding rows in the User and Course tables. Each entry in the member table links a row in the User table to a row in the Course table by going through the Member table.

We indicate that the combination of course_id and user_id is the PRIMARY KEY for the Member table, also creating an uniqueness constraint for a course_id / user_id combination.

Now lets say we need to insert a number of students into the rosters of a number of courses. Lets assume the data comes to us in a JSON-formatted file with records like this:

[  
[ "Charley", "si110"],  
[ "Mea", "si110"],  
[ "Hattie", "si110"],  
[ "Keziah", "si110"],  
[ "Rosa", "si106"],  
[ "Mea", "si106"],  
[ "Mairin", "si106"],  
[ "Zendel", "si106"],  
[ "Honie", "si106"],  
[ "Rosa", "si106"],  
...  
]  

We could write code as follows to read the JSON file and insert the members of each course roster into the database using the following code:

In [None]:
import json
import sqlite3

conn = sqlite.connect('rosterdb.sqlite')
cur = conn.cursor()

str_data = open('roaster_data_sample.json').read()
json_data = json.loads(str_data)

for entry in json_data:
    name = entry[0]
    title = entry[1]
    
    print((name, title))
    
    cur.execute('''InSERT OR IGNORE INTO User (name)
    VALUES (?)''', (name, ))
    cur.execute('SELCT id FROM User WHERE name = ?', (name, ))
    user_id = cur.fetchone()[0]
    
    cur.execute('''INSERT OR INGNORE INTO Courese (title)
    VALUES (?)''', (title, ))
    cur.execute('SELECT id FROM Course WHERE titel = ?', (title, ))
    course_id = cur.fetchone()[0]
    
    cur.execute('''INSERT OR REPLACE INTO Member
    (user_id, course_id) VALUES (?, ?)''',
               (user_id, course_id))
    
    conn.commit()

Like in a previous example, we first make sure that we have an entry in the User table and know the primary key of the entry as well as an entry in the Course table and know its primary key. We use the ‘INSERT OR IGNORE’ and ‘SELECT’ pattern so our code works regardless of whether the record is in the table or not.

Our insert into the Member table is simply inserting the two integers as a new or existing row depending on the constraint to make sure we do not end up with duplicate entries in the Member table for a particular user_id / course_id combination.

To reconstruct our data across all three tables, we again use JOIN / ON to construct a SELECT query;

sqlite> SELECT * FROM Course  
...> JOIN Member ON Course.id = Member.course_id  
...> JOIN User ON Member.user_id = User.id;  
+----+-------+---------+-----------+----+---------+  
| id | title | user_id | course_id | id | name |  
+----+-------+---------+-----------+----+---------+  
| 1 | si110 | 1 | 1 | 1 | Charley |  
| 1 | si110 | 2 | 1 | 2 | Mea |  
| 1 | si110 | 3 | 1 | 3 | Hattie |  
| 1 | si110 | 4 | 1 | 4 | Lyena |  
| 1 | si110 | 5 | 1 | 5 | Keziah |  
| 1 | si110 | 6 | 1 | 6 | Ellyce |  
| 1 | si110 | 7 | 1 | 7 | Thalia |  
| 1 | si110 | 8 | 1 | 8 | Meabh |  
| 2 | si106 | 2 | 2 | 2 | Mea |  
| 2 | si106 | 10 | 2 | 10 | Mairin |  
| 2 | si106 | 11 | 2 | 11 | Zendel |  
| 2 | si106 | 12 | 2 | 12 | Honie |  
| 2 | si106 | 9 | 2 | 9 | Rosa |  
+----+-------+---------+-----------+----+---------+  
sqlite>  

You can see the three tables from left to right - Course, Member, and User and you can see the connections between the primary keys and foreign keys in each row of output.

#### 15.13 Modeling data at the many-to-many connection

While we have presented the “join table” as having two foreign keys making a connection between rows in two tables, this is the simplest form of a join table. It is quite common to want to add some data to the connection itself.

Continuing with our example of users, courses, and rosters to model a simple learning management system, we will also need to understand the role that each user is assigned in each course.

If we first try to solve this by adding an “instructor” flag to the User table, we will find that this does not work because a user can be a instructor in one course and a student in another course. If we add an instructor_id to the Course table it will not work because a course can have multiple instructors. And there is no one-to-many hack that can deal with the fact that the number of roles will expand into roles like Teaching Assistant or Parent.

But if we simply add a role column to the Member table - we can represent a wide range of roles, role combinations, etc.

Lets change our member table as follows:

DROP TABLE Member;  
CREATE TABLE Member (  
user_id INTEGER,  
course_id INTEGER,  
role INTEGER,  
PRIMARY KEY (user_id, course_id)  
);  

For simplicity, we will decide that zero in the role means “student” and one in the role means instructor. Lets assume our JSON data is augmented with the role as follows:

[  
[ "Charley", "si110", 1],  
[ "Mea", "si110", 0],  
[ "Hattie", "si110", 0],  
[ "Keziah", "si110", 0],  
[ "Rosa", "si106", 0], 
[ "Mea", "si106", 1],   
[ "Mairin", "si106", 0],  
[ "Zendel", "si106", 0],  
[ "Honie", "si106", 0],  
[ "Rosa", "si106", 0],  
...  
]

We could alter the roster.py program above to incorporate role as follows:

for entry in json_data:  
name = entry[0]  
title = entry[1]  
role = entry[2]  
...

cur.execute('''INSERT OR REPLACE INTO Member  
(user_id, course_id, role) VALUES ( ?, ?, ? )''',  
( user_id, course_id, role ) )  

In a real system, we would proably build a Role table and make the role column in Member a foreign key into the Role table as follows:

DROP TABLE Member;  

CREATE TABLE Member (  
user_id INTEGER,   
course_id INTEGER, 
role_id INTEGER,   
PRIMARY KEY (user_id, course_id, role_id)  
);  

CREATE TABLE Role (  
id INTEGER PRIMARY KEY,  
name TEXT UNIQUE  
);  

INSERT INTO Role (id, name) VALUES (0, 'Student');  
INSERT INTO Role (id, name) VALUES (1, 'Instructor');  

Notice that because we declared the id column in the Role table as a PRIMARY KEY, we could omit it in the INSERT statement. But we can also choose the id value as long as the value is not already in the id column and does not violate the implied UNIQUE constaint on primary keys.

#### 15.15 Debugging

One common pattern when you are developing a Python program to connect to an SQLite database will be to run a Python program and check the results using the Database Browser for SQLite. The browser allows you to quickly check to see  if your program is working properly.

You must be careful because SQLite takes care to keep two programs from changing the same data at the same time. For example, if you open a database in the browser and make a change to the database and have not yet pressed the “save” button in the browser, the browser “locks” the database file and keeps any other program from accessing the file. In particular, your Python program will not be able to access the file if it is locked.

So a solution is to make sure to either close the database browser or use the File menu to close the database in the browser before you attempt to access the database from Python to avoid the problem of your Python code failing because the database is locked.

## Chapter 16

### Visualizing data

So far we have been learning the Python language and then learning how to use Python, the network, and databases to manipulate data.

In this chapter, we take a look at three complete applications that bring all of these things together to manage and visualize data. You might use these applications as sample code to help get you started in solving a real-world problem.

Each of the applications is a ZIP file that you can download and extract onto your computer and execute.

#### 16.1 Building an OpenStreetMap from geocoded data

In this project, we are using the OpenStreetMap geocoding API to clean up some user-entered geographic locations of university names and then placing the data on an actual OpenStreetMap.

To get started, download the application from:  
www.py4e.com/code3/opengeo.zip

The first problem to solve is that these geocoding APIs are rate-limited to a certain number of requests per day. If you have a lot of data, you might need to stop and restart the lookup process several times. So we break the problem into two phases.

In the first phase we take our input “survey” data in the file where.data and read it one line at a time, and retrieve the geocoded information from Google and store it in a database geodata.sqlite. Before we use the geocoding API for each user-entered location, we simply check to see if we already have the data for that particular line of input. The database is functioning as a local “cache” of our geocoding data to make sure we never ask Google for the same data twice.

You can restart the process at any time by removing the file geodata.sqlite.

Run the geoload.py program. This program will read the input lines in where.data and for each line check to see if it is already in the database. If we don’t have the data for the location, it will call the geocoding API to retrieve the data and store it in the database.

Here is a sample run after there is already some data in the database:

Found in database AGH University of Science and Technology  
Found in database Academy of Fine Arts Warsaw Poland  
Found in database American University in Cairo  
Found in database Arizona State University  
Found in database Athens Information Technology  
Retrieving https://py4e-data.dr-chuck.net/  
opengeo?q=BITS+Pilani  
Retrieved 794 characters {"type":"FeatureColl  
Retrieving https://py4e-data.dr-chuck.net/  
opengeo?q=Babcock+University  
Retrieved 760 characters {"type":"FeatureColl  
Retrieving https://py4e-data.dr-chuck.net/  
opengeo?q=Banaras+Hindu+University   
Retrieved 866 characters {"type":"FeatureColl  
...  

The first five locations are already in the database and so they are skipped. The program scans to the point where it finds new locations and starts retrieving them.

The geoload.py program can be stopped at any time, and there is a counter that you can use to limit the number of calls to the geocoding API for each run. Given that the where.data only has a few hundred data items, you should not run into the daily rate limit, but if you had more data it might take several runs over several days to get your database to have all of the geocoded data for your input.

Once you have some data loaded into geodata.sqlite, you can visualize the data using the geodump.py program. This program reads the database and writes the file where.js with the location, latitude, and longitude in the form of executable JavaScript code.

A run of the geodump.py program is as follows:

AGH University of Science and Technology, Czarnowiejska,  
Czarna Wies, Krowodrza, Krakow, Lesser Poland  
Voivodeship, 31-126, Poland 50.0657 19.91895  

Academy of Fine Arts, Krakowskie Przedmiescie,  
Northern Srodmiescie, Srodmiescie, Warsaw, Masovian  
Voivodeship, 00-046, Poland 52.239 21.0155  
...  
260 lines were written to where.js  
Open the where.html file in a web browser to view the data.  

The file where.html consists of HTML and JavaScript to visualize a Google map. It reads the most recent data in where.js to get the data to be visualized. Here is the format of the where.js file:

myData = [  
[50.0657,19.91895,  
'AGH University of Science and Technology, Czarnowiejska,  
Czarna Wies, Krowodrza, Krakow, Lesser Poland  
Voivodeship, 31-126, Poland '],  
[52.239,21.0155,  
'Academy of Fine Arts, Krakowskie Przedmiesciee,  
Srodmiescie Polnocne, Srodmiescie, Warsaw,  
Masovian Voivodeship, 00-046, Poland'],  
...  
];  

This is a JavaScript variable that contains a list of lists. The syntax for JavaScript list constants is very similar to Python, so the syntax should be familiar to you.

Simply open where.html in a browser to see the locations. You can hover over each map pin to find the location that the geocoding API returned for the user-entered input. If you cannot see any data when you open the where.html file, you might want to check the JavaScript or developer console for your browser.

In [None]:
import urllib.request, urllib.parse, urllib.error
import http
import sqlite3
import json
import time
import ssl
import sys

# https://py4e-data.dr-chuck.net/opengeo?q=Ann+Arbor%2C+MI
serviceurl = 'https://py4e-data.dr-chuck.net/opengeo?'

# Additional detail for urllib
# http.client.HTTPConnection.debuglevel = 1

conn = sqlite3.connect('opengeo.sqlite')
cur = conn.cursor()

cur.execute('''CREATE TABLE IF NOT EXISTS Locations (address TEXT, geodata TEXT)''')

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

fh = open("where.data")
count = 0
nofound = 0
for line in fh:
    if count > 100:
        print('Retrieved 100 locations, restart to retrieve more')
        break
    
    address = line.strip()
    print('')
    cur.execute("SELECT geodata FROM Locations WHERE address = ?",
               (memoryview(address.encode()), ))
    
    try:
        data = cur.fetchone()[0]
        print("FOund in database", address)
        continue
    except:
        pass
    
    parms = dict()
    parms['q'] = address
    
    url = serviceurl + urllib.parse.urlencode(parms)
    
    print('Retrieving', url)
    uh = urllib.request.urlopen(url, context=ctx)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters', data[:20].replace('\n', ' '))
    count = count + 1
    
    try:
        js = json.loads(data)
    except:
        print(data)
        continue
        
    if not js or 'features' not in js:
        print('==== Download error ====')
        print(data)
        break
        
    if len(js['features']) == 0:
        print('==== Object not found ====')
        nofound = nofound + 1
        
    cur.execute('''INSERT INTO Locations (address, geodata) VALUES (?, ?)''',
               (memoryview(address.encode()), memoryview(data.encode())))
    
    conn.commit()
    
    if count % 10 == 0:
        print('Pausing for a bit...')
        time.sleep(5)
        
if nofound > 0:
    print('Number of features for which the location could not be found:', nofound)
    
print("Run geodump.py to read the data from the database so you can visualize it on a map.")

#### 16.2 Visualizing networks and interconnections

In this application, we will perform some of the functions of a search engine. We will first spider a small subset of the web and run a simplified version of the Google page rank algorithm to determine which pages are most highly connected, and then visualize the page rank and connectivity of our small corner of the web. We will use the D3 JavaScript visualization library http://d3js.org/ to produce the visualization output.

You can download and extract this application from:  
www.py4e.com/code3/pagerank.zip

The first program (spider.py) program crawls a web site and pulls a series of pages into the database (spider.sqlite), recording the links between pages. You can restart the process at any time by removing the spider.sqlite file and rerunning spider.py.

Enter web url or enter: http://www.dr-chuck.com/  
['http://www.dr-chuck.com']    
How many pages:2   
1 http://www.dr-chuck.com/ 12    
2 http://www.dr-chuck.com/csev-blog/ 57     
How many pages:    

In this sample run, we told it to crawl a website and retrieve two pages. If you restart the program and tell it to crawl more pages, it will not re-crawl any pages already in the database. Upon restart it goes to a random non-crawled page and starts there. So each successive run of spider.py is additive.

Enter web url or enter: http://www.dr-chuck.com/  
['http://www.dr-chuck.com']   
How many pages:3   
3 http://www.dr-chuck.com/csev-blog 57  
4 http://www.dr-chuck.com/dr-chuck/resume/speaking.htm 1  
5 http://www.dr-chuck.com/dr-chuck/resume/index.htm 13  
How many pages:  

You can have multiple starting points in the same database–within the program, these are called “webs”. The spider chooses randomly amongst all non-visited links across all the webs as the next page to spider.

If you want to dump the contents of the spider.sqlite file, you can run spdump.py as follows:

(5, None, 1.0, 3, 'http://www.dr-chuck.com/csev-blog')  
(3, None, 1.0, 4, 'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm')  
(1, None, 1.0, 2, 'http://www.dr-chuck.com/csev-blog/')  
(1, None, 1.0, 5, 'http://www.dr-chuck.com/dr-chuck/resume/index.htm')  
4 rows.  

This shows the number of incoming links, the old page rank, the new page rank, the id of the page, and the url of the page. The spdump.py program only shows pages that have at least one incoming link to them.

Once you have a few pages in the database, you can run page rank on the pages using the sprank.py program. You simply tell it how many page rank iterations to run.

How many iterations:2  
1 0.546848992536  
2 0.226714939664  
[(1, 0.559), (2, 0.659), (3, 0.985), (4, 2.135), (5, 0.659)]  

You can dump the database again to see that page rank has been updated:

(5, 1.0, 0.985, 3, 'http://www.dr-chuck.com/csev-blog')  
(3, 1.0, 2.135, 4, 'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm')  
(1, 1.0, 0.659, 2, 'http://www.dr-chuck.com/csev-blog/')  
(1, 1.0, 0.659, 5, 'http://www.dr-chuck.com/dr-chuck/resume/index.htm')  
4 rows.  

You can run sprank.py as many times as you like and it will simply refine the page rank each time you run it. You can even run sprank.py a few times and then go spider a few more pages with spider.py and then run sprank.py to reconverge the page rank values. A search engine usually runs both the crawling and ranking
programs all the time.

If you want to restart the page rank calculations without respidering the web pages, you can use spreset.py and then restart sprank.py.

How many iterations:50  
1 0.546848992536  
2 0.226714939664  
3 0.0659516187242  
4 0.0244199333  
5 0.0102096489546   
6 0.00610244329379  
...  
42 0.000109076928206  
43 9.91987599002e-05  
44 9.02151706798e-05  
45 8.20451504471e-05  
46 7.46150183837e-05  
47 6.7857770908e-05  
48 6.17124694224e-05  
49 5.61236959327e-05  
50 5.10410499467e-05  
[(512, 0.0296), (1, 12.79), (2, 28.93), (3, 6.808), (4, 13.46)]

For each iteration of the page rank algorithm it prints the average change in page rank per page. The network initially is quite unbalanced and so the individual page rank values change wildly between iterations. But in a few short iterations, the page rank converges. You should run sprank.py long enough that the page rank values converge.

If you want to visualize the current top pages in terms of page rank, run spjson.py to read the database and write the data for the most highly linked pages in JSON format to be viewed in a web browser.

Creating JSON output on spider.json...  
How many nodes? 30  
Open force.html in a browser to view the visualization  

You can view this data by opening the file force.html in your web browser. This shows an automatic layout of the nodes and links. You can click and drag any node and you can also double-click on a node to find the URL that is represented by the node.

If you rerun the other utilities, rerun spjson.py and press refresh in the browser to get the new data from spider.json.

In [None]:
import sqlite3
import urllib.error
import ssl
from urllib.parse import urljoin
from urllib.parse import urlparse
from urllib.request import urlopen
from bs4 import BeautifulSoup


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

conn = sqlite3.connect('spider.sqlite')
cur = conn.cursor()

cur.execute('''CREATE TABLE IF NOT EXISTS Pages
    (id INTEGER PRIMARY KEY, url TEXT UNIQUE, html TEXT,
     error INTEGER, old_rank REAL, new_rank REAL)''')

cur.execute('''CREATE TABLE IF NOT EXISTS Links
    (from_id INTEGER, to_id INTEGER, UNIQUE(from_id, to_id))''')

cur.execute('''CREATE TABLE IF NOT EXISTS Webs (url TEXT UNIQUE)''')

# Check to see if we are already in progress...
cur.execute('SELECT id,url FROM Pages WHERE html is NULL and error is NULL ORDER BY RANDOM() LIMIT 1')
row = cur.fetchone()
if row is not None:
    print("Restarting existing crawl.  Remove spider.sqlite to start a fresh crawl.")
else :
    starturl = input('Enter web url or enter: ')
    if ( len(starturl) < 1 ) : starturl = 'http://www.dr-chuck.com/'
    if ( starturl.endswith('/') ) : starturl = starturl[:-1]
    web = starturl
    if ( starturl.endswith('.htm') or starturl.endswith('.html') ) :
        pos = starturl.rfind('/')
        web = starturl[:pos]

    if ( len(web) > 1 ) :
        cur.execute('INSERT OR IGNORE INTO Webs (url) VALUES ( ? )', ( web, ) )
        cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( starturl, ) )
        conn.commit()

# Get the current webs
cur.execute('''SELECT url FROM Webs''')
webs = list()
for row in cur:
    webs.append(str(row[0]))

print(webs)

many = 0
while True:
    if ( many < 1 ) :
        sval = input('How many pages:')
        if ( len(sval) < 1 ) : break
        many = int(sval)
    many = many - 1

    cur.execute('SELECT id,url FROM Pages WHERE html is NULL and error is NULL ORDER BY RANDOM() LIMIT 1')
    try:
        row = cur.fetchone()
        # print row
        fromid = row[0]
        url = row[1]
    except:
        print('No unretrieved HTML pages found')
        many = 0
        break

    print(fromid, url, end=' ')

    # If we are retrieving this page, there should be no links from it
    cur.execute('DELETE from Links WHERE from_id=?', (fromid, ) )
    try:
        document = urlopen(url, context=ctx)

        html = document.read()
        if document.getcode() != 200 :
            print("Error on page: ",document.getcode())
            cur.execute('UPDATE Pages SET error=? WHERE url=?', (document.getcode(), url) )

        if 'text/html' != document.info().get_content_type() :
            print("Ignore non text/html page")
            cur.execute('DELETE FROM Pages WHERE url=?', ( url, ) )
            conn.commit()
            continue

        print('('+str(len(html))+')', end=' ')

        soup = BeautifulSoup(html, "html.parser")
    except KeyboardInterrupt:
        print('')
        print('Program interrupted by user...')
        break
    except:
        print("Unable to retrieve or parse page")
        cur.execute('UPDATE Pages SET error=-1 WHERE url=?', (url, ) )
        conn.commit()
        continue

    cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( url, ) )
    cur.execute('UPDATE Pages SET html=? WHERE url=?', (memoryview(html), url ) )
    conn.commit()

    # Retrieve all of the anchor tags
    tags = soup('a')
    count = 0
    for tag in tags:
        href = tag.get('href', None)
        if ( href is None ) : continue
        # Resolve relative references like href="/contact"
        up = urlparse(href)
        if ( len(up.scheme) < 1 ) :
            href = urljoin(url, href)
        ipos = href.find('#')
        if ( ipos > 1 ) : href = href[:ipos]
        if ( href.endswith('.png') or href.endswith('.jpg') or href.endswith('.gif') ) : continue
        if ( href.endswith('/') ) : href = href[:-1]
        # print href
        if ( len(href) < 1 ) : continue

		# Check if the URL is in any of the webs
        found = False
        for web in webs:
            if ( href.startswith(web) ) :
                found = True
                break
        if not found : continue

        cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( href, ) )
        count = count + 1
        conn.commit()

        cur.execute('SELECT id FROM Pages WHERE url=? LIMIT 1', ( href, ))
        try:
            row = cur.fetchone()
            toid = row[0]
        except:
            print('Could not retrieve id')
            continue
        # print fromid, toid
        cur.execute('INSERT OR IGNORE INTO Links (from_id, to_id) VALUES ( ?, ? )', ( fromid, toid ) )


    print(count)

cur.close()
