
## CIS189 Module \#14
---
Author: James D. Triveri


<br>

### **Python Standard Library Quick Hits**

In the class, we've worked a bit with the Python standard library, but there are many modules we haven't had a chance to explore. This week, we'll discuss a few that might be useful for various tasks if you end up using Python beyond this class, which you definitely should!

The full list of standard library packages is available here: 

- [Python Module Index](https://docs.python.org/3/py-modindex.html)


We will be discussing four libraries:

- **pickle**: Serialize Python objects
- **hashlib**: Generate hashes 
- **difflib**: Quantify differences between sequences
- **re**: Regular expressions




<br>

### **[1. pickle](https://docs.python.org/3/library/pickle.html#module-pickle)**

Python's pickle module is used for serializing and deserializing objects. Serialization, often referred to as "pickling," involves converting a Python object into a byte stream, and deserialization, known as "unpickling," is the inverse operation, converting a byte stream back into a Python object.

This allows us to save virtually any Python object (dict, list, tuple, set) to file, which can be read in by other potential users or at some later date into an independent Python session.

Reasons to serialize objects:

- Saving complicated data types to a file, so that programs can resume where they left off (persistence).
- Sending Python data over a TCP/IP connection between processes or machines (data interchange).
- Caching or saving state between program executions.

<br>

In the next cell, we create a list, add elements to it then write it to a pickle file (usually has a .pkl extension):



In [9]:

import pickle

# Create an arbitrary list.
vals = ["hey", "hey", "my", "my"]


# Store vals list in a .pkl file in current working directory. 
with open("vals-list.pkl", "wb") as fpkl:
    pickle.dump(vals, fpkl, pickle.HIGHEST_PROTOCOL)




<br>

Assume we want to load the `vals` list into a new Python session, or allow someone else to access it. Here's how to load a serialized object back into Python using pickle:


In [10]:
del vals


In [11]:

with open("vals-list.pkl", "rb") as fpkl:
    vals = pickle.load(fpkl)

vals


['hey', 'hey', 'my', 'my']

In [12]:

# Add more elements to vals, and re-serialize.
vals = vals + ["rock n roll", "can", "never", "die"]

# Serialize updated vals list.
with open("vals-list.pkl", "wb") as fpkl:
    pickle.dump(vals, fpkl, pickle.HIGHEST_PROTOCOL)



<br>

### **Checkpoint \#1**

The `daily` list contains date, high temperature, low temperature, time of sunrise and time of sunset for a given week in Des Moines. Create a dictionary `dtemps`, keyed by date, where each key points to a dictionary of items in the list. For example, the first two entries of `dtemps` should look like:

```python
{
    "2024-07-08": {"low": 63, "high": 82, "sunrise": "05:48", "sunset": "20:50"},
    "2024-07-09": {"low": 67, "high": 85, "sunrise": "05:49", "sunset": "20:49"},
    .
    .
    .
}
```


In [14]:

daily = [
    ["2024-07-08", 63, 82, "05:48", "20:50"], # Monday
    ["2024-07-09", 67, 85, "05:49", "20:49"], # Tuesday
    ["2024-07-10", 66, 82, "05:50", "20:48"], # Wednesday
    ["2024-07-11", 65, 83, "05:50", "20:48"], # Thursday
    ["2024-07-12", 67, 86, "05:51", "20:48"], # Friday
    ]

##### YOUR CODE HERE #####


In [15]:

dtemps = {}

for ll in daily:
    dtemps[ll[0]] = {
        "low": ll[1], "high": ll[2], "sunrise": ll[3], "sunset": ll[4]
        }

dtemps


{'2024-07-08': {'low': 63, 'high': 82, 'sunrise': '05:48', 'sunset': '20:50'},
 '2024-07-09': {'low': 67, 'high': 85, 'sunrise': '05:49', 'sunset': '20:49'},
 '2024-07-10': {'low': 66, 'high': 82, 'sunrise': '05:50', 'sunset': '20:48'},
 '2024-07-11': {'low': 65, 'high': 83, 'sunrise': '05:50', 'sunset': '20:48'},
 '2024-07-12': {'low': 67, 'high': 86, 'sunrise': '05:51', 'sunset': '20:48'}}


<br>

Save `dtemps` to your current working directory  as `dtemps.pkl`.


In [None]:

##### YOUR CODE HERE #####


In [16]:

with open("dtemps.pkl", "wb") as fpkl:
    pickle.dump(dtemps, fpkl, pickle.HIGHEST_PROTOCOL)
    

<br>

Two additional days of data have come in. Update your serialized dictionary to include data from 2024-07-13 and 2024-07-14, and write the updated dictionary back to file. Specifically:

1. Load in *dtemps.pkl* from file. 
2. Add the additional data the same way as before.
3. Rewrite the updated dictionary back to *dtemps.pkl*. 

In [None]:

weekend = [
    ["2024-07-13", 74, 91, "05:52", "20:47"], # Saturday
    ["2024-07-14", 79, 92, "05:53", "20:47"], # Sunday
    ]

##### YOUR CODE HERE #####


In [None]:

with open("dtemps.pkl", "rb") as fpkl:
    dtemps = pickle.load(fpkl)

# Add saturday and sunday. 
for ll in weekend:
    dtemps[ll[0]] = {
        "low": ll[1], "high": ll[2], "sunrise": ll[3], "sunset": ll[4]
        }

# Export dtemps to dtemps.pkl.
with open("dtemps.pkl", "wb") as fpkl:
    pickle.dump(dtemps, fpkl, pickle.HIGHEST_PROTOCOL)



<br>

### **[2. hashlib](https://docs.python.org/3/library/hashlib.html#module-hashlib)**

Hash functions are cryptographic one-way functions that take input data (of any size) and produce a fixed-size string of bytes, typically referred to as a hash code, hash digest or hash. 


**Hash function characteristics:**

- The same input will always produce the same output.

- A tiny change to the input data (like changing one bit) should change the hash so extensively that the new hash appears uncorrelated with the old hash.

- It should be difficult to find two different inputs that produce the same output hash (collision resistance).

<br>

**Common hash functions:**

- **MD5**: Once widely used, now considered insecure due to vulnerabilities allowing for collision attacks.
- **SHA-1**: Also no longer considered secure against well-funded attackers.
- **SHA-256** and **SHA-3**: Part of the SHA-2 and SHA-3 families, these are currently recommended for security due to their resistance to known attack methods.


**Uses:**

- Quick comparison of arbitrary objects (image comparison)
- Data integrity
- Storing passwords

 
 
**Quick Introduction:**

- [Password Hashing, Salts, Peppers](https://www.youtube.com/watch?v=--tnZMuoK3E&list=PLr4vqfPMMQf6ExJe8kPTVARhNwGs197Sq&index=8)


<br>

Each OS will have different hashing algorithms available. To list available hashing algorithms:

In [17]:

import hashlib

# List available algorithms on your computer.
hashlib.algorithms_available


{'blake2b',
 'blake2s',
 'md5',
 'md5-sha1',
 'ripemd160',
 'sha1',
 'sha224',
 'sha256',
 'sha384',
 'sha3_224',
 'sha3_256',
 'sha3_384',
 'sha3_512',
 'sha512',
 'sha512_224',
 'sha512_256',
 'shake_128',
 'shake_256',
 'sm3'}


<br>

Example creating a hash from an arbitrary string:

In [18]:

# String to hash.
s = "Sittin' in the shade and shoveling with a spade"

# Create digest.
digest1 = hashlib.sha256(s.encode("utf-8")).hexdigest()

digest1

'd0b2c52076e174766898f76a3e15bc47f329f15aef3a376c2c74f53e8425706f'


<br>

How does the digest change if I add a trailing space?

In [19]:

s2 = "Sittin' in the shade and shoveling with a spade "

digest2 = hashlib.sha256(s2.encode("utf-8")).hexdigest()

print(f"digest1: {digest1}")
print(f"digest2: {digest2}")


digest1: d0b2c52076e174766898f76a3e15bc47f329f15aef3a376c2c74f53e8425706f
digest2: f253cadd1fbf55889e1836b7b18f4d4e14d37be3b4a306cd4142bced9c726a54



<br>

Digests can be created from virtually any type of file. Here is an alternate cover of the Allman Brothers classic *At Fillmore East* album:

![](https://popspotsnyc.com/ALLMAN_BROTHERS_FILLMORE_EAST/allman_bros_fillmore_east.jpg)



<br>

To create a hash from *fillmore-east.jpg* (available under Module 14 In-Class Materials in Canvas), run the following:

In [20]:

# Update your path to download directory.
img_path = "misc/fillmore-east.jpg"

with open(img_path, "rb") as f:
    abb_digest = hashlib.sha256(f.read()).hexdigest()

# Or as a one-liner:  
# abb_digest = hashlib.sha256(open(img_path, "rb").read()).hexdigest()

abb_digest


'86edc04e53b51e193563aa83b9d52557e963c0f4a9ddb2c054c16b11e3b295e4'


<br>

Hashes are commonly used for data integrity checks. If a single byte of a file is changed, it will render a completely different digest. 

<br>




### **Checkpoint \#2: Data Integrity Check**

On Python.org, if we navigate to the [Python 3.12.4 installer page](https://www.python.org/downloads/release/python-3124/), we see that each file has an associated MD5 checksum. Perform the following:

1. Download the *Windows installer (64-bit)* or MacOS installer if not on Windows. 

2. Copy the provided MD5 checksum from Python.org, and save it to a variable, something like `h1`. 

3. Generate the appropriate hash from the donwloaded Python installer (should be no different than the example above where we created a hash from *fillmore-east.jpg*). Set this hash to another variable, `h2`. 

4. Verify that `h1` and `h2` are identical.


In [None]:

##### YOUR CODE HERE #####


In [21]:

# MD5 hash taken from Python.org.
h1 = "f3df1be26cc7cbd8252ab5632b62d740" 

# Path to downloaded installer.
p = "C:\\Users\\jtriv\\Downloads\\python-3.12.4-amd64.exe"

# Create hash from downloaded installer.
h2 = hashlib.md5(open(p, "rb").read()).hexdigest()

print(f"h1: {h1}")
print(f"h2: {h2}")
print(f"h1==h2?: {h1 == h2}")



h1: f3df1be26cc7cbd8252ab5632b62d740
h2: f3df1be26cc7cbd8252ab5632b62d740
h1==h2?: True



<br>


### **Checkpoint \#3: Pickling your hashes (not a euphemism)**


1. Identify the hashing algorithms available on your computer (should be the same for most Windows machines, may be some differences for Linux/MacOS). Choose 3, and create separate digests of the Gettysburg Address for each. 

2. Create a dictionary with keys the string representation of the algorithm you chose and values the digest for that algorithm. For example, if I picked sha1, sha224 and sha384, my dictionary would look like the following:

```
d = {
    "sha1": "whatever the sha1 digest turns out to be",
    "sha224": "whatever the sha224 digest turns out to be",
    "sha384": "whatever the sha384 digest turns out to be"
}
```

3. Serialize the dictionary to a .pkl file named *my-digests.pkl*.



In [None]:

gettysburg = """
Four score and seven years ago our fathers brought forth on this continent, a 
new nation, conceived in Liberty, and dedicated to the proposition that all men 
are created equal. Now we are engaged in a great civil war, testing whether that 
nation, or any nation so conceived and so dedicated, can long endure. We are met 
on a great battle-field of that war. We have come to dedicate a portion of that 
field, as a final resting place for those who here gave their lives that that 
nation might live. It is altogether fitting and proper that we should do this. 
But, in a larger sense, we can not dedicate -- we can not consecrate we can not 
hallow this ground. The brave men, living and dead, who struggled here, have 
consecrated it, far above our poor power to add or detract. The world will 
little note, nor long remember what we say here, but it can never forget what 
they did here. It is for us the living, rather, to be dedicated here to the 
unfinished work which they who fought here have thus far so nobly advanced. 
It is rather for us to be here dedicated to the great task remaining before us 
that from these honored dead we take increased devotion to that cause for which 
they gave the last full measure of devotion -- that we here highly resolve that 
these dead shall not have died in vain -- that this nation, under God, shall 
have a new birth of freedom -- and that government of the people, by the people, 
for the people, shall not perish from the earth.
"""

##### YOUR CODE HERE #####


In [None]:

d = {
    "md5": hashlib.md5(gettysburg.encode("utf-8")).hexdigest(),
    "sha384": hashlib.sha384(gettysburg.encode("utf-8")).hexdigest(),
    "sha512": hashlib.sha512(gettysburg.encode("utf-8")).hexdigest(),
}


with open("my-digests.pkl", "wb") as fpkl:
    pickle.dump(d, fpkl, pickle.HIGHEST_PROTOCOL)


<br>


### **[3. difflib](https://docs.python.org/3/library/difflib.html#module-difflib)**


difflib provides tools for computing and working with differences between sequences. It includes functions and classes for comparing sequences, finding differences, and generating human-readable differences or patches.


Imagine we have two strings and want to quantify their similarity. We can leverage `difflib.SequenceMatcher`:


In [22]:

import difflib

s1 = "either renew oneself or perish"
s2 = "either renew thyself or perish"

# Compute similarity score of s1 and s2. 
sim_score = difflib.SequenceMatcher(None, s1, s2).ratio()

print(f"Similarity score of s1 and s2: {sim_score:.2%}")


Similarity score of s1 and s2: 90.00%


<br>

### **Checkpoint \#4: Finding Similar Addresses**

For the list of addresses below, compute the similarity score of each address with `target_addr`. What are the 3 highest similarity scores and addresses when compared with `target_addr`? 


In [24]:

target_addr = "123 Main, Springfield IL 62701"

addrs = [
    "456 Elm Avenue, Pleasantville, NY 10570",
    "789 Oak Lane, Boulder, CO 80302",
    "101 Maple Street, Portland OR 97201", 
    "234 Pine Bluff Road, San Francisco, CA 94103",
    "123 Main Street, Springfield, IL 62701",
    "567 Cedar Drive, Austin, TX 78701",
    "890 Birch Court, Seattle, WA 98101", 
    "123 Main St., Springfield, IL 62701-7543",
    "111 Spruce Street, Boston, MA 02108",
    "222 Cherry Lane, Miami, FL 33101",
    "333 Walnut Avenue, Denver, CO 80202",
    "444 Ash Street, Nashville, TN 37201",
    "555 Willow Way, Chicago, IL 60601",
    "666 Magnolia Boulevard, Los Angeles, CA 90001",
    "777 Cedar Lane, Atlanta, GA 30301",
    ]


##### YOUR CODE HERE #####


In [25]:

sim_scores = []

for addr in addrs:
    sim_score = difflib.SequenceMatcher(None, addr, target_addr).ratio()
    sim_scores.append((addr, sim_score))

# Sort list in decreasing order of score.
sim_scores = sorted(sim_scores, key=lambda v: v[1], reverse=True)

sim_scores[:3]

[('123 Main Street, Springfield, IL 62701', 0.8823529411764706),
 ('123 Main St., Springfield, IL 62701-7543', 0.8571428571428571),
 ('101 Maple Street, Portland OR 97201', 0.46153846153846156)]


<br>

### **[4. re](https://docs.python.org/3/library/re.html#module-re)**

#### Regular Expressions in Python

- [Regular Expressions Video](https://www.youtube.com/watch?v=r6I-Ahc0HB4&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD)
- [Python Regular Expressions HOWTO](https://docs.python.org/3/howto/regex.html)


Regular expressions (regex) are sequences of characters used to define search patterns. They enable sophisticated string matching and manipulation in text processing tasks. Regex patterns specify sequences of characters that must be found or matched in a string. They're employed in tasks such as text search and replace, input validation, and data extraction. Refer to the [Python Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html) for more information.


- Python Regular Expression Evaluator: https://pythex.org/

<br>


Many of the string methods we've used are similar to regular expressions (`startswith`, `endswith`, `isalpha`), but these are relatively limited. Regular expressions can be leveraged to find complex patterns in text streams.

The cheatsheet at pythex.org contains the full list of regular expression symbols, but a few are listed below:

- `?` : optional (0 or 1 repetitions).  
- `*` : Zero or more times.  
- `+` : One or more times.     
- `{m,n}` : Between m and n times.  
- `^` : Match at start of string.  
- `$` : Match at end of string.  
- `.` : Matches any character except newline.  
- `[A-Z]`: Matches uppercase letters.  
- `[a-z]`: Matches lowercase letters.  
- `[A-Za-z0-9.-]`: Matches uppercase, lowercase, digits, literal `.` and `-`.  
- `\b`: Matches empty string at word boundary.    



<br>

Let's say I'm interested in extracting phone numbers from `stream`:

In [6]:
"""
312-845-5555
apples
In 1492 columbus sailed the ocean blue
2018-09-07
(312)-845-5555
17-9866758-900987757567
"""

stream = '''
    312-845-5555
    apples
    In 1492 columbus sailed the ocean blue
    2018-09-07
    (312)-845-5555
    17-9866758-900987757567
    '''



We will need to identify 3 numbers, followed by a dash, then 3 more numbers followed by a dash then 4 numbers followed by a dash. A first pass might be:

```
\d{3}-\d{3}-\d{4}
```

<br>

When we check this in Pythex, only the first phone number is captured. We need to account for parens surrounding the area code, but only in some cases (remember the `?` matches 0 or 1 instances):

```
\(?\d{3}\)?-\d{3}-\d{4}
```

<br>

Testing this in Pythex matches both phone numbers.



We've created our regular expression `"\(?\d{3}\)?-\d{3}-\d{4}"`. Here is how to extract the matches using the re library:


In [26]:

import re

matches = re.findall(r"\(?\d{3}\)?-\d{3}-\d{4}", stream)

matches


['312-845-5555', '(312)-845-5555']

<br>

### **Checkpoint \#5: Parsing Web Logs**

The *weblog.txt* file (downloadable from Module 14 In-Class Materials page in Canvas) contains over 7,000 website requests for a site in January of 2019. Each request is prefixed by an IP address. Here is a sample of the first few rows:

```
5.211.97.39 - - [22/Jan/2019:04:10:40 +0330] "GET /m/filter/ ..."
5.211.97.39 - - [22/Jan/2019:04:10:40 +0330] "GET /settings/logo HTTP/1.1 ..."
5.211.97.39 - - [22/Jan/2019:04:10:41 +0330] "GET /image/58131/productModel/200x200 ..."
5.211.97.39 - - [22/Jan/2019:04:10:41 +0330] "HEAD /amp_preconnect_polyfill_404 ..."
```


<br>

Using regular expressions, extract all IP addresses from *weblog.txt*, and determine the number of unique addresses.

IP4 addresses are 32-bit numbers, typically expressed in decimal format as four 8-bit fields separated by periods (e.g., 192.168.1.1). Each field can range from 0 to 255 (i.e., IP addresses can range from 0.0.0.0 thru 255.255.255.255).



In [None]:

##### YOUR CODE HERE #####


In [27]:

weblog_path = "misc/weblog.txt"

with open(weblog_path, "r") as f:
    logs = f.read()

pattern = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"

matches = re.findall(pattern, logs)

uniq_ips = list(set(matches))

print(f"Number of unique IP addresses: {len(uniq_ips):,.0f}")

Number of unique IP addresses: 470
