
## Spring 2024 - CIS189 Module \#14 (2023-04-17)
---

**This Evening's Agenda**:
- Prior module grade distributions
- [The Big Book of Small Python Projects](https://inventwithpython.com/bigbookpython/)
- Quick discussion of final projects
- Python Standard Library Quick Hits


<br>

### Python Standard Library Quick Hits

In the class, we've worked with the standard library extensively, but there are many libraries we haven't had a chance to work with. THis week, we'll discuss a few that might be useful for various tasks if you use Python beyond this class (which you definitely should!)

The full list of standard library packages is available here: 

- https://docs.python.org/3/py-modindex.html


<br>

### [1. pickle](https://docs.python.org/3/library/pickle.html#module-pickle)

Python's pickle module is used for serializing and deserializing objects. Serialization, often referred to as "pickling," involves converting a Python object into a byte stream, and deserialization, known as "unpickling," is the inverse operation, converting a byte stream back into a Python object.

This allows us to save virtually any Python object (dict, list, tuple, set) to file, which can be read in by other potential users or at some later date.

Reasons to serialize objects:

- Saving complicated data types to a file, so that programs can resume where they left off (persistence).
- Sending Python data over a TCP/IP connection between processes or machines (data interchange).
- Caching or saving state between program executions.

<br>

In the next cell, we create a list, add elements to it then write it to a pickle file (usually has a .pkl extension):



In [None]:

import pickle

# Create an arbitrary list.
vals = ["hey", "hey", "my", "my"]


# Store vals list in a .pkl file in current working directory. 
with open("vals-list.pkl", "wb") as fpkl:
    pickle.dump(vals, fpkl, pickle.HIGHEST_PROTOCOL)



<br>

Assume we want to load the `vals` list into a new Python session, or allow someone else to access it. Here's how to load a serialized object back into Python using pickle:


In [None]:

with open("vals-list.pkl", "rb") as fpkl:
    vals = pickle.load(fpkl)

vals


In [None]:

# Add more elements to vals, and re-serialize.
vals = vals + ["rock n roll", "can", "never", "die"]

print(vals)

# Serialize updated vals list.
with open("vals-list.pkl", "wb") as fpkl:
    pickle.dump(vals, fpkl, pickle.HIGHEST_PROTOCOL)



<br>

### [2. hashlib](https://docs.python.org/3/library/hashlib.html#module-hashlib)

Hash functions are cryptographic one-way functions that take input data (of any size) and produce a fixed-size string of bytes, typically referred to as a hash code, hash digest or hash. 


**Hash function characteristics:**

- The same input will always produce the same output.

- A tiny change to the input data (like changing one bit) should change the hash so extensively that the new hash appears uncorrelated with the old hash.

- It should be difficult to find two different inputs that produce the same output hash (collision resistance).

<br>

**Common hash functions:**

- MD5: Once widely used, now considered insecure due to vulnerabilities allowing for collision attacks.
- SHA-1: Also no longer considered secure against well-funded attackers.
- SHA-256 and SHA-3: Part of the SHA-2 and SHA-3 families, these are currently recommended for security due to their resistance to known attack methods.


**Uses:**

- Quick comparison of arbitrary objects
- Data integrity
- Storing passwords

<br>

List available hashing algorithms:

In [None]:

import hashlib

# List available algorithms on your computer.
hashlib.algorithms_available



<br>

Example creating a hash from an arbitrary string:

In [None]:

# String to hash.
s = "Sittin' in the shade and shoveling with a spade"

# Create digest.
digest = hashlib.sha256(s.encode("utf-8")).hexdigest()

digest


<br>

How does the digest change if I add a trailing space?

In [None]:

s2 = "Sittin' in the shade and shoveling with a spade "

digest2 = hashlib.sha256(s2.encode("utf-8")).hexdigest()

print(f"digest : {digest}")
print(f"digest2: {digest2}")




<br>

Hashes are commonly used for data integrity checks. If a single byte of a file is changed, it will render a completely different digest. 

- Anaconda distribution: https://www.anaconda.com/download/success
- Anaconda installer file hashes: https://docs.anaconda.com/free/anaconda/hashes/index.html



For the version of Anaconda I downloaded, the SHA-256 digest should be:

- `c536ddb7b4ba738bddbd4e581b29308cb332fa12ae3fa2cd66814bd735dff231`

Do we get the same digest, or has a malicious adversary altered the artifact?

In [None]:

expected_digest = "c536ddb7b4ba738bddbd4e581b29308cb332fa12ae3fa2cd66814bd735dff231"

pp = "Anaconda3-2024.02-1-Linux-x86_64.sh"

actual_digest = hashlib.sha256(open(pp, "rb").read()).hexdigest()

print(f"Expected digest: {expected_digest}")
print(f"Actual digest  : {actual_digest}")


<br>

Digests can be created for images, which allows for quick comparison. Take the album cover for the Allman Brothers 1971 classic *At the Fillmore East* (880 x 800 pixel image, 1,920,000 color points):



![](https://popspotsnyc.com/ALLMAN_BROTHERS_FILLMORE_EAST/allman_bros_fillmore_east.jpg)


In [None]:

# Specify path to locally downloaded image.
img_path = "../Misc/Images/fillmore.jpg"

# Create digest.
allman_digest = hashlib.sha256(open(img_path, "rb").read()).hexdigest()

# Print digest.
print(f"allman_digest: {allman_digest}")



<br>


### Exercise 1: Pickling your hashes (not a euphemism)


1. Identify the hashing algorithms available on your computer (should be the same for most Windows machines, may be some differences for MacOS). Choose 3, and create separate digests of the Gettysburg Address for each. 

2. Create a dictionary with keys the string representation of the algorithm you chose and values the digest for that algorithm. For example, if I chose sha1, sha224 and sha384, my dictionary would look like the following:

```
d = {
    "sha1": "whatever the sha1 digest turns out to be",
    "sha224": "whatever the sha224 digest turns out to be",
    "sha384": "whatever the sha384 digest turns out to be"
}
```

3. Serialize the dictionary to a .pkl file named *my-digests.pkl*.



In [None]:

gettysburg = """
Four score and seven years ago our fathers brought forth on this continent, a 
new nation, conceived in Liberty, and dedicated to the proposition that all men 
are created equal. Now we are engaged in a great civil war, testing whether that 
nation, or any nation so conceived and so dedicated, can long endure. We are met 
on a great battle-field of that war. We have come to dedicate a portion of that 
field, as a final resting place for those who here gave their lives that that 
nation might live. It is altogether fitting and proper that we should do this. 
But, in a larger sense, we can not dedicate -- we can not consecrate we can not 
hallow this ground. The brave men, living and dead, who struggled here, have 
consecrated it, far above our poor power to add or detract. The world will 
little note, nor long remember what we say here, but it can never forget what 
they did here. It is for us the living, rather, to be dedicated here to the 
unfinished work which they who fought here have thus far so nobly advanced. 
It is rather for us to be here dedicated to the great task remaining before us 
that from these honored dead we take increased devotion to that cause for which 
they gave the last full measure of devotion -- that we here highly resolve that 
these dead shall not have died in vain -- that this nation, under God, shall 
have a new birth of freedom -- and that government of the people, by the people, 
for the people, shall not perish from the earth.
"""

##### YOUR CODE HERE #####




<br>


### [3. difflib](https://docs.python.org/3/library/difflib.html#module-difflib)


difflib provides tools for computing and working with differences between sequences. It includes functions and classes for comparing sequences, finding differences, and generating human-readable differences or patches.


Imagine we have two strings and want to quantify their similarity. We can leverage `difflib.SequenceMatcher`:


In [None]:

import difflib

s1 = "either renew oneself or perish"
s2 = "either renew thyself or perish"


# Compute similarity score of s1 and s2. 
sim_score = difflib.SequenceMatcher(None, s1, s2).ratio()


print(f"Similarity score of s1 and s2: {sim_score:.2%}")


<br>

### Exercise 2: Finding Similar Addresses

For the list of addresses below, compute the similarity score of each address with the target address. How many entries in the addresses list are highly similar to target_address?


In [None]:

target_address = "123 Main, Springfield IL 62701"


addresses = [
    "456 Elm Avenue, Pleasantville, NY 10570",
    "789 Oak Lane, Boulder, CO 80302",
    "101 Maple Street, Portland OR 97201", 
    "234 Pine Bluff Road, San Francisco, CA 94103",
    "123 Main Street, Springfield, IL 62701",
    "567 Cedar Drive, Austin, TX 78701",
    "890 Birch Court, Seattle, WA 98101", 
    "123 Main St., Springfield, IL 62701-7543",
    "111 Spruce Street, Boston, MA 02108",
    "222 Cherry Lane, Miami, FL 33101",
    "333 Walnut Avenue, Denver, CO 80202",
    "444 Ash Street, Nashville, TN 37201",
    "555 Willow Way, Chicago, IL 60601",
    "666 Magnolia Boulevard, Los Angeles, CA 90001",
    "777 Cedar Lane, Atlanta, GA 30301",
    ]


##### YOUR CODE HERE #####




<br>

### [4. re](https://docs.python.org/3/library/re.html#module-re)

#### Regular Expressions in Python

Regular expressions (regex) are sequences of characters used to define search patterns. They enable sophisticated string matching and manipulation in text processing tasks. Regex patterns specify sequences of characters that must be found or matched in a string. They're employed in tasks such as text search and replace, input validation, and data extraction. Refer to the [Python Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html) for more information.


- Python Regular Expression Evaluator: https://pythex.org/

<br>


Many of the string methods we've used are similar to regular expressions (`startswith`, `endswith`, `isalpha`), but these are relatively limited. Regular expressions can be leveraged to find complex patterns in text streams.

The cheatsheet at pythex.org contains the full list of regular expression symbols, but a few are listed below:

- `?` : optional (0 or 1 repetitions).
- `*` : Zero or more times.
- `+` : One or more times.
-`{m,n}` : Between m and n times.
- `^` : Match at start of string.
- `$` : Match at end of string.
- `.` : Matches any character except newline.
- `[A-Z]`: Matches uppercase letters.
- `[a-z]`: Matches lowercase letters.
- `[A-Za-z0-9.-]`: Matches uppercase, lowercase, digits, literal `.` and `-`.
- `\b`: Matches empty string at word boundary. 


In [None]:
"""
312-845-5555
apples
In 1492 columbus sailed the ocean blue
2018-09-07
(312)-845-5555
17-9866758-900987757567
"""

stream = '''
    312-845-5555
    apples
    In 1492 columbus sailed the ocean blue
    2018-09-07
    (312)-845-5555
    17-9866758-900987757567
    '''



<br>

We've created our regular expression `^\(?\d{3}\)?-\d{3}-\d{4}`. Here is how to extract the matches using the re library:


In [None]:

import re

matches = re.findall(r"\b\(?\d{3}\)?-\d{3}-\d{4}\b", stream)

matches



<br>

### Exercise 3: Contacting Puppy Boy

The following text is a transcript from a customer service call. The representative urgently needs to send an email to Puppy Boy regarding his account, but the addresses can only be obtained from the transcript. Use the power of regular expressions to extract Puppy Boy's email addresses so his information can be delivered. 

In [None]:

transcript = """
REPRESENTATIVE: So, Puppy Boy that's your real name
CUSTOMER: Yep. P-U-P-P-Y B-O-Y I'm part puppy, part boy
REPRESENTATIVE: OK (long pause) The email address we have on file is puppy.boy420@gmail.com is that correct
CUSTOMER: Woof. You can also contact me at beggin4bones@yahoo.com or not.your.scooby@outlook.com 
REPRESENTATIVE: Thank you Puppy Boy we will send your receipt for 10000 pounds of Purina to each address
CUSTOMER: Thank you madam may the wind always be at your back and the sun upon your face
"""

##### YOUR CODE HERE #####
