<a href="https://colab.research.google.com/github/kbehrman/Ca-legislature-explorations/blob/master/Chapter-15%3AOther-Topics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this chapter we will cover some functionaliy and packages that are powerful tools for data science. These are sorting data, reading and writing files, datetime objects and regular expressions. I will try to give you enough familiarity with these topics that you will understand when you need them.



## Sorting

Some Python data structures such as lists, NumPy arrays and Pandas DataFrames have built in sorting capabilities. These methods can be used out of the box, or customized with your own sorting functions.

## Lists
Python lists have a built-in sort() method. This method sorts the list in place. For example, if we define a list of whale types:

In [None]:
whales = [ 'Blue', 'Killer', 'Sperm', 'Humpback', 'Beluga', 'Bowhead' ]

And use it's sort() method, we see that the list is now sorted:

In [None]:
whales.sort()
whales

['Beluga', 'Blue', 'Bowhead', 'Humpback', 'Killer', 'Sperm']

This method does not return a copy of the list, if we capture the return, we see that it is None:

In [None]:
return_value = whales.sort()
print(return_value)

None


Python has a built-in function, sorted(), which takes any iterable as an argument, and returns a sorted list:

In [None]:
sorted(whales)

['Beluga', 'Blue', 'Bowhead', 'Humpback', 'Killer', 'Sperm']

We can use sorted on any iterable. If we call it on a string, it returns a sorted list of the strings characters:

In [None]:
sorted("Moby Dick")

[' ', 'D', 'M', 'b', 'c', 'i', 'k', 'o', 'y']

Both the list.sort() method and the sorted() function take an optional reverse parameter, which defaults to False:

In [None]:
sorted(whales, reverse=True)

['Sperm', 'Killer', 'Humpback', 'Bowhead', 'Blue', 'Beluga']

And both also take an option key argument. This is used to define how the sorting should be defined. To sort our whale types using the length of the strings, we can define a lambda which returns string length, and pass it as the key:

In [None]:
sorted(whales, key=lambda x: len(x))

['Blue', 'Sperm', 'Beluga', 'Killer', 'Bowhead', 'Humpback']

We can also define more complex key functions. Below we define a function which will return the lenght of a string, unless that string is 'Beluga', in which case it returns 1. The result of using this as our key function is to sort the list by string length, except for 'Beluga', which is placed first:

In [None]:
def beluga_first(item):
    if item == 'Beluga':
        return 1
    return len(item)

sorted(whales, key=beluga_first)

['Beluga', 'Blue', 'Sperm', 'Killer', 'Bowhead', 'Humpback']

We can also use sorted() with classes that we define. In listing 16.1 we define a class Food, instantiate four instances of it, and then sort them using the attribute rating as a sort key.

In [None]:
class Food():
    def __init__(self, rating, name):
        self.rating = rating
        self.name = name

    def __repr__(self):
        return f'Food({self.rating}, {self.name})'

foods = [Food(3, 'Bannana'), 
         Food(9, 'Orange'),
         Food(2, 'Tomato'),
         Food(1, 'Olive')]

foods

[Food(3, Bannana), Food(9, Orange), Food(2, Tomato), Food(1, Olive)]

In [None]:
sorted(foods, key=lambda x: x.rating)

[Food(1, Olive), Food(2, Tomato), Food(3, Bannana), Food(9, Orange)]

https://www.whalefacts.org/how-big-are-whales/

insertion order of keys preserved 3.7
https://docs.python.org/3/whatsnew/3.7.html

Is you call sorted on a dictionary, it will return a sorted list of the dictionaries key names. As of Python 3.7, dictionary keys are ordered in the order that they were inserted into the dictionary. In listing 16.2 we create a dictionary of whale weights base on data from whalefacts.org. We print the dictionary keys to demonstrate they retain the order in which they were inserted. We then use sorted() to get a list of key names sorted alphanumerically. We use this to print out the whale name and weight in order.

In [None]:
wieghts = {'Blue': 300000, 
           'Killer': 12000,
           'Sperm': 100000,
           'Humpback': 78000,
           'Beluga':  3500,
           'Bowhead': 200000 }

In [None]:
for key in wieghts:
    print(key)

Blue
Killer
Sperm
Humpback
Beluga
Bowhead


In [None]:
sorted(wieghts)

['Beluga', 'Blue', 'Bowhead', 'Humpback', 'Killer', 'Sperm']

In [None]:
for key in sorted(wieghts):
    print(f'{key} {wieghts[key]}')

Beluga 3500
Blue 300000
Bowhead 200000
Humpback 78000
Killer 12000
Sperm 100000


Pandas DataFrames have a sorting method, .sort_values() which takes a list of column names with which to sort. We demonstrate this in listing 16.3

In [None]:
import pandas as pd
data = {'first': ['Dan', 'Bob', 'Bob'],
        'last': ['Huerando', 'Pousin', 'Smith'],
        'score': [0, 143, 99]}

df = pd.DataFrame(data)
df

Unnamed: 0,first,last,score
0,Dan,Huerando,0
1,Bob,Pousin,143
2,Bob,Smith,99


In [None]:
df.sort_values(by=['first','last'])

Unnamed: 0,first,last,score
1,Bob,Pousin,143
2,Bob,Smith,99
0,Dan,Huerando,0


## Reading and Writing Files

We have already seen that Pandas can read various file formats directly into a DataFrame. There will be times you wish to read and write file data without using DataFrames. Python has a built-in function, open(), which, given a path, will return an open file object:

In [None]:
read_me = open('/content/sample_data/README.md')
read_me

<_io.TextIOWrapper name='/content/sample_data/README.md' mode='r' encoding='UTF-8'>

You can read a single line from the file object using the .readline() method:

In [None]:
read_me.readline()

'This directory includes a few sample datasets to get you started.\n'

The file object keeps track of your place in the file. With each subsequent call to .readline(), the next line is returned as a string:

In [None]:
read_me.readline()

'\n'

In [None]:
read_me.readline()

'*   `california_housing_data*.csv` is California housing data from the 1990 US\n'

It is important to close your connection to the file when you are done, or it may interfere with the ability to open the file again. This is done with the close() function:

In [None]:
read_me.close()

The context manager compound statement is a way to automatically close files. These start with the keyword with, and close the file when it exits it's local state. Below we open a file using a context manager, then read it using the readlines() method. Once the context is exited, the file contents have been read to the variable data, and the file object is closed:

In [None]:
with open('/content/sample_data/README.md') as open_file:
    data = open_file.readlines()

In [None]:
data[0]

'This directory includes a few sample datasets to get you started.\n'

The default when opening a file is that it will be read. You can specify other states, such as read binary, 'rb', write, 'w', and write binary 'wb'. Below we use the 'w' argument to write a new file:

In [None]:
text = 'My intriguing story'

with open('/content/my_new_file.txt', 'w') as open_file:
    open_file.write(text)


We can check that the file is indeed created:

In [None]:
!ls /content/

my_new_file.txt  sample_data


Json is a common format for tramsmitting and storing data. The Python standard library includes a module for translating to and from json. This module can translate between json strings and Python types, below we open a json file:

In [None]:
import json

with open('/content/sample_data/anscombe.json') as open_file:
    data = json.load(open_file)

data

[{'Series': 'I', 'X': 10.0, 'Y': 8.04},
 {'Series': 'I', 'X': 8.0, 'Y': 6.95},
 {'Series': 'I', 'X': 13.0, 'Y': 7.58},
 {'Series': 'I', 'X': 9.0, 'Y': 8.81},
 {'Series': 'I', 'X': 11.0, 'Y': 8.33},
 {'Series': 'I', 'X': 14.0, 'Y': 9.96},
 {'Series': 'I', 'X': 6.0, 'Y': 7.24},
 {'Series': 'I', 'X': 4.0, 'Y': 4.26},
 {'Series': 'I', 'X': 12.0, 'Y': 10.84},
 {'Series': 'I', 'X': 7.0, 'Y': 4.81},
 {'Series': 'I', 'X': 5.0, 'Y': 5.68},
 {'Series': 'II', 'X': 10.0, 'Y': 9.14},
 {'Series': 'II', 'X': 8.0, 'Y': 8.14},
 {'Series': 'II', 'X': 13.0, 'Y': 8.74},
 {'Series': 'II', 'X': 9.0, 'Y': 8.77},
 {'Series': 'II', 'X': 11.0, 'Y': 9.26},
 {'Series': 'II', 'X': 14.0, 'Y': 8.1},
 {'Series': 'II', 'X': 6.0, 'Y': 6.13},
 {'Series': 'II', 'X': 4.0, 'Y': 3.1},
 {'Series': 'II', 'X': 12.0, 'Y': 9.13},
 {'Series': 'II', 'X': 7.0, 'Y': 7.26},
 {'Series': 'II', 'X': 5.0, 'Y': 4.74},
 {'Series': 'III', 'X': 10.0, 'Y': 7.46},
 {'Series': 'III', 'X': 8.0, 'Y': 6.77},
 {'Series': 'III', 'X': 13.0, 'Y': 12.7

## Datetimes

Data that models values over time, called time series data, are commonly used in Data Science. In order to use this kind of data, we need a way to represent time. One common way is to use strings. If we need more functionality, such as easily adding and subtracting, or easily pulling out values for year month and day, we need something more sofisticated. The DateTime library offers varius ways to model time along with useful functionality for time value manipulation. The datatime.datetime() class represents a moment of time down to the microsecond. In listing 16.... we demonstrate creating a datetime object and accessing some of it's values. 

In [None]:
from datetime import datetime

dt = datetime(2022, 10, 1, 13, 59, 33, 10000)
dt

datetime.datetime(2022, 10, 1, 13, 59, 33, 10000)

In [None]:
dt.year

2022

In [None]:
dt.month

10

In [None]:
dt.day

1

In [None]:
dt.hour

13

In [None]:
dt.minute

59

In [None]:
dt.second

33

In [None]:
dt.microsecond

10000

We can get an object for the current time using the datetime.now() function:

In [None]:
dt = datetime.now()
dt

datetime.datetime(2022, 4, 2, 19, 35, 54, 256685)

We can translate strings to datetime objects and datetime objects to string using the datetime.strptime and datetime.strftime functions. Both rely on format codes which define how the string should be processed. These format codes are defined here:
https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
In listing 16... we use the format code %Y for a four digit year, %m for a two digit month and %d for a two digit day to create a datetime from a string. We then use the %y, which represents a two digit year, to create a new string version.

In [None]:
dt = datetime.strptime('1968-06-20', '%Y-%m-%d')

dt

datetime.datetime(1968, 6, 20, 0, 0)

In [None]:
dt.strftime('%m/%d/%y')

'06/20/68'

In [None]:
from datetime import timedelta

delta = timedelta(days=3)

dt - delta

datetime.datetime(1968, 6, 17, 0, 0)

Python 3.9 introduced a new package for setting time zones. With this package it is easy to set the timezone of a datetime:

In [None]:
from zoneinfo import ZoneInfo

dt = datetime(2032, 10, 14, 23, tzinfo=ZoneInfo("America/Jujuy"))
dt.tzname()

The datetime library also includes a datetime.date class. This class is similar to a datetime.datetime except that it does not track the time of day:

In [None]:
from datetime import date

date.today()

datetime.date(2022, 4, 2)

## Regular Expressions

The last package we will cover in this chapter is the regex Library, re. Regular expressions are a sofisticated language for searching within text. We define a search pattern as a string, and then use it to search our target text. At the simplest level, our search pattern can be the exact text we wish to match. Below we define a text containing ship captains and their associated emails. We then search this text using the re.match() function, which returns a match object: 

In [None]:
captains = '''Ahab: ahab@pequod.com
Peleg: peleg@pequod.com
Ishmael: ishmael@pequod.com
Herman: herman@acushnet.io
Pollard: pollard@essex.me
'''

import re
re.match("Ahab:", captains )

<re.Match object; span=(0, 5), match='Ahab:'>

We can use the result of this match with an if statement, whose code block will only execute if the text is matched. 

In [None]:
if re.match("Ahab:", captains ):
  print("We found Ahab")

We found Ahab


The re.match() function matches from the begining of the string. If we try to match a sub-string later in the source string, it will not match:

In [None]:
if re.match("Peleg", captains):
  print("We found Peleg")
else:
  print("No Peleg found!")

No Peleg found!


If you wish to match any substring contained within a text, use the re.search() function instead:

In [None]:
re.search("Peleg", captains)

<re.Match object; span=(22, 27), match='Peleg'>

### Character sets

Character sets are a syntax for defining more generalized matches. The syntax for character sets is some group of characters enclosed in square brackets. So to search for the first occurance of either zero or one, we could use the character set:
`"[01]"`. Or to search for the first occurance of a vowel followed by a punctuation mark: `"[aeiou][!,?.;]"`. We can indicate a range of character in a character set by using a dash. For any digit we would use `"[0-0]"`, for any capital letter, `"[A-Z]"`, or for any lowercase letter, `"[a-z]"`. If we follow a character set by a +, it will match one or more instances. If we follow the character set by a number in curly brackets, it will match that exact number of occurances in a row. We demonstrate these in listing 16....


In [None]:
re.search("[A-Z][a-z]", captains)

<re.Match object; span=(0, 2), match='Ah'>

In [None]:
re.search("[A-Za-z]+", captains)

<re.Match object; span=(0, 4), match='Ahab'>

In [None]:
re.search("[A-Za-z]{7}", captains)

<re.Match object; span=(46, 53), match='Ishmael'>

In [None]:
re.search("[a-z]+\@[a-z]+\.[a-z]+", captains)

<re.Match object; span=(6, 21), match='ahab@pequod.com'>

### Character classes

Character classes are predefined groups of characters supplied for easier matching. You can see the whole list of characet classes  in the re documentation, but some common ones are `\d` for digital characters, `\s` for whitespace characters, and `\w` for word characters. Word characters generally match any characters that are commonly used in words as well as numeric digits and underscores. 

To search for the first occurance of a digit surrounded by word characters, we could use `"\w\d\w"`:

In [None]:
re.search("\w\d\w", "His panic over Y2K was overwhelming.")

<re.Match object; span=(15, 18), match='Y2K'>

We can use the plus and curly brackets to indicate multple consequetive occurances of a character class in the same way we did with character sets:

In [None]:
re.search("\w+\@\w+\.\w+", captains)

<re.Match object; span=(6, 21), match='ahab@pequod.com'>

### Groups

If we enclose parts of our regular expression pattern in parenthesis, they become groups. We can access groups on our match object using the group method. Each group is numbers, with group 0 being the whole match:

In [None]:
m = re.search("(\w+)\@(\w+)\.(\w+)", captains)
print(f'''
Group 0 is {m.group(0)}
Group 1 is {m.group(1)}
Group 2 is {m.group(2)}
Group 3 is {m.group(3)}
''')


Group 0 is ahab@pequod.com
Group 1 is ahab
Group 2 is pequod
Group 3 is com



### Named Groups

It is often useful to refer to groups by names rather than numbers. The syntax for defining a named group is:

`(?P<GROUP_NAME>PATTERN)`

We can then get groups using the group names instead of numbers:

In [None]:
m = re.search("(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)", captains)

print(f'''
Email address: {m.group()}
Name:  {m.group("name")}
Secondary level domain: {m.group("SLD")}
Top level Domain: {m.group("TLD")}
''')


Email address: ahab@pequod.com
Name:  ahab
Secondary level domain: pequod
Top level Domain: com



### Find all

Until now we have only found the first occurance of a match. We can use the re.findall() function to match all occurances. This function returns each match as a string:

In [None]:
m = re.findall("\w+\@\w+\.\w+", captains)
m

['ahab@pequod.com',
 'peleg@pequod.com',
 'ishmael@pequod.com',
 'herman@acushnet.io',
 'pollard@essex.me']

If we have defined groups, it will return each match as a tuple of strings, with each string beging the match for a group:

In [None]:
re.findall("(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)", captains)


[('ahab', 'pequod', 'com'),
 ('peleg', 'pequod', 'com'),
 ('ishmael', 'pequod', 'com'),
 ('herman', 'acushnet', 'io'),
 ('pollard', 'essex', 'me')]

### Find Iterator
If we are serching for all matches in a large text, we can use re.finditer(). This function returns an iterator, which returns each subsequent match with each iteration:

In [None]:
iterator = re.finditer("\w+\@\w+\.\w+", captains)

print(f"An {type(iterator)} object is returned by finditer" )

An <class 'callable_iterator'> object is returned by finditer


In [None]:
m = next(iterator)
f"The first match, {m.group()} is processed without processing the rest of the text"

'The first match, peleg@pequod.com is processed without processing the rest of the text'

### Substitution

Regular expression can be used for substitution as well as matching. The re.sub() function takes a match pattern, a replacement string, and a source text:

In [None]:
re.sub("\d", "#", "Your secrect pin is 12345")

'Your secrect pin is #####'

### Substitution using named groups

We can refer to named groups in the replacement string using the syntax:

`\g<GROUP_NAME>`


To reverse our email addresses in our captains text:

In [None]:
new_text = re.sub("(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)", "\g<TLD>.\g<SLD>.\g<name>", captains)

print(new_text)

Ahab: com.pequod.ahab
Peleg: com.pequod.peleg
Ishmael: com.pequod.ishmael
Herman: io.acushnet.herman
Pollard: me.essex.pollard



### Compiling regexes

There is some cost to compiling a regular expression pattern. If you are using the same regular expression many times, it is more efficient to compile it once. This is done using the re.compile() function, which returns a compiled regular expression object based on a match pattern:

In [None]:
regex = re.compile("\w+: (?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)")
regex

re.compile(r'\w+: (?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)', re.UNICODE)

This object has method mapping many of the re functions, such as match(), search(), findall, finditer() and sub():

In [None]:
regex.match(captains)

<re.Match object; span=(0, 21), match='Ahab: ahab@pequod.com'>

In [None]:
regex.search(captains)

<re.Match object; span=(0, 21), match='Ahab: ahab@pequod.com'>

In [None]:
regex.findall(captains)

[('ahab', 'pequod', 'com'),
 ('peleg', 'pequod', 'com'),
 ('ishmael', 'pequod', 'com'),
 ('herman', 'acushnet', 'io'),
 ('pollard', 'essex', 'me')]

In [None]:
new_text = regex.sub("Ahoy \g<name>!", captains)
print(new_text)

Ahoy ahab!
Ahoy peleg!
Ahoy ishmael!
Ahoy herman!
Ahoy pollard!



## Summary

In this chapter we've introduced sorting data, file objects, the datetime library, and the re library. Having at least a passing knowledge of these is important for any Python developer. Sorting is done either the sorted() function or object sort() methods, such as the one attached to list objects. Files can be opened using the open() function. While open they can be read from or written to. The datetimne library models time and is particularly useful when dealing with time-series data. And finally the re library can be used to define complicated searches of texts.