# Extra Practice for Python Camp

## Agenda & Instructions

This notebook provides a series of exercises designed as extra practice for the material covered in the daily lessons and the homework. You are by no means required to work through these exercises, but doing so may prove helpful in a couple of situations:

- If you want to review a particular concept in order to solidify your understanding.
- If you want to build more "muscle memory" by writing code beyond the opportunities provided by the group activities and homework.

The exercises below are arranged by concept, largely following the order in which the concepts were presented in the course materials. The code in each section is self-contained, so feel free to jump around and/or pick only certain sections to do. 

Answers to the challenges in each section are provided in a collapsed (Hint) cell at the bottom of the section.

````{admonition} How to Use this Notebook
:class: how-to

1. Read the documentation above each cell containing code and run the cell (`Ctrl+Enter` or `Cmd+Return`) to view the output.


2. Follow the prompts labeled `Try it out!` that ask you to write your own code in the provided blank cells.


3. (Hidden) solutions to these exercises follow the blank cells; click the toggle bar to expand the solution to compare with your approach.


4. Some prompts include alternative exercises (Parsons Problems) that will be linked from the prompt. These alternatives may help clarify concepts (especially if you find yourself struggling to keep up with all the syntax).


5. Optional annotations (labeled `For the curious...`) provide additional explanation and/or context for those who want them. Feel free to skip these sections if you like. As a beginner, it's important to maintain a balanced cognitive load: taking in too much information all at once can impede your progress toward understanding. This balance looks different for everyone, but we have tried to keep the main content focused on a few key concepts, tools, and techniques, while providing that additional context for those who might benefit from it.

````

## Integers, floats & strings





Li X. is a graduate student working in a bioinformatics lab, and he is learning how to work with protein and DNA sequences using Python. His first task is to compute the lengths of various sequences, three examples of which are given below. Write some code to compute the length of each of these sequences.

In [5]:
seq1 = "LYLIFGAWAGMVGTALSLLIRAELGQPGTLLGDDQIYNVIVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSTVEAGAGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGVSSILGAINFITTAINMKPPTLSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
seq2 = "VGTALXLLIRAELXQPGALLGDDQIYNVVVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLMASSTVEAGAGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGISSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
seq3 = "WAGMVGTALSLLIRAELGQPGALLGDDQIYNVVXTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLMASSTVEAGVGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGISSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"

In [6]:
#Your code here

Now Li needs to find the average length of a collection of sequences, `seq1` through `seq3` above, in addition to the following two sequences. Write some code to find the average length of all five sequences.

In [7]:
seq4 = "VGTALSLLIRAELGQPGTLLGDDQIYNVIVTAHAFVMIFFMVMPVMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSTVEAGAGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGVSSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
seq5 = "LYLIFGAWAGMVGTALSLLIRAELGQPGALLGDDQVYNVVVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSTVEAGVGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGISSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIX"

In [8]:
#Your code here

The lab director is pleased with Li's progress and gives him another task. The lab is trying out some new software, and this software requires that each sequence used as input be prefixed with a number representing the length of the sequence, followed by a space. So for `seq1` above, the input would look as follows (the sequence has been abbreviated for display purposes):
```
231 LYLIFGAWAGMVGTALSLL....
```
Write some code to add the sequence length (and a space) to the beginning of each of the sequences given above.

In [12]:
#Your code here

So far Li really likes working with Python; he can see how it will make his work in the lab a lot more efficient. One thing that confuses him, however, is that whole numbers in Python are represented sometimes as one-place decimals, e.g., `30.0`, and sometimes without the decimal part, e.g., `30`. Surely Python isn't being arbitary! Comparing the output of the following two lines of code, can you tell why the decimal part sometimes appears when dealing with whole numbers? 

In [None]:
len(seq1)

In [None]:
len(seq2) / 10

````{hint} Solutions
:class: dropdown

**Finding the length of strings**

Li's sequences are represented as Python {term}`string`s -- we can tell because each sequence is surrounded by quotation marks (`""`). In Python, a string has a length, which we can find by using the built-in `len()` {term}`function`. To find the length of the string represented by the variable `seq1`, we write
```
len(seq1)
```

To find the average of all five sequences, we add up the lengths of each sequence and divide by the total number of sequences (5). Note the use of parentheses in the code below, which ensures that the addition happens before the division. (Python's order of operations is similar to what you'd expect from a calculator.)
```
avg_length = (len(seq1) + len(seq2) + len(seq3) + len(seq4) + len(seq5)) / 5
```

**Concatenating strings**

To prefix the length of the sequence to the sequence itself, we need to make a new string. In Python, we can join two strings together -- called concatenation -- by using the `+` operator. Note that in context of strings, the `+` does not do numeric addition, it does concatenation. In other words,  `1 + 2` produces `3` (addition), but `"1" + "2"` produces `"12"` (concatenation), because the quotation marks around the digits in the second example tell Python to treat them as strings, not numbers.

As a result, the following code will produce a `TypeError`:

```
new_seq1 = len(seq1) + seq1
```
Here we asked Python to "add" an integer (the result of `len(seq1)`) to a string (the value of `seq1`), which is an undefined operation in Python. To remedy the error, we need to make sure that the values on both sides of the plus sign are strings. We can use the built-in `str()` function to convert the integer result of `len()` to a string. While we're at it, we also concatenate a single space (between quotation marks) as required in the instructions:
```
new_seq1 = str(len(seq1)) + " " + seq1
```

**Floats vs. integers**

- `len(seq1)` returns `231` (no decimal part) because the `len()` function is defined always to return a value of type {term}`integer`. That makes sense; nothing in Python that has a length (e.g., a string, a list) is ever going to have a fractional length. 
- `len(seq2) / 10` returns `22.0`, which is a whole number but given its one-place decimal representation because division in Python (the `/` operator) is defined always to return a value of type {term}`float`. It returns a float even in cases where the result _could be_ represented as an integer (as in this case). Python behaves this way in order to provide consistency: as a programmer, you can rely on the division operator always to return the same data type. Generally speaking, integers and floats in Python are interoperable, meaning that you can mix both freely in most types of calculation. 
- If for some reason you need to convert a float to an integer, you can use the `int()` function: e.g. `int(len(seq2) / 10)` returns `22` (no decimal part). 

````

## Working with strings (1)

Alia M. is a social scientist working with data on residential housing patterns. She has a large datafile of residential properties in the United States, consisting of street addresses and some information about the type of housing at each address. For each row in the datafile, the first part consists of a {term}`string` that represents the property's street address, including city, state abbreviation, and zip code. The address is preceded by a seven-digit unique identifier. One such string is given below.

In [1]:
address1 = "AGF5670 2123 N. 3rd St., St. Cloud, MN 56301"

Alia's first task is to extract the unique identifier (`AGF5670` in the example above) from each string. How can she use a string {term}`slice` to accomplish this task? Write some code that extracts the identifier from the string stored in the `address1` variable.

In [2]:
#Your code here

Next Alia would like to extract the five-digit zip code from the address string. (The zip code is `56301` in the example provided.) What code can she use to do that? (Assume that all zip codes in the datafile consist of five digits.) 

In [3]:
#Your code here

Great work! This code is really going to make Alia's life easier -- no more manual data entry for her! The next challenge, as you might guess, is to extract the US postal code for the state. Assume that the addresses in this datafile all have a two-letter abbreviation (`MN`, for Minnesota, in the example above).

In [4]:
#Your code here

Finally, Alia needs to extract the street address itself, together with the city name: e.g., `2123 N. 3rd St., St. Cloud`. The length of this portion of the address varies throughout the datafile, so ideally, your code would work for any string fitting the above pattern, regardless of length.

In [5]:
#Your code here

To make sure your code works properly on addresses of different lengths, try extracting the different parts of `address2` below.

In [2]:
address2 = "WGA9753 412 Mockingbird Lane, Crowley, LA 70506"

In [None]:
#Your code here

````{hint} Solutions
:class: dropdown

There are other ways to accomplish Alia's tasks, but all of the above can be done using slicing. How can we determine that? Looking at the example given, and assuming that the rest of the datafile is consistent, we have the following pattern:

$$
\underbrace{\text{AGF5670}}_\text{7}\text{ 2123 N. 3rd St., St. Cloud, }\underbrace{\text{MN}}_\text{2}\text{ }\underbrace{\text{56301}}_\text{5}
$$

Alia wants to subdivide the string into four elements, three of which have a fixed length: the identifier, the state abbreviation, and the zip code. Only the street address and city are variable, and if we treat the latter as a single element, we can construct slices for all four elements using what we know about the lengths of the first three, together with negative indexing.

The table below shows the lengths of each element along with the indices for slicing (taking into account the white space between identifier, address, state, and zip code):

|Element|Length|Slice From|Slice To|Slice Value|
|-|-|-|-|-|
|identifier|7 characters|0|7|`"AGF5670"`|
|street address & city|varies|8|-9|`"2123 N. 3rd St., St. Cloud,"`|
|state abbreviation|2 characters|-8|-6|`"MN"`|
|zip code|5 characters|-5|-|`"56301"`|


1. To extract the **identifier**, we write: `address1[0:7]` or `address1[:7]`. (These expressions are equivalent.)
2. To extract the **zip code**, we count back five characters from the end, starting with `-1`, which yields `address1[-5:]`. Note that we leave off the second number in the slice (after the colon) because we want to take a slice _including_ the last character.
3. To extract the **state abbreviation**, we count back again from the end to the `-8` position (the `M`) and add 2 to get our slice: `address1[-8:-6]`.
4. To extract the remaining part of the address, we start from the second position _after_ the end of the 7-digit identifier (to account for the white space, which we don't want in our slice), and we take _up to_ (but _not_ including) the 9th character from the end (the space before the state code): `address1[8:-9]`. 

We could use slightly different code for step 4 in order to exclude the comma after the city name (`"St. Cloud"`), i.e., `address1[8:-10]`.

**N.b.** We could have also accomplished these tasks using Python's `str.split()` method instead of slicing. See the next section for details.  

````

## Working with strings (and lists)

Social scientist Alia has some additional data points in her file that she wishes to extract. These data consist of two alphanumeric codes and a four-digit year. The first element indicates the type of dwelling, the second indicates whether the property is a rental or occupant-owned, and the third indicates the year in which the property was built. The following example shows the string for a single-family rental property built in 1986.

In [19]:
address_info1 = "SF,R,1986"

Alia is feeling pretty confident about Python string slicing, so her first approach is to write three {term}`slice` expressions to extract the information between the commas in the string above. Can you write those expressions below, assigning each part of the string (without the commas) to a new variable?

In [20]:
#Your code here

Alia is pleased with herself until she realizes that not all of the codes in the datafile are of the same length. The code in the first position can be either "SF" (single-family), "MF" (multi-family), or "Mixed" (for properties with a mix of residential and commercial spaces). Likewise, the second code can be either "R" (rental), "O" (owned), or "n/a" (a null value, used when the information was not available). 

Since two of the three elements in this string are of variable lengths, it's not easy to write slice expressions that will extract all three elements. Fortunately, Alia has learned about the [str.split()](https://docs.python.org/3.3/library/stdtypes.html#str.split) method. Write some code that will split the `address_info1` string into three parts, and assign each part to a new variable.

In [21]:
#Your code here

Now verify that your approach works for strings with elements of different lengths: `address_info2` and `address_info3` below.

In [22]:
address_info2 =  "MF,n/a,1996"

In [23]:
#Your code here

In [24]:
address_info3 = "Multi,R,2007"

In [25]:
#Your code here

"That slice method is pretty cool!" Alia thinks. She wonders whether it would be a good idea to split the address strings as well (see the introduction to [Working with Strings (1)](#working-with-strings-1)), instead of slicing them. 

Given the `address1` variable below, what happens if you split it on the {term}`white space`? Do you think this would be a good approach for separating out the elements of the string: the 7-digit identifier, the street address and city name (of variable length), the two-letter state abbreviation, and the five-digit zip code? Why or why not?

In [27]:
address1 = "AGF5670 2123 N. 3rd St., St. Cloud, MN 56301"

In [29]:
#Your code here

Using the result of splitting `address1`, can you extract the identifier, state abbreviation, and zip code?

In [30]:
#Your code here

**Bonus** Now Alia is having trouble working with the street address and city name in the result of splitting `address1`. A colleague mentioned the [str.join()](https://docs.python.org/3.3/library/stdtypes.html#str.join) method, which does the inverse operation of `str.split()`. 

Can you use list slicing and the `join` method to produce a single string corresponding to the street address and city name in `address1`, starting from the result of splitting the latter?

````{hint} Solutions
:class: dropdown

Since the three elements in the `address_info` strings are separated by commas, we can use the `str.split()` method with its optional first argument, `sep`. Note that the Python [documentation](https://docs.python.org/3.3/library/stdtypes.html#str.split) tells us the following:

> If `sep` is not specified or is `None`, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

In other words, if we write `address_info1.split()` with nothing between parentheses, the `sep` argument will be `None` by default, which means that the string will be split on {term}`white space`, with any number of consecutive spaces being treated as a single instance (a single separator). That's the default behavior, but we can modify this by providing a single argument to `split` for the separator, either with or without the argument name:

```
address_info1.split(",")
```
and 
```
address_info1.split(sep=",")
```
work equally well, giving the {term}`list` `["SF", "R", "1986"]` as the result.

The same approach works also for `address_info2` and `address_info3`. In each case, we can access the individual elements by {term}`index`ing into the list returned by `split`:
```
result1 = address_info1.split(",")
property_type1 = result1[0]
rent_or_owned1 = result1[1]
year_built1 = result1[2]
```
Assuming that the strings with this information always have _two and only two_ commas, the `split` method will always return a three-element list. 

We could take the same approach with the `address1` string (`"AGF5670 2123 N. 3rd St., St. Cloud, MN 56301"`), splitting this time on the whitespace (the default):
```
address_parts1 = address1.split()
```
Now `address_parts1` will be the following list: `['AGF5670', '2123', 'N.', '3rd', 'St.,', 'St.', 'Cloud,', 'MN', '56301']`

Assuming the strings in Alia's datafile are consistently formed, we could extract the identifer, state abbreviation, and zip code as follows:
```
identifier1 = address_parts1[0]
state_abbreviation1 = address_parts1[-2]
zip_code1 = address_parts1[-1]
```
Extracting the street adddress and city proves more difficult, however. In our example, each part of the street address (`"2123 N. 3rd St.,"`) is separated by white space, as are the two parts of the city name (`"St. Cloud,"`). This situation poses two problems for using the `str.split()` method:
  - It's not going to be terribly useful -- at least, not for Alia's purposes -- to treat the parts of the street address or city name as different elements. For instance, the "N." and the "St." in "N. 3rd St." don't really mean anything on their own, and the same goes for "St." and "Cloud" in "St. Cloud." 
  - Different addresses and cities will involve different amounts of white space. Consider "3150 Elm St." or "Roanoke."

To to extract the address/city name element from our original address string using split, we could take the following approach:

1. Split the originl string on white space, producing a list.
```
address_parts1 = address1.split()
```
2. Take a slice _of the resulting list_ (not the original string) in order to get the elements between the first (the identifier) and the second-to-last (the state abbreviation).
```
street_address1 = address_parts1[1:-2]
```
3. Note that `street_address1` is also a list. (Slicing a string returns a string; slicing a list returns a list.) We can use this list with the `str.join` method, which takes as its argument a list and returns a string that consists of each element in the list separated by whatever string the method is called on. The following code creates a new string out of the street address/city name elements by gluing them together (so to speak) with white space:
```
street_and_city1 = " ".join(street_address1)
```

In this case, splitting isn't necessarily a better approach to Alia's first problem than slicing. But the larger point is that there are almost always more ways than one of solving a problem with Python. In this case, which approach you choose might come down to your preferences as a programmer. 

````