# From Code to Data

## Objectives

- To review the basic data types and syntactic patterns of Python.
- To build more complex structures out of these basic data types.
- To explore tools for cleaning and transforming data
- To use control structures to clean and transform data in an automated way

## Agenda & Instructions

This notebook is intended for you to work through independently, in order to review and clarify the concepts introduced on Python Camp Day 1, and to lay the groundwork for the activities on Python Camp Day 2. However, feel free to collaborate with others in working through it. It is also intended to serve as a resource you can return to review as necessary.

In this homework, you will cover most of the fundamental tools of the Python programming language that you need to work with data. You've already encountered these tools in today's team activities: _Choreographing Code_ and _From Data to Code_. But these homework exercises will unpack their syntax and explain how to use them. Along the way, you'll cover the following concepts:

- Working with numbers in Python
- Working with text (strings) in Python
- Slicing and splitting strings to extract textual data
- Working with lists of numbers or strings
- Using loops to process data in lists

If you have little or no prior programming experience, you may find this homework challenging. But please try to work through all the exercises, even if you lose the thread or can't make sense of the provided sample solutions. Just make a note of where you got stuck: we will review the homework and address your questions tomorrow in class.

#### How to Use this Notebook

1. Read the documentation above each cell containing code and run the cell (`Ctrl+Enter` or `Cmd+Return`) to view the output.


2. Follow the prompts labeled `Try it out!` that ask you to write your own code in the provided blank cells.


3. (Hidden) solutions to these exercises follow the blank cells; click the toggle bar to expand the solution to compare with your approach.


4. Some prompts include alternative exercises (Parsons Problems) that will be linked from the prompt. These alternatives may help clarify concepts (especially if you find yourself struggling to keep up with all the syntax).


5. Optional annotations (labeled `For the curious...`) provide additional explanation and/or context for those who want them. Feel free to skip these sections if you like. As a beginner, it's important to maintain a balanced cognitive load: taking in too much information all at once can impede your progress toward understanding. This balance looks different for everyone, but we have tried to keep the main content focused on a few key concepts, tools, and techniques, while providing that additional context for those who might benefit from it.



## I. Numbers and text

All computer programs, and all data consumed by computers, translate into sequences of electronic pulses that can be represented in binary code (1's and 0's). 

But for our purposes -- and the purposes of much programming -- the most basic elements of data we deal with are **numbers** and **text**. That means that the data we're actually interested in -- the prices of textbooks, for instance, or the names of courses in the GW course catalog -- will be represented in one of these two forms.

### I.1  Working with numbers

We can represent a single numeric value in Python -- like the price of a textbook, or the enrollment in a course -- by typing the number directly, _without_ using [quotation marks](https://gwu-libraries.github.io/python-camp/glossary.html#term-quotation-marks). 

Here we create two variables to store two different numeric values.

In [9]:
book_price = 99.95
num_students = 55

#### Try it out!

Because the values are numeric, we can use these variables in calculations. In the cell below, create a new variable called `total_cost` that represents `book_price` multiplied by `num_students`. (Hint: in Python, we use the asterisk (`*`) to do multiplication.)  Use the `print()` function to display the result.



In [2]:
# Your code here

Now check your answer by expanding the hidden solution cell below.

In [4]:
total_cost = book_price * num_students
print(total_cost)

### I.2 Flavors of numbers

In Python, numbers come in two main flavors: **integers** and **floats**. We can use Python's `type()` function to expose the [type](https://gwu-libraries.github.io/python-camp/glossary.html#term-type) of any value or [variable](https://gwu-libraries.github.io/python-camp/glossary.html#term-variable).

In [5]:
type(book_price)

In [6]:
type(num_students)

A **float** is any number that contains a decimal point. An **integer** (`int`) is what we call a whole number. 

Unlike some programming languages, Python handles the conversion between these two types automatically. 

#### Question

Can you guess what type `total_cost` will be? Run the following cell to find out.



In [None]:
type(total_cost)

#### For the curious

Why two different numeric types? It has to do with how numbers are stored in the computer's memory. Decimal fractions are handled by Python (and many other languages) using floating-point representation (hence the word "float"). In many cases, the technical nuances of floating-point math won't matter, and you can use a float just as you would any number with a decimal point. But in certain situations, floats can exhibit some weird behavior. 

The following is a classic example. Run this code in a code cell; the output may surprise you!

```
# Evaluates whether adding 0.1 and 0.2 equals 0.3
0.1 + 0.2 == 0.3  
```
For more on this topic, David Amos' [blog post](https://davidamos.dev/the-right-way-to-compare-floats-in-python/) has tips on working with floats accurately in Python.



### I.3 Don't be scared of quotes

Above we saw that Python allows us to perform calculations involving both integers and floats. Things are not quite so seamless with non-numeric types. Run the following code: instead of valid output, you should see an error message.  

In [None]:
book_price = '$99.95'
tax = book_price * .1
print(book_price + tax)

The `TypeError` informs us that we can't perform multiplication in this instance because `book_price` is not of the right [type](https://gwu-libraries.github.io/python-camp/glossary.html#term-type). 

#### Notes

We saw prices represented this way -- `"$99.95"` -- in the bookstore dataset. The quotation marks indicate that this value is a **string**. As far as Python is concerned, *anything* between quotation marks is a string. 

You can think of the quotation marks as a container: whatever you put into the container will be treated as an instance of the `str` type. 

You can use either **single** (`''`) or **double** (`""`) quotation marks: Python doesn't care. But in any given string, the quotation marks must match: a string that starts with a double quote and ends with a single quote will produce an error.



#### Try it out!

Create two variables to hold the title and author of a textbook. The title and author can be anything you like; the variable names should be `book_title` and `book_author`, respectively.



In [1]:
# Your code here

Now check your answer by expanding the hidden solution cell below.

In [None]:
book_title = 'Organic Chemistry'
book_author = 'John P. Bunsen'

### I.4 Strings vs. variable names

Strings in Python can consist of any sequence of [Unicode](https://gwu-libraries.github.io/python-camp/glossary.html#term-Unicode) characters between quotation marks. That includes letters, numbers, characters from non-Roman scripts (like Arabic or Chinese), even emojis. If you can type it on your keyboard, you can probably represent it in a Python string.

Python **variable names**, by contrast, are much more restrictive. 
1. They _must_ begin with an alphabetic character (which can include characters from non-Roman alphabets). 
2. They _cannot_ contain spaces.
3. They _may_ contain numerals (but not at the beginning of the name).
4. The only permitted punctuation in a variable name is the **underscore** (`_`). 

The following are some examples of valid and invalid Python variable names.

| Name | Valid? | Reason |
| -----|--------|--------- |
|`my_name` | Yes |         |
|`book_title_2` | Yes |    |
|`$price` | No | Uses punctuation not allowed |
| `2nd_book` | No | Begins with a number |
| `course year` | No | Spaces not allowed |



### I.5 How to do things with strings: Slicing

When working with textual data, keep in mind that Python _doesn't know anything_ about the meaning of what's inside the quotes. It has no concept of words, punctuation, etc. -- all the stuff that we as humans rely on to communicate effectively (elements of so-called _natural languages_). 

A Python string is just a **collection** of **characters**. Imagine spelling out words in Scrabble, or with wooden alphabet blocks. 

That said, Python strings are also suprisingly flexible. Python provides a lot of tools to make working with them easy, starting with the fact that each character in the string has a well-defined position, which we call the **index**. We can use the indices of characters to extract information from parts of a string.

The following code defines two string variables that hold information about a course and a term.

In [14]:
course = 'CHEM 1002 10'
term = 'Summer 2023'

What if we want to extract the department code, the course number, and the section number from the `course` variable? 

By counting characters, we can see the following:
- The department code occupies the first four (4) index positions. With strings, the first position is labeled `0`, not `1`, so the first 4 characters would fall in positions `0`, `1`, `2`, and `3`.

|0|1|2|3|4|5|6|7|8|9|10|11|
|-|-|-|-|-|-|-|-|-|-|-|-|
|C|H|E|M| |1|0|0|2| |1|0|

- The course number occupies four more positions, but we also have to account for the intervening space: `4` (the space), then `5`, `6`, `7`, `8`.

We can use this information to **slice** our `course` variable as follows:

In [24]:
dept_code = course[0:4]         # Positions 0 through 3
course_num = course[5:9]        # Positions 5 through 8
print(dept_code)
print(course_num)

#### Notes

Note that in slicing a string, we provide -- inside [square brackets](https://gwu-libraries.github.io/python-camp/glossary.html#term-square-brackets) -- the first index (counting from 0, not 1), followed by a [colon](https://gwu-libraries.github.io/python-camp/glossary.html#term-colon), followed by the _last index plus one_. The colon means _up to BUT NOT including_. 



#### Try it out!

Create a variable called `section_num` to hold the two-digit section number at the end of the `course` string, and `print` the variable.



In [None]:
# Your code here

Now check your answer by expanding the hidden solution cell below.

In [55]:
section_num = course[10:12]
print(section_num)

Because it's common to want to slice off the first part of a string and take the rest of it up to the end, we could actually write the above as follows, leaving off the number after the colon:

```
section_num = course[10:]
```

The `[10:]` means, _start at the 11th position and extract everything up to AND including the last character of the string_.

What if we don't know the exact position of the characters we want to extract? 

Look at the `term` variable as defined above. 

```
term = 'Summer 2023'
```

Let's say that we expect the term data to consist of the name for a particular semester -- `Summer`, `Spring`, or `Fall` -- plus a 4-digit year. 

Do you see the problem here? The name of the term can be either 4 characters long (`Fall`) or 6 (`Summer`, `Spring`). 

Fortunately, Python has us covered, because it lets us _count backwards_ as well as forwards when slicing a string. We just use negative numbers! Since there's no such number as `-0`, the last characer in a string has the position `-1`. 

|-11|-10|-9|-8|-7|-6|-5|-4|-3|-2|-1|
|-|-|-|-|-|-|-|-|-|-|-|
|S|u|m|m|e|r| |2|0|2|3|



To extract the last 4 characters (the year) from the `term` variable, we can use this slice:

In [34]:
term_year = term[-4:]
print(term_year)

The colon with no number after it means "slice up to and _including_ the end of the string." It's useful in cases where we don't know (in advance) how long the string will actually be. (In this case, with our three different terms, the string could be either 9 or 11 characters long.) 

We can also mix negative and positive positions in slicing. To start from the first character and slice **up to** (but not including) the fifth character from the end, we could use the following slice:

In [35]:
term_name = term[0:-5]
print(term_name)

### I.6 How to do things with strings: Splitting

Strings in Python come with a lot of built-in functionality. One of the most useful is a [method](https://gwu-libraries.github.io/python-camp/glossary.html#term-method) caled `split()`. 

Take a look at what happens when we call the `split()` method on our `course` variable.

In [41]:
course.split()

Let's pause here and note a few things:

1. There is a **period** (`.`) between `course` and `split()`. This indicates that the `split()` method belongs to our `course` variable. We get access to this method whenever we use a string value. 

  And we can call it on string values directly: `'CHEM 1002 10'.split()` produces the same output.
  

2. The **parentheses** after the word `split` are required. We've seen parentheses in calling the `print()` and `type()` functions, too. Here the parens are empty because we're not providing any [arguments](https://gwu-libraries.github.io/python-camp/glossary.html#term-arguments) to `split()`. (Python knows what we want to split because of the period attaching the method to the `course` variable.)


3. The output from `course.split()` is a **list** of three separate strings. A Python list is enclosed with [square brackets](https://gwu-libraries.github.io/python-camp/glossary.html#term-square-brackets) and contains items separated by [comma](https://gwu-libraries.github.io/python-camp/glossary.html#term-comma)s.

#### Try it out!

Use `split()` on the `term` variable (defined above) and compare the output with that of `course.split()`. 

Can you tell how `split()` works? How does it know where to separate the string?



In [None]:
# Your code here

Now check your answer by expanding the hidden solution cell below.

If you ran the code `term.split()`, you should have gotten output like this:

`['Summer', '2023']`

As you might have guessed, `split()` separates a string on the **white space** (which can be a regular space, a tab, or a line break). 

If you imagine a string as a six-pack of some canned beverage, the spaces are like the plastic rings holding the six-pack together. When we call `split()` on the string, that's like pulling each can out of the six-pack and discarding the plastic rings: you are left with six individual cans (a list of strings without spaces).

You can use `split()` with other characters, too. We'll see some examples later.


## II. Loops and lists

We have seen that when we call the `split()` method on a string, Python turns the string into a list of strings. 

With this conversion, we get some added structure. A string can hold any sequence of valid Unicode characters; a list can hold any sequence of _valid Python values_. Even other lists!

In the team activity with the `bkst_data` dataset, you encountered some ways to work with lists. The following reviews those and adds a few more.

### II.1 Accessing items

We can use an [integer](https://gwu-libraries.github.io/python-camp/glossary.html#term-integer) inside square brackets to access the item at single position ([index](https://gwu-libraries.github.io/python-camp/glossary.html#term-index)) within a list.

In [48]:
course_info = course.split()
print(course_info[0])

Just as with strings, negative indexing works, too. The `-1` index gives us the last item in the list.

In [49]:
print(course_info[-1])

#### Try it out!


Slicing works, too.

Use the slicing syntax you learned above to extract the first two items from the `course_info` variable.



In [None]:
# Your code here

Now check your answer by expanding the hidden solution cell below.

In [50]:
print(course_info[0:2])

#### Notes

If you didn't get two items, remember that in slicing, the number on the left side of the colon represents the [index](https://gwu-libraries.github.io/python-camp/glossary.html#term-index) we want to start with, and the number on the right side represents the index _one after_ the index we want to end with. 



### II.2 Doing things with strings & lists

The real power of lists enters when we can use them to automate a lot of repetitive tasks.

In "Choreographing Code," we used what's called a `for` loop to adjust a list of prices with sales tax. Here we'll work up to the same task, step by step, but we'll add a couple of enhancements. 

For this example, we'll start with a list of strings representing book prices.

In [51]:
book_prices = ['$55.99', '$119.95', '$13.95', '$250.67', '$99.99']

First, let's just print each price from the list on its own line.

In [52]:
for price in book_prices:
    print(price)

#### Notes

1. Our loop begins with the Python [keyword](https://gwu-libraries.github.io/python-camp/glossary.html#term-keyword) `for`. This is not a function -- like `len()` or `print()` -- but part of the Python syntax itself (like the quotation marks around strings or the square brackets around lists). 


2. `for` always goes with `in`; they make a pair. 


3. The variable `price`, immediately following `for`, is being created here. (It was not previously defined in our code.) Its role is to hold -- in sequence -- each item in the [collection](https://gwu-libraries.github.io/python-camp/glossary.html#term-collection) following `in`.


4. `book_prices` _was_ defined before the loop. The variable following `in` should always be some sort of collection type -- a string, a list, a dictionary, etc. -- or else a [function](https://gwu-libraries.github.io/python-camp/glossary.html#term-function) that returns a collection. We cannot write `for x in 10`, for instance, because the integer `10` is not a collection; it holds no items within itself. (For the same reason, we can't take a slice of `10`).


5. The first line of the `for` loop ends in a [colon](https://gwu-libraries.github.io/python-camp/glossary.html#term-colon), and the line or lines underneath it are **indented** (separated from the left margin by the same number of tabs or spaces). We call these indented lines in Python a [block](https://gwu-libraries.github.io/python-camp/glossary.html#term-block).



#### Try it out!


All of the prices follow the same format, beginning with the dollar sign. To calculate the sales tax, we need to multiply each price by a fixed percentage -- let's say 110%, or 1.1, to reflect a sales tax of 10% on the dollar.

But as we saw above, we can't perform math with strings, and `book_prices` is a list of strings. To convert our strings to numbers, we first need to remove the dollar sign from each price. 

Modify the `for` loop above so that it prints each price _without_ the dollar sign. For a hint, consult the example above where we created the variable `course_section`.



In [3]:
# Your code here

Now check your answer by expanding the hidden solution cell below.

In [None]:
for price in book_prices:
    print(price[1:])

#### Try it out!

Now that we have extracted the numeric part of the string (the part after the dollar sign), we can **convert** this string to a float in order to do math with it. 

Modify the `for` loop again to do the following:

1. Assign the slice of the price (without the dollar sign) to a new variable called `price_num`.

2. Use the `float()` function to convert `price_num` to a float, and multiply the result by `1.1`. 

   For instance, to convert the string `'1.5'` (notice the quotations marks!) to its float representation, we would write `float('1.5')`.

3. Assign the result of this calculation to `price_num` and print it.

If the foregoing feels intimidating, try this [Parsons Problem](https://gwu-libraries.github.io/python-camp/parsons-problems/html/homework-1-1.html) first. It allows you to focus on the logical order of the actions to be performed, rather than on the syntax of the commands.



In [None]:
# Your code here

Now check your answer by expanding the hidden solution cell below.

In [56]:
for price in book_prices:
    price_num = price[1:]
    price_num = float(price_num) * 1.1
    print(price_num)

### II.3 From one list to another

If you didn't get the intended answer, don't worry! You'll get the hang of it. The most important thing for now is to understand that we used a `for` loop because we wanted to repeat a certain set of actions for all the items in a list. 

For now, let's add one enhancement to our loop: instead of just printing the new price (with the added sales tax), we'll store it in a separate list. 

To do this, we need to use a [method](https://gwu-libraries.github.io/python-camp/glossary.html#term-method) of Python lists called `append()`. This method adds a new item to the end of a list.

Much like the `split()` method we used on strings, the `append()` method can be called from any Python list by writing the name of the list plus `.append(item)`, where `item` is the value we want to add to the list. 

See the code below and the explanation that follows.

In [57]:
prices_with_tax = []
for price in book_prices:
    price_num = price[1:]
    price_num = float(price_num) * 1.1
    prices_with_tax.append(price_num)
print(prices_with_tax)

#### Notes

This loop achieves the same thing as the previous, with this difference: the adjusted prices are now stored in the `prices_with_tax` variable, which is another list. This is a fairly common Python pattern; when using this pattern, here are a few rules of thumb to keep in mind:


1. We need to create the new list _outside_ of the `for` loop. The line `prices_with_tax = []` creates a new variable that consists of an **empty list**. 


2. As before, we use the [loop variable](https://gwu-libraries.github.io/python-camp/glossary.html#term-loop-variable) `price` to work with each item in the `book_prices` list.


3. The variable `price_num` is temporary, in the sense that, like the `price` loop variable, it will change (be reassigned to a new value) with every iteration of the loop.


3. The line `prices_with_tax.append(price_num)` stores the value in `price_num` to our list. **Without this step, we could not "save" the results of our calculations.** It's like performing a series of steps on a calculator: if you don't write down your answer before moving onto the next problem, you will lose all your hard work!


4. Steps 2, 3, and 4 are all indented underneath the line that kicks off the `for` loop. Visually, this tells you that all of these steps happen on each iteration of the loop (once for every item in the `book_prices` lists.


5. The line `print(prices_with_tax)` happens _outside_ the `for` loop, so it's not indented. That's because we only want to print our new list when it's finished, not every time we add a new item to it. 


6. Note that our `book_prices` list still holds the original strings: our code made a new, second list. 



In [58]:
print(book_prices)

#### Try it out!

We have a list of strings that represent courses, where each string (as in the example above) consists of a department code, a course number, and a section number:

```
courses = ['CHEM 1001 10', 'CHEM 1001 11' ...]
```

Using a `for` loop, transform this list into three separate lists: one to hold the department codes, one to hold the course numbers, and one to hold the section numbers.

The cells below will get you started. For more help, check out this [Parsons Problem](https://gwu-libraries.github.io/python-camp/parsons-problems/html/homework-1-2.html).



In [60]:
courses = ['CHEM 1001 10', 'CHEM 1001 11', 'BISC 1111 10', 'BISC 2207 10', 'PSC 1001 10',
           'PSC 1001 11', 'PSC 1001 12', 'ANTH 3808 10', 'AMST 2071 80']

In [61]:
depts = []
course_nums = []
sections = []

In [None]:
# Your code here

Now check your answer by expanding the hidden solution cell below.

In [None]:
for course in courses:
    course_info = course.split()
    depts.append(course_info[0])
    course_nums.append(course_info[1])
    sections.append(course_info[2])
print(depts)
print(course_nums)
print(sections)

## Wrapping up

Congratulations! This homework covered _a lot_ of material. (Subsequent homework notebooks will not be as extensive.) 

1. We learned about various Python **types**, which allow us to represent (computationally) raw data in various ways.

2. We saw that different types have different uses -- or we might say, different behaviors: we can do addition and multiplication with integers and floats, we can **split** and **slice** strings, we can **append** items to lists.

3. With **integers**, **floats**, **strings**, and **lists**, we can capture complex data using the tools of the Python language. In the team exercise _From Data to Code_, we saw how a dataset in the [JSON](https://gwu-libraries.github.io/python-camp/glossary.html#term-JSON) format translates into these Python types (along with one additional type, the **dictionary**, which you will meet in tomorrow's activity.)

4. And with **for loops**, we can _automate_ actions that would otherwise be tedious, like calculating the sales tax of a (long) list of prices. And that's one of the main reasons for learning to program: as the title of one popular book on Python puts it, ["to automate the boring stuff"](https://wrlc-gwu.primo.exlibrisgroup.com/discovery/fulldisplay?docid=alma99185917215304107&context=L&vid=01WRLC_GWA:live&lang=en&search_scope=WRLC_P_MyInst_All&adaptor=Local%20Search%20Engine&tab=WRLC&query=any,contains,automate%20the%20boring%20stuff). 