# From Data to Code

## Objectives

- To explore a "real-world" dataset for use with Python.
- To use Python to interact with data in a common, widely used format (JSON).
- To explore the data structures and syntactical patterns of the Python language.

<div class="lesson-plan" style="background: rgba(144, 238, 144, .1)">


## Lesson Plan

### Setup and goals

1. Explain that we will encounter a lot of unfamiliar syntax in this lesson, but that as in the previous lesson, the goal is to see code in action **before** we spend time learning why the code is written as it is (syntax). 
   - Define **syntax** as the way code is written in a particular language.
   - Emphasize that the syntax will be covered in depth in the homework, and that there will be ample time to clarify and solidify understanding over the course of Python Camp.

2. Introduce the idea that we will be working with a single dataset throughout the four days of Python Camp.
   - Explain the dataset and its relevance.
   - Note that in this exercise, teams will be getting acquainted with the dataset by using some Python commands.
   - The data is _represented_ by Python **types**. To start to understand how types work, teams should attend both to the _content_ of the data and how it is _represented_ (the syntax).
   - Teams should designate a notetaker to jot down questions and observations that arise in team discussion. 
   - Remind teams to rotate the notetaker's role.

------------------
### The dataset
    
Introduce the dataset and JSON format, and ensure everyone can open the JSON file in a notebook.
    
Note that the path given is relative to the notebook in the `/textbook/_build/html/_sources/notebooks/lessons` folder.

------------------

### Navigating lists 

Have each team work through the `Navigating lists` section. Pause so that groups can share what they notice.

--------------------

### Navigating nested data / Questions for discussion

One goal for this discussion is to highlight how data formats like JSON allow us to *represent relationships*. Our dataset represents a model of something out in the world (textbooks/course materials used by courses at GW), and we're using Python in order to _interact with that model_ in various ways.

Instructors should document the modeling that emerges from this discussion as a tool for use in subsequent team activities.

1. Ask for volunteers (note takers or others) to share the data diagramming done by their teams. The cleaned data has the following hierarchy:

```

bkst_data[list]
---a course[dict]
------department[str] [...course, section, instructor, etc.]
------texts[list]
---------a text[dict]
------------title[str] [...author, isbn, etc.]

```

2. Note the presence of the different data types, but emphasize that the homework for Day 1 will cover the differences in detail. For now, focus on how the data elements relate to one another. Language like the following can help:
  - A course (in this dataset) _has_ a department, instructor, etc.
  - A course _can have_ one or more texts.
  - A text has a title, price, etc.
  
  
3. Help teams surface any questions they have about the data or points of confusion.

--------------------------

The second goal of this part of the discussion is to start to identify the syntactic elements of Python that allow us to model the world in certain ways. 

Have each team/note taker share their inventory of syntactic elements.

It's not necessary at this point to go through all of these elements in detail. Depending on the mood of the group, however, it may be useful to highlight the following (as preparation for the homework):

  - **Quotation marks** `''`
    - surround strings of characters (letters, numbers, other punctuation)
    - occuring on the left side of a colon (e.g., `title`, `isbn`), they function as field names, in Python called **keys**
    - occuring on the right side of a colon (e.g., `CHEM`, `Organic Chemistry Model Set`), they function as field **values**
    - also used inside square brackets (see below) to retrieve values from their fields:  `text['title']`
    - can be single or double: makes no difference to Python
  - **Square brackets** `[]`
    - enclose a collection of data elements called a **list**
    - elements of a list can be any other type of Python data, or even multiple types
    - also used for accessing data in two ways:
      - by position (from a list): `bkst_data[0]`
      - by field (from a dictionary): `my_course['department']`
  - **Curly braces** `{}`
    - enclose a collection of data elements called a **dictionary**
    - dictionaries have more structure than lists: each element comprises a field name (**key**) associated with a **value**
    - square brackets are used to access dictionary values by key (see above)
  - **Commas**
    - used to separate elements in a dictionary or list
  - **Parentheses**
    - used when calling **functions**: `print("Hello, world!")`, `len(bkst_data)`
    - enclose values or variables passed to the function for processing (**arguments**)

</div>

## Instructions

The following notebook has cells containing Python code for interacting with an external data file. 

1. Before you begin, each team should designate a note taker to record answers to the questions for discussion (see below).
2. Read the section below, `Introducing the dataset`, and as a team, record any questions you have about it.
3. Run each cell of code in this notebook (by pressing `Control` + `Enter` on a PC,  `Command` + `Return` on a Mac). The output of running the code will appear below the cell.
4. Each cell is accompanied by annotations, and in some cases, questions for discussion. Discuss your responses to these questions with your team. The note taker should brielfy document the conversation (including any further questions or points of confusion that arise).
5. Blank cells labeled `Try it out!` invite you to write your own code based on the provided examples. Run your code and discuss the output with your team.
6. Once everyone has worked through the notebook, we'll review the questions and your responses in the larger group.


## The dataset

Over the next few days, we'll be working together on a dataset containing information about textbooks assigned by courses at GW for the Summer 2023 semester. 

This dataset might lend itself to uses like the following:
 - Identifying courses with particularly high-cost textbooks
 - Calculating the average cost of textbooks by instructor, department, etc.
 - Identifying the publishers that supply the majority of textbooks used at GW
 - Identifying instructors who assign textbooks they themselves have authored

As you work through this exercise, make a note of any additional uses you can imagine for this data. These notes will come in handy later!

### Source

This dataset was obtained by scraping the [GW Bookstore](https://www.bkstr.com/georgewashingtonstore/shop/textbooks-and-course-materials) website. Because [web scraping](https://learning.oreilly.com/library/view/web-scraping-with/9781098145347/) is a rather advanced topic, we won't be covering that process in Python Camp.

The dataset was also pre-processed to simplify it somewhat (removing extraneous and redundant elements, etc.).

### Format

The dataset is in {term}`JSON` format. JSON (usually pronounced *jay-sawn*) is a common format for sharing data on the web. It's not as concise or human-readable as some formats (e.g., CSV, which is often used for sharing tabular data). But it has a few advantages that make it popular with programmers:
 - JSON data can be deeply nested, reflecting hierarchical relationships between data elements.
 - The JSON format comprises structures that map well onto the most common data structures used by modern programming languages. 
 - The JSON syntax has a lot in common with languages like Python.
 
We'll explore these three aspects of JSON today.

### Loading the data

The code below will fetch data from a URL and save it locally as a file before loading the data in your notebook.

1. We {term}`import` a Python function called `urlretrieve` and a Python module called `json`. A module is just some external Python code that provides a particular functionality.
   - The `urlretrieve` function allows us to fetch data from a remote source and save a local copy.
   - The `json` module allows us to convert data in JSON format to Python types. (More on types below).
3. We use the `open` function to open a file in a directory called `data`. The file is called `bookstore-data-summer-2023.json`, where the `.json` extension indicates that this is a JSON-formatted file. (The file extension is part of the filename, like `docx` for Word documents or `.xlsx` for Excel spreadsheets.) 
4. The file is assigned to the temporary variable `f`.
5. We use the `json.load` {term}`method` to read the file (`f`). This method is specifically designed for JSON files; it won't work if the file does not contain data in valid JSON format. 
6. The contents of the file, as processed by `json.load`, are assigned to a new variable, `bkst_data`.

In [1]:
from urllib.request import urlretrieve
import json
urlretrieve('https://go.gwu.edu/pythoncampdata', 'bookstore-data.json')
with open('bookstore-data.json') as f:
    bkst_data = json.load(f)

## Navigating lists

The `type` {term}`function` provides as output information about the value associated with the variable `bkst_data`. It tells us the name of the Python data structure that characterizes this value.

In [2]:
type(bkst_data)

list

````{admonition} Question
:class: question

You have encountered a Python {term}`list` before. What was the name of the variable that held a list in the "Choreographing Code" exercise?
````

````{admonition} Notes
:class: notes

- Going forward, when referring to functions and their output, we will say that the function **returns** something.
- Every Python value has a defined {term}`type`.
- When we use the word {term}`variable`, we're generally referring to the combination of a name and a value. The name points to the value, which is located somewhere in memory. When we say, "the variable `bkst_data` is a list," we mean that the value to which the name `bkst_data` points is represented in Python as a list. 
````

In [2]:
bkst_data[0]

{'department': 'ACA',
 'course': '6203',
 'section': '10',
 'instructor': 'Alexander Wild',
 'term_name': 'Summer 2023',
 'texts': []}

The code above uses {term}`index`ing to access the first element in the `bkst_data` list. When you run it, you should see data enclosed in curly braces (`{}`). 

````{admonition} Try it out!
:class: try-it-out

In the cell below, use indexing to look at other elements in the list. (Change the number inside the square brackets in the expression `bkst_data[0]`). 

````

In [None]:
# Your code here

````{admonition} Questions
:class: question

- What data elements do you see that might prove useful to our project (identifying courses with high- and low-cost textbooks)?
- What data elements do you have questions about?

````

````{admonition} Try it out!
:class: try-it-out

For an example of course with assigned textbooks, look at the element in the 100th position in the list (at index 99).

Note any additional data elements that might be useful, as well as any that you have questions about.

````

In [None]:
# Your code here

## Navigating nested data

The `len()` function returns the length of a list.

In [4]:
len(bkst_data)

1250

There are 1,250 top-level elements in `bkst_data`. But each element contains other elements nested within it. 

Here we assign one of those elements -- the element in the 100th position -- to a new variable, `my_course`.

In [None]:
my_course = bkst_data[99]
print(my_course)

````{admonition} Question
:class: question

Look back at how we initially defined this dataset. In terms of that definition, what does each top-level element in the `bkst_data` list represent? 

````

In [None]:
type(my_course)

A single course is represented as a Python {term}`dictionary` (`dict`) within the `bkst_data` list. 

As we saw in the previous activity, dictionaries allow us to store data in fields, similar to a database. The elements on the left-hand side of the colons are called **keys**, and the elements on the right-hand side of the colons are called **values**.

Here the keys are strings; anything enclosed in quotation marks in Python is a {term}`string`.

We can use the keys to retrieve the values from the dictionary.

In [7]:
my_dept = my_course['department']
print(my_dept)

CHEM


````{admonition} Try it out!
:class: try-it-out

Practice accessing other elements within `my_course` using keys. 

````

In [None]:
# Your code here

````{admonition} Questions
:class: question

Which pieces of information in `my_course` can you NOT access this way?

````

The textbooks associated with this course/section are stored within a **nested** element: a list associated with the key `'texts'`. 

In [13]:
type(my_course['texts'])

list

This particular course/section has 7 entries under the `texts` heading.

In [14]:
len(my_course['texts'])

7

````{admonition} Question
:class: question

At this point, we have drilled down a few levels into our dataset, and things might be starting to look rather complicated. Take a moment with your team to sketch on paper the structure of our dataset so far. Draw it in whatever way feels most intuitive to you (without worrying about Python data types for the moment).

````

````{admonition} Try it out!
:class: try-it-out

Use indexing to look at each of the items in the `my_books` variable below. With your team, revise your sketch/data diagram to reflect how course materials are represented in this dataset.

````

In [None]:
my_books = my_course['texts']
my_books[0]

In [None]:
# Your code here

## Questions for Discussion

1. We're ultimately interested in the cost of course materials as reported by the bookstore. Where do the relevant data elements reside within the structure we have been exploring?


2. Up to we've been looking at our data in terms of content and organization: thinking about what different elements represent, and how those elements relate to each other. Now take a few moments and look at the data in terms of **syntax**. 
   - With your team, make an inventory of the differents kinds of punctuation marks you see in the parts of the dataset you have examined, and discuss any patterns you notice in how punctuation is used.