# From Data to Code

### Objectives

- To explore a "real-world" dataset for use with Python.
- To use Python to interact with data in a common, widely used format (JSON).
- To explore the data structures and syntactical patterns of the Python language.

### Instructions

The following notebook has cells containing Python code for interacting with an external data file. 

1. Before you begin, each team should designate a note taker to record answers to the questions for discussion (see below).
2. Read the section below, `Introducing the dataset`, and as a team, record any questions you have about it.
3. Run each cell of code in this notebook (by pressing `Control` + `Enter` on a PC,  `Command` + `Return` on a Mac). The output of running the code will appear below the cell.
4. Each cell is accompanied by annotations, and in some cases, questions for discussion. Discuss your responses to these questions with your team. The note taker should brielfy document the conversation (including any further questions or points of confusion that arise).
5. Blank cells labeled `Try it out!` invite you to write your own code based on the provided examples. Run your code and discuss the output with your team.
6. Once everyone has worked through the notebook, we'll review the questions and your responses in the larger group.


### The dataset

Over the next few days, we'll be working together on a dataset containing information about textbooks assigned by courses at GW for the **[add term]** semester. 

#### Source

This dataset was obtained by scraping the [GW Bookstore](https://www.bkstr.com/georgewashingtonstore/shop/textbooks-and-course-materials) website. Because web scraping is a rather advanced topic, we won't be covering that process in Python Camp.(The [code](https://github.com/gwu-libraries/python-camp/blob/main/course-utils/gw_bookstore_scraping.ipynb) is available in the Python Camp GitHup repository, for those who are interested.) 

The dataset was also pre-processed to simplify it somewhat (removing extraneous and redundant elements, etc.).

#### Format

The dataset is in **JSON** format. JSON (usually pronounced *jay-sawn*) is a common format for sharing data on the web. It's not as concise or human-readable as some formats (e.g., CSV, which is often used for sharing tabular data). But it has a few advantages that make it popular with programmers:
 - JSON data can be deeply nested, reflecting hierarchical relationships between data elements.
 - The JSON format comprises structures that map well onto the most common data structures used by modern programming languages. 
 - The JSON syntax has a lot in common with languages like Python.
 
We'll explore these three aspects of JSON today.

### Loading the data

The code below will load data from an external file located in a directory called `data`. The directory is relative to the current directory (in which this notebook lives).

1. We **import** a Python library called `json`. A library is just some external Python code that provides a particular functionality. The `json` library allows us to convert data in JSON format to Python data types. (More on data types below).
2. We use the `open` function to open a file in a directory called `data`. The file is called `bookstore-data-cleaned.json`, where the `.json` extension indicates that this is a JSON-formatted file. (The file extension is part of the filename, like `docx` for Word documents or `.xlsx` for Excel spreadsheets.) 
3. The file is assigned to the temporary name `f`.
4. We use the `json.load` **method** to read the file (`f`). This method is specifically designed for JSON files; it won't work if the file does not contain data in valid JSON format. 
5. The contents of the file, as processed by `json.load`, are assigned to a new name, `bkst_data`.

In [None]:
import json
with open('../../data/bookstore-data-cleaned.json') as f:
    bkst_data = json.load(f)

The `type` **function** provides as output information about the value associated with the name `bkst_data`. It tells us the name of the Python data structure that characterizes this value.

In [None]:
type(bkst_data)

##### Question

You have encountered a Python list before. What was the name of the variable that held a list in the "Choreographing Code" exercise?

##### Notes

- Going forward, when referring to functions and their output, we will say that the function **returns** something.
- Every Python value has a defined **type**.
- When we use the word **variable**, we're generally referring to the combination of a name and a value. Thus, we can say, "the variable `bkst_data` is a list." 

### Navigating lists

In [None]:
bkst_data[0]

The code above uses **indexing** to access the first element in the `bkst_data` list. When you run it, you should see a lot of data enclosed in curly braces (`{}`). 

##### Try it out!

In the cell below, use indexing to look at other elements in the list. (Change the number inside the square brackets in the expression `bkst_data[0]`). 

In [None]:
# Your code here

##### Questions

- What data elements do you see that might prove useful to our project (identifying courses with high- and low-cost textbooks)?
- What data elements do you have questions about?

##### Try it out!

For an example of course with assigned textbooks, look at the element in the 100th position in the list (at index 99).

Note any additional data elements that might be useful, as well as any that you have questions about.

In [None]:
# Your code here

### Navigating nested data

In [None]:
len(bkst_data)

The `len()` function returns the length of a list. Here we can see that there are 1,250 top-level elements in `bkst_data`. But each element contains many other elements below it. 

##### Question

What does each top-level element in the `bkst_data` list represent?

In [None]:
my_course = bkst_data[99]
type(my_course['courseSection'])

A single course is represented as a Python dictionary within the `bkst_data` list. Unfortunately, almost all of the useful information -- the information unique to that particular course -- is nested *within* that dictionary as *another dictionary*. 

We can access that inner dictionary using the **key** `'courseSection'`, which we use between quotation marks because it's a string.

In [None]:
my_section = my_course['courseSection']
my_section['department']

We're using a new variable, `my_section`, to hold this inner dictionary. We can access its elements by their keys, as in the example above. 

##### Note

We could have also written the above as `my_course['courseSection']['department']`, without using the intermediate variable `my_section`. From Python's point of view, there's not much difference. It's really a matter of programmer preference, and what makes your code more readable for you and your collaborators.

##### Try it out!

Practice accessing other elements within `my_section` using keys. 

In [None]:
# Your code here

##### Questions

Which elements can you NOT access this way?

In [None]:
type(my_section['courseMaterials'])

In [None]:
len(my_section['courseMaterials'])

The course materials associated with this course/section are stored within yet another nested element: a list associated with the key `'courseMaterials'`. 

This particular course/section has 4 entries under the `courseMaterials` heading.

##### Question

At this point, we have drilled down a few levels into our dataset, and things might be starting to look rather complicated. Take a moment with your team to sketch on paper the structure of our dataset so far. Draw it in whatever way feels most intuitive to you (without worrying about Python data types for the moment).

In [None]:
my_books = my_section['courseMaterials']
my_books[0]

For this course section, the `courseMaterials` list has four items, representing a mix of physical and electronic course materials associated with this course.

##### Try it out!

Use indexing to look at each of the items. With your team, revise your sketch/data diagram to reflect how course materials are represented in this dataset.

In [None]:
# Your code here

##### Questions

1. We're ultimately interested in the cost of course materials as reported by the bookstore. Where do the relevant data elements reside within the structure we have been exploring?


2. Print and digital course materials are organized in different ways within each course. Can you identify the differences?


3. Up to we've been looking at our data in terms of content and organization: thinking about what different elements represent, and how those elements relate to each other. Now take a few moments and look at the data in terms of **syntax**. With your team, make an inventory of the differents kinds of punctuation marks you see in the parts of the dataset you have examined, and discuss any patterns you notice in how punctuation is used.

### Wrap up

#### Part 1: Data semantics

The goal in this part of the discussion is to highlight how data formats like JSON allow us to *represent relationships*. Our dataset represents a model of something out in the world (textbooks/course materials used by courses at GW), and we're using Python in order to _interact with that model_ in various ways.

Instructors should document the modeling that emerges from this discussion as a tool for use in subsequent team activities.

1. Ask for volunteers (note takers or others) to share the data diagramming done by their teams. The cleaned data has the following hierarchy:

```

bkst_data[list]
---a course record[dict]
------courseSection[dict]
---------course[str] [...section, instructor, etc.]
---------courseMaterials[list][optional]
------------course material[dict]
---------------title[str] [...author, isbn, etc.]
---------------printItems[dict][optional]
------------------BUY_NEW|BUY_USED|RENTAL_NEW|RENTAL_USED[dict]
---------------------priceNumeric[float] [...]
---------------digitalItems[list][optional]
------------------a digital item[dict]
---------------------typeCondition[str] [...priceNumeric, etc.]

```

2. Note the presence of the different data types, but emphasize that the homework for Day 1 will cover the differences in detail. For now, focus on how the data elements relate to one another. Language like the following can help:
  - A course record (in this dataset) _has_ a single course section.
  - A course section _can have_ one or more course materials.
  - Each course material has _either_ print items, digital items, _or_ both.
  - Each item has a condition and a price, etc.
  
  
3. Help teams surface any questions they have about the data or points of confusion.


#### Part 2: Syntax

The goal of this part of the discussion is to start to identify the syntactic elements of Python that allow us to model the world in certain ways. 

Have each team/note taker share their inventory of syntactic elements.

It's not necessary at this point to go through all of these elements in detail. Depending on the mood of the group, however, it may be useful to highlight the following (as preparation for the homework):

  - **Quotation marks** `''`
    - surround strings of characters (letters, numbers, other punctuation)
    - occuring on the left side of a colon (e.g., `title`, `courseSection`), they function as field names, in Python called **keys**
    - occuring on the right side of a colon (e.g., `CHEM`, `Organic Chemistry Model Set`), they function as field **values**
    - also used inside square brackets (see below) to retrieve values from their fields:  `my_section['courseMaterials']`
    - can be single or double: makes no difference to Python
  - **Square brackets** `[]`
    - enclose a collection of data elements called a **list**
    - elements of a list can be any other type of Python data, or even multiple types
    - also used for accessing data in two ways:
      - by position (from a list): `bkst_data[0]`
      - by field (from a dictionary): `my_section['department']`
  - **Curly braces** `{}`
    - enclose a collection of data elements called a **dictionary**
    - dictionaries have more structure than lists: each element comprises a field name (**key**) associated with a **value**
    - square brackets are used to access dictionary values by key (see above)
  - **Commans**
    - used to separate elements in a dictionary or list
  - **Parentheses**
    - used when calling **functions**: `print("Hello, world!")`, `len(bkst_data)`
    - enclose values or variables passed to the function for processing (**arguments**)