# From Data to Code

## Objectives

- To explore a "real-world" dataset for use with Python.
- To use Python to interact with data in a common, widely used format (JSON).
- To explore the data structures and syntactical patterns of the Python language.

## Instructions

The following notebook has cells containing Python code for interacting with an external data file. 

1. Before you begin, each team should designate a note taker to record answers to the questions for discussion (see below).
2. Read the section below, `Introducing the dataset`, and as a team, record any questions you have about it.
3. Run each cell of code in this notebook (by pressing `Control` + `Enter` on a PC,  `Command` + `Return` on a Mac). The output of running the code will appear below the cell.
4. Each cell is accompanied by annotations, and in some cases, questions for discussion. Discuss your responses to these questions with your team. The note taker should brielfy document the conversation (including any further questions or points of confusion that arise).
5. Blank cells labeled `Try it out!` invite you to write your own code based on the provided examples. Run your code and discuss the output with your team.
6. Once everyone has worked through the notebook, we'll review the questions and your responses in the larger group.


## The dataset

Over the next few days, we'll be working together on a dataset containing information about textbooks assigned by courses at GW for the Summer 2023 semester. 

### Source

This dataset was obtained by scraping the [GW Bookstore](https://www.bkstr.com/georgewashingtonstore/shop/textbooks-and-course-materials) website. Because [web scraping](https://library.gwu.edu/events/web-scraping-what-you-need-know-2023-10-03) is a rather advanced topic, we won't be covering that process in Python Camp.

The dataset was also pre-processed to simplify it somewhat (removing extraneous and redundant elements, etc.).

### Format

The dataset is in [JSON](https://gwu-libraries.github.io/python-camp/glossary.html#term-JSON) format. JSON (usually pronounced *jay-sawn*) is a common format for sharing data on the web. It's not as concise or human-readable as some formats (e.g., CSV, which is often used for sharing tabular data). But it has a few advantages that make it popular with programmers:
 - JSON data can be deeply nested, reflecting hierarchical relationships between data elements.
 - The JSON format comprises structures that map well onto the most common data structures used by modern programming languages. 
 - The JSON syntax has a lot in common with languages like Python.
 
We'll explore these three aspects of JSON today.

### Loading the data

The code below will load data from an external file located in a directory called `data`. The directory is relative to the current directory (in which this notebook lives).

1. We [import](https://gwu-libraries.github.io/python-camp/glossary.html#term-import) a Python module called `json`. A module is just some external Python code that provides a particular functionality. The `json` module allows us to convert data in JSON format to Python types. (More on types below).
2. We use the `open` function to open a file in a directory called `data`. The file is called `bookstore-data-summer-2023.json`, where the `.json` extension indicates that this is a JSON-formatted file. (The file extension is part of the filename, like `docx` for Word documents or `.xlsx` for Excel spreadsheets.) 
3. The file is assigned to the temporary variable `f`.
4. We use the `json.load` [method](https://gwu-libraries.github.io/python-camp/glossary.html#term-method) to read the file (`f`). This method is specifically designed for JSON files; it won't work if the file does not contain data in valid JSON format. 
5. The contents of the file, as processed by `json.load`, are assigned to a new variable, `bkst_data`.

In [1]:
import json
with open('../../../data/bookstore-data-summer-2023.json') as f:
    bkst_data = json.load(f)

## Navigating lists

The `type` [function](https://gwu-libraries.github.io/python-camp/glossary.html#term-function) provides as output information about the value associated with the variable `bkst_data`. It tells us the name of the Python data structure that characterizes this value.

In [2]:
type(bkst_data)

#### Question

You have encountered a Python [list](https://gwu-libraries.github.io/python-camp/glossary.html#term-list) before. What was the name of the variable that held a list in the "Choreographing Code" exercise?


#### Notes

- Going forward, when referring to functions and their output, we will say that the function **returns** something.
- Every Python value has a defined [type](https://gwu-libraries.github.io/python-camp/glossary.html#term-type).
- When we use the word [variable](https://gwu-libraries.github.io/python-camp/glossary.html#term-variable), we're generally referring to the combination of a name and a value. The name points to the value, which is located somewhere in memory. When we say, "the variable `bkst_data` is a list," we mean that the value to which the name `bkst_data` points is represented in Python as a list. 


In [2]:
bkst_data[0]

The code above uses [index](https://gwu-libraries.github.io/python-camp/glossary.html#term-index)ing to access the first element in the `bkst_data` list. When you run it, you should see data enclosed in curly braces (`{}`). 

#### Try it out!

In the cell below, use indexing to look at other elements in the list. (Change the number inside the square brackets in the expression `bkst_data[0]`). 



In [None]:
# Your code here

#### Questions

- What data elements do you see that might prove useful to our project (identifying courses with high- and low-cost textbooks)?
- What data elements do you have questions about?



#### Try it out!

For an example of course with assigned textbooks, look at the element in the 100th position in the list (at index 99).

Note any additional data elements that might be useful, as well as any that you have questions about.



In [None]:
# Your code here

## Navigating nested data

The `len()` function returns the length of a list.

In [4]:
len(bkst_data)

There are 1,250 top-level elements in `bkst_data`. But each element contains other elements nested within it. 

Here we assign one of those elements -- the element in the 100th position -- to a new variable, `my_course`.

In [None]:
my_course = bkst_data[99]
print(my_course)

#### Question

Look back at how we initially defined this dataset. In terms of that definition, what does each top-level element in the `bkst_data` list represent? 



In [None]:
type(my_course)

A single course is represented as a Python [dictionary](https://gwu-libraries.github.io/python-camp/glossary.html#term-dictionary) (`dict`) within the `bkst_data` list. 

As we saw in the previous activity, dictionaries allow us to store data in fields, similar to a database. The elements on the left-hand side of the colons are called **keys**, and the elements on the right-hand side of the colons are called **values**.

Here the keys are strings; anything enclosed in quotation marks in Python is a [string](https://gwu-libraries.github.io/python-camp/glossary.html#term-string).

We can use the keys to retrieve the values from the dictionary.

In [7]:
my_dept = my_course['department']
print(my_dept)

#### Try it out!

Practice accessing other elements within `my_course` using keys. 



In [None]:
# Your code here

#### Questions

Which pieces of information in `my_course` can you NOT access this way?



The textbooks associated with this course/section are stored within a **nested** element: a list associated with the key `'texts'`. 

In [13]:
type(my_course['texts'])

This particular course/section has 7 entries under the `texts` heading.

In [14]:
len(my_course['texts'])

#### Question

At this point, we have drilled down a few levels into our dataset, and things might be starting to look rather complicated. Take a moment with your team to sketch on paper the structure of our dataset so far. Draw it in whatever way feels most intuitive to you (without worrying about Python data types for the moment).



#### Try it out!

Use indexing to look at each of the items in the `my_books` variable below. With your team, revise your sketch/data diagram to reflect how course materials are represented in this dataset.



In [None]:
my_books = my_course['texts']
my_books[0]

In [None]:
# Your code here

## Questions for Discussion

1. We're ultimately interested in the cost of course materials as reported by the bookstore. Where do the relevant data elements reside within the structure we have been exploring?


2. Up to we've been looking at our data in terms of content and organization: thinking about what different elements represent, and how those elements relate to each other. Now take a few moments and look at the data in terms of **syntax**. 
   - With your team, make an inventory of the differents kinds of punctuation marks you see in the parts of the dataset you have examined, and discuss any patterns you notice in how punctuation is used.