(updated March 2022)

# Day 3: Working with structured data
Instructor: Dolsy

Now we're covering some content that's not in the Open edX course--we think it's useful in your learning about Python fundamentals, especially for working with data on the web. 

Let's start a new notebook.

Pre-work: you should have watched and followed along with the video about Python dictionaries. 

A Python dictionary holds pairs of objects: keys and values. 
- Keys may be strings or integers. 
- A value may be any type, whether an integer, string, boolean, list, etc. It can even be another dictionary! 
- In other programming languages, these structures might be called maps, hashmaps, or associative arrays.

The last exercise in the video was to create a dictionary that holds data about a city. We'll walk through creating one now and if you did the exercise, I'd also invite you to share your dictionary in the Slack chat. 


In [None]:
city = {"name": "District of Columbia", "population": 692683, 
        "country": "USA", 
        "landmarks": ["Lincoln Memorial", "U.S. Capitol", "GWU"]}

In today's lesson, we're going to work with lists and dictionaries as the building blocks for data structures that we can transform in Python in various ways. Transforming data from one structure to another is a very common task in programming, and one for which people often turn to a language like Python to accomplish.

We'll also look at a Python library called pandas, which provides powerful tools for working with large datasets efficiently.

### Extracting values from a dictionary

In the pre-work for today, you saw how to extract a single value from a dictionary by key. Let's say we want to print the name and population of the city in our `city` dictionary.

In [None]:
name = city['name']
population = city['population']
print('Name: ', name)
print('Population: ', population)

What if we had to do this for a number of cities? As we learned in Python Camp Day 2, we can use **functions** to encapsulate code that we need to run repeatedly. 

#### Exercise:

Can you write a function that prints the name, population, and the number of landmarks from a `city` dictionary (as defined above)? 

##### Optional
Using the `get` method on dictionaries, account for cases where a city may have no landmarks associated with it.

##### Answer
```
def print_city_info(city):
   name = city['name']
   population = city['population']
   landmarks = city.get('landmarks', [])
   print('Name: ', name)
   print('Population: ', population) 
   print('Number of landmarks: ', len(landmarks))
```

In [None]:
def print_city_info(city):
    name = city['name']
    population = city['population']
    landmarks = city.get('landmarks', [])
    print('Name: ', name)
    print('Population: ', population) 
    print('Number of landmarks: ', len(landmarks))

In [None]:
print_city_info(city)

As you saw in the pre-work video, there are two ways of accessing the keys in a dictionary. 
1. We can simply put the key in square brackets after the name of the dictionary. If the key doesn't exist, we'll get a `KeyError`. 
2. We can use the `get` method, which returns `None` -- or another default value that we can supply -- if the key does not exist.

In [None]:
city['longitude']

In [None]:
city.get('longitude')

In [None]:
city.get('longitude', 'Not defined')

### Getting data from the web


Dictionaries are frequently-used in Python when bringing in data from the web. 

An **API** (application programming interface) is an endpoint or a URL on the web, that returns structured data in response to a request from another computer/application/browser requesting it. 

Many API's return data in **JSON format** (Javascript Object Notation, pronounced Jay-son). 

Let's look at an example in the browser. This is from Geonames.org, a crowd-sourced database of geographical information.

http://api.geonames.org/countryInfoJSON?formatted=true&lang=it&username=dolsysmith&style=full

Note that the structure resembles a Python dictionary. Can you describe this structure, using the Python terms we've learned so far?

### Using an API

The Geonames [country info](https://www.geonames.org/export/web-services.html#countryInfo) API returns structured data about each country in the world. 

#### Parts of an API
1. A base URL or endpoint
2. A set of optional or required parameters
3. A method of authentication

Looking at a URL, the endpoint or base is the part before the question mark.

In [None]:
country_url = 'http://api.geonames.org/countryInfoJSON'

Parameters are included after the question mark in the URL, separated by ampersands (`&`). 

In [None]:
# We won't include the country parameter, so that we can get all countries by default
lang = 'en'

Web API's use a variety of methods of authentication. The Geonames API  requires that you include `username` parameter in your request URL. 

If you followed the instructions in the Pre-work for today, you should have created a Geonames account. If you didn't, let's take a moment to do that now. 

In [None]:
# Insert your username between the quotation marks
username = 'dolsysmith'

#### Making API requests in Python

The [requests](https://docs.python-requests.org/en/latest/) library for Python simplifies the process of interacting with web API's. Using `requests`, we can just supply the different parts defined above, and the library will take care of the rest.

But before we can use `requests`, we have to `import` it. 

Importing a Python library brings a set of additional functionality into the current Python session.

In [None]:
import requests

The `requests` library (also called a **module**) has a number of methods. We can access these by the dot-notation (similar to how we use `append` on a list). 

The particular method we need is called `get`. (Note that this **NOT** the same method as the `get` we used with Python dictionaries. It just happens to have the same name.)

`requests.get` takes a few possible arguments. 
- A URL as a string (required).
- Optionally, a set of parameters as a Python dictionary.

In [None]:
params = {'lang': lang,
         'username': username}

In [None]:
response = requests.get(country_url, params)

We assigned the result of `requests.get` to a variable called `response`. 

The first thing we can do is check the **status code** of our request. 
- `200` means that the request was successful.
- Any other number usually means that an error was encountered.

In [None]:
response.status_code

If you got an error, look at `response.text` to see if there's an error message associated with it.

If you got `200`, you can access the data via the `json` method.

In [None]:
data = response.json()

`requests` has loaded the JSON into a Python dictionary, which allows us to use the keys to access the data. The top-level key is `geonames`.

In [None]:
data['geonames']

### Analyzing/transforming data with lists & dictionaries

`data['geonames']` is a list of Python dictionaries. Let's compare it to the same data presented as a [spreadsheet](https://docs.google.com/spreadsheets/d/197QM-eq03pZ7Fva_7k9S1oNxHsCIPrY85UnEsTBFYSI/edit?usp=sharing).

What do you notice about this comparison? 

Spreadsheets have various functions that we can use to analyze our data. For instance, we can sort the sheet by population to find the most or least populous countries.

How could we do this with Python?

#### Exercise

Take a minute to make a plan for some code to find the most populous country in our dataset. Don't write any Python code yet. Just focus on the logic and the elements of the data structure. 
- What is the data element we want to measure?
- What is/are the data element(s) we want to return? 
- What operations do we need to perform on `data['geonames']` in order to accomplish this?

If you're following along in the notebook code for this lesson, try not to look ahead before doing this exercise.

#### A first pass

In [None]:
def most_populous(countries):
    '''
    Returns the most populous country from a list of countries.
    :param countries: a list of dictionaries; should contain countryName and population keys
    '''
    # Why are we setting result to be an empty dictionary rather than a string?
    result = {}
    for country in countries:
        if country['population'] > result['population']:
            result = country
    return result

In [None]:
most_populous(data['geonames'])

You should have gotten a `KeyError` when running the above code. Try stepping through the code with the first element in `data['geonames']` and see if you can identify the problem.

#### A second pass

In [None]:
def most_populous(countries):
    '''
    Returns the most populous country from a list of countries.
    :param countries: a list of dictionaries; should contain countryName and population keys
    '''
    # Why are we setting result to be an empty dictionary rather than a string?
    result = {}
    for country in countries:
        # Why do we need result.get('population', 0) instead of result['population']?
        if country['population'] > result.get('population', 0):
            result = country
    return result

In [None]:
most_populous(data['geonames'])

What's up with this `TypeError`? Try stepping through the code again. 

Hint: Running `type(data['geonames'][0]['population'])` might help.

#### A third pass

In [None]:
def most_populous(countries):
    '''
    Returns the most populous country from a list of countries.
    :param countries: a list of dictionaries; should contain countryName and population keys
    '''
    # Why are we setting result to be an empty dictionary rather than a string?
    result = {}
    for country in countries:
        # Converting the population value to an integer
        pop = int(country['population'])
        # Why do we need result.get('population', 0) instead of result['population']?
        if pop > result.get('population', 0):
            result = country
    return result

In [None]:
most_populous(data['geonames'])

Yikes! I thought we fixed it! Try stepping through the code again, this time using the first **two** elements of `data['geonames']`.

#### A fourth pass

Maybe the problem lies in trying to capture the whole dictionary of the country each time. Another approach: leverage the fact that `countries` is a list. We can simply keep track of the **index** of the most populous country each time, and then at the end of the loop, whatever that index is, we return the element at that position from our `countries` list.

In [None]:
def most_populous(countries):
    '''
    Returns the most populous country from a list of countries.
    :param countries: a list of dictionaries; should contain countryName and population keys
    '''
    # We set it to 0 so that it will be initialized the first time we use it below
    most_pop_index = 0
    # We'll also keep track of the largest population value we've seen
    most_pop = 0
    # Remember that len() gives you the length of a list
    num_countries = len(countries)
    # Here we use the range() function to loop over the numbers from 0...num_countries
    for i in range(num_countries):
        # Converting the population value to an integer
        # Note that now we need to access the country by index explicitly
        pop = int(countries[i]['population'])
        if pop > most_pop:
            # Here we need to update both the index of the most populous country and the actual population value
            most_pop_index = i
            most_pop = pop
    return countries[most_pop_index]

In [None]:
most_populous(data['geonames'])

Whew! We finally got the result we wanted. The advantage of this approach is that we made all the steps explicit. But is there a more concise way to achieve the same result? Or what if we wanted to find the top 5 most populous countries?

As it happens, there are a few ways. Let's look at one way that involves **transforming** our data structure (instead of looping over it).

#### Lists vs dictionaries

But first, a detour.

`data['geonames']` has the disadvantage that we can't easily look up a country by name. For that, we need to loop over the whole list.

If needed: review indexing by position vs. indexing by key.

Let's transform our list of dictionaries into a **dictionary** mapping each country's name to its population.

In [None]:
def create_pop_map(countries):
    '''
    Returns a dictionary mapping country names to their populations.
    :param countries: a list of dictionaries ...
    '''
    pop_map = {}
    for country in countries:
        name = country['countryName']
        pop = int(country['population'])
        pop_map[name] = pop
    # Don't forget to return something!
    return pop_map

In [None]:
population_map = create_pop_map(data['geonames'])

In [None]:
population_map['United States']

Now we've transformed our list of dictionaries into a flatter structure: a dictionary mapping strings to integers.

Python dictionaries don't have a built-in method for sorting the values, but there's a special kind of Python collection called a `Counter` that we can use to do that.

In [None]:
from collections import Counter

Here we create a new `Counter` object. 

In [None]:
pop_counter = Counter(population_map)

`pop_counter` still behaves like a Python dictionary.

In [None]:
pop_counter['Argentina']

But it has some added features, like a `most_common` method that accepts an integer (N) and returns the N elements with the highest values.

In [None]:
pop_counter.most_common(5)

#### Exercise

Write a function that returns the top 5 largest countries by area.

```
def largest_areas(countries, top_n):
    '''
    Returns the top N countries by area.
    :param countries: a list of dictionaries ...
    :param top_n: an integer
    '''
    area_map = {}
    for country in countries:
        name = country['countryName']
        area = float(country['areaInSqKm'])
        area_map[name] = area
    # Don't forget to return something!
    return Counter(area_map).most_common(top_n)
```

In [None]:
# largest_areas(data['geonames'], 5)

#### Exercise

Write a function that returns a dictionary mapping each country name to the number of official languages recorded by Geonames for that country. 

```
def create_lang_map(countries):
    '''
    Returns a dictionary mapping country names to the number of official languages.
    :param countries: a list of dictionaries ...
    '''
    lang_map = {}
    for country in countries:
        name = country['countryName']
        # the languages value is a comma-separated string
        languages = country['languages'].split(',')
        lang_map[name] = len(languages)
    # Don't forget to return something!
    return lang_map
```

In [None]:
# create_lang_map(data['geonames'])

### Exporting data

One more thing we might want to do with our data is export it as a comma-separated values (CSV) file. 

Like JSON, CSV is a common format for working with and exchanging data. Unlike JSON, CSV is a **flat** format, meaning that the data is structured as a table, with rows and columns. Nested JSON doesn't always translate well to CSV. Our `data['geonames']` object, however -- a list of dictionaries -- lends itself quite well to transformation to CSV.

A CSV file has a header consisting of the names of the columns. In this case, we can derive the column names from the **keys** of the dictionaries representing the different countries. 

Assuming every dictionary in the list `data['geonames']` has the same keys, we could define the header as a comma-separated string consisting of the keys in the first dictionary in the list.

In [None]:
header = data['geonames'][0].keys()
print(','.join(header))

What about the rows beneath the header? One way to think of them is as the set of strings produced by joining the **values** of each dictionary with commas.

In [None]:
# Here's what the first row might look like
row = data['geonames'][0]
values = []
for value in row.values():
    # We have to convert any non-string values to strings, or else the .join method won't work
    values.append(str(value))
print(','.join(values))

In practice, it's a little more complicated than that. For one, we need to account for cases where the elements in a row might contain commas themselves, which is usually done by using special quoting characters.

Fortunately, there are tools out there to help us!

#### Introducing pandas

With the `pandas` library, we can easily convert a list of Python dictionaries into a CSV.

First we import the `pandas` library.

In [None]:
# pd is an alias -- just an abbreviation so that we don't have to retype pandas a lot
import pandas as pd

Then we create what's called a **DataFrame** from our list of dictionaries.

In [None]:
df = pd.DataFrame.from_records(data['geonames'])

It even displays in our notebook as a nice, formatted table!

In [None]:
df

Now we can save to CSV in one line. We just need to pass as an argument a string that indicates the filename and the path to it in our environment. The `./` means that we want to save it in the same directory as our notebook.

We also use the optional **keyword argument** `index=False` so that the CSV won't have the line numbers you see above as a separate column.

In [None]:
df.to_csv('./countries-from-geonames.csv', index=False)