# `004-data-manip-jsonlines`

Task: use list and dict comprehensions to work with data stored as newline-delimited JSON

## Setup

In [1]:
import json
import requests
from collections import Counter

## Task

As you discovered on Homework 1, preparing data is often a key and tedious component of model training. Here we'll practice a few basic data prep tasks. Large datasets often use streaming file formats like ndjson aka JSON-lines, so we'll practice with a (small) dataset in that format.

In [2]:
url = 'https://raw.githubusercontent.com/jsonlines/guide/master/datagov100.json'

1. Load the data. Remove `tags` and `extras` as you read it in, since these are large data structures that we don't need. (You can use `del dct[key]` to remove a key from a dictionary.)
1. What is the most common `license_title` for these datasets? (use `Counter`, imported above from the `collections` module, with a list comprehension). *you should get 'U.S. Government Work'*
2. What is the average number of `resources` for each dataset? (use `len(dataset['resources'])` in a list comprehension. *you should get 1.36*
3. Create a dictionary mapping the title of the dataset to the `url` of the first resource listed. (use a dict comprehension). Skip datasets with no resources. Use this dict to find the URL of `'Geologic map of Arkansas (NGMDB)'`.

## Solution

In [3]:
# Your code here
# You may find it helpful to look at `data[0]` and `data[0].keys()`.

In [4]:
response = requests.get(url, stream=True)
data = []
for line in response.iter_lines():
    item = json.loads(line)
    del item['tags']
    del item['extras']
    data.append(item)

In [5]:
Counter(dataset['license_title'] for dataset in data)

Counter({'Creative Commons CCZero': 8,
         'Other License Specified': 8,
         'U.S. Government Work': 15,
         None: 69})

In [6]:
sum(len(dataset['resources']) for dataset in data) / len(data)

1.36

In [7]:
Counter(len(dataset['resources']) for dataset in data)

Counter({1: 71, 2: 18, 4: 2, 3: 7, 0: 2})

In [10]:
title_to_url = {ds['title']: ds['resources'][0]['url'] for ds in data if len(ds['resources'])}

In [11]:
title_to_url['Geologic map of Arkansas (NGMDB)']

'http://ngmdb.usgs.gov/Prodesc/proddesc_16308.htm'