The data set we'll use, is in a format called **JavaScript Object Notation (JSON)**. As the name indicates, JSON originated from the JavaScript language, but has now become a language-independent format.

From a Python perspective, **JSON** can be thought as a collection of Python objects nested inside each other.

In [4]:
"""[{"team_1": "France",
         "team_2": "Croatia",
        "game_type": "Final",
        "score" : [4, 2]
        },
    {
        "team_1": "Belgium",
        "team_2": "England",
        "game_type": "3rd/4th Playoff",
        "score" : [2, 0]
    }]"""

'[{"team_1": "France",\n         "team_2": "Croatia",\n        "game_type": "Final",\n        "score" : [4, 2]\n        },\n    {\n        "team_1": "Belgium",\n        "team_2": "England",\n        "game_type": "3rd/4th Playoff",\n        "score" : [2, 0]\n    }]'

The `JSON` above is a list, where each element in the list is a dictionary. Each of the dictionaries have the same keys, and one of the values of each dictionary is itself a list.


The Python `json` [module](https://docs.python.org/3.7/library/json.html#module-json) contains a number of functions to make working with JSON objects easier. We can use the `json.loads()` method to convert JSON data contained in a string to the equivalent set of Python objects:

In [3]:
json_string = """[
{ "team_1": "France", "team_2": "Croatia", "game_type": "Final","score" : [4, 2]},
{"team_1": "Belgium","team_2": "England","game_type": "3rd/4th Playoff","score" : [2, 0]}
]"""

print(json_string)

[
{ "team_1": "France", "team_2": "Croatia", "game_type": "Final","score" : [4, 2]},
{"team_1": "Belgium","team_2": "England","game_type": "3rd/4th Playoff","score" : [2, 0]}
]


In [34]:
import json

python_obj = json.loads(json_string)
python_obj 

[{'team_1': 'France',
  'team_2': 'Croatia',
  'game_type': 'Final',
  'score': [4, 2]},
 {'team_1': 'Belgium',
  'team_2': 'England',
  'game_type': '3rd/4th Playoff',
  'score': [2, 0]}]

We can observe a few things:

* The formatting from our original string is gone. This is because printing Python lists and dictionaries has a simple formatting structure.
* The order of the keys in the dictionary have changed. This is because (prior to version 3.6) Python dictionaries don't have fixed order.

One of the places where the `JSON format` is commonly used is in the results returned by an **Application programming interface (API)**. **APIs** are interfaces that can be used to send and transmit data between different computer systems. 

The data set **hn_2014.json** was downloaded from the **Hacker News API**. It's a different set of data from the CSV and it contains data about stories from Hacker News in 2014.

To read a file from JSON format, we use the `json.load()` function. Note that the function is `json.load()` without an `"s"` at the end. The `json.loads()` function is used for loading JSON data from a string (`"loads"` is short for `"load string"`), whereas the `json.load()` function is used to load from a file object. 

In [35]:
with open("hn_2014.json") as f:
    hn = json.load(f)

In [36]:
print(hn[:2])

[{'author': 'dragongraphics', 'numComments': 0, 'points': 2, 'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy', 'storyText': '', 'createdAt': '2014-05-29T08:07:50Z', 'tags': ['story', 'author_dragongraphics', 'story_7815238'], 'createdAtI': 1401350870, 'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability', 'objectId': '7815238'}, {'author': 'jcr', 'numComments': 0, 'points': 1, 'url': 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot', 'storyText': '', 'createdAt': '2014-05-29T08:05:58Z', 'tags': ['story', 'author_jcr', 'story_7815234'], 'createdAtI': 1401350758, 'title': 'Telemba Turns Your Old Roomba and Tablet Into a Telepresence Robot', 'objectId': '7815234'}]


In [37]:
print(len(hn))
print(type(hn))
print(type(hn[0]))

35806
<class 'list'>
<class 'dict'>


Our `hn` variable is a list and the type of the first object (which will almost always be the type of every object in the list in JSON data):

Here is a summary of the keys and the data that they contain:

* `author`: The username of the person who submitted the story.
* `createdAt`: The date and time at which the story was created.
* `createdAtI`: An integer value representing the date and time at which the story was created.
* `numComments`: The number of comments that were made on the story.
* `objectId`: The unique identifier from Hacker News for the story.
* `points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes.
* `storyText`: The text of the story (if the story contains text).
* `tags`: A list of tags associated with the story.
* `title`: The title of the story.
* `url`: The URL that the story links to (if the story links to a URL).

In [38]:
# creating a function that delete particular key

def del_key(dict_, key):
    # create a copy so we don't
    # modify the original dict
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict

In [39]:
clean_hn = []
for d in hn:
    clean_hn.append(del_key(d,'createdAtI'))   

**List Comprehensions**. A list comprehension provides a concise way of creating lists in a single line of code.

To transform this structure into a list comprehension, we do the following within brackets:

* Start with the code that transforms each item.
* Continue with our for statement (without a colon).

We can then assign the list comprehension to a variable name.

The "transformation" step of our list comprehension can be anything, including a function or method.

In [40]:
hn_clean = [del_key(d,'createdAtI') for d in hn]


List comprehensions can be used for many different things. Three common applications are:

1. Transforming a list
2. Creating a new list
3. Reducing a list

In [41]:
# list comprehension to extract the url value from each dictionary in hn_clean
urls = [d["url"] for d in hn_clean]
urls[:2]


['http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot']

To include an if statement in a list comprehension, we include it at the very end, before the closing bracket:

In [42]:
thousand_points = [d for d in hn_clean if d['points']>1000]
num_thousand_points = len(thousand_points)

There is a way we can actually tell functions like `min()`, `max()`, and `sorted()` how to sort complex objects like dictionaries and lists of lists. We do this by using the optional `key` argument.

The parentheses are what tells Python to execute the function, so if we omit the parentheses we can treat a function like a variable

In [43]:
# Creating a "key function" that accepts a single dictionary and returns the value from the numComments key.

def get_num_comments(dictionary):
    return dictionary['numComments']
    
# Use the max() function to find the value from the hn_clean list with the most comments:
most_comments = max(hn_clean, key = get_num_comments ) # max, min, sorted function take iterate object
most_comments

{'author': 'platz',
 'numComments': 1208,
 'points': 889,
 'url': 'https://blog.mozilla.org/blog/2014/04/03/brendan-eich-steps-down-as-mozilla-ceo/',
 'storyText': None,
 'createdAt': '2014-04-03T19:02:53Z',
 'tags': ['story', 'author_platz', 'story_7525198'],
 'title': 'Brendan Eich Steps Down as Mozilla CEO',
 'objectId': '7525198'}

Usually, we create functions when we want to perform the same task many times.

Python provides a special syntax to create `temporary` functions to use just once. These functions are called **lambda functions**. Lambda functions can be defined in a single line, which allows us to define a function we want to pass as an argument at the time we need it.

To create a lambda function, we:

* Use the `lambda` keyword, followed by
* The parameter and a colon, and then
* The transformation we wish to perform on our argument

If a function is particularly complex, it may be a better choice to define a regular function rather than create a lambda, even if it will only be used once. 

In [44]:
def multiply(a, b):
    return a * b

multiply = lambda a,b: a*b

Assigning a `lambda` to a variable so it can be called by name is a pretty `uncommon pattern`. The primary use of lambda functions is to define a function in place, like when we are providing a function as an argument.

In [45]:
hn_sorted_points = sorted(hn_clean, key = lambda d:d['points'], reverse = True) # sort higehtest to lowest
hn_sorted_points[:2]

[{'author': 'frederfred',
  'numComments': 398,
  'points': 2732,
  'url': 'http://gabrielecirulli.github.io/2048/',
  'storyText': '',
  'createdAt': '2014-03-10T15:44:42Z',
  'tags': ['story', 'author_frederfred', 'story_7373566'],
  'title': '2048',
  'objectId': '7373566'},
 {'author': 'brokenparser',
  'numComments': 260,
  'points': 1958,
  'url': 'https://thedaywefightback.org/',
  'storyText': '',
  'createdAt': '2014-02-11T08:12:28Z',
  'tags': ['story', 'author_brokenparser', 'story_7216471'],
  'title': 'Today is The Day We Fight Back',
  'objectId': '7216471'}]

In [46]:
# list comprehension to return a list of the five post titles (dictionary key title) that have the most points 

top_5_titles = [d['title'] for d in sorted(hn_clean, key = lambda d:d['points'], reverse = True)[:5]]
top_5_titles

['2048',
 'Today is The Day We Fight Back',
 'Wozniak: “Actually, the movie was largely a lie about me”',
 'Microsoft Open Sources C# Compiler',
 'Elon Musk: To the People of New Jersey']

So far, we've worked with our `JSON data` using `pure Python`. One other option available to us is to convert the `JSON` to a `pandas dataframe` and then use pandas methods to manipulate it.

Pandas has the `pandas.read_json()` function, which is designed to read **JSON** from either a `file` or a `JSON string`. 

In our case, our `JSON` exists as `Python objects` already, so we don't need to use this function.

Because the structure of `JSON objects` can vary a lot, sometimes we will need to prepare our data in order to be able to convert it to a tabular form. In our case, our data is a list of dictionaries, which pandas is easily able to convert to a dataframe.

We can use the `pandas.DataFrame()` constructor and pass the list of dictionaries directly to it to convert the JSON to a dataframe:

In [47]:
import pandas as pd

hn_df = pd.DataFrame(hn_clean)
hn_df.head()

Unnamed: 0,author,createdAt,numComments,objectId,points,storyText,tags,title,url
0,dragongraphics,2014-05-29T08:07:50Z,0,7815238,2,,"[story, author_dragongraphics, story_7815238]",Are we getting too Sassy? Weighing up micro-op...,http://ashleynolan.co.uk/blog/are-we-getting-t...
1,jcr,2014-05-29T08:05:58Z,0,7815234,1,,"[story, author_jcr, story_7815234]",Telemba Turns Your Old Roomba and Tablet Into ...,http://spectrum.ieee.org/automaton/robotics/ho...
2,callum85,2014-05-29T08:05:06Z,0,7815230,1,,"[story, author_callum85, story_7815230]",Apple Agrees to Buy Beats for $3 Billion,http://online.wsj.com/articles/apple-to-buy-be...
3,d3v3r0,2014-05-29T08:00:08Z,0,7815222,1,,"[story, author_d3v3r0, story_7815222]",Don’t wait for inspiration,http://alexsblog.org/2014/05/29/dont-wait-for-...
4,timmipetit,2014-05-29T07:46:19Z,0,7815191,1,,"[story, author_timmipetit, story_7815191]",HackerOne Get $9M In Series A Funding To Build...,http://techcrunch.com/2014/05/28/hackerone-get...


In [48]:
# create a boolean mask based on whether each item in tags has a length of 4
tags = hn_df['tags']
four_tags = tags[tags.apply(len)==4]
four_tags.head()

43     [story, author_alamgir_mand, story_7813869, sh...
86       [story, author_cweagans, story_7812404, ask_hn]
104    [story, author_nightstrike789, story_7812099, ...
107    [story, author_ISeemToBeAVerb, story_7812048, ...
109       [story, author_Swizec, story_7812018, show_hn]
Name: tags, dtype: object

Special version of an `if statement` known as a **ternary operator**. We can use the **ternary operator** whenever we need to return one of two values depending on a boolean expression. The syntax is as follows:

`[on_true] if [expression] else [on_false]`

In [49]:
# def extract_tag(l):
#     return l[-1] if len(l) == 4 else None



cleaned_tags = hn_df['tags'].apply(lambda l:l[-1] if len(l) == 4 else None) # Where the item is a list with length four, return the last item.


In [51]:
hn_df['tags'] = cleaned_tags
