### The JSON Format

The Python json module contains a number of functions to make working with JSON objects easier. We can use the json.loads() method to convert JSON data contained in a string to the equivalent set of Python objects:

In [7]:
json_string = """
[
  {
    "name": "Sabine",
    "age": 36,
    "favorite_foods": ["Pumpkin", "Oatmeal"]
  },
  {
    "name": "Zoe",
    "age": 40,
    "favorite_foods": ["Chicken", "Pizza", "Chocolate"]
  },
  {
    "name": "Heidi",
    "age": 40,
    "favorite_foods": ["Caesar Salad"]
  }
]
"""
import json
json_obj = json.loads(json_string)
print(type(json_obj))

<class 'list'>


json_obj is turned into a list

In [8]:
print(json_obj)

[{'name': 'Sabine', 'age': 36, 'favorite_foods': ['Pumpkin', 'Oatmeal']}, {'name': 'Zoe', 'age': 40, 'favorite_foods': ['Chicken', 'Pizza', 'Chocolate']}, {'name': 'Heidi', 'age': 40, 'favorite_foods': ['Caesar Salad']}]


### Instructions

We have created a JSON string, world_cup_str, which contains data about games from the 2018 Football World Cup.

Import the json module.
1. Use json.loads() to convert world_cup_str to a Python object. Assign the result to world_cup_obj.


In [9]:
world_cup_str = """
[
    {
        "team_1": "France",
        "team_2": "Croatia",
        "game_type": "Final",
        "score" : [4, 2]
    },
    {
        "team_1": "Belgium",
        "team_2": "England",
        "game_type": "3rd/4th Playoff",
        "score" : [2, 0]
    }
]
"""

In [10]:
world_cup_obj = json.loads(world_cup_str)
print(type(world_cup_obj))

<class 'list'>


### Reading JSON File 

One of the places where the JSON format is commonly used is in the results returned by an Application programming interface (API). APIs are interfaces that can be used to send and transmit data between different computer systems. 

In [11]:
import json
file = open('hn_2014.json')
hn = json.load(file)
print(type(hn))

<class 'list'>


In [12]:
print(len(hn))
print(type(hn[0]))

35806
<class 'dict'>


In [13]:
print(hn[0].keys())

dict_keys(['author', 'numComments', 'points', 'url', 'storyText', 'createdAt', 'tags', 'createdAtI', 'title', 'objectId'])


### Deleting Dictionary Keys 

The function will use the json.dumps() function ("dump string") which does the opposite of the json.loads() function — it takes a JSON object and returns a string version of it. The json.dumps() function accepts arguments that can specify formatting for the string, which we'll use to make things easier to read:

In [14]:
def jprint(obj):
    text = json.dumps(obj, sort_keys = True,indent = 4)
    print(text)

In [15]:
first_story = hn[0]
jprint(first_story)

{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "createdAtI": 1401350870,
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}


In [16]:
d = {'a': 1, 'b': 2, 'c': 3}
del d['a']
print(d)

{'b': 2, 'c': 3}


In [17]:
def del_key(dict_,key):
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict

In [18]:
first_story = del_key(first_story, 'createdAtI')
jprint(first_story)

{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}


### Instructions

We have provided the code for the del_key() function.

1. Create an empty list, hn_clean to store the cleaned data set.
2. Loop over the dictionaries in the hn list. In each iteration:
3. Use the del_key() function to delete the createdAtI key from the dictionary.
4. Append the cleaned dictionary to hn_clean.


In [19]:
hn_clean = []
for lists in hn:
    del_ = del_key(lists,'createdAtI')
    hn_clean.append(del_)

In [20]:
hn_clean[1]

{'author': 'jcr',
 'numComments': 0,
 'points': 1,
 'url': 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot',
 'storyText': '',
 'createdAt': '2014-05-29T08:05:58Z',
 'tags': ['story', 'author_jcr', 'story_7815234'],
 'title': 'Telemba Turns Your Old Roomba and Tablet Into a Telepresence Robot',
 'objectId': '7815234'}

### Writing List Comprehension

A list comprehension provides a concise way of creating lists in a single line of code.

In [21]:
ints = [1,2,3,4]
plus_one = []
for i in ints:
    plus_one.append(i+1)

In [22]:
print(plus_one)

[2, 3, 4, 5]


In [23]:
plus_one = [i+1 for i in ints]
print(plus_one)

[2, 3, 4, 5]


In [24]:
times_ten = []
for i in ints:
    times_ten.append(i*10)

In [25]:
print(times_ten)

[10, 20, 30, 40]


In [26]:
times_ten = [i*10 for i in ints]
print(times_ten)

[10, 20, 30, 40]


In [27]:
floats = [2.1,8.7,4.2,8.9]
rounded = []
for f in floats:
    rounded.append(round(f))
print(rounded)

[2, 9, 4, 9]


In [28]:
rounded = [round(f) for f in floats]
print(rounded)

[2, 9, 4, 9]


In [29]:
letters = ['a','b','c','d']
caps = []
for l in letters:
    caps.append(l.upper())
print(caps)

['A', 'B', 'C', 'D']


In [30]:
caps = [l.upper() for l in letters]
print(caps)

['A', 'B', 'C', 'D']


In [31]:
hn_clean = []
for d in hn:
    new_d = del_key(d,'createdAtI')
    hn_clean.append(new_d)

In [32]:
hn_clean = [del_key(d,'createdAtI') for d in hn]

In [33]:
hn_clean[0]

{'author': 'dragongraphics',
 'numComments': 0,
 'points': 2,
 'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
 'storyText': '',
 'createdAt': '2014-05-29T08:07:50Z',
 'tags': ['story', 'author_dragongraphics', 'story_7815238'],
 'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
 'objectId': '7815238'}

### Using List Comprehension to Transform and Create Lists

List comprehensions can be used for many different things. Three common applications are:

1. Transforming a list
2. Creating a new list
3. Reducing a list


In [34]:
squares = [1,4,9,16,25,36]

In [35]:
sqroots = []
for sq in squares:
    sqroots.append(sq**(0.5))
print(sqroots)

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0]


In [36]:
sqroots = [sq**0.5 for sq in squares]
print(sqroots)

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0]


As an example, let's create a list of generic columns names that we could use to create a dataframe using the range() function and the str.format() method to combine numbers and text:

In [37]:
cols = 

SyntaxError: invalid syntax (<ipython-input-37-793833788b4f>, line 1)

In [38]:
cols = []
for i in range(1,5):
    cols.append('col()'.format(i))
print(cols)

['col()', 'col()', 'col()', 'col()']


In [39]:
cols = ['col()'.format(i) for i in range(1,5)]
print(cols)

['col()', 'col()', 'col()', 'col()']


In [40]:
import numpy as np
import pandas as pd
data = np.zeros((4,4))
df = pd.DataFrame(data, columns = cols)
print(df)

   col()  col()  col()  col()
0    0.0    0.0    0.0    0.0
1    0.0    0.0    0.0    0.0
2    0.0    0.0    0.0    0.0
3    0.0    0.0    0.0    0.0


In [41]:
urls = [i['url'] for i in hn_clean]

In [42]:
urls = []
for i in hn_clean:
    urls.append(i['url'])

In [43]:
hn_clean[0]['url']

'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy'

In [44]:
tb = {}
ta = ['o','o','v','v']
for i in ta:
    if i in tb:
        tb[i] +=1
    else:
        tb[i] = 1

In [95]:
tb

{'o': 2, 'v': 2}

### Using List Comprehension to reduce a list

In [45]:
ints = [25,14,13,84,43,6,77,56]
big_ints = []
for i in ints:
    if i >= 50:
        big_ints.append(i)
print(big_ints)

[84, 77, 56]


In [46]:
big_ints = [i for i in ints if i>=50]
print(big_ints)

[84, 77, 56]


In [49]:
has_comments = []
for d in hn_clean:
    if d['numComments'] > 0:
        has_comments.append(d)
num_comment = len(has_comments)
print(num_comment)

9279


In [50]:
has_comments = [d for d in hn_clean if d['numComments']>0]
num_comments = len(has_comments)
print(num_comments)

9279


In [54]:
thousand_points = []
for p in hn_clean:
    if p['points'] > 1000:
        thousand_points.append(p)
num_thousand_points = len(thousand_points)
print(num_thousand_points)

8


In [55]:
thousand_points = [p for p in hn_clean if p['points']>1000]
num_thousand_points = len(thousand_points)
print(num_thousand_points)

8


### Parsing Functions as Arguments

In [56]:
jprint(json_obj)

[
    {
        "age": 36,
        "favorite_foods": [
            "Pumpkin",
            "Oatmeal"
        ],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Chicken",
            "Pizza",
            "Chocolate"
        ],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Caesar Salad"
        ],
        "name": "Heidi"
    }
]


In [57]:
min(json_obj)

TypeError: '<' not supported between instances of 'dict' and 'dict'

#### How to parse a function as an argument

In [70]:
def greet():
    return 'hello'
greet()

'hello'

In [64]:
t = type(greet())
print(t)

<class 'str'>


In [65]:
t = type(greet)
print(t)

<class 'function'>


In [66]:
greet_2 = greet
greet_2()

'hello'

In [71]:
def run_func(func):
    print('RUNNING FUNCTION: {}'.format(func))
    return func

In [74]:
run_func(greet)

RUNNING FUNCTION: <function greet at 0x000000AC93EB9BF8>


<function __main__.greet()>

In [75]:
def get_age(json_dict):
    return json_dict['age']

In [77]:
def get_age(json_dict):
    return json_dict['age']
youngest = min(json_obj, key = get_age)
jprint(youngest)

{
    "age": 36,
    "favorite_foods": [
        "Pumpkin",
        "Oatmeal"
    ],
    "name": "Sabine"
}


In [93]:
def num(num_comments):
    return num_comments['numComments']
most_comments = max(hn_clean, key = num)

In [94]:
print(most_comments)

{'author': 'platz', 'numComments': 1208, 'points': 889, 'url': 'https://blog.mozilla.org/blog/2014/04/03/brendan-eich-steps-down-as-mozilla-ceo/', 'storyText': None, 'createdAt': '2014-04-03T19:02:53Z', 'tags': ['story', 'author_platz', 'story_7525198'], 'title': 'Brendan Eich Steps Down as Mozilla CEO', 'objectId': '7525198'}


In [88]:
hn_clean[9]['numComments']

0

### Lambda Functions

These functions are called lambda functions. Lambda functions can be defined in a single line, which allows you to define a function you want to pass as an argument at the time you need it.

In [100]:
def unchanged(x):
    return x

unchanged - Function Name  
                        (x)- parameter
                                                            x - Transformation

To create a lambda function equivalent of this function, we:

1. Use the lambda keyword, followed by
2. The parameter and a colon, and then
3. The transformation we wish to perform on our argument
4. We can then assign that to the function name:


In [101]:
unchanged = lambda x:x

In [102]:
def plus_one(x):
    return x+1

In [103]:
plus_one = lambda x:x+1

In [104]:
def add(x,y):
    return x+y

In [105]:
add = lambda x,y:x+y

If a function is particularly complex, it may be a better choice to define a regular function rather than create a lambda, even if it will only be used once. For instance, this function below, which extracts digits from a string and then adds one to the resultant integer:




In [106]:
def extract_and_increment(string):
    digits = re.search(r"\d+", string).group()
    incremented = int(digits) + 1
    return incremented

In [107]:
extract_and_increment = lambda string:int((re.search(r'\d+',string).group()))+1

In [108]:
def multiply(a,b):
    return a*b

In [109]:
multiply = lambda a,b:a*b

### Using Lambda Functions to analyze json data

In [110]:
jprint(json_obj)

[
    {
        "age": 36,
        "favorite_foods": [
            "Pumpkin",
            "Oatmeal"
        ],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Chicken",
            "Pizza",
            "Chocolate"
        ],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Caesar Salad"
        ],
        "name": "Heidi"
    }
]


In [111]:
sorted(json_obj, key = lambda d:d['name'])

[{'name': 'Heidi', 'age': 40, 'favorite_foods': ['Caesar Salad']},
 {'name': 'Sabine', 'age': 36, 'favorite_foods': ['Pumpkin', 'Oatmeal']},
 {'name': 'Zoe',
  'age': 40,
  'favorite_foods': ['Chicken', 'Pizza', 'Chocolate']}]

In [113]:
def get_age(json_dict):
    return json_dict['age']
youngest = min(json_obj, key = get_age)
jprint(youngest)

{
    "age": 36,
    "favorite_foods": [
        "Pumpkin",
        "Oatmeal"
    ],
    "name": "Sabine"
}


In [112]:
min(json_obj, key = lambda json_dict:json_dict['age'])

{'name': 'Sabine', 'age': 36, 'favorite_foods': ['Pumpkin', 'Oatmeal']}

In [114]:
min(json_obj, key = lambda d:d['age'])

{'name': 'Sabine', 'age': 36, 'favorite_foods': ['Pumpkin', 'Oatmeal']}

In [116]:
def get_age(json_dict):
    return len(json_dict['favorite_foods'])
maximum = max(json_obj, key = get_age)
jprint(maximum)

{
    "age": 40,
    "favorite_foods": [
        "Chicken",
        "Pizza",
        "Chocolate"
    ],
    "name": "Zoe"
}


In [119]:
max(json_obj, key = lambda d:len(d['favorite_foods']))

{'name': 'Zoe', 'age': 40, 'favorite_foods': ['Chicken', 'Pizza', 'Chocolate']}

### Instructions

1. Using sorted() and a lambda function, sort the hn_clean JSON list by the number of points (dictionary key points) from highest to lowest:
2. Check the documentation for sorted() to see how to reverse the order to highest to lowest.
3. Assign the result to hn_sorted_points.
4. Use a list comprehension to return a list of the five post titles (dictionary key title) that have the most points in our data set:
5. Assign the result to top_5_titles.


In [120]:
hn_sorted_points = sorted(hn_clean, key = lambda d:d['points'], reverse = True)

In [127]:
top_5_titles = [i['title'] for i in hn_sorted_points ]

In [130]:
top_5_titles[0:5]

['2048',
 'Today is The Day We Fight Back',
 'Wozniak: “Actually, the movie was largely a lie about me”',
 'Microsoft Open Sources C# Compiler',
 'Elon Musk: To the People of New Jersey']

### Reading JSON files into pandas

In [131]:
jprint(json_obj)

[
    {
        "age": 36,
        "favorite_foods": [
            "Pumpkin",
            "Oatmeal"
        ],
        "name": "Sabine"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Chicken",
            "Pizza",
            "Chocolate"
        ],
        "name": "Zoe"
    },
    {
        "age": 40,
        "favorite_foods": [
            "Caesar Salad"
        ],
        "name": "Heidi"
    }
]


In [132]:
json_df = pd.DataFrame(json_obj)
print(json_df)

   age               favorite_foods    name
0   36           [Pumpkin, Oatmeal]  Sabine
1   40  [Chicken, Pizza, Chocolate]     Zoe
2   40               [Caesar Salad]   Heidi


In [135]:
hn_df = pd.DataFrame(hn_clean)
print(hn_df.head(2))

           author             createdAt  numComments objectId  points  \
0  dragongraphics  2014-05-29T08:07:50Z            0  7815238       2   
1             jcr  2014-05-29T08:05:58Z            0  7815234       1   

  storyText                                           tags  \
0            [story, author_dragongraphics, story_7815238]   
1                       [story, author_jcr, story_7815234]   

                                               title  \
0  Are we getting too Sassy? Weighing up micro-op...   
1  Telemba Turns Your Old Roomba and Tablet Into ...   

                                                 url  
0  http://ashleynolan.co.uk/blog/are-we-getting-t...  
1  http://spectrum.ieee.org/automaton/robotics/ho...  


### Exploring Tags Using the Apply Function

In [136]:
hn_df.head()

Unnamed: 0,author,createdAt,numComments,objectId,points,storyText,tags,title,url
0,dragongraphics,2014-05-29T08:07:50Z,0,7815238,2,,"[story, author_dragongraphics, story_7815238]",Are we getting too Sassy? Weighing up micro-op...,http://ashleynolan.co.uk/blog/are-we-getting-t...
1,jcr,2014-05-29T08:05:58Z,0,7815234,1,,"[story, author_jcr, story_7815234]",Telemba Turns Your Old Roomba and Tablet Into ...,http://spectrum.ieee.org/automaton/robotics/ho...
2,callum85,2014-05-29T08:05:06Z,0,7815230,1,,"[story, author_callum85, story_7815230]",Apple Agrees to Buy Beats for $3 Billion,http://online.wsj.com/articles/apple-to-buy-be...
3,d3v3r0,2014-05-29T08:00:08Z,0,7815222,1,,"[story, author_d3v3r0, story_7815222]",Don’t wait for inspiration,http://alexsblog.org/2014/05/29/dont-wait-for-...
4,timmipetit,2014-05-29T07:46:19Z,0,7815191,1,,"[story, author_timmipetit, story_7815191]",HackerOne Get $9M In Series A Funding To Build...,http://techcrunch.com/2014/05/28/hackerone-get...


In [137]:
tags = hn_df['tags']
print(tags.dtype)

object


The tags column is stored as an object type. Whenever pandas uses the object type, each item in the series uses a Python object to store the data. Most commonly we see this type used for string data.

We previously learned that we could use the Series.apply() method to apply a function to every item in a series. Let's look at what we get when we pass the type() function as an argument to the column:


In [138]:
tags_types = tags.apply(type)
type_counts = tags_types.value_counts(dropna = False)
print(type_counts)

<class 'list'>    35806
Name: tags, dtype: int64


In [139]:
tags_types = tags.apply(len)
type_counts = tags_types.value_counts(dropna = False)
print(type_counts)

3    33459
4     2347
Name: tags, dtype: int64


In [150]:
tags_types = tags.apply(len)
four_tags = tags[(tags_types == 4)]

In [148]:
four_tags = [(tags.apply(len) == 4)

In [152]:
four_tags.head()

43     [story, author_alamgir_mand, story_7813869, sh...
86       [story, author_cweagans, story_7812404, ask_hn]
104    [story, author_nightstrike789, story_7812099, ...
107    [story, author_ISeemToBeAVerb, story_7812048, ...
109       [story, author_Swizec, story_7812018, show_hn]
Name: tags, dtype: object

### Extracting Tags Using Apply with a Lambda Function

In [153]:
def extract_tag(l):
    if l == 4:
        return l[-1]
    else:
        return None

In [154]:
tags_type = tags.apply(extract_tag)

In [155]:
tags_type.head()

0    None
1    None
2    None
3    None
4    None
Name: tags, dtype: object

let's look at how we can complete this operation in a single line.

To achieve this, we'll have to use a special version of an if statement known as a ternary operator. You can use the ternary operator whenever you need to return one of two values depending on a boolean expression. The syntax is as follows:


[on_true] if [expression] else [on_false]

In [160]:
l[-1] if len(l)==4 else None

In [168]:
def extract_tag(l):
    return l[-1] if len(l) == 4 else None

In [171]:
cleaned_tags = tags.apply(extract_tag)

In [173]:
hn_df['tags'] = cleaned_tags