# List Comprehensions and Lambda Functions
So far, we've learned how to use regular expressions to make cleaning and analyzing text data easier.

In this mission, we'll learn some tips and syntax shortcuts we can use on top of everything we've learned, including:

- Creating list comprehensions to replace loops with a single line of code.
- Creating single use functions called lambda functions.

The data set we'll use in this mission is in a format called JavaScript Object Notation (JSON). As the name indicates, JSON originated from the JavaScript language, but has now become a language-independent format.

From a Python perspective, JSON can be thought as a collection of Python objects nested inside each other.


The JSON under is a list, where each element in the list is a dictionary. Each of the dictionaries have the same keys, and one of the values of each dictionary is itself a list.

The Python json module contains a number of functions to make working with JSON objects easier. We can use the json.loads() method to convert JSON data contained in a string to the equivalent set of Python objects:
```python
json_string = """
[
  {
    "name": "Sabine",
    "age": 36,
    "favorite_foods": ["Pumpkin", "Oatmeal"]
  },
  {
    "name": "Zoe",
    "age": 40,
    "favorite_foods": ["Chicken", "Pizza", "Chocolate"]
  },
  {
    "name": "Heidi",
    "age": 40,
    "favorite_foods": ["Caesar Salad"]
  }
]
"""
​
import json
json_obj = json.loads(json_string)
print(type(json_obj))
class 'list'
We can see that json_string has turned into a list. Let's take a look at the values in the list:

print(json_obj)
[{'age': 36, 'favorite_foods': ['Pumpkin', 'Oatmeal'], 'name': 'Sabine'},
 {'age': 40, 'favorite_foods': ['Chicken', 'Pizza', 'Chocolate'], 'name': 'Zoe'},
 {'age': 40, 'favorite_foods': ['Caesar Salad'], 'name': 'Heidi'}]
We can observe a few things:
```
The formatting from our original string is gone. This is because printing Python lists and dictionaries has a simple formatting structure.
The order of the keys in the dictionary have changed. This is because (prior to version 3.6) Python dictionaries don't have fixed order.
Let's practice using json.loads() to convert JSON data from a string to Python objects!

1. Import the json module.
2. Use json.loads() to convert world_cup_str to a Python object. Assign the result to world_cup_obj.

In [14]:
import pandas as pd
import numpy as np

In [2]:
import json
world_cup_str = """
[
    {
        "team_1": "France",
        "team_2": "Croatia",
        "game_type": "Final",
        "score" : [4, 2]
    },
    {
        "team_1": "Belgium",
        "team_2": "England",
        "game_type": "3rd/4th Playoff",
        "score" : [2, 0]
    }
]
"""
world_cup_obj = json.loads(world_cup_str)
world_cup_obj

[{'team_1': 'France',
  'team_2': 'Croatia',
  'game_type': 'Final',
  'score': [4, 2]},
 {'team_1': 'Belgium',
  'team_2': 'England',
  'game_type': '3rd/4th Playoff',
  'score': [2, 0]}]

One of the places where the JSON format is commonly used is in the results returned by an Application programming interface (API). APIs are interfaces that can be used to send and transmit data between different computer systems. We'll learn about how to work with APIs in a later course.

The data set from this mission — hn_2014.json — was downloaded from the Hacker News API. It's a different set of data from the CSV we've been using in the previous two missions, and it contains data about stories from Hacker News in 2014.

To read a file from JSON format, we use the json.load() function. Note that the function is json.load() without an "s" at the end. The json.loads() function is used for loading JSON data from a string ("loads" is short for "load string"), whereas the json.load() function is used to load from a file object. Let's look at how we would read that in our data:
```python
import json
file = open("hn_2014.json")
hn = json.load(file)
​
print(type(hn))
class 'list'
Our hn variable is a list. Let's find out how many objects are in the list, and the type of the first object (which will almost always be the type of every object in the list in JSON data):

print(len(hn))
print(type(hn[0]))
35806
class 'dict'
```
Our data set contains 35,806 dictionary objects, each representing a Hacker News story. In order to understand the format of our data set, we'll print the keys of the first dictionary:

```python
print(hn[0].keys())
dict_keys(['createdAtI', 'numComments', 'objectId', 'createdAt',
           'tags', 'title', 'points', 'author', 'storyText', 'url'])
```

If we recall the data set we used in the previous two missions, we can see some similarities. There are keys representing the title, URL, points, number of comments, and date, as well as some others that are less familiar to us. Here is a summary of the keys and the data that they contain:
<pre>
author: The username of the person who submitted the story.
createdAt: The date and time at which the story was created.
createdAtI: An integer value representing the date and time at which the story was created.
numComments: The number of comments that were made on the story.
objectId: The unique identifier from Hacker News for the story.
points: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes.
storyText: The text of the story (if the story contains text).
tags: A list of tags associated with the story.
title: The title of the story.
url: The URL that the story links to (if the story links to a URL).
</pre>
Let's start by reading our Hacker News JSON file:

*Instructions*

1. Use the open() function to open the hn_2014.json file as a file object.
2. Use the json.load() function to parse the file object and assign the result to hn.

In [4]:
file = open('data/hn_2014.json')
hn = json.load(file)
hn[0]

{'author': 'dragongraphics',
 'numComments': 0,
 'points': 2,
 'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
 'storyText': '',
 'createdAt': '2014-05-29T08:07:50Z',
 'tags': ['story', 'author_dragongraphics', 'story_7815238'],
 'createdAtI': 1401350870,
 'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
 'objectId': '7815238'}

Let's look at the first dictionary in full. To make it easier to understand, we're going to create a function which will print a JSON object with formatting to make it easier to read.

The function will use the json.dumps() function ("dump string") which does the opposite of the json.loads() function — it takes a JSON object and returns a string version of it. The json.dumps() function accepts arguments that can specify formatting for the string, which we'll use to make things easier to read:
```python
def jprint(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)
​
first_story = hn[0]
jprint(first_story)

    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}
{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "createdAtI": 1401350870,
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}
```
You may notice that the createdAt and createdAtI keys both have the date and time data in two different formats. Because the format of createdAt is much easier to understand, let's do some data cleaning by deleting the createdAtI key from every dictionary.

To delete a key from a dictionary, we can use the del statement. Let's learn the syntax by looking at a simple example:
```python
d = {'a': 1, 'b': 2, 'c': 3}
del d['a']
print(d)
{'b': 2, 'c': 3}
```
We can create a function using del that will return a copy of our dictionary with the key removed:
```python
def del_key(dict_, key):
    # create a copy so we don't
    # modify the original dict
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict
#Let's use this function to delete the createdAtI key from first_story:

first_story = del_key(first_story, 'createdAtI')
jprint(first_story)
{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}
```
The dictionary returned by the function no longer includes the createdAtI key.

Let's use a loop and the del_key() function to remove the createdAtI key from every story in our Hacker News data set:

*Instructions*

We have provided the code for the del_key() function.

1. Create an empty list, hn_clean to store the cleaned data set.
2. Loop over the dictionaries in the hn list. In each iteration:
3. Use the del_key() function to delete the createdAtI key from the dictionary.
4. Append the cleaned dictionary to hn_clean.

In [11]:
def del_key(dict_, key):
    # create a copy so we don't
    # modify the original dict
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict
hn_clean = []

for dic in hn:
    hn_clean.append(del_key(dic,'createdAtI'))

hn_clean[0]

{'author': 'dragongraphics',
 'numComments': 0,
 'points': 2,
 'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
 'storyText': '',
 'createdAt': '2014-05-29T08:07:50Z',
 'tags': ['story', 'author_dragongraphics', 'story_7815238'],
 'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
 'objectId': '7815238'}

The task we performed is an extremely common one. Specifically, we:

1. Iterated over values in a list.
2. Performed a transformation on those values.
3. Assigned the result to a new list.

Python includes a special syntax shortcut for tasks that meet these criteria: List Comprehensions. A list comprehension provides a concise way of creating lists in a single line of code.

List comprehensions can look complex at first, but we are simply reordering the elements of our for loop code. To keep things simple, we'll start with a basic example, where we want to add 1 to each item in a list of integers.
```python
ints = [1, 2, 3, 4]
​
plus_one = []
for i in ints:
    plus_one.append(i + 1)
​
print(plus_one)
[2, 3, 4, 5]
```
Let's start by labeling the three main parts of our loop:

<img src='images/loop_components.svg' />

To transform this structure into a list comprehension, we do the following within brackets:

- Start with the code that transforms each item.
- Continue with our for statement (without a colon).
- We can then assign the list comprehension to a variable name. The animation below shows how we convert the manual loop version to a list comprehension.

<img src='images/list_comp_anim.svg' />

Let's look at a second example, where we want to multiply each item in the list by 10:
```python
times_ten = []
for i in ints:
    times_ten.append(i * 10)
​
print(times_ten)
[10, 20, 30, 40]
```
To convert this to a list comprehension, we follow the same pattern:

<img src='images/loop_vs_lc_2.svg' />

- The "transformation" step of our list comprehension can be anything, including a function or method. In the example below, we are applying a function to a list of floats to round them to integers:
```python
floats = [2.1, 8.7, 4.2, 8.9]
​
rounded = []
for f in floats:
    rounded.append(round(f))
​
print(rounded)
[2, 9, 4, 9]
```
To convert to a list comprehension, we simply rearrange the components:

<img src='images/loop_vs_lc_3.svg' />

Just like in a normal loop, we can use any name for our iterator variable. Here, we have used f.

For the last example, we'll apply a method to each string in a list to capitalize it. We won't color the different components, so we can get used to how that looks.
```python
letters = ['a', 'b', 'c', 'd']
​
caps = []
for l in letters:
    caps.append(l.upper())
Even though we've used a different kind of transformation, the ordering of the list comprehension remains the same:

caps = [l.upper() for l in letters]
print(caps)
['A', 'B', 'C', 'D']
```
Let's recap what we have learned so far. A list comprehension can be used where we:

- Iterated over values in a list.
- Performed a transformation on those values.
- Assigned the result to a new list.
- To transform a loop to a list comprehension, in brackets we:

1. Start with the code that transforms each item.
2. Continue with our for statement (without a colon).

We are going to write a list comprehension version of the code from the previous screen. To help, we've provided a copy of the code with the components labeled.

*Instructions*

1. Create a list comprehension representation of the loop from the previous screen:
2. Call the del_key() function to remove the createdAtI value from each dictionary in the hn list.
3. Assign the results to a new list, hn_clean.

In [12]:
hn_clean = [del_key(d,'createdAtI')  for d in hn]

In [15]:
#We can then use this to create an empty dataframe with labels:
cols = ["col_{}".format(i) for i in range(1,5)]
data = np.zeros((4,4))
df = pd.DataFrame(data, columns=cols)
print(df)

   col_1  col_2  col_3  col_4
0    0.0    0.0    0.0    0.0
1    0.0    0.0    0.0    0.0
2    0.0    0.0    0.0    0.0
3    0.0    0.0    0.0    0.0


In [20]:
#Use a list comprehension to extract the url value from each dictionary in hn_clean. Assign the result to urls.
urls = [d['url'] for d in hn_clean]
urls[0:5]

['http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot',
 'http://online.wsj.com/articles/apple-to-buy-beats-1401308971',
 'http://alexsblog.org/2014/05/29/dont-wait-for-inspiration/',
 'http://techcrunch.com/2014/05/28/hackerone-get-9m-in-series-a-funding-to-build-bug-tracking-bounty-programs/']

The last common application of list comprehensions is reducing a list. Let's say we had a list of integers and we wanted to remove any integers that were smaller than 50. We could do this by adding an if statement to our loop:

In [22]:
ints = [1, 3, 50, 78, 80]
big_ints = [i for i in ints if i > 50]
print(big_ints)

[78, 80]


Let's use list comprehension to count how many stories have more than 1,000 points.

*Instructions*

1. Use list comprehension to create a new list, thousand_points:
2. The list should contain values from hn_clean where the points key has a value greater than 1000.
3. Count the number of values in thousand_points and assign the result to num_thousand_points.

In [25]:
thousand_points = [d for d in hn_clean if d['points'] > 1000]
num_thousand_points = len(thousand_points)
num_thousand_points

8

Let's define a very simple function as an example:
```python
def greet():
    return "hello"
​
greet()
'hello'
```
If we try to examine the type of our function, we are unsuccessful:
```python
t = type(greet())
print(t)
str
```
We need to find a way to look at the function itself, rather than the result of the function. The key to this is the parentheses: ().

The parentheses are what tells Python to execute the function, so if we omit the parentheses we can treat a function like a variable, rather than working with the output of the function:
```python
t = type(greet)
print(t)
```

There are other variable-like behaviors we can also use when we omit the parentheses from a function. For instance, we can assign a function to a new variable name:
```python
greet_2 = greet
​
greet_2()
'hello'
```
Now that we understand how to treat a function as variable, let's look at how we can run a function inside another function by passing it as an argument:
```python
def run_func(func):
    print("RUNNING FUNCTION: {}".format(func))
    return func()
run_func(greet)
RUNNING FUNCTION: function greet at 0x12a64c400
'hello'
```
passing a function as an argument
Now that we have some intuition on how to pass functions as arguments, let's see how we use a function to control the behavior of the sorted() function:

sorting a list of lists using a key function
Let's look at the same thing in code form:
```python
def get_age(json_dict):
    return json_dict['age']
​
youngest = min(json_obj, key=get_age)
jprint(youngest)
{
    "age": 36,
    "favorite_foods": ["Pumpkin", "Oatmeal"],
    "name": "Sabine"
}
```
Let's use this technique to find the story that has the greatest number of comments.

*Instructions*

1. Create a "key function" that accepts a single dictionary and returns the value from the numComments key.
2. Use the max() function with the "key function" you just created to find the value from the hn_clean list with the most comments:
3. Assign the result to the variable most_comments.

In [27]:
def key_function(d):
    return d['numComments']
most_comments = max(hn_clean, key=key_function)
most_comments

{'author': 'platz',
 'numComments': 1208,
 'points': 889,
 'url': 'https://blog.mozilla.org/blog/2014/04/03/brendan-eich-steps-down-as-mozilla-ceo/',
 'storyText': None,
 'createdAt': '2014-04-03T19:02:53Z',
 'tags': ['story', 'author_platz', 'story_7525198'],
 'title': 'Brendan Eich Steps Down as Mozilla CEO',
 'objectId': '7525198'}

### Now we will learn lambdas:
<img src='images/lambda_2_comparison.svg' />

<img src='images/lambda_2_comparison.svg' />

If a function is particularly complex, it may be a better choice to define a regular function rather than create a lambda, even if it will only be used once. For instance, this function below, which extracts digits from a string and then adds one to the resultant integer:

```python
def extract_and_increment(string):
    digits = re.search(r"\d+", string).group()
    incremented = int(digits) + 1
    return incremented

#It becomes tough to understand in its lambda form:

extract_and_increment = lambda string: int(re.search(r"\d+", string).group()) + 1
```
Being mindful of this will ensure our code remains easy to read and understand.

Let's practice creating a lambda function version of a simple function:

*Instructions*

1. In the display code, we have defined (in comments) a function multiply() using traditional syntax.
2. Create a lambda function that performs the same operation. Assign it to the variable name multiply.

In [30]:
#ef multiply(a, b):
#   return a * b
multiply = lambda a,b: a*b
multiply(2,3)

6

In [33]:
#Let's look at how this works in common usage with min(), max(), and sorted().
sorted(hn_clean[0:2], key=lambda d: d['points'], reverse=True)

[{'author': 'dragongraphics',
  'numComments': 0,
  'points': 2,
  'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
  'storyText': '',
  'createdAt': '2014-05-29T08:07:50Z',
  'tags': ['story', 'author_dragongraphics', 'story_7815238'],
  'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
  'objectId': '7815238'},
 {'author': 'jcr',
  'numComments': 0,
  'points': 1,
  'url': 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot',
  'storyText': '',
  'createdAt': '2014-05-29T08:05:58Z',
  'tags': ['story', 'author_jcr', 'story_7815234'],
  'title': 'Telemba Turns Your Old Roomba and Tablet Into a Telepresence Robot',
  'objectId': '7815234'}]

Over the past three screens, we have:

- Learned that functions can be passed as arguments.
- Created functions and used them to calculate the minimum, maximum, and to sort lists of lists.
- Learned about lambda functions and how to create them.
- Learned how to use a lambda function to pass an argument in place when calculating minimums, maximums, and sorting lists of lists.

We can now apply all of this new knowledge to our Hacker News data to calculate the posts that had the most points in 2014!
1. Using sorted() and a lambda function, sort the hn_clean JSON list by the number of points (dictionary key points) from highest to lowest:
2. Check the documentation for sorted() to see how to reverse the order to highest to lowest.
3. Assign the result to hn_sorted_points.
4. Use a list comprehension to return a list of the five post titles (dictionary key title) that have the most points in our data set:
5. Assign the result to top_5_titles.

In [35]:
hn_sorted_points = sorted(hn_clean, key=lambda d: d['points'], reverse=True)
top_5_titles = [d['title'] for d in hn_sorted_points[0:5]]
top_5_titles

['2048',
 'Today is The Day We Fight Back',
 'Wozniak: “Actually, the movie was largely a lie about me”',
 'Microsoft Open Sources C# Compiler',
 'Elon Musk: To the People of New Jersey']

We can use the pandas.DataFrame() constructor and pass the list of dictionaries directly to it to convert the JSON to a dataframe:
```python
json_df = pd.DataFrame(json_obj)
print(json_df)
age                 favorite_foods    name
0   36             [Pumpkin, Oatmeal]  Sabine
1   40    [Chicken, Pizza, Chocolate]     Zoe
2   40                 [Caesar Salad]   Heidi
```
In this case, the favorite_foods column contains the list from the JSON. We'll see a similar thing with the tags column for our Hacker News data. We'll learn how to correct that on the next screen, but for now, let's convert our data to a pandas dataframe.

1. Import the pandas library.
2. Use the pandas.DataFrame() constructor to create a dataframe version of the hn_clean JSON list. Assign the result to hn_df.

In [36]:
hn_df = pd.DataFrame(hn_clean)
hn_df.head(1)

Unnamed: 0,author,numComments,points,url,storyText,createdAt,tags,title,objectId
0,dragongraphics,0,2,http://ashleynolan.co.uk/blog/are-we-getting-t...,,2014-05-29T08:07:50Z,"[story, author_dragongraphics, story_7815238]",Are we getting too Sassy? Weighing up micro-op...,7815238


At first glance, it looks like each values in this JSON list contain three items:

- The string story
- The name of the author
- The story ID
- If that's the case, then the column doesn't contain any unique data, and we can remove it. We're going to analyze this column to make - sure that's the case.

Let's start by exploring how pandas is storing that data. First, we'll extract the column as a series, and check its type:
```python
tags = hn_df['tags']
print(tags.dtype)
```

The tags column is stored as an object type. Whenever pandas uses the object type, each item in the series uses a Python object to store the data. Most commonly we see this type used for string data.

We previously learned that we could use the Series.apply() method to apply a function to every item in a series. Let's look at what we get when we pass the type() function as an argument to the column:

```python
tags_types = tags.apply(type)
type_counts = tags_types.value_counts(dropna=False)
print(type_counts)
class 'list'    35806
Name: tags, dtype: int64
All 35,806 items in the column are a Python list type.
```

Next, let's use Series.apply() to check the length of each of those lists. If our hypothesis from earlier is correct, every row will have a list containing three items:

```python
tags_types = tags.apply(len)
type_lengths = tags_types.value_counts(dropna=False)
print(type_lengths)
3    33459
4     2347
Name: tags, dtype: int64
```
While most of the item have three values in the list, about 2,000 values contain four values. Let's use a boolean mask to look at the items where the list has four items:

1. Use Series.apply() and len() to create a boolean mask based on whether each item in tags has a length of 4.
2. Use the boolean mask to filter tags. Assign the result to four_tags.

In [41]:
tags = hn_df.tags
four_tags = tags[tags.apply(len) > 3]
four_tags.head(5)

43     [story, author_alamgir_mand, story_7813869, sh...
86       [story, author_cweagans, story_7812404, ask_hn]
104    [story, author_nightstrike789, story_7812099, ...
107    [story, author_ISeemToBeAVerb, story_7812048, ...
109       [story, author_Swizec, story_7812018, show_hn]
Name: tags, dtype: object

It looks like whenever there are four tags, the extra tag is the last of the four. In this final exercise of the mission, we're going to use a lambda function to extract this fourth value in cases where there is one. To do this for any single list, we'll need to:

- Check the length of the list.
- If the length of the list is equal to four, return the last value.
- If the length of the list isn't equal to four, return a null value.
- This is how we could create this as a standard function:
```python
def extract_tag(l):
    if len(l) == 4:
        return l[-1]
    else:
        return None
```
We could use Series.apply() to apply this function as is, but to practice working with lambda functions, let's look at how we can complete this operation in a single line.

To achieve this, we'll have to use a special version of an if statement known as a ternary operator. You can use the ternary operator whenever you need to return one of two values depending on a boolean expression. The syntax is as follows:
<pre>
[on_true] if [expression] else [on_false]
</pre>

The diagram below shows our function using an if statement and its ternary operator equivalent:

<img src='images/ternary_operator_eg.svg' /> 

Let's finish by creating a lambda function version of this function and using apply to extract the tags.

We have provided a function that uses a ternary operator to provide the logic to extract the tags.

- Use Series.apply() and a lambda function to extract the tag data from tags:
    - Where the item is a list with length four, return the last item.
    - In all other cases, return None.
    - Assign the result to cleaned_tags.
    - Assign the cleaned_tags series to the tags column of the hn_df dataframe.

In [42]:
# def extract_tag(l):
#     return l[-1] if len(l) == 4 else None
tags.apply(lambda d: d[-1] if len(d) > 3 else None)

0        None
1        None
2        None
3        None
4        None
         ... 
35801    None
35802    None
35803    None
35804    None
35805    None
Name: tags, Length: 35806, dtype: object