# Coercing and Reducing Data Lab

### Introduction

In this lesson, we'll see how we can use loops and list comprehensions to both reduce the amount of information, and coerce our data.

### Loading the Data

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/eng-6-22/mod-1-a-data-structures/master/3-coercing-filtering-data/imdb_movies.csv')
movies = df.to_dict('records')

In [2]:
movies[:2]

[{'title': 'Avatar',
  'genre': 'Action',
  'budget': 237000000,
  'runtime': 162.0,
  'year': 2009,
  'month': 12,
  'revenue': 2787965087},
 {'title': "Pirates of the Caribbean: At World's End",
  'genre': 'Adventure',
  'budget': 300000000,
  'runtime': 169.0,
  'year': 2007,
  'month': 5,
  'revenue': 961000000}]

### Exploring Data

1. Start by selecting the first movie from the list of movies, this way we can get a sense of our data.

In [3]:
first_movie = movies[0]
first_movie
# {'title': 'Avatar',
#  'genre': 'Action',
#  'budget': 237000000,
#  'runtime': 162.0,
#  'year': 2009,
#  'month': 12,
#  'revenue': 2787965087}

{'title': 'Avatar',
 'genre': 'Action',
 'budget': 237000000,
 'runtime': 162.0,
 'year': 2009,
 'month': 12,
 'revenue': 2787965087}

Now let's just return the keys from the dictionary..

In [4]:
movie_keys = first_movie.keys()
movie_keys

# dict_keys(['title', 'genre', 'budget', 'runtime', 'year', 'month', 'revenue'])

dict_keys(['title', 'genre', 'budget', 'runtime', 'year', 'month', 'revenue'])

And then calculate the number of keys.

In [6]:
num_keys = len(movie_keys)

num_keys
# 7

7

2. Now let's see how many movies we have

In [7]:
movies_count = len(movies)
movies_count
# 2000

2000

### 2. Using Loops to Reduce Data

Next, let's plot the amount of money that each movie makes along with it's name.  

To do so, loop through the list to select only `title` from each movie.
> Let's practice this using first a `for` loop, and then list comprehension.

1. Use a `for` loop and the `append` method to add each title to a list.

In [12]:
titles = []
for movie in movies:
  titles.append(movie['title'])



In [13]:
titles[:3]
# ['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre']


['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre']

2. Now do the same thing using list comprehension.

In [14]:
titles = [movie['title'] for movie in movies]

In [15]:
titles[:3]
# ['Avatar',
# "Pirates of the Caribbean: At World's End",
# 'Spectre']

['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre']

3. Select the revenue from each movie using list comprehension

In [16]:
revenues = [movie['revenue'] for movie in movies]


In [17]:
revenues[:3]
# [2787965087, 961000000, 880674609]

[2787965087, 961000000, 880674609]

4. Now select the year from each movie.

In [18]:
years = [movie['year'] for movie in movies]

years[:3]

# [2009, 2007, 2015]

[2009, 2007, 2015]

> Press `shift + return` on the cell below to plot the movie information.

In [19]:
import plotly.graph_objects as go
scatter = go.Scatter(x = years,
                     y = revenues,
                     hovertext=titles,
                     mode = 'markers')
fig = go.Figure(scatter)
fig

> Answer: <img src="https://github.com/eng-6-22/mod-1-a-data-structures/blob/master/3-coercing-filtering-data/movie-revenue.png?raw=1" width="60%">

## 3. Coercing Data

Let's take another look at the years of all of our movies, that we created above.

In [20]:
years[:2]
# [2009, 2007]

[2009, 2007]

We can see we only have movies from a certain number of years.  So now let's see the unique collection of years with the by turning our list into a `set`.

In [21]:
unique_years = set(years)

In [22]:
len(unique_years)
# 48

48

2. Also find unique list of genres in our collection of movies above.

> First collect all of the genres.

In [24]:
genres = [movie['genre'] for movie in movies]

genres[:3]

['Action', 'Adventure', 'Action']

Then assign a unique list of genres.

In [28]:
unique_genres = list(set(genres))

In [29]:
len(unique_genres)
# 12

12

In [30]:
unique_genres[:5]
# [nan, 'Science Fiction', 'Action', 'Romance', 'Horror']

['Science Fiction', 'Horror', 'Adventure', 'Action', 'Crime']

2. Calculate the net revenue for each movie
* net revenue = revenue - budget

In [31]:
net_revenues = [(movie['revenue'] - movie['budget']) for movie in movies]

In [32]:
net_revenues[:5]
# [2550965087, 661000000, 635674609, 834939099, 24139100]

[2550965087, 661000000, 635674609, 834939099, 24139100]

Now it would be great to plot the net revenue along with the genre.  To do so we need to replace each occurrence of a movie genre with a corresponding color.  For example, if we look at the first genre of each of the first five movies:

In [33]:
genres[:5]
# ['Action', 'Adventure', 'Action', 'Action', 'Action']

['Action', 'Adventure', 'Action', 'Action', 'Action']

We could replace the first five elements below with:

`['Red', 'Blue', 'Red', 'Red', 'Red']`.

So below, we'll create a dictionary for replacing each of the values to a corresponding color.

In [None]:
unique_genres

[nan,
 'Thriller',
 'Adventure',
 'Science Fiction',
 'Animation',
 'Comedy',
 'Fantasy',
 'Drama',
 'Action',
 'Horror',
 'Crime',
 'Romance']

In [34]:
colors = ['black', 'yellow', 'yellow', 'red',
          'purple', 'gold', 'gold', 'yellow',
          'orange', 'green', 'yellow', 'blue']
color_map = dict(zip(unique_genres, colors))

In [35]:
colors = [color_map[genre] for genre in genres]

colors[:3]

['red', 'yellow', 'red']

In [36]:
import plotly.graph_objects as go
scatter = go.Scatter(x = years,
                     y = net_revenues,
                     hovertext=titles,
                    marker = dict(size=8, color=colors),
                     mode = 'markers')
fig = go.Figure(scatter)
fig

<img src="https://github.com/eng-6-22/mod-1-a-data-structures/blob/master/3-coercing-filtering-data/revenue-genres.png?raw=1" width="60%">

> No one said it was gonna be pretty.

### Creating new dictionaries

Now let's create new dictionaries of movie titles, revenues and dates.  

We'll do so in steps.  First, let's first pair together the lists of `titles`, `net_revenues` and `dates`.  The paired data should be a list where each element contains this data for each movie.

In [77]:
titles_revenues = list(zip(titles, net_revenues))

In [79]:
titles_revenues[:3]
# [('Avatar', 2550965087, datetime.datetime(2009, 12, 1, 0, 0)),
#  ("Pirates of the Caribbean: At World's End",
#   661000000,
#   datetime.datetime(2007, 5, 1, 0, 0)),
#  ('Spectre', 635674609, datetime.datetime(2015, 10, 1, 0, 0))]

[('Avatar', 2550965087),
 ("Pirates of the Caribbean: At World's End", 661000000),
 ('Spectre', 635674609)]

Now from here, create a list of dictionaries, where there is a new dictionary for every movie.  The keys are defined below.

In [80]:
keys = ['title', 'net_revenue']

Use the paired data above to create the new dictionaries.

In [82]:
movie_summaries = [dict(zip(keys, movie)) for movie in titles_revenues]
movie_summaries

[{'title': 'Avatar', 'net_revenue': 2550965087},
 {'title': "Pirates of the Caribbean: At World's End",
  'net_revenue': 661000000},
 {'title': 'Spectre', 'net_revenue': 635674609},
 {'title': 'The Dark Knight Rises', 'net_revenue': 834939099},
 {'title': 'John Carter', 'net_revenue': 24139100},
 {'title': 'Spider-Man 3', 'net_revenue': 632871626},
 {'title': 'Tangled', 'net_revenue': 331794936},
 {'title': 'Avengers: Age of Ultron', 'net_revenue': 1125403694},
 {'title': 'Harry Potter and the Half-Blood Prince', 'net_revenue': 683959197},
 {'title': 'Batman v Superman: Dawn of Justice', 'net_revenue': 623260194},
 {'title': 'Superman Returns', 'net_revenue': 121081192},
 {'title': 'Quantum of Solace', 'net_revenue': 386090727},
 {'title': "Pirates of the Caribbean: Dead Man's Chest",
  'net_revenue': 865659812},
 {'title': 'The Lone Ranger', 'net_revenue': -165710090},
 {'title': 'Man of Steel', 'net_revenue': 437845518},
 {'title': 'The Chronicles of Narnia: Prince Caspian',
  'net_r

In [83]:
movie_summaries[:3]
# [{'title': 'Avatar', 'net_revenue': 2550965087},
#  {'title': "Pirates of the Caribbean: At World's End",
#   'net_revenue': 661000000},
#  {'title': 'Spectre', 'net_revenue': 635674609}]

[{'title': 'Avatar', 'net_revenue': 2550965087},
 {'title': "Pirates of the Caribbean: At World's End",
  'net_revenue': 661000000},
 {'title': 'Spectre', 'net_revenue': 635674609}]

Then sort the movies by the amount of net revenue they earned, from most to least.

In [84]:
sorted_movies = sorted(movie_summaries, key=lambda x: x['net_revenue'])

In [None]:
sorted_movies[:5]

[{'title': 'Avatar', 'net_revenue': 2550965087},
 {'title': 'Titanic', 'net_revenue': 1645034188},
 {'title': 'Jurassic World', 'net_revenue': 1363528810},
 {'title': 'Furious 7', 'net_revenue': 1316249360},
 {'title': 'The Avengers', 'net_revenue': 1299557910}]

### Summary

In this lesson we practiced using loops and list comprehensions to coerce our data.  We saw that both loops and list comprehensions perform the same operation, but that a list comprehension is shorter syntactically .

```python
# ordinary loops
titles = []
for movie in movies:
    titles.append(movie['title'])
    
# list comprehensions
titles = [movie['title'] for movie in movies]
```

We also practiced pairing up two lists of data with the `zip` method.