# Optimizing Code: Common Books
Let's go through an example scenario where we optimize some code to be more efficient. Say we are managing books for a store, and we want to find all the books published within the last two years about code. We have a file that lists all the ids of books published in the last two years, `books_published_last_two_years.txt`, as well as a file for all coding books, `all_coding_books.txt`.

Here's what the first few lines of each file looks like.
#### `books_published_last_two_years.txt`
```txt
1262771
9011996
2007022
9389522
8181760
...
```
#### `all_coding_books.txt`
```txt
382944
483549
103957
590274
045832
...
```

Since we want to find all the coding books published within the last two years, we'd want to find the book ids included in both of these files. Your coworker came up with one approach, and shows you this code to find the books in both files.

In [1]:
import time
import pandas as pd
import numpy as np

In [2]:
with open('books_published_last_two_years.txt') as f:
    recent_books = f.read().split('\n')

with open('all_coding_books.txt') as f:
    coding_books = f.read().split('\n')

In [3]:
start = time.time()
recent_coding_books = []

for book in recent_books:
    if book in coding_books:
        recent_coding_books.append(book)

print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 16.510533571243286 seconds


Their strategy is to loop through each book in the first file, check if it's contained in the second file, and if so - add it to the final list. This makes sense and is an intuitive first approach. However, there are several things we can do to make this more efficient. Here are some tips.

### Tip #1: Use vector operations over loops when possible
Numpy and pandas are your best friends for this. There are MANY cases in which you can replace loops with Numpy and pandas that use vector operations to make your computations a LOT faster. Sometimes there is a method that does exactly what you need. Other times, you need to be a little creative. This example in particular has a useful method you can use.

Let me show you how I would approach this. No joke, I google: "how to find common elements in two Numpy arrays" and here are the results I get!

In the Jupyter notebook quiz on the next page, use Numpy's `intersect1d` method to get the intersection of the `recent_books` and `coding_books` arrays. I'll give you this same notebook, and I'll put a cell right here with code to record the time it takes to run. Write your line of code in between these time start and time end lines.

```python
start = time.time()
recent_coding_books = # TODO: compute intersection of lists
print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))
```

In [4]:
# put your code here
start = time.time()

recent_coding_books = np.intersect1d(recent_books, coding_books)

print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 0.0413815975189209 seconds


### Tip #2: Know your data structures and which methods are faster
In addition to relying on Numpy and pandas, it's often good to double check whether there's a data structure or method in Python you can use to accomplish your task more effectively. For example, in this case do you recall a data structure in Python that stores a group of unique elements and can quickly compute intersections and unions of different groups? You can read more about why sets are more efficient than lists for this task in the link on the bottom of this page.

Also, remember how I said I googled everything? Last time, I was googling how to find common elements in specifically Numpy arrays. But you can go more general and google something like "how to find common elements in two lists python" and you'll see posts like [this](https://stackoverflow.com/questions/2864842/common-elements-comparison-between-2-lists) that share and compare different answers. And you can see the set being introduced here.

This seems to have a lot of great explanation and votes, but ultimately we should try different methods and compare their efficiency for our example. Because different methods perform differently in different contexts. So it's smart to always test for yourself. In the next cell of the Jupyter notebook, find out how long it takes to compute the common elements of `recent_books` and `coding_books` using Python's `set.intersection` method. Here again is some code to measure how long this takes.

```python
start = time.time()
recent_coding_books = # TODO: compute intersection of lists
print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))
```

In [7]:
# put your code here
start = time.time()
recent_coding_books = set(recent_books).intersection(set(coding_books))

print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 0.028631925582885742 seconds


# Optimizing Code: Holiday Gifts
In the last example, you learned that using vectorized operations and more efficient data structures can optimize your code. Let's use these tips for one more example.

Say your online gift store has one million users that each listed a gift on a wish list. You have the prices for each of these gifts stored in `gift_costs.txt`. For the holidays, you're going to give each customer their wish list gift for free if it is under 25 dollars. Now, you want to calculate the total cost of all gifts under 25 dollars to see how much you'd spend on free gifts. Here's one way you could've done it.

In [8]:
import time
import numpy as np

In [17]:
with open('gift_costs.txt') as f:
    gift_costs = f.read().split('\n')

gift_costs = np.array(gift_costs).astype(int)  # convert string to int

In [19]:
gift_costs.shape

(3604469,)

In [10]:
start = time.time()

total_price = 0
for cost in gift_costs:
    if cost < 25:
        total_price += cost * 1.08  # add cost after tax

print(total_price)
print('Duration: {} seconds'.format(time.time() - start))

11807496.359999606
Duration: 3.134730577468872 seconds


Here you iterate through each cost in the list, and check if it's less than 25. If so, you add the cost to the total price after tax. This works, but there is a much faster way to do this. Can you refactor this to run under half a second?

## Refactor Code
**Hint:** Using numpy makes it very easy to select all the elements in an array that meet a certain condition, and then perform operations on them together all at once. You can them find the sum of what those values end up being.

In [None]:
start = time.time()

total_price =  # TODO: compute the total price

print(total_price)
print('Duration: {} seconds'.format(time.time() - start))

In [12]:
start = time.time()

# Use boolean indexing to find costs less than 25 and sum them after tax
total_price = np.sum(gift_costs[gift_costs < 25] * 1.08)

print(total_price)
print('Duration: {} seconds'.format(time.time() - start))

11807496.360000003
Duration: 0.03827834129333496 seconds
