## `Example 1` Optimizing Code: Common Books
Here's the code your coworker wrote to find the common book ids in `books_published_last_two_years.txt` and `all_coding_books.txt` to obtain a list of recent coding books. 

In [6]:
import time
import numpy as np
import pandas as pd

In [3]:
DATA_DIRECTORY = '../dataset/swe/'

with open(DATA_DIRECTORY + 'books_published_last_two_years.txt', 'r') as f:
    recent_books = f.read().split('\n')
    
with open(DATA_DIRECTORY + 'all_coding_books.txt', 'r') as f:
    coding_books = f.read().split('\n')

In [5]:
start_time = time.time()
recent_coding_books = []

for book in recent_books :
    if book in coding_books :
        recent_coding_books.append(book)

print(len(recent_coding_books))
print(f'Duration: {time.time() - start_time} seconds')

96
Duration: 12.633978128433228 seconds


It took 12 seconds...! Any faster ways? Google it!

### Tip #1: Use vector operations over loops when possible

search term: how to find common element in two numpy array

In [7]:
start_time = time.time()

recent_coding_books = np.intersect1d(recent_books, coding_books)

print(len(recent_coding_books))
print(f'Duration: {time.time() - start_time} seconds')

96
Duration: 0.03740382194519043 seconds


Wow...it only took 0.03 secs...!!!

There are MANY cases in which you can replace loops with Numpy and pandas that use vector operations to make your computations a LOT faster. Sometimes there is a method that does exactly what you need. Other times, you need to be a little creative. This example in particular has a useful method you can use.

### Tip #2: Know your data structures and which methods are faster
search term: how to find common elements in two lists python

In [9]:
start_time = time.time()

recent_coding_books = set(recent_books).intersection(set(coding_books))

print(len(recent_coding_books))
print(f'Duration: {time.time() - start_time} seconds')

96
Duration: 0.009005069732666016 seconds


Oh no...it is even faster!!!!


In addition to relying on Numpy and pandas, it's often good to double check whether there's a data structure or method in Python you can use to accomplish your task more effectively. For example, in this case do you recall a data structure in Python that stores a group of unique elements and can quickly compute intersections and unions of different groups? You can read more about why sets are more efficient than lists for this task in the link on the bottom of this page.

---
## `Example 2` Optimizing Code: Holiday Gifts
In the last example, you learned that using vectorized operations and more efficient data structures can optimize your code. Let's use these tips for one more example.

Say your online gift store has one million users that each listed a gift on a wish list. You have the prices for each of these gifts stored in `gift_costs.txt`. For the holidays, you're going to give each customer their wish list gift for free if it is under 25 dollars. Now, you want to calculate the total cost of all gifts under 25 dollars to see how much you'd spend on free gifts. Here's one way you could've done it.

In [13]:
with open(DATA_DIRECTORY + 'gift_costs.txt') as f :
    gift_costs = f.read().split('\n') # returns a list of strings
    
# Convert string to int 
gift_costs = np.array(gift_costs).astype(int)

In [16]:
start_time = time.time()

total_price = 0
for cost in gift_costs :
    if cost < 25 :
        total_price += cost * 1.08 # cost after 8% tax 
        
print(total_price)
print(f'Duration: {time.time() - start_time} seconds')

32765421.23999867
Duration: 7.614516019821167 seconds


Here you iterate through each cost in the list, and check if it's less than 25. If so, you add the cost to the total price after tax. This works, but there is a much faster way to do this. Can you refactor this to run under half a second?

## Refactor Code
Remember matrix operations can be done on np.array

In [17]:
start_time = time.time()

total_price = ( gift_costs[gift_costs < 25] * 1.08 ).sum()
        
print(total_price)
print(f'Duration: {time.time() - start_time} seconds')

32765421.240000006
Duration: 0.09139776229858398 seconds


In [18]:
start_time = time.time()

total_price = ( gift_costs[gift_costs < 25].sum() ) * 1.08
        
print(total_price)
print(f'Duration: {time.time() - start_time} seconds')

32765421.240000002
Duration: 0.06149101257324219 seconds


Yes it should be faster if we multiply on a scaler sum.