# Writing Efficient Python Code
Run the hidden code cell below to import the data used in this course.

## Course Description

As a Data Scientist, the majority of your time should be spent gleaning actionable insights from data -- not waiting for your code to finish running. Writing efficient Python code can help reduce runtime and save computational resources, ultimately freeing you up to do the things you love as a Data Scientist. In this course, you'll learn how to use Python's built-in data structures, functions, and modules to write cleaner, faster, and more efficient code. We'll explore how to time and profile code in order to find bottlenecks. Then, you'll practice eliminating these bottlenecks, and other bad design patterns, using Python's Standard Library, NumPy, and pandas. After completing this course, you'll have the necessary tools to start writing efficient Python code

## Chapter 1 - Foundations for efficiencies

In this chapter, you'll learn what it means to write efficient Python code. You'll explore Python's Standard Library, learn about NumPy arrays, and practice using some of Python's built-in tools. This chapter builds a foundation for the concepts covered ahead

In [3]:
# Importing pandas
import pandas as pd

# Reading in the data
baseball = pd.read_csv("datasets/baseball.csv")

In [4]:
## Defining Pythonic
# Non - Pythonic
doubled_numbers = []

for num in range(5): 
    doubled_numbers.append(num**2)
print(doubled_numbers)

# Pythonic
print([num**2 for num in range(5)])

[0, 1, 4, 9, 16]
[0, 1, 4, 9, 16]


In [5]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


### Built-in practice: range()

In [6]:
nums = range(0,6)
print(type(nums))

nums_list = list(nums)
print(nums_list)

nums_list_2 = [*range(1,12,2)]
print(nums_list_2)

<class 'range'>
[0, 1, 2, 3, 4, 5]
[1, 3, 5, 7, 9, 11]


### Built-in practice: enumerate()

In [7]:
names = ["Vijay","Pavan","Kalyan","Reddy","Lavanya","Reddy"]

# rewrite for loop to use enumerate
indexed_names = []
for i, name in enumerate(names):
    index_name = (i,name)
    indexed_names.append(index_name)
print(indexed_names)

# rewrite the above loop using list comprehension
indexed_names_comp = [(i,name) for i,name in enumerate(names)]
print(indexed_names_comp)

# Unpack an enumerate object with a starting index of one
indexed_names_unpack = [*enumerate(names,1)]
print(indexed_names_unpack)

[(0, 'Vijay'), (1, 'Pavan'), (2, 'Kalyan'), (3, 'Reddy'), (4, 'Lavanya'), (5, 'Reddy')]
[(0, 'Vijay'), (1, 'Pavan'), (2, 'Kalyan'), (3, 'Reddy'), (4, 'Lavanya'), (5, 'Reddy')]
[(1, 'Vijay'), (2, 'Pavan'), (3, 'Kalyan'), (4, 'Reddy'), (5, 'Lavanya'), (6, 'Reddy')]


### Built-in practice: map()

In [8]:
# Use map to apply str.upper to each element in names
names_str = map(str.upper,names)

# Print the type of the names_map
print(type(names_str))

# Unpack names_map into a list
names_uppercase = [*map(str.upper,names)]

# print the list created above
print(names_uppercase)

<class 'map'>
['VIJAY', 'PAVAN', 'KALYAN', 'REDDY', 'LAVANYA', 'REDDY']


### Practice with NumPy arrays

In [9]:
import numpy as np
nums = np.array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

# Print second row of nums
print(nums[1,:])

# Print all elements of nums that are greater than six
print(nums[nums > 6])

# Double every element of nums
nums_dbl = nums * 2
print(nums_dbl)

# Replace the third column of nums
nums[:,2] = nums[:,2] + 1

print(nums)

[ 6  7  8  9 10]
[ 7  8  9 10]
[[ 2  4  6  8 10]
 [12 14 16 18 20]]
[[ 1  2  4  4  5]
 [ 6  7  9  9 10]]


A numpy array contains homogeneous data types (which reduces memory consumption) and provides the ability to apply operations on all elements through broadcasting.

### Bringing it all together: Festivus!

In [10]:
arrival_times = [*range(10,51,10)]
print(arrival_times)

[10, 20, 30, 40, 50]


In [11]:
# Convert arrival_times to an array and update the times
arrival_times_np = np.array(arrival_times)
new_times = arrival_times_np - 3

In [12]:
# Use list comprehension and enumerate to pair guests to new times
guest_arrivals = [(names[i],time) for i,time in enumerate(new_times)]
print(guest_arrivals)

[('Vijay', 7), ('Pavan', 17), ('Kalyan', 27), ('Reddy', 37), ('Lavanya', 47)]


In [13]:
""""
# Map the welcome_guest function to each (guest,time) pair
welcome_map = map(welcome_guest, guest_arrivals)

guest_welcomes = [*welcome_map]
print(*guest_welcomes, sep='\n')
"""

'"\n# Map the welcome_guest function to each (guest,time) pair\nwelcome_map = map(welcome_guest, guest_arrivals)\n\nguest_welcomes = [*welcome_map]\nprint(*guest_welcomes, sep=\'\n\')\n'

## Chapter 2 - Timing and profiling code

In this chapter, you will learn how to gather and compare runtimes between different coding approaches. You'll practice using the line_profiler and memory_profiler packages to profile your code base and spot bottlenecks. Then, you'll put your learnings to practice by replacing these bottlenecks with efficient Python code.

### Using %timeit: your turn!

In [14]:
wts = [441.0,
 65.0,
 90.0,
 441.0,
 122.0,
 88.0,
 61.0,
 81.0,
 104.0,
 108.0,
 90.0,
 90.0,
 72.0,
 169.0,
 173.0,
 101.0,
 68.0,
 57.0,
 54.0,
 83.0,
 90.0,
 122.0,
 86.0,
 358.0,
 135.0,
 106.0,
 146.0,
 63.0,
 68.0,
 57.0,
 98.0,
 270.0,
 59.0,
 50.0,
 101.0,
 68.0,
 54.0,
 81.0,
 63.0,
 67.0,
 180.0,
 77.0,
 54.0,
 57.0,
 52.0,
 61.0,
 95.0,
 79.0,
 133.0,
 63.0,
 181.0,
 68.0,
 216.0,
 135.0]

In [15]:
import numpy as np

In [16]:
""""
%%timeit 
hero_wts_lbs = []
for wt in wts:
    hero_wts_lbs.append(wt * 2.20462)
"""

'"\n%%timeit \nhero_wts_lbs = []\nfor wt in wts:\n    hero_wts_lbs.append(wt * 2.20462)\n'

In [17]:
"""
%%timeit
wts_np = np.array(wts)
hero_wts_lbs_np = wts_np * 2.20462
"""

'\n%%timeit\nwts_np = np.array(wts)\nhero_wts_lbs_np = wts_np * 2.20462\n'

### Using %lprun: spot bottlenecks

In [18]:
# pip install line_profiler


In [19]:
import numpy as np
heroes = ['Batman', 'Superman', 'Wonder Woman']
hts = np.array([188.0, 191.0, 183.0])
wts = np.array([ 95.0, 101.0,  74.0])


In [20]:
def convert_units_broadcast(heroes, heights, weights):

    # Array broadcasting instead of list comprehension
    new_hts = heights * 0.39370
    new_wts = weights * 2.20462

    hero_data = {}

    for i,hero in enumerate(heroes):
        hero_data[hero] = (new_hts[i], new_wts[i])

    return hero_data

convert_units_broadcast(heroes,hts,wts)


{'Batman': (74.01559999999999, 209.4389),
 'Superman': (75.19669999999999, 222.66661999999997),
 'Wonder Woman': (72.0471, 163.14188)}

### Code profiling: line_profiler

In [21]:
%load_ext line_profiler
%lprun -f convert_units_broadcast convert_units_broadcast(heroes,hts,wts)

Timer unit: 1e-09 s

Total time: 1.66e-05 s
File: /tmp/ipykernel_163/2767885347.py
Function: convert_units_broadcast at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def convert_units_broadcast(heroes, heights, weights):
     2                                           
     3                                               # Array broadcasting instead of list comprehension
     4         1       9830.0   9830.0     59.2      new_hts = heights * 0.39370
     5         1       2310.0   2310.0     13.9      new_wts = weights * 2.20462
     6                                           
     7         1        260.0    260.0      1.6      hero_data = {}
     8                                           
     9         3       1670.0    556.7     10.1      for i,hero in enumerate(heroes):
    10         3       2390.0    796.7     14.4          hero_data[hero] = (new_hts[i], new_wts[i])
    11                             

### Code profiling for memory usage

#### Quick and dirty approach

In [22]:
import sys

In [23]:
nums_np = np.array(range(1000))
sys.getsizeof(nums_np)

8112

### Code profiling : memory

In [24]:
# pip install memory_profiler

In [25]:
"""
from hero_funcs import convert_units
%load_ext memory_profiler
%mprun -f convert_units convert_units(heroes, hts, wts)
"""

'\nfrom hero_funcs import convert_units\n%load_ext memory_profiler\n%mprun -f convert_units convert_units(heroes, hts, wts)\n'

## Chapter 3- Gaining efficiencies

This chapter covers more complex efficiency tips and tricks. You'll learn a few useful built-in modules for writing efficient code and practice using set theory. You'll then learn about looping patterns in Python and how to make them more efficient.

### Set Theory

In [26]:
list_a = ['Bulbasaur', 'Charmander', 'Squirtle']
list_b = ['Caterpie', 'Pidgey', 'Squirtle']


In [27]:
in_common = []
for pokeman_a in list_a:
    for pokeman_b in list_b:
        if pokeman_a == pokeman_b:
            in_common.append(pokeman_a)
print(in_common)

['Squirtle']


In [28]:
set_a = set(list_a)
set_b =  set(list_b)

set_a.intersection(set_b)

{'Squirtle'}

### Set method : difference

In [29]:
set_a.difference(set_b)

{'Bulbasaur', 'Charmander'}

In [30]:
set_b.difference(set_a)

{'Caterpie', 'Pidgey'}

### Set method: symmetric difference

In [31]:
set_a.symmetric_difference(set_b)

{'Bulbasaur', 'Caterpie', 'Charmander', 'Pidgey'}

### Set method : union

In [32]:
set_a.union(set_b)

{'Bulbasaur', 'Caterpie', 'Charmander', 'Pidgey', 'Squirtle'}

### Uniques with sets

In [33]:
unique_types = []
for prim_type in list_a:
    if prim_type not in unique_types:       
        unique_types.append(prim_type)
print(unique_types)

['Bulbasaur', 'Charmander', 'Squirtle']


### Uniques with sets

In [34]:
primary_names = ["PK","Vijay","Pavan","PK","Kalyan","Vijay"]

unique_names = set(primary_names)

print(unique_names)

{'Vijay', 'Pavan', 'Kalyan', 'PK'}


### Eliminating loops

### Eliminating loops with built-ins

In [35]:
# List of HP, Attack, Defense, Speed
poke_stats = [[90,  92, 75, 60],    
              [25,  20, 15, 90],    
              [65, 130, 60, 75],]
# For loop approach
totals = []
for row in poke_stats:    
    totals.append(sum(row))
print(totals)

# list comprehension
totals_comp = [sum(row) for row in poke_stats]
print(totals_comp)

# Built-in map() function
totals_map = [*map(sum,poke_stats)]
print(totals_map)


[317, 150, 330]
[317, 150, 330]
[317, 150, 330]


In [36]:
%%timeit
totals = []
for row in poke_stats:    
    totals.append(sum(row))

605 ns ± 1.89 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [37]:
%timeit totals_comp = [sum(row) for row in poke_stats]

607 ns ± 3.07 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [38]:
%timeit totals_map = [*map(sum, poke_stats)]

544 ns ± 6.13 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


### Eliminate loops with NumPy

In [39]:
import numpy as np
poke_stats = np.array(poke_stats)
%timeit avgs = poke_stats.mean(axis=1)

5.08 µs ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


### Writing Better Loops

#### Moving calculations above a loop

In [40]:
import numpy as np
names = ['Absol', 'Aron', 'Jynx', 'Natu', 'Onix']
attacks = np.array([130, 70, 50, 50, 45])
for pokemon,attack in zip(names, attacks):    
    total_attack_avg = attacks.mean()
    if attack > total_attack_avg:
        print("{}'s attack: {} > average: {}!".format(pokemon, attack, total_attack_avg)       )

Absol's attack: 130 > average: 69.0!
Aron's attack: 70 > average: 69.0!


In [41]:
import numpy as np
names = ['Absol', 'Aron', 'Jynx', 'Natu', 'Onix']
attacks = np.array([130, 70, 50, 50, 45])
# Calculate total average once (outside the loop)
total_attack_avg = attacks.mean()
for pokemon,attack in zip(names, attacks):
    if attack > total_attack_avg:
        print("{}'s attack: {} > average: {}!".format(pokemon, attack, total_attack_avg))

Absol's attack: 130 > average: 69.0!
Aron's attack: 70 > average: 69.0!


#### Using holistic conversions

In [42]:
names = ['Pikachu', 'Squirtle', 'Articuno']
legend_status = [False, False, True]
generations = [1, 1, 1,]
poke_data = []
for poke_tuple in zip(names, legend_status, generations):   
    poke_list = list(poke_tuple)    
    poke_data.append(poke_list)
print(poke_data)

[['Pikachu', False, 1], ['Squirtle', False, 1], ['Articuno', True, 1]]


In [43]:
names = ['Pikachu', 'Squirtle', 'Articuno']
legend_status = [False, False, True]
generations = [1, 1, 1]
poke_data_tuples = []
for poke_tuple in zip(names, legend_status, generations):   
    poke_data_tuples.append(poke_tuple)
poke_data = [*map(list, poke_data_tuples)]
print(poke_data)

[['Pikachu', False, 1], ['Squirtle', False, 1], ['Articuno', True, 1]]


## Chapter 4 - Basic Pandas Optimizations

This chapter offers a brief introduction on how to efficiently work with pandas DataFrames. You'll learn the various options you have for iterating over a DataFrame. Then, you'll learn how to efficiently apply functions to data stored in a DataFrame.

### Intro to pandas DataFrame iteration

#### Baseball stats

In [44]:
baseball.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424


### Calculating win percentage

In [45]:
import numpy as np
def calc_win_perc(wins,games_played):
    wins_perc = wins/games_played
    return np.round(wins_perc,2)

In [46]:
win_perc = calc_win_perc(50,100)
print(win_perc)

0.5


### Adding win percentage to DataFrame

In [47]:
wins_perc_list = []
for i in range(len(baseball)):
    row = baseball.iloc[i]
    wins = row["W"]
    games_played = row["G"]
    win_perc = calc_win_perc(wins,games_played)
    wins_perc_list.append(win_perc)
baseball["WP"] = wins_perc_list

In [48]:
baseball.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38


### Iterating with .iloc

In [49]:
%%timeit
win_perc_list = []
for i in range(len(baseball)):    
    row = baseball.iloc[i]    
    wins = row['W']    
    games_played = row['G']    
    win_perc = calc_win_perc(wins, games_played)   
    win_perc_list.append(win_perc)
baseball['WP'] = win_perc_list

124 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Iterating with .iterrows()

In [50]:
wins_perc_list = []
for i,row in baseball.iterrows():
    wins = row["W"]
    games_played = row["G"]
    wins_perc = calc_win_perc(wins,games_played)
    wins_perc_list.append(wins_perc)
baseball["WPN"] = wins_perc_list

In [51]:
baseball.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP,WPN
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5,0.5
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58,0.58
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57,0.57
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43,0.43
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38,0.38


In [52]:
%%timeit
wins_perc_list = []
for i,row in baseball.iterrows():
    wins = row["W"]
    games_played = row["G"]
    wins_perc = calc_win_perc(wins,games_played)
    wins_perc_list.append(wins_perc)
baseball["WPN"] = wins_perc_list

69.5 ms ± 594 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Another iterator method: .itertuples()

In [53]:
"""
for row_tuple in baseball.iterrows():
    #print(row_tuple)
    #print(type(row_tuple[1]))
"""

'\nfor row_tuple in baseball.iterrows():\n    #print(row_tuple)\n    #print(type(row_tuple[1]))\n'

### Iterating with.itertuples()

In [54]:
#for row_namedtuple in team_wins_df.itertuples():
 #   print(row_namedtuple)

In [55]:
#print(row_named_tuple.Index)

In [56]:
# print(row_named_tuple.Team)

In [57]:
#for row_namedtuple in team_wins_df.itertuples():
 #   print(row_namedtuple.Team)

### pandas alternative to looping

#### Run differentials with a loop

In [59]:
def calc_run_diff(runs_scored, runs_allowed):   
    run_diff = runs_scored - runs_allowed
    return run_diff

In [61]:
"""
run_diffs_iterrows = []
for i,row in baseball.iterrows():    
    run_diff = calc_run_diff(row['RS'], row['RA'])  
    run_diffs_iterrows.append(run_diff)
    baseball['RD'] = run_diffs_iterrows
print(baseball)
"""

"\nrun_diffs_iterrows = []\nfor i,row in baseball.iterrows():    \n    run_diff = calc_run_diff(row['RS'], row['RA'])  \n    run_diffs_iterrows.append(run_diff)\n    baseball['RD'] = run_diffs_iterrows\nprint(baseball)\n"

### Run differentials with .apply()

In [64]:
run_diffs_apply = baseball.apply(lambda row: calc_run_diff(row['RS'], row['RA']),         axis=1)
baseball['RD'] = run_diffs_apply
print(baseball)

     Team League  Year   RS   RA    W  ...    G   OOBP   OSLG    WP   WPN   RD
0     ARI     NL  2012  734  688   81  ...  162  0.317  0.415  0.50  0.50   46
1     ATL     NL  2012  700  600   94  ...  162  0.306  0.378  0.58  0.58  100
2     BAL     AL  2012  712  705   93  ...  162  0.315  0.403  0.57  0.57    7
3     BOS     AL  2012  734  806   69  ...  162  0.331  0.428  0.43  0.43  -72
4     CHC     NL  2012  613  759   61  ...  162  0.335  0.424  0.38  0.38 -146
...   ...    ...   ...  ...  ...  ...  ...  ...    ...    ...   ...   ...  ...
1227  PHI     NL  1962  705  759   81  ...  161    NaN    NaN  0.50  0.50  -54
1228  PIT     NL  1962  706  626   93  ...  161    NaN    NaN  0.58  0.58   80
1229  SFG     NL  1962  878  690  103  ...  165    NaN    NaN  0.62  0.62  188
1230  STL     NL  1962  774  664   84  ...  163    NaN    NaN  0.52  0.52  110
1231  WSA     AL  1962  599  716   60  ...  162    NaN    NaN  0.37  0.37 -117

[1232 rows x 18 columns]


### Optimal pandas iterating

In [65]:
wins_np = baseball['W'].values
print(type(wins_np))

<class 'numpy.ndarray'>


In [66]:
print(wins_np)

[ 81  94  93 ... 103  84  60]


### Power of vectorization

In [67]:
baseball['RS'].values - baseball['RA'].values

array([  46,  100,    7, ...,  188,  110, -117])

### Run differentials with arrays

In [71]:
run_diffs_np = baseball['RS'].values - baseball['RA'].values
baseball['RD'] = run_diffs_np
print(baseball)

     Team League  Year   RS   RA    W  ...    G   OOBP   OSLG    WP   WPN   RD
0     ARI     NL  2012  734  688   81  ...  162  0.317  0.415  0.50  0.50   46
1     ATL     NL  2012  700  600   94  ...  162  0.306  0.378  0.58  0.58  100
2     BAL     AL  2012  712  705   93  ...  162  0.315  0.403  0.57  0.57    7
3     BOS     AL  2012  734  806   69  ...  162  0.331  0.428  0.43  0.43  -72
4     CHC     NL  2012  613  759   61  ...  162  0.335  0.424  0.38  0.38 -146
...   ...    ...   ...  ...  ...  ...  ...  ...    ...    ...   ...   ...  ...
1227  PHI     NL  1962  705  759   81  ...  161    NaN    NaN  0.50  0.50  -54
1228  PIT     NL  1962  706  626   93  ...  161    NaN    NaN  0.58  0.58   80
1229  SFG     NL  1962  878  690  103  ...  165    NaN    NaN  0.62  0.62  188
1230  STL     NL  1962  774  664   84  ...  163    NaN    NaN  0.52  0.52  110
1231  WSA     AL  1962  599  716   60  ...  162    NaN    NaN  0.37  0.37 -117

[1232 rows x 18 columns]


### Comparing approaches

In [74]:
%%timeit
run_diffs_np = baseball['RS'].values - baseball['RA'].values
baseball['RD'] = run_diffs_np

84 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [81]:
# Add your code snippets here