# Writing Efficient Code with pandas

In [1]:
import time
import numpy as np
import pandas as pd

## Measuring time

The `time.time()` function can be loaded and used to assess the time.

## Row selection: `loc[]` vs `iloc[]`
A big part of working with DataFrames is to locate specific entries in the dataset. You can locate rows in two ways:

    - By a specific value of a column (feature).
    - By the index of the rows (index). In this exercise, we will focus on the second way.
    
If you have previous experience with pandas, you should be familiar with the `.loc` and `.iloc` indexers, which stands for 'location' and 'index location' respectively. In most cases, the indices will be the same as the position of each row in the Dataframe (e.g. the row with index 13 will be the 14th entry).

While we can use both functions to perform the same task, we are interested in which is the most efficient in terms of speed.

In [7]:
poker_hands = pd.read_csv('../data/24. Pandas Avanzado/poker.csv')
poker_hands.head()

Unnamed: 0,S1,R1,S2,R2,S3,R3,S4,R4,S5,R5,Class
0,1,10,1,11,1,13,1,12,1,1,9
1,2,11,2,13,2,10,2,12,2,1,9
2,3,12,3,11,3,13,3,10,3,1,9
3,4,10,4,11,4,1,4,13,4,12,9
4,4,1,4,13,4,12,4,11,4,10,9


<img src= '../images/poker.jpg'>

In [8]:
# Define the range of rows to select: row_nums
row_nums = range(0, 1000)

# Select the rows using .loc[] and row_nums and record the time before and after
loc_start_time = time.time()
rows = poker_hands.loc[row_nums]
loc_end_time = time.time()

# Print the time it took to select the rows using .loc[]
print("Time using .loc[]: {} sec".format(loc_end_time - loc_start_time))

Time using .loc[]: 0.006997823715209961 sec


In [9]:
# Select the rows using .iloc[] and row_nums and record the time before and after
iloc_start_time = time.time()
rows = poker_hands.iloc[row_nums]
iloc_end_time = time.time()

# Print the time it took to select the rows using .iloc
print("Time using .iloc[]: {} sec".format(iloc_end_time-iloc_start_time))

Time using .iloc[]: 0.0050008296966552734 sec


If you need to select specific rows of a DataFrame, which function is more efficient, it terms of speed?

## Column selection: `.iloc[]` vs by name

Another important task is to find the faster function to select the targeted features (columns) of a DataFrame. In this exercise, we will compare the following:

- using the index locator `.iloc()`
- using the names of the columns While we can use both functions to perform the same task, we are interested in which is the most efficient in terms of speed.

In [12]:
# Use .iloc to select the first 6 columns and record the times before and after
iloc_start_time = time.time()
cols = poker_hands.iloc[:,0:6]
iloc_end_time = time.time()

# Print the time it took
print("Time using .iloc[] : {} sec".format(iloc_end_time - iloc_start_time))

Time using .iloc[] : 0.0009999275207519531 sec


In [13]:
# Use simple column selection to select the first 6 columns 
names_start_time = time.time()
cols = poker_hands[['S1', 'R1', 'S2', 'R2', 'S3', 'R3']]
names_end_time = time.time()

# Print the time it took
print("Time using selection by name : {} sec".format(names_end_time-names_start_time))

Time using selection by name : 0.004999876022338867 sec


## Random row selection

We will compare the two methods described for selecting random rows (entries) with replacement in a pandas DataFrame:

- The built-in pandas function `.random()`
- The NumPy random integer number generator `np.random.randint()`

Generally, in the fields of statistics and machine learning, when we need to train an algorithm, we train the algorithm on the 75% of the available data and then test the performance on the remaining 25% of the data.

For this exercise, we will randomly sample the 75% percent of all the played poker hands available, using each of the above methods, and check which method is more efficient in terms of speed.

In [14]:
# Extract number of rows in dataset
N=poker_hands.shape[0]

# Select and time the selection of the 75% of the dataset's rows
rand_start_time = time.time()
poker_hands.iloc[np.random.randint(low=0, high=N, size=int(0.75 * N))]
print("Time using Numpy: {} sec".format(time.time() - rand_start_time))

Time using Numpy: 0.011528730392456055 sec


In [15]:
# Select and time the selection of the 75% of the dataset's rows using sample()
samp_start_time = time.time()
poker_hands.sample(int(0.75 * N), axis=0, replace = True)
print("Time using .sample: {} sec".format(time.time() - samp_start_time))

Time using .sample: 0.006000518798828125 sec


## Random column selection

We can use the same functions to randomly select columns in a pandas DataFrame.

In [16]:
# Extract number of columns in dataset
D=poker_hands.shape[1]

# Select and time the selection of 4 of the dataset's columns using NumPy
np_start_time = time.time()
poker_hands.iloc[:,np.random.randint(low=0, high=D, size=4)]
print("Time using NymPy's random.randint(): {} sec".format(time.time() - np_start_time))

Time using NymPy's random.randint(): 0.0029723644256591797 sec


In [17]:
# Select and time the selection of 4 of the dataset's columns using pandas
pd_start_time = time.time()
poker_hands.sample(4, axis=1)
print("Time using panda's .sample(): {} sec".format(time.time() - pd_start_time))

Time using panda's .sample(): 0.003998756408691406 sec


## Replacing scalar values

In this exercise, we will replace a list of values in our dataset by using the `.replace()` method with another list of desired values.

In [19]:
names = pd.read_csv('../data/24. Pandas Avanzado/baby_names.csv')
names.head()

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,SOPHIA,119,1
1,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,CHLOE,106,2
2,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMILY,93,3
3,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,OLIVIA,89,4
4,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMMA,75,5


In [20]:
start_time = time.time()

# Replace all the entries that has 'FEMALE' as a gender with 'GIRL'
names['Gender'].replace('FEMALE', 'GIRL', inplace=True)

print("Time using .replace(): {} sec".format(time.time() - start_time))

Time using .replace(): 0.0050008296966552734 sec


In [22]:
start_time = time.time()

# Replace all the entries that has 'FEMALE' as a gender with 'GIRL'
names.loc[names['Gender']=='FEMALE', 'Gender'] = 'GIRL'

print("Time using .loc[]: {} sec".format(time.time() - start_time))

Time using .loc[]: 0.011001348495483398 sec


## Replace multiple values

You will apply the `.replace()` function for the task of replacing multiple values with one or more values. 

In [27]:
start_time = time.time()

# Replace all non-Hispanic ethnicities with 'NON HISPANIC'
names['Ethnicity'].loc[(names["Ethnicity"] == 'BLACK NON HISP') | 
                      (names["Ethnicity"] == 'BLACK NON HISPANIC') | 
                      (names['Ethnicity'] == 'WHITE NON HISP') | 
                      (names['Ethnicity'] == 'WHITE NON HISPANIC')] = 'NON HISPANIC'

print("Time using .loc[]: {} sec".format(time.time() - start_time))

Time using .loc[]: 0.04699873924255371 sec


In [28]:
start_time = time.time()

# Replace all non-Hispanic ethnicities with 'NON HISPANIC'
names['Ethnicity'].replace(['BLACK NON HISP', 'BLACK NON HISPANIC', 'WHITE NON HISP' , 'WHITE NON HISPANIC'], 'NON HISPANIC', inplace=True)

print("Time using .replace(): {} sec".format(time.time() - start_time))

Time using .replace(): 0.0060007572174072266 sec


Instead of using the `.replace()` function multiple times to replace multiple values, you can use lists to map the elements you want to replace one to one with those you want to replace them with.

In [29]:
start_time = time.time()

# Replace ethnicities as instructed
names['Ethnicity'].replace(['ASIAN AND PACI','BLACK NON HISP', 'WHITE NON HISP'], ['ASIAN AND PACIFIC ISLANDER','BLACK NON HISPANIC','WHITE NON HISPANIC'], inplace=True)

print("Time using .replace(): {} sec".format(time.time() - start_time))

Time using .replace(): 0.0049991607666015625 sec


Apply the following replacing technique of replacing multiple values using dictionaries on a different dataset.

In [32]:
# Replace Royal flush or Straight flush to Flush
start_time = time.time()

names.replace({'FEMALE':'GIRL'}, inplace=True)

print("Time using .replace() and dicts: {} sec".format(time.time() - start_time))

Time using .replace() and dicts: 0.01199960708618164 sec


You can use dictionaries to replace multiple values with just one value, even from multiple columns.

In [33]:
# Replace the number rank by a string
start_time = time.time()

names['Rank'].replace({1:'FIRST', 2:'SECOND', 3:'THIRD'}, inplace=True)

print("Time using .replace() and dicts: {} sec".format(time.time() - start_time))

Time using .replace() and dicts: 0.006005287170410156 sec


And you can do the same for the column that you want

In [34]:
# Replace the rank of the first three ranked names to 'MEDAL'
names.replace({'Rank': {1:'MEDAL', 2:'MEDAL', 3:'MEDAL'}}, inplace=True)

# Replace the rank of the 4th and 5th ranked names to 'ALMOST MEDAL'
names.replace({'Rank': {4:'ALMOST MEDAL', 5:'ALMOST MEDAL'}}, inplace=True)

In [35]:
names.head()

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,GIRL,ASIAN AND PACIFIC ISLANDER,SOPHIA,119,FIRST
1,2011,GIRL,ASIAN AND PACIFIC ISLANDER,CHLOE,106,SECOND
2,2011,GIRL,ASIAN AND PACIFIC ISLANDER,EMILY,93,THIRD
3,2011,GIRL,ASIAN AND PACIFIC ISLANDER,OLIVIA,89,ALMOST MEDAL
4,2011,GIRL,ASIAN AND PACIFIC ISLANDER,EMMA,75,ALMOST MEDAL


# Create a generator for a pandas DataFrame

You can easily create a generator out of a pandas DataFrame. Each time you iterate through it, it will yield two elements:

- the index of the respective row
- a pandas Series with all the elements of that row

In [36]:
# Create a generator over the rows
generator = poker_hands.iterrows()

# Access the elements of the 2nd row
first_element = next(generator)
first_element

(0, S1        1
 R1       10
 S2        1
 R2       11
 S3        1
 R3       13
 S4        1
 R4       12
 S5        1
 R5        1
 Class     9
 Name: 0, dtype: int64)

In [38]:
second_element = next(generator)
print(second_element)

(2, S1        3
R1       12
S2        3
R2       11
S3        3
R3       13
S4        3
R4       10
S5        3
R5        1
Class     9
Name: 2, dtype: int64)


## The `.iterrows()` function for looping

You just saw how to create a generator out of a pandas DataFrame. You will now use this generator and see how to take advantage of that method of looping through a pandas DataFrame.

In [47]:
data_generator = poker_hands.iterrows()

for index, values in data_generator:
    # Check if index is odd
    if not index % 2 == 0:
        # Sum the ranks of all the cards
        hand_sum = sum([values[1], values[3], values[5], values[7], values[9]])
        if hand_sum > 57:
            print(hand_sum)
            break

58


## `.apply()` function in every cell

You can use `.apply()` to map a function to every cell of the DataFrame, regardless the column or the row.

In [48]:
# Define the lambda transformation
get_square = lambda x: x**2

# Apply the transformation
data_sum = poker_hands.apply(get_square)
data_sum.head()

Unnamed: 0,S1,R1,S2,R2,S3,R3,S4,R4,S5,R5,Class
0,1,100,1,121,1,169,1,144,1,1,81
1,4,121,4,169,4,100,4,144,4,1,81
2,9,144,9,121,9,169,9,100,9,1,81
3,16,100,16,121,16,1,16,169,16,144,81
4,16,1,16,169,16,144,16,121,16,100,81


### `.apply()` for rows iteration
`.apply()` is a very useful to iterate through the rows of a DataFrame and apply a specific function.

In [60]:
# Define the lambda transformation
get_mean = lambda x: np.mean(x)

row_start_time = time.time()

# Apply the transformation row-based
data_tr = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].apply(get_mean, axis=1)
print("Time using pandas apply for rows: {} sec".format(time.time() - row_start_time))

Time using pandas apply for rows: 9.046329975128174 sec


In [61]:
col_start_time = time.time()

# Apply the transformation column-based
data_tr = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].apply(get_mean, axis=0)
print("Time using pandas apply for columns: {} sec".format(time.time() - col_start_time))

Time using pandas apply for columns: 0.010001420974731445 sec


# Vectorization in pandas

We achieved a massive improvement using some form of vectorization.


In [56]:
# Calculate the mean rank in each hand
row_start_time = time.time()

mean_r = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].mean(axis=1)

print("Time using pandas vectorization for rows: {} sec".format(time.time() - row_start_time))

Time using pandas vectorization for rows: 0.00699925422668457 sec


In [57]:
# Calculate the mean rank of each of the 5 card in all hands
col_start_time = time.time()

mean_c = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].mean(axis=0)

print("Time using pandas vectorization for columns: {} sec".format(time.time() - col_start_time))

Time using pandas vectorization for columns: 0.005003690719604492 sec


### Why vectorization in pandas is so fast?
Fewer operations are required due to optimization in pandas.

### Best method of vectorization
So far, you have encountered two vectorization methods:

- Vectorization over pandas Series
- Vectorization over Numpy ndarrays

## Vectorization methods for looping a DataFrame

Now that you're familiar with vectorization in pandas and NumPy, you're going to compare their respective performances yourself.

In [67]:
# Calculate the variance in each hand
start_time = time.time()

poker_var = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].var(axis=1)

print("Time using pandas vectorization: {} sec".format(time.time() - start_time))

Time using pandas vectorization: 0.007000446319580078 sec


In [68]:
# Calculate the variance in each hand
start_time = time.time()

poker_var = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].values.var(axis=1, ddof=1)

print("Time using NumPy vectorization: {} sec".format(time.time() - start_time))

Time using NumPy vectorization: 0.008000612258911133 sec


# Data transformation using `.groupby().transform`

### The min-max normalization using `.transform()`

A very common operation is the **min-max normalization**. It consists in rescaling our value of interest by deducting the minimum value and dividing the result by the difference between the maximum and the minimum value. 

For example, to rescale student's weight data spanning from 160 pounds to 200 pounds, you subtract 160 from each student's weight and divide the result by 40 (200 - 160).

You're going to define and apply the **min-max normalization** to all the numerical variables in the restaurant data.

In [69]:
restaurant_data = pd.read_csv('../data/24. Pandas Avanzado/restaurant_data.csv')
restaurant_data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [74]:
# Define the min-max transformation
min_max_tr = lambda x: (x - x.min()) / (x.max() - x.min())

# Group the data according to the time and apply the transformation (CHAIN)
restaurant_min_max_group = restaurant_data.groupby('time').transform(min_max_tr)

restaurant_min_max_group.head()

Unnamed: 0,total_bill,tip,size
0,0.291579,0.001111,0.2
1,0.152283,0.073333,0.4
2,0.375786,0.277778,0.4
3,0.431713,0.256667,0.2
4,0.450775,0.29,0.6


### Validation of normalization

For this exercise, we will perform a z-score normalization and verify that it was performed correctly.

A distinct characteristic of normalized values is that they have a mean equal to zero and standard deviation equal to one.

After you apply the normalization transformation, you can group again on the same variable, and then check the mean and the standard deviation of each group.

In [77]:
zscore = lambda x: (x - x.mean()) / x.std()

# Apply the transformation
poker_trans = poker_hands.groupby('Class').transform(zscore)

poker_trans.head()

Unnamed: 0,S1,R1,S2,R2,S3,R3,S4,R4,S5,R5
0,-1.380537,0.270364,-1.380537,-0.730297,-1.380537,0.631224,-1.380537,0.350823,-1.380537,-0.724286
1,-0.613572,0.495666,-0.613572,1.095445,-0.613572,0.039451,-0.613572,0.350823,-0.613572,-0.724286
2,0.153393,0.720969,0.153393,-0.730297,0.153393,0.631224,0.153393,-1.403293,0.153393,-0.724286
3,0.920358,0.270364,0.920358,-0.730297,0.920358,-1.735866,0.920358,1.227881,0.920358,1.2675
4,0.920358,-1.757363,0.920358,1.095445,0.920358,0.433966,0.920358,-0.526235,0.920358,0.905357


In [80]:
print(np.round(poker_trans.mean(), 3))
print('\n')
print(poker_trans.std())

S1    0.0
R1   -0.0
S2   -0.0
R2   -0.0
S3   -0.0
R3    0.0
S4   -0.0
R4   -0.0
S5   -0.0
R5   -0.0
dtype: float64


S1    0.99982
R1    0.99982
S2    0.99982
R2    0.99982
S3    0.99982
R3    0.99982
S4    0.99982
R4    0.99982
S5    0.99982
R5    0.99982
dtype: float64


# Data filtration using `.filter()`

As you noticed in the video lesson, you may need to filter your data for various reasons.

In this exercise, you will use filtering to select a specific part of our DataFrame:

- by the number of entries recorded in each day of the week
- by the mean amount of money the customers paid to the restaurant each day of the week

In [87]:
restaurant_data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [112]:
# Filter the days where the count of total_bill is greater than 70
total_bill_70 = restaurant_data.groupby('day').filter(lambda x: x['total_bill'].count() > 70)

print('Number of tables where total_bill count by day is greater than 70:', total_bill_70.shape[0])

Number of tables where total_bill count by day is greater than 70: 163


In [113]:
restaurant_data.groupby('day')['total_bill'].count()

day
Fri     19
Sat     87
Sun     76
Thur    62
Name: total_bill, dtype: int64

In [114]:
total_bill_70.day.unique()

array(['Sun', 'Sat'], dtype=object)

In [115]:
total_bill_70

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
238,35.83,4.67,Female,No,Sat,Dinner,3
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
