# Analytics in Python

We will compare the performance on data operations when executed in series of sequential steps and when executed in parallel.

0. We need more cores to effectively compare serial and parallel processing. Click on the instance type `2vCpu + 8 GiB` above the notebook. This will open the popup to select your instance type. Uncheck the `Fast launch only` types, then choose the `mlk.m5.xlarge` option. Note it will take a few minutes for the new instance type to connect.

Also, you will need to install the package fix for pulling data into Pandas from S3

In [None]:
!pip install s3fs==2021.11.1 --force

1. Read in the `StateNames.csv` file to Pandas based on the code from `Accessing S3.ipynb` and print the head of the dataset using the `.head()` method.

In [1]:
import os
import pandas as pd

In [26]:
# simulate a larger dataset by making it 8x larger code like the following.
df_large = pd.concat([df_pd] * 8).reset_index(drop=True)

We are increasing the size of the dataframe because there is a cost to parallelizing a dataset. Sometimes smaller datasets are more efficiently transformed using serial processing.

2. Install Swifter, a package that is designed to speed up Pandas operations when a single function is "applied" to every row of the dataset. Swfiter will consider whether to use parallel processing, vectorization, or regular step by step "looping".

In [5]:
!pip install swifter

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting swifter
  Using cached swifter-1.0.9-py3-none-any.whl (14 kB)
Installing collected packages: swifter
Successfully installed swifter-1.0.9


In [5]:
import swifter

3. Run speed tests using list comprehension method, pandas apply, and swifter apply on the following data operations
    - squaring the value of the Count column
    - counting the vowels of the Name column
    - logic function that meet ALL the following (LAB TO DO):
        - names with at least 5 letters or 2 vowels
        - count of at least 30 when year is after 1990 or count of at least 10 when year is after 1960, any count otherwise
        - not from a state ending in the letter 'A'


#### Squaring a value

In [6]:
# list comprehension
%time test = [x**2 for x in df_large['Count']]

CPU times: user 16 s, sys: 339 ms, total: 16.3 s
Wall time: 16.3 s


In [7]:
# pandas apply
%time test = df_large['Count'].apply(lambda x: x**2)

CPU times: user 23.9 s, sys: 1.3 s, total: 25.2 s
Wall time: 25.2 s


In [8]:
# swifter pandas apply
%time test = df_large['Count'].swifter.apply(lambda x: x**2)

CPU times: user 42 ms, sys: 87.2 ms, total: 129 ms
Wall time: 129 ms


In this case, switer has not actually done parallel processing, but it is smarter than Pandas and realized that the function can be "vectorized". Vectorization is a fancy word that means loading the computations in the CPU in a batch format rather than working on each row independently.

#### Counting vowels

In [9]:
def count_vowels(x):
    """ Function that returns number of vowels in a string """
    count = 0
    for i in x:
        if i in 'aeiouy':
            count += 1
    return count
df_pd['vowel_count'] = df_pd['Name'].apply(count_vowels)

In [10]:
%time test = [count_vowels(x) for x in df_large['Name']]

CPU times: user 19.1 s, sys: 149 ms, total: 19.3 s
Wall time: 19.3 s


In [11]:
%time test = df_large['Name'].apply(count_vowels)

CPU times: user 24.3 s, sys: 717 ms, total: 25 s
Wall time: 25 s


It turns out swifter has some issues effectively parallelizing the count vowels task. Instead let's use the standard multiprocessing library to accomplish the same result
####  multiprocessing to "map" the count_vowels function

In [24]:
import multiprocessing as mp
# open a pool of workers equal based on the size of the machine
with mp.Pool(mp.cpu_count()) as pool:
    # send the data into the pool mapping operation and run the count_vowels function on each row
    %time df_large['vowel_count'] = pool.map(count_vowels, df_large['Name'])

CPU times: user 19.2 s, sys: 925 ms, total: 20.1 s
Wall time: 21.5 s


We see that parallel processing has reduced the time to extract the number of vowels from each name!

Confirm that the parallel processing is using all cores by clicking on the terminal symbol at the top of the notebook. Then install htop using the command `apt-get install htop -y` then open htop with the command `htop`.

#### Custom filtering function - **LAB TO DO**

- Write a function that meets ALL of the following conditions:
    - names with at least 5 letters or 2 vowels
    - count of at least 30 when year is after 1990 or count of at least 10 when year is after 1960, any count otherwise
    - not from a state ending in the letter 'A'
    
1. Write a function that can take as an input a row of data, run the following logic, and output True or False based the logic.
2. Speed test the function using pandas apply on the first 100,000 rows of the dataframe.
3. Use the supplied code to speed test using multiprocessing map on both the regular `df_pd` object and the `df_large` object.
4. Write a few sentences summarizing your results.

In [None]:
# STEP 1

In [None]:
# STEP 2

In [None]:
# STEP 3
with mp.Pool(mp.cpu_count()) as pool:
    %time test = pool.map(custom_check, df_pd.to_dict('records'))

In [None]:
# STEP 4