In below,

Parallel processing

1. multiprocessing

In [None]:
from multiprocessing import Pool, cpu_count

def main_multiprocessing_func(func, list_of_things_to_be_processed, num_processes=None):
    
    # If num_processes is not specified, default to minimum(#things_to_be_processed, #machine-cores)
    if num_processes==None:
        num_processes = min(len(list_of_things_to_be_processed), cpu_count())
    
    # 'with' context manager takes care of pool.close() and pool.join() for us
    with Pool(num_processes) as pool:
        
        # we need a list to be passed to pool.map
        # pool.map returns results as a list
        results_list = pool.map(func, list_of_things_to_be_processed)
        
        # return list of processed columns, concatenated together as a new dataframe
        return pd.concat(results_list)

2. joblib

    See examples here: https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html

In [None]:
from math import sqrt
from joblib import Parallel, delayed
Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10))

3. Dask

    For manipulating large datasets, when those datasets don’t fit in memory
    
    For accelerating long computations by using many cores
    
    See examples here: https://docs.dask.org/en/latest/index.html and https://docs.dask.org/en/latest/dataframe.html


<br>
<br>

In below,

If a dataframe's column names are `["s_apple", "s_banana", "s_orange"]`, and want to change them to `["apple", "banana", "orange"]`

In [None]:
df.rename(columns=lambda x: int(x[2:]), inplace=True)

<br>
<br>

In below,

SQL sampling

In [None]:
...order by random() limit 10000

In the SQL below,

1. md5 is hashing
2. || is the String Concatenation Operator

In [None]:
seed = 42
md5({seed} || customer_id) as random_customer_id

The usage of SQL,

`COALESCE` returns the first non-null value in a list.

`NULLIF` returns NULL if two expressions are equal, otherwise it returns the first expression.

`group by 1, 2, 3` groups by output columns (position 1, 2, 3)

<br>
<br>

In below,

turn values into ranks

In [10]:
col_names =["apple", "banana", "pear", "orange"]
data = pd.DataFrame(np.random.rand(6, 4), columns=col_names)
data

Unnamed: 0,apple,banana,pear,orange
0,0.301489,0.98308,0.437496,0.911782
1,0.663863,0.278423,0.194561,0.809135
2,0.003051,0.654187,0.01232,0.080039
3,0.962843,0.251235,0.021869,0.892796
4,0.01476,0.08786,0.113315,0.207766
5,0.585164,0.62074,0.466001,0.954973


In [12]:
rank = np.empty((len(data), len(col_names)))
o = data[col_names].values.argsort(1)
rank[np.arange(len(data))[:, None], o] = np.arange(len(col_names))[None]
rank

array([[0., 3., 1., 2.],
       [2., 1., 0., 3.],
       [0., 3., 1., 2.],
       [3., 1., 0., 2.],
       [0., 1., 2., 3.],
       [1., 2., 0., 3.]])

<br>
<br>

In below,

use double `argsort()` to convert numbers into rank

In [1]:
import numpy as np
x = np.array([0.3, 0.1, 0.2])

# Default is axis=-1 (the last axis)
x.argsort(axis=-1)

array([1, 2, 0])

In [2]:
x.argsort(axis=-1).argsort(axis=-1)

array([2, 0, 1])

One more `argsort` would inverse it back

In [3]:
x.argsort(axis=-1).argsort(axis=-1).argsort(axis=-1)

array([1, 2, 0])

In [4]:
x = np.array([
    [
        [0.3, 0.1, 0.2],
        [0.2, 0.7, 0.3]
    ],
    [
        [0.5, 0.7, 0.2],
        [0.2, 0.8, 0.3]
    ],
    [
        [0.6, 0.5, 0.5],
        [0.8, 0.4, 0.6]
    ],
    [
        [0.4, 0.5, 0.2],
        [0.1, 0.2, 0.6]
    ]
])

x.argsort(axis=-1).argsort(axis=-1)

array([[[2, 0, 1],
        [0, 2, 1]],

       [[1, 2, 0],
        [0, 2, 1]],

       [[2, 0, 1],
        [2, 0, 1]],

       [[1, 2, 0],
        [0, 1, 2]]])

Different direction

In [5]:
x.argsort(axis=1).argsort(axis=1)

array([[[1, 0, 0],
        [0, 1, 1]],

       [[1, 0, 0],
        [0, 1, 1]],

       [[0, 1, 0],
        [1, 0, 1]],

       [[1, 1, 0],
        [0, 0, 1]]])

<br>
<br>

In below,

a good way to insert a break in a for-loop or a big function; press Enter to resume

In [None]:
input()

<br>
<br>

In below,

empty dictionary or list is a dangerous default value in Python

In [1]:
def f(value, key, hash={}):
    hash[value] = key
    return hash

print(f('a', 1))
print(f('b', 2))

{'a': 1}
{'a': 1, 'b': 2}


<br>
<br>

In below,

`locals()` inside of a function will return a dict of local variables

In [1]:
def localsNotPresent():
    return locals()

def localsPresent():
    present = True
    return locals()

print('localsNotPresent:', localsNotPresent())
print('localsPresent:', localsPresent())

localsNotPresent: {}
localsPresent: {'present': True}
