# Tutorial 8: Running NumPy In Your Data Warehouse

Pandas and NumPy are the two most popular Python data science libraries, used by more than 60% of Python developers. NumPy supports linear algebra operations in Python, and as a result, is the fundamental building block of machine learning.

Ponder lets you run NumPy commands directly in your warehouse. This means you can work with the NumPy API to build data and ML pipelines, and let DuckDB take care of scaling and security for you.

Here, we'll show a few examples of Ponder in action with NumPy.

In [1]:
import ponder; ponder.init()
import modin.pandas as pd
import duckdb
duckdb_con = duckdb.connect("../ponder.db")
ponder.configure(default_connection=duckdb_con)



In [2]:
df = pd.read_sql("PONDER_CUSTOMER", duckdb_con)

<div class="alert alert-block alert-info"> <b>Note: </b> <span>NumPy support is currently part of Modin's experimental API, please drop us a note at <a href"mailto:support@ponder.io">support@ponder.io</a> if you run into any issues. Feedback welcome!</span></div>

In [3]:
import modin.config as cfg
cfg.ExperimentalNumPyAPI.put(True)
import modin.numpy as np

In [4]:
arr = df.select_dtypes("number").astype("float").to_numpy()



We can convert the numerical values of the dataframe into Modin's NumPy array.

In [5]:
type(arr)

modin.numpy.arr.array

In [6]:
arr

array([[ 6.00010e+04,  1.40000e+01,  9.95756e+03],
       [ 6.00020e+04,  1.50000e+01,  7.42460e+02],
       [ 6.00030e+04,  1.60000e+01,  2.52692e+03],
       [ 6.00040e+04,  1.00000e+01,  7.97522e+03],
       [ 6.00050e+04,  1.20000e+01,  2.50474e+03],
       [ 6.00060e+04,  2.20000e+01,  9.05140e+03],
       [ 6.00070e+04,  1.20000e+01,  6.01717e+03],
       [ 6.00080e+04,  2.00000e+00,  5.62144e+03],
       [ 6.00090e+04,  9.00000e+00,  9.54801e+03],
       [ 6.00100e+04,  2.10000e+01,  3.49791e+03],
       [ 6.00110e+04,  8.00000e+00,  3.42248e+03],
       [ 6.00120e+04,  2.30000e+01,  6.71600e+02],
       [ 6.00130e+04,  0.00000e+00, -4.85690e+02],
       [ 6.00140e+04,  1.60000e+01,  7.93215e+03],
       [ 6.00150e+04,  9.00000e+00,  4.62239e+03],
       [ 6.00160e+04,  2.00000e+00,  4.48097e+03],
       [ 6.00170e+04,  1.00000e+00,  3.65317e+03],
       [ 6.00180e+04,  0.00000e+00,  5.75983e+03],
       [ 6.00190e+04,  1.00000e+00,  3.44477e+03],
       [ 6.00200e+04,  1.50000e

We can perform reduce operations such as `np.sum` and `np.mean` across the entire matrix: 

In [7]:
np.sum(arr)

6448157.578125

or we can perform the reduce operation along a specific axis: 

In [12]:
# mean of every row returning object of same dimensions
np.mean(arr, axis=-1, keepdims=True)



array([[23324.188],
       [20253.154],
       [20848.64 ],
       [22663.072],
       [20840.58 ],
       [23026.467],
       [22012.057],
       [21877.146],
       [23188.67 ],
       [21176.303],
       [21147.16 ],
       [20235.533],
       [19842.436],
       [22654.049],
       [21548.797],
       [21499.656],
       [21223.725],
       [21925.943],
       [21154.924],
       [20407.434],
       [23315.17 ],
       [19754.088],
       [20010.344],
       [20006.453],
       [22447.717],
       [22518.377],
       [22055.406],
       [21176.33 ],
       [23108.049],
       [22149.5  ],
       [22268.248],
       [20981.62 ],
       [19854.47 ],
       [22426.229],
       [22412.5  ],
       [22648.633],
       [20357.807],
       [20938.727],
       [19761.193],
       [21163.594],
       [20664.273],
       [22913.654],
       [20260.01 ],
       [19745.03 ],
       [22811.37 ],
       [20362.857],
       [20771.254],
       [20618.043],
       [20728.943],
       [21749.33 ],


We can also do element-wise matrix operations such as addition of two matrices:

In [13]:
# add an array with an array with reversed columns
arr + arr[:,::-1]

array([[6.9958562e+04, 2.8000000e+01, 6.9958562e+04],
       [6.0744461e+04, 3.0000000e+01, 6.0744461e+04],
       [6.2529922e+04, 3.2000000e+01, 6.2529922e+04],
       [6.7979219e+04, 2.0000000e+01, 6.7979219e+04],
       [6.2509738e+04, 2.4000000e+01, 6.2509738e+04],
       [6.9057398e+04, 4.4000000e+01, 6.9057398e+04],
       [6.6024172e+04, 2.4000000e+01, 6.6024172e+04],
       [6.5629438e+04, 4.0000000e+00, 6.5629438e+04],
       [6.9557008e+04, 1.8000000e+01, 6.9557008e+04],
       [6.3507910e+04, 4.2000000e+01, 6.3507910e+04],
       [6.3433480e+04, 1.6000000e+01, 6.3433480e+04],
       [6.0683602e+04, 4.6000000e+01, 6.0683602e+04],
       [5.9527309e+04, 0.0000000e+00, 5.9527309e+04],
       [6.7946148e+04, 3.2000000e+01, 6.7946148e+04],
       [6.4637391e+04, 1.8000000e+01, 6.4637391e+04],
       [6.4496969e+04, 4.0000000e+00, 6.4496969e+04],
       [6.3670172e+04, 2.0000000e+00, 6.3670172e+04],
       [6.5777828e+04, 0.0000000e+00, 6.5777828e+04],
       [6.3463770e+04, 2.000

Putting everything together, we can do both together: 

In [14]:
# subtract each element from the average of its row
arr - np.mean(arr, axis=-1, keepdims=True)

array([[ 36676.812 , -23310.188 , -13366.628 ],
       [ 39748.844 , -20238.154 , -19510.693 ],
       [ 39154.36  , -20832.64  , -18321.72  ],
       [ 37340.93  , -22653.072 , -14687.852 ],
       [ 39164.42  , -20828.58  , -18335.84  ],
       [ 36979.53  , -23004.467 , -13975.066 ],
       [ 37994.945 , -22000.057 , -15994.887 ],
       [ 38130.85  , -21875.146 , -16255.707 ],
       [ 36820.33  , -23179.67  , -13640.66  ],
       [ 38833.695 , -21155.303 , -17678.393 ],
       [ 38863.84  , -21139.16  , -17724.68  ],
       [ 39776.47  , -20212.533 , -19563.934 ],
       [ 40170.562 , -19842.436 , -20328.125 ],
       [ 37359.953 , -22638.049 , -14721.898 ],
       [ 38466.203 , -21539.797 , -16926.406 ],
       [ 38516.344 , -21497.656 , -17018.686 ],
       [ 38793.273 , -21222.725 , -17570.555 ],
       [ 38092.055 , -21925.943 , -16166.113 ],
       [ 38864.08  , -21153.924 , -17710.154 ],
       [ 39612.566 , -20392.434 , -19220.133 ],
       [ 36705.83  , -23310.17  , -13395

Some additional NumPy operations Ponder currently supports include:

- Element-wise matrix operations such as addition, subtraction, multiplication, division, power
- Axis-collapsing or reducing operations such as min, max, sum, product, mean
- Multi-array operations such as maximum or minimum
- And many others, such as where, ravel, and transpose

In [15]:
duckdb_con.close()