# Tutorial 8: Running NumPy In Your Data Warehouse

<div class="alert alert-block alert-info"> <b>Before we get started: </b> 
    <ul style="list-style-type: none;margin: 0;padding: 0;">
        <li>✍️ To run this notebook, you need to have Ponder installed and set up on your machine. If you have not done so already, please refer to our <a href="https://docs.ponder.io/getting_started/quickstart.html">Quickstart guide</a> to get started.</li>
        <li>📖 Otherwise, if you're just interested in browsing through the tutorial, keep reading below!</li>
    </ul>
</div>

Pandas and NumPy are the two most popular Python data science libraries, used by more than 60% of Python developers. NumPy supports linear algebra operations in Python, and as a result, is the fundamental building block of machine learning.

Ponder lets you run NumPy commands directly in your warehouse. This means you can work with the NumPy API to build data and ML pipelines, and let BigQuery take care of scaling and security for you.

Here, we'll show a few examples of Ponder in action with NumPy.

In [1]:
import ponder; ponder.init()
import modin.pandas as pd
from google.cloud import bigquery
from google.cloud.bigquery import dbapi
from google.oauth2 import service_account
import json
bigquery_con = dbapi.Connection(bigquery.Client(credentials=service_account.Credentials.from_service_account_info(json.loads(open("../credential.json").read()),scopes=["https://www.googleapis.com/auth/bigquery"])))

2023-05-12 10:01:29 - Creating session xSpjy6Ay-h5m_J842U4KQtaq5XpS3szSf8m1quDMkC


In [2]:
df = pd.read_sql("TEST.PONDER_CUSTOMER", bigquery_con)

<div class="alert alert-block alert-info"> <b>Note: </b> <span>NumPy support is currently part of Modin's experimental API, please drop us a note at <a href"mailto:support@ponder.io">support@ponder.io</a> if you run into any issues. Feedback welcome!</span></div>

In [3]:
import modin.config as cfg
cfg.ExperimentalNumPyAPI.put(True)
import modin.numpy as np

In [4]:
arr = df.select_dtypes("number").to_numpy()



We can convert the numerical values of the dataframe into Modin's NumPy array.

In [5]:
type(arr)

modin.numpy.arr.array

In [6]:
arr

array([[ 6.00820e+04,  0.00000e+00,  3.64547e+03],
       [ 6.00800e+04,  0.00000e+00,  6.89240e+02],
       [ 6.00180e+04,  0.00000e+00,  5.75983e+03],
       [ 6.00620e+04,  0.00000e+00,  6.21099e+03],
       [ 6.00220e+04,  0.00000e+00, -7.59740e+02],
       [ 6.00530e+04,  0.00000e+00,  6.51551e+03],
       [ 6.00430e+04,  0.00000e+00,  7.37030e+02],
       [ 6.00130e+04,  0.00000e+00, -4.85690e+02],
       [ 6.00980e+04,  0.00000e+00,  1.44968e+03],
       [ 6.00710e+04,  1.00000e+00,  7.05068e+03],
       [ 6.00740e+04,  1.00000e+00,  8.36434e+03],
       [ 6.00190e+04,  1.00000e+00,  3.44477e+03],
       [ 6.00170e+04,  1.00000e+00,  3.65317e+03],
       [ 6.00450e+04,  1.00000e+00,  8.38811e+03],
       [ 6.00160e+04,  2.00000e+00,  4.48097e+03],
       [ 6.00470e+04,  2.00000e+00,  2.26476e+03],
       [ 6.00240e+04,  2.00000e+00, -6.64000e+00],
       [ 6.00570e+04,  2.00000e+00,  9.84870e+02],
       [ 6.00080e+04,  2.00000e+00,  5.62144e+03],
       [ 6.00560e+04,  3.00000e

We can perform reduce operations such as `np.sum` and `np.mean` across the entire matrix: 

In [7]:
np.sum(arr)

6448157.58

In [None]:
np.mean(arr)

or we can perform the reduce operation along a specific axis: 

In [9]:
# mean of every row returning object of same dimensions
np.mean(arr, axis=-1, keepdims=True)



array([[21242.49      ],
       [20256.41333333],
       [21925.94333333],
       [22090.99666667],
       [19754.08666667],
       [22189.50333333],
       [20260.01      ],
       [19842.43666667],
       [20515.89333333],
       [22374.22666667],
       [22813.11333333],
       [21154.92333333],
       [21223.72333333],
       [22811.37      ],
       [21499.65666667],
       [20771.25333333],
       [20006.45333333],
       [20347.95666667],
       [21877.14666667],
       [22979.76      ],
       [22268.89      ],
       [23027.20333333],
       [19745.03      ],
       [20648.39666667],
       [20303.        ],
       [21163.59333333],
       [20357.80666667],
       [21975.31666667],
       [23315.17      ],
       [21576.70333333],
       [21176.33      ],
       [20900.69      ],
       [20938.72666667],
       [21463.5       ],
       [21147.16      ],
       [20664.27333333],
       [22268.24666667],
       [20618.04333333],
       [20728.94333333],
       [20849.08      ],


We can also do element-wise matrix operations such as addition of two matrices:

In [10]:
# add an array with an array with reversed columns
arr + arr[:,::-1]

array([[6.372747e+04, 0.000000e+00, 6.372747e+04],
       [6.076924e+04, 0.000000e+00, 6.076924e+04],
       [6.577783e+04, 0.000000e+00, 6.577783e+04],
       [6.627299e+04, 0.000000e+00, 6.627299e+04],
       [5.926226e+04, 0.000000e+00, 5.926226e+04],
       [6.656851e+04, 0.000000e+00, 6.656851e+04],
       [6.078003e+04, 0.000000e+00, 6.078003e+04],
       [5.952731e+04, 0.000000e+00, 5.952731e+04],
       [6.154768e+04, 0.000000e+00, 6.154768e+04],
       [6.712168e+04, 2.000000e+00, 6.712168e+04],
       [6.843834e+04, 2.000000e+00, 6.843834e+04],
       [6.346377e+04, 2.000000e+00, 6.346377e+04],
       [6.367017e+04, 2.000000e+00, 6.367017e+04],
       [6.843311e+04, 2.000000e+00, 6.843311e+04],
       [6.449697e+04, 4.000000e+00, 6.449697e+04],
       [6.231176e+04, 4.000000e+00, 6.231176e+04],
       [6.001736e+04, 4.000000e+00, 6.001736e+04],
       [6.104187e+04, 4.000000e+00, 6.104187e+04],
       [6.562944e+04, 4.000000e+00, 6.562944e+04],
       [6.893628e+04, 6.000000e

Putting everything together, we can do both together: 

In [11]:
# subtract each element from the average of its row
arr - np.mean(arr, axis=-1, keepdims=True)

array([[ 38839.51      , -21242.49      , -17597.02      ],
       [ 39823.58666667, -20256.41333333, -19567.17333333],
       [ 38092.05666667, -21925.94333333, -16166.11333333],
       [ 37971.00333333, -22090.99666667, -15880.00666667],
       [ 40267.91333333, -19754.08666667, -20513.82666667],
       [ 37863.49666667, -22189.50333333, -15673.99333333],
       [ 39782.99      , -20260.01      , -19522.98      ],
       [ 40170.56333333, -19842.43666667, -20328.12666667],
       [ 39582.10666667, -20515.89333333, -19066.21333333],
       [ 37696.77333333, -22373.22666667, -15323.54666667],
       [ 37260.88666667, -22812.11333333, -14448.77333333],
       [ 38864.07666667, -21153.92333333, -17710.15333333],
       [ 38793.27666667, -21222.72333333, -17570.55333333],
       [ 37233.63      , -22810.37      , -14423.26      ],
       [ 38516.34333333, -21497.65666667, -17018.68666667],
       [ 39275.74666667, -20769.25333333, -18506.49333333],
       [ 40017.54666667, -20004.45333333

Some additional NumPy operations Ponder currently supports include:

- Element-wise matrix operations such as addition, subtraction, multiplication, division, power
- Axis-collapsing or reducing operations such as min, max, sum, product, mean
- Multi-array operations such as maximum or minimum
- And many others, such as where, ravel, and transpose

# Summary

In this tutorial, we demonstrated how you can Ponder to run NumPy operations natively directly on your database. Congrats! This wraps up our 8-part tutorial series on how you can use Ponder to start accelerating your data science workflow. We're so excited to see what you can do with Ponder! If you have any questions, issues, or want to show off something cool you've built with Ponder:
- [Join us on Slack](https://modin.org/slack.html) and share your thoughts on #ponder-support, or
- Drop us a note at support@ponder.io.