### Tutorial 7: Beyond Pandas: Third-Party Library Integration

To support end-to-end data science workflows, Ponder supports integrations with other commonly used data science libraries in Python. In this tutorial, we will show examples of how you can work with other visualization and machine learning libraries in Python.

In [None]:
! pip install matplotlib scikit-learn xgboost

<div class="alert alert-block alert-info"> <b>Note: </b> While Ponder supports pandas operations running on the data warehouse, we do not currently yet support other libraries to run the computation on the warehouse directly. Instead, Ponder pulls the data out of the warehouse and operates it in memory. If the table is exceeds 10k rows, we extract a sample of 10k rows to pull into memory. The focus of this tutorial is to demonstrate how Ponder works well and interoperates with these library. <span>  </span></div>

 ## Visualization

Visualization is a critical part of any exploratory data analysis workflow for identifying patterns and trends in your data. Ponder works out of the box with popular plotting libraries in the PyData ecosystem, including sklearn and matplotlib. 

In [None]:
import os; os.chdir("..")
import credential
import ponder.bigquery
import modin.pandas as pd
bigquery_con = ponder.bigquery.connect(user=credential.params["user"],password=credential.params["password"],account=credential.params["account"],role=credential.params["role"],database=credential.params["database"],schema=credential.params["schema"],warehouse=credential.params["warehouse"])
ponder.bigquery.init(bigquery_con,enable_ssl=True)

Here is an example of how to plot a histogram with Ponder:

In [None]:
df = pd.read_sql("PONDER_TAXI",con=bigquery_con)

In [None]:
df["TOTAL_AMOUNT"].plot.hist(bins=100)

Here is an example of how to plot a scatterplot with Ponder:

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/car.csv")
df.plot(x="MilesPerGal",y="Horsepower",kind="scatter")

You can also make more elaborate plots by making using of matplotlib's `plt.plot` functionality directly.

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.xlabel("MilesPerGal")
plt.ylabel("Horsepower")
for country in df.Origin.unique(): 
    cdf = df[df["Origin"]==country]
    plt.plot(cdf["MilesPerGal"],cdf["Horsepower"],'o',label=country)
plt.legend()

### Machine Learning

Ponder integrates with popular libraries used for machine learning including sci-kit learn, XGBoost, HuggingFace, Tensorflow, and more. Ponder also provides the ability to run NumPy on your data warehouse directly, which is the foundation of machine learning training in Python.


In [None]:
df = pd.read_csv("https://github.com/ponder-org/ponder-datasets/blob/main/USA_Housing.csv?raw=True")

In [None]:
X = df.drop(columns=['Price', 'Address'])
y = df[['Price']]

#### Sci-kit Learn

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25)

my_imputer = SimpleImputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [None]:
lr = LinearRegression()
lr.fit(train_X, train_y)
pred_y = lr.predict(test_X)

In [None]:
print(r2_score(pred_y,test_y)*100)

### XGBoost

In [None]:
from xgboost import XGBRegressor

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(train_X, train_y, early_stopping_rounds=5, 
             eval_set=[(test_X, test_y)], verbose=False)

In [None]:
predictions = my_model.predict(test_X)

from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))
print(r2_score(predictions,test_y)*100)