# Explaining how Buckaroo fits into the PyData ecosystem for Friends and Family

I want to epxlain for my non technical friends what I have been building, and how it fits into the larger Python Data Ecosystem.

## How programming languages do Math

Before we dive into exciting visuals and interactive programs, let’s lay some groundwork. In commonly used programming languages like Python, JavaScript, and Excel, mathematical operations can be slow. When evaluating an expression like `c = a + b`, the computer must check the types of a and b, then figure out how to add them. Other languages like `C` and `Fortran` allowed faster processing but are less user-friendly.

## NumPy: Accelerating Matrix Math

Quickly performing operations on matrices is essential for linear regression, image recognition, and AI (like ChatGPT). 

In 2006, Travis Oliphant created NumPy — a library specifically for arrays and matrices. In most programming languages, adding two lists of numbers involves type-checking every element.  NumPy determines the types of entire matrices, then efficiently adds their elements together. 

Let's see how much faster this is on a 1,000 x 1,000 matrix!

In [1]:
ELEMENTS = 1_000_000
py_a = [x for x in range(ELEMENTS)]
py_b = [x for x in range(ELEMENTS*10, 0, -10)]
%timeit -n 10 [py_a[i] + py_b[i] for i in range(ELEMENTS)]

37.8 ms ± 650 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [2]:
import numpy as np
np_a = np.arange(ELEMENTS)
np_b = np.arange(ELEMENTS * 10, 0, -10)
%timeit -n 100 np_a + np_b

712 µs ± 295 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## How much faster
37.8 milliseconds vs 895 microseconds, or about 50 times faster. A microsecond is a milionth of a second, a millisecond is a thousandth
With NumPy, C/Fortran speed was accessible from the friendly language of python, in short easy to understand snippets

# The PyData ecosystem emerges

NumPy was revolutionary because it made high performance numerical computing accesible to academics that needed to harness computation to power their analyses.  Computational Biologists, Physicists, electrical engineers, and Astophysiscists all used NumPy, and contributed back to this open source library.  They also wrote their own libraries 

* [SciPy](https://scipy.org/) for linear regressions, differential equations, and much more.  2001 - Travis Oliphant and many more
* [Matplotlib](https://matplotlib.org/) for static plots 2003

Here are these two tools together to build a linear regression chart. https://python-graph-gallery.com/scatterplot-with-regression-fit-in-matplotlib/

# Explaining Buckarooto my non-technical friends 

This notebook explains my side project Buckaroo, and how it fits into the datascience ecosystem.

## Background on python and how it's used in datascience

Before we dive into exciting visuals and interactive programs, let’s lay some groundwork.

Python is a popular programming language, you might also have heard of java, C, and javascript.  Datascientists have come to rely on python because it balances speed of execution (C is faster) with ease of use and learning.  There are other open source libraries that Buckaroo leverages.


### NumPy

NumPy was written by Travis Oliphant in 2006.  Matrix math is at the heart of linear regression, image recognition, and AI like ChatGPT.  In many cases NumPy accelerates matrix operations to 25-100x faster than raw python.

Here is NumPy and the Matplotlib charting

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
rng = np.random.default_rng(1234)
# Generate data
x = rng.uniform(0, 10, size=100)
y = x + rng.normal(size=100)

# Initialize layout
fig, ax = plt.subplots(figsize = (9, 9))

# Add scatterplot
ax.scatter(x, y, s=60, alpha=0.7, edgecolors="k")

# Fit linear regression via least squares with numpy.polyfit
# It returns an slope (b) and intercept (a)
# deg=1 means linear fit (i.e. polynomial of degree 1)
b, a = np.polyfit(x, y, deg=1)

# Create sequence of 100 numbers from 0 to 100 
xseq = np.linspace(0, 10, num=100)

# Plot regression line
ax.plot(xseq, a + b * xseq, color="k", lw=2.5);

## Pandas

Pandas was built by quant researcher Wes Mckinney in 2011.  It uses NumPy to deal with hetrogenous data (intgers, floats and strings) and makes manipulation easier, allowing excel like operations (not the UI)


In [None]:
import pandas as pd
df = pd.read_csv("/Users/paddy/code/example-notebooks/citibike-trips.csv") 
df

# The Jupyter notebook

I have been demonstrating the entire PyData ecosystem inside of the Jupyter notebook.  This is an interactive analysis and documentation environment built around python.  While a grad student Fernando Perez wanted a better [interactive environment](https://en.wikipedia.org/wiki/IPython) for playing with data in 2001.  In 2011 the IPython first released the [jupyter notebook](https://en.wikipedia.org/wiki/Project_Jupyter) interface you see here.

Combining small snippets of analysis code, with charts, and narrative text allowed academics to write and share research in ways that were cumbersome before.
(Maybe show emacs/vscode traditional method of writing code).  This is particularly important for data intensive analysis.  You need to look at the data and play with it iteratively.  This interface works very well for the problem that data scientists and academics deal with every day.

# Pandas

In 2011 financial quant researcher Wes Mckinney released pandas which made analysis of realworld data easier, timeseries data in particular.  Pandas was built on top of NumPY, and allowed computations to be run on mixed datasets (you could have a set of temperature observations ordered by time of day, with a string column for location).

Pandas like each of the previous tools took what was technically possible, and increased the usability so a broader audience could start doing their work in the PyData ecosystem.

The following code shows reading a 300,000 csv file about citibike trips

In [None]:
import pandas as pd
df = pd.read_csv("/Users/paddy/code/example-notebooks/citibike-trips.csv") 
df

In [None]:
# Once you have a dataset like this there are a lot of operations you might want to perform
df['tripduration'].mean()

In [None]:
df['start station name'].value_counts()

In [None]:
df.groupby('start station name').mean('tripduration')

In [None]:
df['tripduration'].hist()

In [None]:
df['tripduration'].quantile(.01)

In [None]:
df['tripduration'].quantile(.99)

In [None]:
df[(df['tripduration'] > df['tripduration'].quantile(.01)) & (df['tripduration'] < df['tripduration'].quantile(.99))]['tripduration'].hist()

# Why I wrote Buckaroo

Thank you for bearing with me this far.  You now have seen the PyData ecosystem and a small sample of how it is used.
These are all powerful tools, but a bit cumbersome to use.  I look at multiple different datasets a day, and I want to quickly understand them.  I don't want to type a bunch of commands to get the overview I'm looking for and I want to be able to look at the raw data.  Here is Buckaroo

In [None]:
import buckaroo
df = pd.read_csv("/Users/paddy/code/example-notebooks/citibike-trips.csv") 
df