<br><br><br><br><br>

# Numpy and Pandas

In [None]:
# Start with a plausible problem: analyze a dataset of daily Newark temperatures since 1883.

import glob

for filename in glob.glob("data/newark-temperature-*.txt"):
    print("-------------------------------")
    print(filename)
    with open(filename) as file:
        print(file.read()[:1000])

In [None]:
# Read the averages into arrays.

avrg = []
with open("data/newark-temperature-avrg.txt") as file:
    for line in file.readlines():
        avrg.append(float(line))

mini = []
with open("data/newark-temperature-mini.txt") as file:
    for line in file.readlines():
        mini.append(float(line))

maxi = []
with open("data/newark-temperature-maxi.txt") as file:
    for line in file.readlines():
        maxi.append(float(line))

print("how many?", len(avrg), len(mini), len(maxi))    # having the same length is essential!
print("starts with", avrg[:3], mini[:3], maxi[:3])
print("ends with  ", avrg[-3:], mini[-3:], maxi[-3:])

In [None]:
# The minima and maxima are more complete than the averages.

import math

print("fraction good in avrg:", sum(0 if math.isnan(x) else 1 for x in avrg) / len(avrg))
print("fraction good in mini:", sum(0 if math.isnan(x) else 1 for x in mini) / len(mini))
print("fraction good in maxi:", sum(0 if math.isnan(x) else 1 for x in maxi) / len(maxi))

In [None]:
%%timeit

# So let's "impute" averages: the measured average is ideal, but take (mini + maxi)/2 if unavailable.

imputed = []
for average, minimum, maximum in zip(avrg, mini, maxi):
    if math.isnan(average):
        imputed.append(0.5*(minimum + maximum))
    else:
        imputed.append(average)

In [None]:
# Same thing in Numpy: load the data and impute missing averages.

import numpy

np_avrg = numpy.array(avrg)
np_mini = numpy.array(mini)
np_maxi = numpy.array(maxi)

print("how many?", len(np_avrg), len(np_mini), len(np_maxi))
print("starts with", np_avrg[:3], np_mini[:3], np_maxi[:3])
print("ends with  ", np_avrg[-3:], np_mini[-3:], np_maxi[-3:])

print()
print("fraction good in avrg:", numpy.isnan(np_avrg).sum() / len(np_avrg))
print("fraction good in mini:", numpy.isnan(np_mini).sum() / len(np_mini))
print("fraction good in maxi:", numpy.isnan(np_maxi).sum() / len(np_maxi))

In [None]:
%%timeit

#                        condition             if true                  if false
np_imputed = numpy.where(numpy.isnan(np_avrg), 0.5*(np_mini + np_maxi), np_avrg)

<br><br><br><br><br>

Your milage may vary, but <tt>6.73 ms</tt> versus <tt>79.1 µs</tt> is a factor of 85!

Factor 100‒1000 speedups from pure Python → Numpy are common, and they make the difference between 5 minutes (bathroom break) and 8 hours (overnight).

<br><br><br><br><br>

### Fundamentally different code order

Also notice that we had to change the code from

```python
imputed = []
for average, minimum, maximum in zip(avrg, mini, maxi):
    if math.isnan(average):
        imputed.append(0.5*(minimum + maximum))
    else:
        imputed.append(average)
```

to

```python
np_imputed = numpy.where(numpy.isnan(np_avrg), 0.5*(np_mini + np_maxi), np_avrg)
```

Pure Python is step-by-step, which can be good or bad. Numpy is all-at-once, which can be good or bad.

**Step-by-step:**

   * is **good** because you can insert breakpoints and watch variables to debug the code; it's like a microscopic view with no abstraction;
   * is **bad** because the bigger picture can be lost when spread out among so many lines. (This is why I use list comprehensions.)

**All-at-once:**

   * is **good** because the composition of functions often reads like an English description of the problem to be solved;
   * is **bad** because many indexes need to align; it's hard to break the process apart to debug it. (I usually get a line of Numpy right on the fifth try. Error messages are your friend.)

**Trade-offs:**

Pure Python is generally easier to _write,_ making it good for prototyping. Numpy is often easier to _read_.

And, of course, Numpy is faster.

In [None]:
# Pure Python code order: acts on one DATUM at a time.
# Numpy code order: acts on one ATTRIBUTE at a time.

a = numpy.random.uniform(5, 10, 10000)
b = numpy.random.uniform(10, 20, 10000)
c = numpy.random.uniform(-0.1, 0.1, 10000)

# Computes one quadratic formula on ai, bi, ci before moving on to the next one.
roots1 = numpy.empty(10000, dtype=a.dtype)
for i in range(10000):
    roots1[i] = (-b[i] + math.sqrt(b[i]**2 - 4*a[i]*c[i])) / (2*a[i])

# Computes one step in the quadratic formula for all 10000 before moving on to the next step.
roots2 = (-b + numpy.sqrt(b**2 - 4*a*c)) / (2*a)

print(roots1[:10])
print(roots2[:10])

In [None]:
# The Numpy expression (-b + numpy.sqrt(b**2 - 4*a*c)) / (2*a) actually computes something like:

tmp1 = numpy.negative(b)            # -b
tmp2 = numpy.square(b)              # b**2
tmp3 = numpy.multiply(4, a)         # 4*a
tmp4 = numpy.multiply(tmp3, c)      # tmp3*c
tmp5 = numpy.subtract(tmp2, tmp4)   # tmp2 - tmp4
tmp6 = numpy.sqrt(tmp5)             # sqrt(tmp5)
tmp7 = numpy.add(tmp1, tmp6)        # tmp1 + tmp6
tmp8 = numpy.multiply(2, a)         # 2*a
roots3 = numpy.divide(tmp7, tmp8)   # tmp7 / tmp8

print(roots1[:10])
print(roots2[:10])
print(roots3[:10])

In [None]:
# Not only are operations like + and numpy.sqrt array-at-a-time, but equality checks are, too.

print(roots1 == roots2)
print(roots2 == roots3)

# To get a single answer to the question, "Are these arrays equal?" we have to reduce it with all().

print()
print((roots1 == roots2).all())
print((roots2 == roots3).all())

In [None]:
# Not all roots1 == roots2? Why not?

# Which ones fail? (Indexes where roots1 != roots2 is nonzero/true.)
failures, = numpy.nonzero(roots1 != roots2)
print(failures[:20])

# Show me the elements at the first index that fails!
print()
print("from roots1:", roots1[failures[0]])
print("from roots2:", roots2[failures[0]])

# How big is the difference?
print()
print("       diff:", roots1[failures[0]] - roots2[failures[0]])

In [None]:
# Oh! Are they all these tiny differences?

print(roots1[failures] - roots2[failures])

<br><br><br>

The difference came from `math.sqrt` versus `numpy.sqrt`: the latter is implemented differently on arrays. (Even using `numpy.sqrt` in both places doesn't help, because it runs different code on a single scalar than it does on arrays.)

<br>

**Moral:** do not expect pure Python math to agree perfectly with Numpy math!

<br>

(It wasn't the issue in this case, but in general, [floating point math is not associative](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)—the order matters at a microscopic scale.)

<br><br><br>

<br><br><br><br><br>

### Pandas: like Numpy, but convenient for analysis

<br><br><br><br><br>

In [None]:
# Our last example was based on pre-formatted data. The data I downloaded from NOAA looked like this:

with open("data/newark-temperature.csv") as file:
    print(file.read()[:10000])

In [None]:
# To use this with Numpy, we'd have to parse the CSV (import csv), handle the header line, etc.

# Instead, we use Pandas.
import pandas

df = pandas.read_csv("data/newark-temperature.csv", parse_dates=["DATE"])
df

In [None]:
# Computations on Pandas columns work like Numpy (and mostly defer to Numpy for the actual work).

(df["TMIN"] + df["TMAX"])/2

In [None]:
# But there are many built-in conveniences. The fillna method replaces values only if they are NaN.

df["imputed"] = df["TAVG"].fillna((df["TMIN"] + df["TMAX"]) / 2)
df

In [None]:
%matplotlib inline

# It also has built-in plotting.

df["imputed"].plot()

In [None]:
# These things would be convenient by themselves, but there's also an essential difference:
# Pandas data are INDEXED, meaning that there's a special column indicating what each row means.

print(df.index)
print()

df2 = df.reindex(df["DATE"])
df2