# Temperature dataset



In [None]:
import numpy as np
import matplotlib.pyplot as plt
import numexpr
import numba

Let's start with a plausible problem. We have a dataset of all daily temperatures measured at Newark since 1893 and we want to analyze it. First, this is the "pure Python" way to open it:

In [None]:
%%time
with open("data/newark-temperature-avg.txt") as file:
    temperatures = [float(line) for line in file]

temperatures = np.array(temperatures)
print(temperatures, len(temperatures), "elements")

> #### Note:
>
> Don't forget the *double* percent sign for cell magics! Single percent sign is a line magic, which measures the line (basically nothing if you were trying to measure a cell).

We could easily then convert this to NumPy. Let's instead use NumPy directly, which will save memory:

In [None]:
%%time
temperatures = np.loadtxt("data/newark-temperature-avg.txt")
print(temperatures, len(temperatures), "elements")

Sadly, this does not save time, since it's trying to be general and can support several things, like multiple columns and more. The reduction in time, clarity, and generality *should usually* be worth it.

Let's load the rest of the data:

In [None]:
min_temperatures = np.loadtxt("data/newark-temperature-min.txt")
max_temperatures = np.loadtxt("data/newark-temperature-max.txt")

Now let's check the fraction of nan values:

In [None]:
fraction_nan = np.sum(np.isnan(temperatures)) / len(temperatures)
print(f"Fraction of values that are NaN: {fraction_nan:.2%}")

Let's look at two ways of doing the same thing: Computing missing temperatures from average of min and max temperatures:

In [None]:
%%timeit

missing = np.isnan(temperatures)
imputed_temperatures = temperatures.copy()
imputed_temperatures[missing] = 0.5 * (
    min_temperatures[missing] + max_temperatures[missing]
)

In [None]:
%%timeit

imputed_temperatures = np.where(
    np.isnan(temperatures),  # condition
    0.5 * (min_temperatures + max_temperatures),  # if true
    temperatures,  # if false
)

Remember, timeit does not change the environment, so let's repeat that here. We will use np.mean, because it is more descriptive, even though it is slower. If we used `minmax_temps = np.stack([min_temperatures, max_temperatures])`, then it would be much closer in speed.

In [None]:
imputed_temperatures = np.where(
    np.isnan(temperatures),  # condition
    np.mean([min_temperatures, max_temperatures], axis=0),  # if true
    temperatures,  # if false
)

In [None]:
fraction_nan = np.sum(np.isnan(imputed_temperatures)) / len(imputed_temperatures)
print(f"Fraction of values that are NaN: {fraction_nan:.2%}")

Now, let's try a more interesting calculation (we are limited in what we can find interesting to do here until we introduce Pandas, since it's a simple dataset).

> #### Note:
> 
> These are *very* simple calculations, but we can still see performance differences.

In [None]:
%%timeit
c_temps = (imputed_temperatures - 32) * 5 / 9

Predict: Will this be slower, faster, or the same?

In [None]:
%%timeit
c_temps = (imputed_temperatures - 32) * (5 / 9)

On older NumPy, this used to be faster - due to fusion, it should be the same on Unix systems:

In [None]:
%%timeit
c_temps = imputed_temperatures - 32
c_temps *= 5 / 9

Sadly, this is to simple to get help from numexpr:

In [None]:
%%timeit
c_temps = numexpr.evaluate("(imputed_temperatures - 32) * (5/9)")

Even in this simple case, a properly compiled function can help out just a little:

In [None]:
@numba.vectorize((numba.float64(numba.float64),), target="parallel")
def convert(degrees):
    return (degrees - 32) * (5 / 9)

In [None]:
%%timeit
c_temps = convert(imputed_temperatures)

## Pandas

Let's try a little more analysis, but we will do it properly, in Pandas!

The datasets above were really part of the newark-temperature csv file, so let's open that in Pandas:

In [None]:
import pandas as pd

In [None]:
df_orig = pd.read_csv(
    "data/newark-temperature.csv",
    index_col="DATE",
    usecols="DATE TAVG TMAX TMIN".split(),
    parse_dates=["DATE"],
)
df_orig

In [None]:
df_orig.info()

Let's fill in the NAN values:

In [None]:
df = df_orig.copy()
df.TAVG[df.TAVG.isnull()] = df[df.TAVG.isnull()][["TMAX", "TMIN"]].mean(axis=1)
df

Or, even better:

In [None]:
df = df_orig.copy()
df.TAVG.where(~df.TAVG.isnull(), df[["TMAX", "TMIN"]].mean(axis=1), inplace=True)
df

Better still:

In [None]:
df = df_orig.copy()
df.TAVG.fillna(df[["TMAX", "TMIN"]].mean(axis=1), inplace=True)
df

We did the above calculations on a copy, so we could do them inline.

In [None]:
df.TAVG.plot(style=".")

In [None]:
df["1893-01-01":"1910-01-01"].TAVG.plot(style=".")

In [None]:
dfm = df.groupby(pd.Grouper(freq="M")).mean()
dfm

In [None]:
dfm["1893-01-01":"1920-01-01"].TAVG.plot(style=".-")

Another thing we can do is a rolling mean; let's average over three years:

In [None]:
df.rolling(3 * 365).mean().plot()

### Pandas: speed

Pandas is not necessarily *faster* than raw NumPy. It is more descriptive and more powerful. When you need speed, ***profile*** it then write just what you need in numba or something similar.

Here is the underlying array, as a PandasArray:

In [None]:
dfm.TAVG.array

Note that a Series, the 1D array that makes up the columns of a DataFrame, actually stores two arrays; the data you see above and an index (reference)

This supports the Python 3 memoryview / NumPy array protocol:

In [None]:
arr = np.asarray(dfm.TAVG.array)
arr

In [None]:
arr.flags["OWNDATA"]

So no copies are involved. You can now take full advantage of anything you could on a NumPy array. Note that if you want a numpy array, you can use the shortcut:

In [None]:
dfm.TAVG.to_numpy()

### Pandas: alternatives

The Pandas DataFrame is wildly popular. So much so that it is being used as an API by projects that do things that normal Pandas does not do, such as out-of-memory DataFrames (Dask).

## See also:

* [CompClass: Structured data](https://nbviewer.jupyter.org/github/henryiii/compclass/blob/master/classes/week7/1_pandas.ipynb)

