# Problem Set 2.1: numpy arrays, Series, DataFrame

[![lite-badge](https://jupyterlite.rtfd.io/en/latest/_static/badge.svg)](https://leifwalsh.github.io/data-analysis-problem-sets/lab/index.html?path=2-pandas-basics/2.1-numpy-arrays-series-dataframe/2.1-numpy-arrays-series-dataframe.ipynb)

Learn about the core data structures in pandas:
- numpy arrays, which pandas builds on
- The 1-dimensional Series
- The 2-dimensional DataFrame

First, our imports:

In [None]:
# People usually import these with aliases:
import numpy as np
import pandas as pd

## numpy arrays

We've seen Python lists already: `[1, 2, 3, 4]` is a list of numbers.

Python lists can have any kind of thing inside, for example, `[1, "two", False,
4.0]` is perfectly valid. Try it:

In [None]:
things = [1, "two", False, 4.0]
things

In [None]:
print(things[1])
print(things[2])

Let's try adding up our `things`:

In [None]:
sum([str(thing) for thing in things])


Python can't add `1` and `"two"`.

Most of the time, when we work with lists, they tend to belists of the same
kind of thing, and that means you can do things like add them all up, or apply
the same change to each thing in the list. But normal Python lists don't force
us to do that, so they can't provide operations that make sense only if that's
true.

`numpy` (short for Numerical Python) is a library that works with something a
bit stricter: arrays of *homogeneous types* - basically lists where we know
they're all numbers, or all Booleans.

numpy also happens to be extremely optimized for performance, but you don't
have to think about that.

Let's see numpy in action:

In [None]:
a1 = np.arange(1, 11)
a1

A numpy array gives you a lot more than a Python list, let's see some examples:

In [None]:
a1.sum()

You can multiply (or add) a number to an array, this does that operation with
each element in the list (this is called "broadcasting"):

In [None]:
a2 = a1 * 2
a2

In [None]:
a2.mean()

Adding arrays pairs up the elements and adds each pair:

In [None]:
a3 = a1 + a2
a3

Just like Python lists, you can get the length, and get individual elements:

In [None]:
print(len(a3))
print(a3[5])

Each numpy array has a `dtype` (for "data type"). This one knows it's `int64`
(64-bit positive and negative whole numbers).

In [None]:
a3.dtype

We'll also often see `float64`, which can have decimals:

In [None]:
a4 = a3 / a2
print(f"dtype: {a4.dtype!r}")
a4

There's a lot more you can do with numpy, but let's move on to `pandas`.

## pandas Series

A pandas `Series` is pretty much just a numpy array, with a name, and possibly
an "index", which is a way of accessing the items in it other than by their
position. But we'll ignore that for a while.

In [None]:
s1 = pd.Series(a1, name="numbers")
s1

pandas prints Series (yes, "series" is the plural of "series") as columns, and
shows you their positions as well (this is the index part). It also shows the
name and dtype. The name will be important when we get to DataFrames, but
mostly you can treat it the same as a numpy array:

In [None]:
s2 = s1 * 2
s2

In [None]:
s3 = s1 + s2
s3

In [None]:
s3.sum()

In [None]:
s3.count()

In fact, a Series isn't just "pretty much" a numpy array, it *contains* one:

In [None]:
s3.values

## pandas DataFrames

Finally, we usually don't just have individual lists of numbers floating
around, we have *tables* of them. Here's one:

In [None]:
df = pd.DataFrame({
    "item": ["pizza", "soup", "ice cream"],
    "price": [3.25, 7, 5],
    "quantity": [4, 2, 3],
})
df

A pandas `DataFrame` is a table, like a spreadsheet, where each column is a
Series with a name, and all the columns have the same length (actually the same
"index", but again, we'll get to that later). You can also have columns with
strings, or other non-numeric things.

You can think of a DataFrame like a Python `dict` where the values are Series:

In [None]:
df["price"]

In [None]:
df["item"]

This means you can do all the numpy things with them:

In [None]:
df["price"] * df["quantity"]

You can also assign new columns just like you can set new things in a dict:

In [None]:
df["cost"] = df["price"] * df["quantity"]
df

In [None]:
df["cost"].sum()

## Exercises

First, run this cell to set up the input data (in case you changed `df` above).

If you mess up `df`, you can just run this cell again to reset it.

In [None]:
df = pd.DataFrame({
    "item": ["pizza", "soup", "ice cream"],
    "price": [3.25, 7, 5],
    "quantity": [4, 2, 3],
    "net weight": [0.4, 0.5, 0.1]
})
df

### Exercise 1

Can you compute the cost column again, multiplying the item price by the quantity?

### Exercise 2

Calculate the total cost of all the items together.

### Exercise 3

Calculate the price per pound for each item (assume `net weight` is in pounds).

### Exercise 4

Calculate the total weight of all the items together.