## Tutorial 009

### Working with data
In data science we spend a lot of time – wait for it – working with ***data***! To work with data we ways to: 

* store the data a systematic way
* access the data
* modify the format of the data for various needs

We have alreay met the data container we will ultimately be using a lot: the pandas DataFrame. So far, we've imported data directly into a DataFrame, have passed that DataFrame to plotting functions, and then accessed columns of data using the column names. We've even seen that DataFrames "know" how to plot the data inside them.

But to really understand and use DataFrames, it helps to understand and use what they are built upon, which are numpy arrays. Numpy arrays are grids or tables for holding, accessing, and manipulating data. DataFrames are literally built on top of numpy arrays, providing bells and whistles such as column names. 

Numpy arrays, in turn, are created and accessed in ways very similar to a built-in data container in Python called a "list". 

So what we are going to do is learn to work with lists (lists are handy!), then we'll graduate to numpy arrays and see what they can do. Finally, we'll step back up to DataFrames – with our newfound understanding of what's under the hood – in this and the next tutorial.

### Python lists

Lists in Python are just what they sound like, lists of things. We make them using square brackets.

In [1]:
mylist = [1, 3, 5, 7, 11, 13]

A list is an extension of a regular (or "scaler") `variable`, which can only hold one thing.

In [18]:
notalist = 3.14

In [19]:
notalist

3.14

In [20]:
mylist

[1, 3, 5, 7, 11, 13]

Aside: If we're working with only numbers, then you can think of a regular variable as a "scaler" and a list as a "vector".

List values are accessed using `index` values enclosed in square brackets.

Lists can hold things besides numbers.

In [21]:
mylist2 = ['this', 'is', 'a', 'list', 'of', 'words']

In [22]:
mylist2

['this', 'is', 'a', 'list', 'of', 'words']

Lists can even hold different kinds of things at once.

In [23]:
mylist3 = [1, 'one', [2, 3, 4]]

In [24]:
mylist3

[1, 'one', [2, 3, 4]]

(notice that this list holds another list at index=2)

We can get elements of a list by using there index values in square brackets.

In [25]:
mylist

[1, 3, 5, 7, 11, 13]

In [26]:
mylist[5]

13

In [27]:
mylist[0]

1

Notice that the first value is actually at index=0, not index=1. This is different than many other languages, such as R!

We can get more than one value at a time if we we want.

In [29]:
mylist[0:3]

[1, 3, 5]

If we go back to a junior high math class, we can read this as "Give me the interval **\[0,3)** - zero inclusive to 3 exclusive."

I know this is weird. At least for indexes a and b, the number of elements you get back from mylist[a,b] is always equal to b minus a, so I guess that's good!

We can get any consecutive hunk of elements this way.

In [31]:
mylist[3:5]

[7, 11]

If you omit the indexes, Python will assume you want everything.

In [34]:
mylist[:]

[1, 3, 5, 7, 11, 13]

That doesn't seem very useful! But actually, it will turn out to be later in when we start using numpy arrays!

If you just use one index, the colon is assumed to mean "from the beginning" or "to the end". Like this:

In [38]:
mylist[:3]

[1, 3, 5]

And this:

In [37]:
mylist[3:]

[7, 11, 13]

In addition to the `list[start:stop]` syntax, you can add a step after a second colon, as in `list[start:stop:step]`

In [48]:
mylist[0:5:2]

[1, 5, 11]

As you've probably figured out, all our outputs above have been lists. So if we assign the output a name, it will be another list.

In [44]:
everyotherone = mylist[0:-1:2]

In [45]:
everyotherone

[1, 5, 11]

See!

If we want a group of elements that aren't evenly spaced, we'll need to specify the indexes "by hand".

In [46]:
anothernewlist = [mylist[1],mylist[2],mylist[4]]

In [47]:
anothernewlist

[3, 5, 11]

So those are the basics of lists. They:

* store a list of things (duh)
* start at index zero
* can be accessed using
    - square brackets `[]`
    - integer indexes (including negative "start from the end" indexes)
    - a colon `:` (or two if you want a step value other than 1)
    


What's awesome is that that numpy arrays were designed to be lists with superpowers, so everything we just learned applies to numpy arrays as well!

### Numpy Arrays

Not to state the obvious, but to use numpy arrays, we'll need to import numpy.

In [49]:
import numpy as np

In [50]:
myarr = np.array([2, 4, 6, 8, 9, 10])

Then we can make an array in much the same way we make a list.

In [51]:
myarr

array([ 2,  4,  6,  8,  9, 10])

From then on, all the indexing we've learned so far applies directly!

In [52]:
myarr[4]

9

In [55]:
myarr[-3:]

array([ 8,  9, 10])

We can make an array directly out of a list.

In [61]:
myarr2 = np.array(mylist)

In [59]:
myarr2

array([ 1,  3,  5,  7, 11, 13])

And then of course we can index it exactly the same way, so... *Wait, why are we making arrays now? What's the difference?*

One **huge** difference is that it's cumbersome to do maths with lists.

In [60]:
mylist + 5

TypeError: can only concatenate list (not "int") to list

That didn't work, but this does!

In [62]:
myarr + 5

array([ 7,  9, 11, 13, 14, 15])

Now **that** seems like it might be useful!

We can even add two arrays, or subtract them, or whatever!

In [97]:
myarr + myarr2

array([ 3,  7, 11, 15, 20, 23])

We can combine our 2 arrays into a single *two dimensional* array.

In [72]:
twoDarr = np.array([myarr, myarr2])

In [73]:
twoDarr

array([[ 2,  4,  6,  8,  9, 10],
       [ 1,  3,  5,  7, 11, 13]])

Simple though this may seem, 2D arrays just like this are the bedrock of data analysis! Arrays of real data are usually larger – sometimes much much larger! – but all the principles are the same.

One thing besides maths that arrays can do is tell us there shape.

In [74]:
twoDarr.shape

(2, 6)

Unlike lists, which are always just lists, arrays can come in any shape. So it's *really* convenient that they can tell us what shape they are straight away.

Indexing into 2D arrays is a straightforward extension of indexing into 1D arrays or lists. Like this.

In [76]:
twoDarr[1,3]

7

The first index refers to the row index, and the second to the column index. In this case, we're asking for the value in the second row and the fourth column, which is indeed 7 (remember the first row and column are index=0!).

We can play all the same games indexing with 2D arrays as we can with 1D arrays, we just have to remember that everything before the comma `,` refers to the *rows* in that it works along the *vertical dimension*, and everything after it refers to the *columns* in that it works along the *horizontal dimension*.

So this:

In [77]:
twoDarr[:,0:3]

array([[2, 4, 6],
       [1, 3, 5]])

means "Give me all the rows (the colon) and then (the comma) give me the first 3 columns (the "`0:3`)."

A few more examples:

In [84]:
twoDarr[-1,:] # the last row (regardless of the number of rows)

array([ 1,  3,  5,  7, 11, 13])

In [86]:
twoDarr[:,-2:] # last two columns

array([[ 9, 10],
       [11, 13]])

In [88]:
twoDarr[0,::2] # first row, every other column

array([2, 6, 9])

To get good at this, you don't need natural born talent or anything like that. Like so much in life, the key is *practice, practice, practice*!!! So play around! You can't break your computer or anything!

Another neat trick that arrays can do is *transpose* themselves, flipping the rows for columns.

(Hold your right hand in front of your face so that you're looking at your palm with your fingers pointing towards the left. Now flip your hand so that you're looking at the back of your hand with your fingers pointing up. You just *transposed* your hand such that the first row (your pointer finger) became the first column!)

In [89]:
colarr = twoDarr.T

In [90]:
colarr

array([[ 2,  1],
       [ 4,  3],
       [ 6,  5],
       [ 8,  7],
       [ 9, 11],
       [10, 13]])

Why would we want to do that? By convention, *variables* in datasets should correspond to the columns, and *observations* should correspond to the rows. So we have taken data in which this was not so and turned it into an array in which the columns are the non-prime numbers and the prime numbers, respectively, and the rows correspond to the instances in order (1st prime, 2nd prime, etc.).

We have just done a little of what is known as **data wrangling**. While not as fun as data visualization, data wrangling is often a big part of any analysis project!

Now that we have the data into shape, we can unleash all the powers of numpy arrays, powers which pandas DataFrames will inherit and build upon!

For example, who's bigger overall, the primes or the non-primes?

In [93]:
colarr.sum(0)

array([39, 40])

The primes win! 
In `colarr.sum(0)`, the 0 means "the first (vertical) dimension", i.e., sum the values *across the rows* within each column. To sum along the second dimension, we do:

In [96]:
colarr.sum(1)

array([ 3,  7, 11, 15, 20, 23])

The list of things that numpy arrays can do themselves is pretty impressive.

Check it out [here](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).

(or paste this into your browser: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)

So those are the basics of numpy arrays. They:

* store values in rows and columns
* each dimension starts at index zero (like lists)
* can be accessed using
    - square brackets `[]` with row and column indexes separated by a comma
    - integer indexes (including negative "start from the end" indexes)
    - a colon `:` (or two if you want a step value other than 1)
* can have maths done to every element in one go
* can be added, subtracted, etc. from one another
* have superpowers! they can compute stuff along their rows and columns!