## Working with data preliminaries

Data scientists spend a lot of time – wait for it – working with ***data***! To work with **data** it is critical to organize the data in a way that facilitate the work on the potential analyses we might need to do. So organizing data means guessing what type of work we will want to do with the dataset. And, odd is it may seem, good guessing requires some practice. The data organization process will require: 

* store the data a clear and systematic way
* provide methods to access the data that are simple and straightforward
* be flexible enough so to and allow to modify the format of the data for various needs

We have already met a data container that is commonly used and that we will ultimately be using a lot: the Pandas DataFrame. So far, we have imported data directly into a DataFrame, have passed that DataFrame to plotting functions, and then accessed columns of data using the column names. We have even seen that DataFrames "know" how to plot the data inside them. In more technical terms this means that the DataFrame is an object that have basic methods to plot and explore the data it contains.

Next, to really understand and use DataFrames, it is of great help to understand and use what they are built upon, which are the `numpy arrays`. Numpy arrays are grids or tables for holding, accessing, and manipulating data. DataFrames are literally built on top of numpy arrays, providing many bells and whistles, such as column names, neat formatting, etc. 

Numpy arrays are simpler than Pandas DataFrames. They are created and accessed in ways that are very similar to the ways Python `lists` can be accessed. Lists are among the simplest most  built-in data containers. This is no accident of course; if you can work with one, then it's easy to work with the other.

So what we are going to do is to learn how to work with `lists` (lists are handy!), then we will graduate to `numpy arrays` and see what they can do. Finally, we'll step back up to `DataFrames` – with our newfound understanding of what's under the hood – in this and the next tutorial.

After all this, you will be `Masters of data` or at least know more about Python.

### Python lists

Lists in Python are just what they sound like, lists of things. We make them using `[square brackets]`.

In [None]:
mylist = [1, 3, 5, 7, 11, 13]

A list is an extension of a regular (or "scalar") `variable`, which can only hold one thing at the time.

In [None]:
notalist = 3.14

In [None]:
notalist

In [None]:
mylist

Aside: If we're working with only numbers, then you can think of a regular variable as a "scalar" and a list as a "vector".

Lists can, however, hold things besides numbers. For example, they can hold 'text'.

In [None]:
mylist2 = ['this', 'is', 'a', 'list', 'of', 'words']

In [None]:
mylist2

(Some people, even us, might casually call this a vector but that's technically not true.)

In reality, lists can hold all sort of things, say numbers (scalars), 'text' and even other lists, and all at once.

In [None]:
mylist3 = [1, 'one', [2, 3, 4]]

In [None]:
mylist3

Note that this last list holds a list at index=2

We can get elements of a list by using `index` values in square brackets.

In [None]:
mylist

In [None]:
mylist[5]

In [None]:
mylist3[2]

**Remember that in Python the first value in a list is actually at index=0, not index=1. This is different than many other languages including R and MatLab!**

We can address more than one element in a list by using the `:` (colon) operator.

In [None]:
mylist[0:3]

We can read this as "Give me all the elements in the interval between 0 **inclusive** to 3 **exclusive**."

I know this is weird. But at least for any two indexes `a` and `b`, the number of elements you get back from `mylist[a,b]` is always equal to `b` minus `a`, so I guess that's good!

We can get any consecutive hunk of elements using `:`.

In [None]:
mylist[2:5]

If you omit the indexes, Python will assume you want everything.

In [None]:
mylist[:]

That doesn't seem very useful... But, actually, it will turn out to be **really** useful later on, when we will start using numpy arrays!

If you just use one index, the `:` is assumed to mean "from the beginning" or "to the end". Like this:

In [None]:
mylist[:3] # from the beginning to 3

And this:

In [None]:
mylist[3:] # from 3 to the end

In addition to the `list[start:stop]` syntax, you can add a step after a second colon, as in `list[start:stop:step]`. This asks for all the element between `start` and `stop` but in steps of `step`, not necessarily consecutive elements. For example every other element:

In [None]:
mylist[0:5:2] # get every other element

As you've probably figured out, all our outputs above have been lists. So if we assign the output a name, it will be another list.

In [None]:
every_other_one = mylist[0:-1:2] # could also do mylist[0::2]

In [None]:
every_other_one

See!

If we want a group of elements that aren't evenly spaced, we'll need to specify the indexes "by hand".

In [None]:
anothernewlist = [mylist[1],mylist[2],mylist[4]]

In [None]:
anothernewlist

So those are the basics of lists. They:

* store a list of things (duh)
* start at index zero
* can be accessed using three things together:
    - square brackets `[]`
    - integer indexes (including negative "start from the end" indexes)
    - a colon `:` (or two if you want a step value other than 1)
    


So now that we have explored how to use basic lists in Python we can study Numpy Arrays. 

### Numpy Arrays

Numpy arrays were designed to be lists with superpowers, so everything we learned in the previous section will apply to numpy arrays as well!

Not to state the obvious, but to use numpy arrays, we'll need to `import numpy`.

In [None]:
import numpy as np

Then we can make an array in much the same way we make a list.

In [None]:
myarr = np.array([2, 4, 6, 8, 9, 10])

In [None]:
myarr

From then on, all the indexing we've learned so far applies directly!

In [None]:
myarr[4]

In [None]:
myarr[-3:]

We can make an array directly out of a list.

In [None]:
myarr2 = np.array(mylist)

In [None]:
myarr2

And then of course we can index it exactly the same way, so... *Wait, why are we making arrays now? What's the difference?*

One **huge** difference is that if we wanted to do some math with basic Python lists, the fact that they can hold multiple types of data elements does not assure that the mathematical operations will perform.

In [None]:
mylist + 5

`numpy` arrays instead contain numerical elements by definition. This definition assures the ability to perform math ith the arrays. So, whereas the addition above did not work when using the `list`, it does work when using the `numpy` array, even though both `list` and `array` contain the same elements!

In [None]:
myarr2 + 5

Now **that** seems like it might be useful!

We can even add two arrays, or subtract them, or whatever!

In [None]:
myarr + myarr2

We can combine our 2 arrays into a single ***two dimensional (2D) array***.

In [None]:
twoDarr = np.array([myarr, myarr2])

In [None]:
twoDarr

Simple though this may seem, *2D arrays just like this are the bedrock of data analysis!* Arrays of real data are usually larger – sometimes much much larger! – but all the principles are the same and all you as a Data Scientists need to remmeber is the dimensionality of the data arrays. Python will then compute what you ask for.

So one important thing to know about arrays, besides the ability to do maths, is the shape of arrays. Indeed, numpy arrays have built-in functionality to tell us their shape.

In [None]:
twoDarr.shape

Unlike lists, which are always just lists, arrays can come in any shape. So it's *really* convenient that they can tell us what shape they are straight away.

Indexing into 2D arrays is a straightforward extension of indexing into 1D arrays or lists. We just provide a second index after a `,` (comma). Like this.

In [None]:
twoDarr[1,3]

The first index refers to the **row index**, and the second to the **column index**. In this case, we're asking for the value in the second row and the fourth column, which is indeed 7 (remember *the first row and column are index=0!*).

We can play all the same games indexing with 2D arrays as we can with 1D arrays, we just have to remember that everything before the comma `,` refers to the *rows* in that it specifies locations along the *vertical dimension*, and everything after the comma `,` refers to the *columns* in that it specifies locations along the *horizontal dimension*.

So this:

In [None]:
twoDarr[:,0:3]

means "Give me all the rows (the colon `:`) in the first 3 columns (the "`0:3`)."

I told you that the colon all alone by itself would end up being useful!!! In this case for example, by using the `:` you do not need to type many indices (one per row) and you even do not need to remmeber how many rows there are, just use `:` and Python will return all the elements.

A few more examples:

In [None]:
# the last row (regardless of the number of rows, 
# again you do not need to knowhow many rows exist)
twoDarr[-1,:] 

In [None]:
twoDarr[:,-2:] # last two columns

In [None]:
twoDarr[0,::2] # first row, every other column

To get good at this, you don't need natural born talent or anything like that. Like so much in life, the key is *practice, practice, practice*!!! So play around! You can't break your computer or anything!

Another neat trick that arrays can do is *transpose* themselves, flipping the rows for columns.

(Hold your right hand in front of your face so that you're looking at your palm with your fingers pointing towards the left. Now flip your hand so that you're looking at the back of your hand with your fingers pointing up. You just *transposed* your hand such that the first row (your pointer finger) became the first column!)

In [None]:
colarr = twoDarr.T

In [None]:
colarr

Why would we want to do that? By convention, *variables* in datasets should correspond to the columns, and *observations* should correspond to the rows. So we have taken data in which this was not so and turned it into an array in which the columns are the first few non-prime numbers and the prime numbers, respectively, and the rows correspond to the instances in order (1st , 2nd, 3rd, ....).

We have just done a little of what is known as **data wrangling**. While not as fun as data visualization, data wrangling is often a big part of any analysis project!

Now that we have the data into shape, we can unleash all the powers of numpy arrays, powers which pandas DataFrames will inherit and build upon!

For example, who's bigger overall, the primes or the non-primes?

In [None]:
colarr.sum(0)

The primes win! 
In `colarr.sum(0)`, the 0 means "the first (vertical) dimension", i.e., sum the values *across the rows* within each column. To sum along the second dimension, we do:

In [None]:
colarr.sum(1)

So any numpy array knows how to add up the numbers in it by row or by column (see what happens if you leave off the dimension, like this `colarr.sum()`. The list of things that numpy arrays can do themselves is pretty impressive.

Check it out [here](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).

(or paste this into your browser: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)

Often, we want to create a array that we know we're going to put values in later. For example, we might be planning on doing a computation that will result in 3 sets of 7 values, and we want be able to store them directly into an array. We can pre-make an array filled with zeros with np.zeros(r, c).

In [None]:
myzeros = np.zeros((7, 3))

In [None]:
myzeros

So those are the basics of numpy arrays. They:

* store values in rows and columns
* each dimension starts at index zero (like lists)
* can be accessed using
    - square brackets `[]` with row and column indexes separated by a comma
    - integer indexes (including negative "start from the end" indexes)
    - a colon `:` (or two if you want a step value other than 1)
* can have maths done to every element in one go
* can be added, subtracted, etc. from one another
* have superpowers! they can compute stuff along their rows and columns!

So in this tutorial we have shown how to organize and manipulate data using Python `lists` and `numpy` `arrays`. 
The operations that are available for these two data types will be the base for many things that you might need to do as a Data Scientist.  