# Building up to DataFrames

In this module we will build up to the key data structure in python data
analysis, the `pandas` DataFrame. The dataframe is essentially a spreadsheet
and you can do extremely powerful operations when your data is stored as a
dataframe.

First, however, we need to look at a handful of other types of objects that
can store multiple values, an abstract class that python calls "Iterables".

## The noble list

To me the list is the general workhorse of doing stuff in python. It is
mutable meaning it can be changed after creation and it can hold objects of
any type. Its like a typeless Vector in C++ or  an array in javascript

In [84]:
# this is a list
my_list = [1, 2.0, 3, "four", 17]
print(f"This is my list: {my_list}")

This is my list: [1, 2.0, 3, 'four', 17]


We access elements in our list the same way we do in other languages: by
indexing with integer values that correspond to the positions of the list
elements. Like most other programming languages, indexing in python starts at 0.

| index | value |
| ----- | ----- |
| 0 | 1 |
| 1 | 2.0 |
| 2 | 3 |
| 3 | "four" |
| 4 | 17 |


In [85]:
# lets pull out the second element of our list (index = 1)
print(f"\nThis is the second element in my list: {my_list[1]}")


This is the second element in my list: 2.0


We can also grab a **slice** of a list. This is a contigious chunk of elements
in our list. We do this inside the indexing square brackets with the syntax
`[start_index (inclusive):end_index (exclusive)]`. Here by inclusive and
exclusive I mean that the element at index `start_index` will be in your slice
but the element at `end_index` will not. We will see this over and over again.

In [128]:
print(f"This is my slice: {my_list[1:3]}")

This is my slice: [2, 3]


We can also get slices without specifying a start or end part and grab the
beginning or end of the list

In [129]:
print(f"Beginning slice: {my_list[:2]}")
print(f"end slice: {my_list[2:]}")

Beginning slice: [1, 2]
end slice: [3, 4, 5, 666]


Earlier I mentioned the mutability of lists. This makes them flexible but also
slow.

For our purposes here, it means we can add elements to them. The main way is
by a lists `append()` method which adds the new value onto the end of the list.

In [86]:
my_list.append("six")
print(f"\nThis is my list after adding to it: {my_list}")


This is my list after adding to it: [1, 2.0, 3, 'four', 17, 'six']


Of course we can also remove elements from lists. We can do this either by
index with `pop()` or by value with `remove()`

In [87]:
# remove by index
my_list.pop(1)
print(f"\nThis is my list after removing an element: {my_list}")


This is my list after removing an element: [1, 3, 'four', 17, 'six']


In [88]:
# remove by value
my_list.remove("four")
print(f"\nThis is my list after removing an element: {my_list}")


This is my list after removing an element: [1, 3, 17, 'six']


Finally in terms of changes, we can also change the value of an element in a
list. We do this by assigning a new value to the old element with indexing

In [None]:
# we can change the values in the list but reassigning the elements
my_list[1] = 3.0000001
print(f"\nI made my 3 weird: {my_list}")


I made my 3 weird: [1, 3.0000001, 17, 'six']


Python calls objects that hold multiple values iterables because they can be
iterated over. This means we can construct loops to pull out the values of
our list one at a time. We will do this a lot.

There's two main ways to do this. The first is pretty straightforward.

In [94]:
# directly iterate over our list
for elem in my_list:
    print(f"This is an element of my list: {elem}")

This is an element of my list: 1
This is an element of my list: 3.0000001
This is an element of my list: 17
This is an element of my list: six


The second way we can do this is by iterating over the `enumerate()` function
called on our list. This gives us two for loop variables, the index of each
element and the value of each element.

In [95]:
for index, value in enumerate(my_list):
    print(f"At index {index} the value of the list is {value}")

At index 0 the value of the list is 1
At index 1 the value of the list is 3.0000001
At index 2 the value of the list is 17
At index 3 the value of the list is six


You will see both often, along with an older idiom (`for i in range(my_list):`)
that gives you just the index.

Lists are incredibly important because they are just all over the place. We
will see and use them over and over again. Strictly speaking, we could do data
analysis only with lists but it would be terrible.

## The tuple

Tuples are alot like lists but they are immutable. They cannot be changed after
creation which saves on memory. We probably won't be defining too many tuples
by python itself will frequently. In fact, the reason that we have two values
when we loop using `enumerate()` is that its returning a tuple of (index,
value)`

In [96]:
# note parentheses rather than square braces
my_tuple = (1, 2.0, 3, "four", 5)
print(f"This is my tuple: {my_tuple}")

This is my tuple: (1, 2.0, 3, 'four', 5)


We index them in the usual way.

In [97]:
# we index tuples the same way as lists
print(f"\nThis is the fourth element of my tuple: {my_tuple[3]}")


This is the fourth element of my tuple: four


We iterate over them in the usual way

In [92]:
# we can iterate over the tuple in the same way
print("")  # just a newline again
for elem in my_tuple:
    print(f"This is an element in my tuple: {elem}")


This is an element in my tuple: 1
This is an element in my tuple: 2.0
This is an element in my tuple: 3
This is an element in my tuple: four
This is an element in my tuple: 5


Really, though we can't change them

In [98]:
# if we try to change the tuple however...
my_tuple[0] = 1.0

TypeError: 'tuple' object does not support item assignment

## Dictionaries

Dictionaries are mutable like lists but are unlike lists in other ways.
Dictionaries don't have a guaranteed order. Instead, each value in a dictionary
is given a key. When the key is used to index the dictionary, the value is
recovered.

In [105]:
# the format is {key: value, key: value, ...}
my_dict = {"a": 9, "b": [7.45, 2.34], "c": "whats up this is the 'c' entry", "d": 6}
print(f"This is my dict: {my_dict}")

This is my dict: {'a': 9, 'b': [7.45, 2.34], 'c': "whats up this is the 'c' entry", 'd': 6}


As mentioned, we retrive the value by indexing with the key

In [106]:
# we access elements of dictionaries using their keys rather than an index
print(f"This is the 'c' element of my dict: {my_dict['c']}")

This is the 'c' element of my dict: whats up this is the 'c' entry


Like lists, we can add or remove elements from our dictionaries. We probably
won't do much dictionary stuff directly but just incase we'll take a quick
look at changing a dictionary.

In [107]:
# we can do some of the same things we can do with lists
my_dict.pop("b")
print(f"This is my dict after removing 'b: {my_dict}")

This is my dict after removing 'b: {'a': 9, 'c': "whats up this is the 'c' entry", 'd': 6}


We change elements like we do with lists except using keys to index of course.
We could also add a value in this same way.

In [109]:
# we can also change a dictionary by directly changing an element
my_dict['c'] = 10  # changing the 'c' value
my_dict['deepmay'] = 25
print(f"This is my dict after changing the 'c' element")
print(my_dict)

This is my dict after changing the 'c' element
{'a': 9, 'c': 10, 'd': 6, 'deepmay': 25}


Dicts are iterables so, of course, we can iterate over them. Typically we do
this by iterating over `keys()` but we can iterate over pairs or values.

In [110]:
# when we iterate over a dictionary we need to specify keys or values
for key in my_dict.keys():
    print(f"This is a key: {key}")
    print(f"This is the value for the key: {my_dict[key]}")

# this is how we iterate over values
# for val in my_dict.values():
#     print(f"This is a value: {val}")

This is a key: a
This is the value for the key: 9
This is a key: c
This is the value for the key: 10
This is a key: d
This is the value for the key: 6
This is a key: deepmay
This is the value for the key: 25


## NumPy arrays

Now we're getting somewhere. Numpy arrays are like lists but better. We can do
all kinds of things using just numpy arrays and they are uisually much, much
faster than working with lists.

There are two caveats. First, the number of elements in an array cannot change
once the array is created. There is no `append()` and no `pop()`. Instead, we
convert lists into numpy arrays once we have the values we want.

Second, numpy arrays can only contain values of a single type. Integers (7) get
converted to floats (7.0) if you try to combine them in a single array.
Everything gets converted to a string if you try to combine them with strings.

In [142]:
import numpy as np

# you can make a numpy array out of a list
my_list = [1, 2, 3.0, 4, 5.134]
my_list.append(666)

my_array = np.array(my_list)
my_string_array = np.array(["1", 2, 3.0, 4, 5.134])
print(f"This is my number array: {my_array}")
print(f"This is my string array: {my_string_array}")

This is my number array: [  1.      2.      3.      4.      5.134 666.   ]
This is my string array: ['1' '2' '3.0' '4' '5.134']


As the name suggests, numpy is for working with numbers. The array is generally
how we do it. Numpy comes with a ton of built in math functionality.

In [144]:
# some basic functionality
print(f"The mean of my array is {np.mean(my_array)}")
print(f"The median of my array is {np.median(my_array)}")
print(f"A sample from my array is {np.random.choice(my_string_array, size=3)}")

The mean of my array is 113.52233333333334
The median of my array is 3.5
A sample from my array is ['2' '5.134' '2']


Numpy arrays themselves support "element-wise" operations on their contents.
This let's us avoid loops in many cases which makes our code work much better.
We can iterate over arrays like we do lists, and make numerical changes to the
values as we would any other number.

In [116]:
# copying it to refer to later
my_loop_array = my_array.copy()

for i, _ in enumerate(my_loop_array):
    my_loop_array[i] = my_loop_array[i] + 1

print(f"My '+ 1' array after looping: {my_loop_array}")

My '+ 1' array after looping: [  2   3   4   5   6 667]


However, we can simply *add one to the array* with numpy and get the same
result. We can even add two arrays together to add corresponding elements. The
same goes for any binary mathematical operation.

In [118]:
print(f"The add one array without looping: {my_array + 1}")
print(f"Doubling our array: {my_array + my_array}")

The add one array without looping: [  2   3   4   5   6 667]
Doubling our array: [   2    4    6    8   10 1332]


Numpy also let's us get dangerously close to a kind of data table. Arrays are
not restricted to one dimension and retain the same array functionality.

In [123]:
# I'll just generate some random numbers
n_rows = 6
n_columns = 3
my_fancy_array = np.random.normal(size=(6, 4))
print(f"This is my fancy two-dimensional array:")
print(my_fancy_array)
print(f"The shape of my fancy array is {my_fancy_array.shape}")

This is my fancy two-dimensional array:
[[ 0.87916021  1.15541461 -1.36291418 -0.18567654]
 [-0.4591675   2.10887321  0.14117039 -0.14347012]
 [-0.52038534  2.45968369 -1.29579613  0.7208201 ]
 [ 0.87165973  1.41499007  0.26532429  0.14519923]
 [ 0.72334958  0.83361434 -0.47430634  0.41568891]
 [-1.16140712  0.04834652  1.4725735  -0.01824101]]
The shape of my fancy array is (6, 4)


We can access elements in an array like we would a list but now we need to
provide two indices to get a particular value out. one for each **axis**.

In [124]:
# we can access elements in multi-dimensional arrays by using multiple indices
print(f"Element (1, 3) of my fancy array is: {my_fancy_array[1, 3]}")

Element (1, 3) of my fancy array is: -0.14347011648260843


If we provide only a single argument, we can slice out a single "row" or "column"
as an array.

In [130]:
my_row = my_fancy_array[2, :]  # the : means "all of em"
my_column = my_fancy_array[:, 1]

print(f"My row: {my_row}")
print(f"My column: {my_column}")

My row: [-0.52038534  2.45968369 -1.29579613  0.7208201 ]
My column: [1.15541461 2.10887321 2.45968369 1.41499007 0.83361434 0.04834652]


Built in numpy math functions are aware of how sick this is and their functions
will perform operations for-each-row or for-each-column for us! Whether invoked
explicitly or not, this is functionality we will be using constantly.

In [131]:
print(f"Sums accross 'row'-axis (per-'column'): {np.sum(my_fancy_array, axis=0)}")
print(f"Means accross 'column'-axis (per-'row'): {np.mean(my_fancy_array, axis=1)}")

Sums accross 'row'-axis (per-'column'): [ 0.33320956  8.02092244 -1.25394847  0.93432057]
Means accross 'column'-axis (per-'row'): [0.12149603 0.41185149 0.34108058 0.67429333 0.37458662 0.08531797]



in addition to the ability to do element-wise math easily and quickly
numpy arrays give us the ability to do fancy indexing. This proves very
important in working with real data
#
The first kind of fancy indexing and the most important is masking. Just
like how numpy arrays let us do element-wise math, numpy arrays also let
us do element-wise logic.

Let's start combining things by looking for extreme values in our fancy array.
We'll say that this is any value more than two standard deviations away from
the mean.

We're going to do this by creating a **mask** of boolean True/False values that
is the shape of our array

In [None]:
# first we'll get the number that is our 'extreme value' threshold
threshold = np.mean(my_fancy_array) + 2 * np.std(my_fancy_array)  # pemdas

# next we;ll make a copy of our fancy array that is all positive numbers
my_absolute_values = np.abs(my_fancy_array)

# elment-wise logic
mask = my_absolute_values > threshold

print(f"Extreme value?")
print(mask)

Extreme value?
[[False False False False]
 [False False False False]
 [False  True False False]
 [False False False False]
 [False False False False]
 [False False False False]]


the key step is that we can give this mask to our array as an index to select
only those values where the mask contains the value True. In our case we get
only our extreme values.

In [134]:
print(f"extreme values from our array: {my_fancy_array[mask]}")

extreme values from our array: [2.45968369]


We usually won't define the mask explicitly, and instead would just give a
logical statement in the index brackets.

In [135]:
# values greater than one
print(f"my array elements that are greater than 1: {my_fancy_array[my_fancy_array > 1]}")

my array elements that are greater than 1: [1.15541461 2.10887321 2.45968369 1.41499007 1.4725735 ]


the mask is an array of boolean true and false values. we can also use an
array of indices. This can be useful for sorting or copying values but won't
come up too often, sometimes sorting in this way is useful for visualization.

In [146]:
my_array = np.array([13, 14, 15, 16, 17, 18])
my_indices = np.array([-1, 0, 0, 0])  # negatives allow you to "reverse index"

print(f"Re-indexed array: {my_array[my_indices]}")

Re-indexed array: [18 13 13 13]


# The DataFrame

In [70]:
import pandas as pd

# dataframes are amazing
my_df = pd.DataFrame({"x": ["a", "b", "c", "d", "e"], "y": np.random.normal(size=5)})
print("This is my DataFrame:")
print(my_df)

This is my DataFrame:
   x         y
0  a  1.088413
1  b  1.283318
2  c -0.070339
3  d  0.611381
4  e -0.222970


In [71]:
# we can kind of think of it a little bit like a distionary of numpy arrays
# we call the the key-value pair a column
print("My x column is:")
print(my_df["x"])

My x column is:
0    a
1    b
2    c
3    d
4    e
Name: x, dtype: object


In [73]:
# we can use numpy fancy indexing to pull out specific rows
print(my_df[my_df["y"] > 1])

   x         y
0  a  1.088413
1  b  1.283318
