## Week 5: From Nested Lists to Data Frames


Numpy allows us to work efficiently with arrays and underpin the dataframes from Pandas.
Numpy uses strict type setting so Python knows exactly what it is dealing with, making it faster and easier.
Even when we analyze images or text, were using numbers to do so.
An image is reduced numerically to multi-dimensional arrays where numbers depict color, brightness, etc.

In [45]:
import numpy as np 
import math
import pandas as pd

In [2]:
# generate 10 random numbers as a one-dimensional array, 
# reshape it to have 5 rows (-1 means 'give it whatever columns make sense with 5 rows')
# then round the results to the tenth value.
m1 = np.random.randn(10).reshape(5,-1).round(1)
m1

array([[ 2.8, -0.9],
       [-1. , -1.2],
       [-0.8, -0.3],
       [-0.3,  2.3],
       [-0.6, -2.4]])

In [3]:
m2 = np.random.poisson(1,10).reshape(5,-1)

In [6]:
# input is two matrices in a single nested list
# axis=0 adds them by row, effectively stacking the two.
np.concatenate([m1,m2],axis=0)

# equivalent to above
np.vstack([m1,m2])

array([[ 2.8, -0.9],
       [-1. , -1.2],
       [-0.8, -0.3],
       [-0.3,  2.3],
       [-0.6, -2.4],
       [ 0. ,  0. ],
       [ 1. ,  0. ],
       [ 0. ,  0. ],
       [ 1. ,  0. ],
       [ 0. ,  1. ]])

In [5]:
# axis=1 adds them by column, expands the column space
np.concatenate([m1,m2],axis=1)

# equivalent to above
M = np.hstack([m1,m2])
M

array([[ 2.8, -0.9,  0. ,  0. ],
       [-1. , -1.2,  1. ,  0. ],
       [-0.8, -0.3,  0. ,  0. ],
       [-0.3,  2.3,  1. ,  0. ],
       [-0.6, -2.4,  0. ,  1. ]])

### Views & Copies
Note that when we slice an array we **_do not copy the array_**, rather we get a "**view**" of the array. 


In [9]:
x = [1,2,3]
y = x
y[2] = 100
x

[1, 2, 100]

In [10]:
# We can get around this behavior by making copies. 
# One way to make a copy is to slice
y = x[:]
y[2] = -999
x

[1, 2, 100]

In [11]:
P = np.ones((5,5))
P

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [12]:
g = P[:2,:2] 
g

array([[1., 1.],
       [1., 1.]])

In [13]:
g += 100
g

array([[101., 101.],
       [101., 101.]])

In [14]:
P

array([[101., 101.,   1.,   1.,   1.],
       [101., 101.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.]])

In [15]:
g2 = P[:2,:2].copy()
g2 -= 1000
g2

array([[-899., -899.],
       [-899., -899.]])

In [16]:
P

array([[101., 101.,   1.,   1.,   1.],
       [101., 101.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.]])

### Broadcasting

Broadcasting makes it possible for operations to be performed on arrays of mismatched shapes.

Broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is "broadcast" across the larger array so that they have compatible shapes.

Previously, if we wanted to take a list of numbers and add 5 to each value, we would need to loop through and pull out each iteration, make the addition, and then put it back.

Instead, vectorization allows the addition process to occur across the vector simultaneous by making the addition fit the matrix structure and then completing the computation. We didn't have to create something in memory of equal value to complete the task.

By 'broadcast', we mean that the smaller array is made to match the size of the larger array in order to allow for element-wise manipulations.

### How it works:

- Shapes of the two arrays are compared _element-wise_. 
- Dimensions are considered in reverse order, starting with the trailing dimensions, and working forward 
- We are stretching the smaller array by making copies of its elements. `numpy` does not actually duplicate the smaller array; instead, it makes computationally efficient use of existing structures in memory that achieve the same result.

A general **Rule of thumb**: All corresponding dimension of the arrays must be compatible or one of the two dimensions is 1.

Dimension disagreement is when the shapes don't allow for broadcasting operations.

In [17]:
A = np.array([1,2,3,4,5])
A + 5

array([ 6,  7,  8,  9, 10])

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays (from [reading](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html)):


### Rule 1
> If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.

### Rule 2

> If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.

### Rule 3 

> If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

In [20]:
# a-range works the same way as range in base python
# starts with 0
np.arange(3) + 5

array([5, 6, 7])

In [21]:
np.ones((3,3)) + np.arange(3)
# this is an element wise add, 
# so the broadcasting is turning np.arange(3) which is [0,1,2]
# and is making it a 3x3 matrix to complete the operation

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

### Vectorization

Similar to broadcasting, vectorization allows for simultaneous computation along all values in the array. 

In [22]:
# randint(lower bound, upper bound, 50 instances)
X = np.random.randint(1,10,50).reshape(10,5)
X

array([[7, 5, 9, 8, 4],
       [4, 7, 5, 6, 6],
       [5, 4, 9, 8, 2],
       [9, 5, 5, 8, 8],
       [5, 7, 6, 8, 7],
       [9, 1, 4, 4, 7],
       [7, 3, 3, 6, 2],
       [8, 2, 8, 8, 2],
       [7, 8, 5, 2, 2],
       [8, 5, 9, 4, 7]])

In [None]:
np.log(X)

The computations are performed on each element in the array _simultaneously_. 

> "When looping over an array or any data structure in Python, there’s a lot of overhead involved. Vectorized operations in `numpy` delegate the looping internally to highly optimized `C` and `Fortran` functions, making for cleaner and faster Python code." - [RealPython](https://realpython.com/numpy-array-programming/)

In [23]:
X2 = X.tolist()
n_rows = len(X2)
n_cols = len(X2[0])
for i in range(n_rows):
    for j in range(n_cols):
        X2[i][j] = math.log(X2[i][j])
X2

# in this i is pulling out the specific row and then j is the value in the list, like individual cars of a train
# however we can do this more efficiently thanks to vectorization

[[1.9459101490553132,
  1.6094379124341003,
  2.1972245773362196,
  2.0794415416798357,
  1.3862943611198906],
 [1.3862943611198906,
  1.9459101490553132,
  1.6094379124341003,
  1.791759469228055,
  1.791759469228055],
 [1.6094379124341003,
  1.3862943611198906,
  2.1972245773362196,
  2.0794415416798357,
  0.6931471805599453],
 [2.1972245773362196,
  1.6094379124341003,
  1.6094379124341003,
  2.0794415416798357,
  2.0794415416798357],
 [1.6094379124341003,
  1.9459101490553132,
  1.791759469228055,
  2.0794415416798357,
  1.9459101490553132],
 [2.1972245773362196,
  0.0,
  1.3862943611198906,
  1.3862943611198906,
  1.9459101490553132],
 [1.9459101490553132,
  1.0986122886681098,
  1.0986122886681098,
  1.791759469228055,
  0.6931471805599453],
 [2.0794415416798357,
  0.6931471805599453,
  2.0794415416798357,
  2.0794415416798357,
  0.6931471805599453],
 [1.9459101490553132,
  2.0794415416798357,
  1.6094379124341003,
  0.6931471805599453,
  0.6931471805599453],
 [2.0794415416798357

In [24]:
# Instead, we can do something like this that doesn't require all of the looping and iterating

# Locate the absolute value for an array
np.abs([1,2,-6,7,8])

array([1, 2, 6, 7, 8])

Vectorization across array dimensions

The universal functions constructed in Python come with an axis argument that outlines how the function should be applied


In [28]:
A = np.random.randint(1,10,100).reshape(20,5)
A


array([[1, 9, 5, 1, 8],
       [2, 3, 7, 3, 8],
       [7, 9, 9, 4, 9],
       [9, 5, 2, 4, 4],
       [1, 4, 4, 5, 7],
       [1, 6, 1, 9, 7],
       [7, 2, 3, 6, 7],
       [1, 6, 6, 9, 5],
       [3, 6, 1, 8, 3],
       [3, 4, 7, 2, 2],
       [3, 6, 8, 7, 4],
       [5, 6, 5, 8, 6],
       [1, 4, 8, 5, 6],
       [8, 9, 6, 4, 5],
       [4, 4, 2, 9, 5],
       [5, 3, 3, 7, 8],
       [2, 9, 3, 4, 3],
       [4, 4, 1, 6, 2],
       [3, 9, 3, 6, 6],
       [7, 5, 4, 8, 5]])

In [32]:
# across columns
A.mean(axis=0)

array([3.85, 5.65, 4.4 , 5.75, 5.5 ])

In [29]:
A.mean()

5.03

In [30]:
A.mean(axis=1)

array([4.8, 4.6, 7.6, 4.8, 4.2, 4.8, 5. , 5.4, 4.2, 3.6, 5.6, 6. , 4.8,
       6.4, 4.8, 5.2, 4.2, 3.4, 5.4, 5.8])

In [33]:
def bigsmall(a,b):
    if a > b:
        return "A is larger"
    else:
        return "B is larger"

In [34]:
bigsmall(5,6)

'B is larger'

In [35]:
bigsmall(6,5)

'A is larger'

While a trite example, we can then vectorize this function, assign it to a new object np.vectorize, and that function can then apply to arrays

In [36]:
# Create a vectorized version of the function
vec_bigsmall = np.vectorize(bigsmall)
vec_bigsmall 

<numpy.vectorize at 0x7fac84465bb0>

In [37]:
# And now implement on arrays of numbers!
vec_bigsmall([0,2,5,7,0],[4,3,10,2,6])

array(['B is larger', 'B is larger', 'B is larger', 'A is larger',
       'B is larger'], dtype='<U11')

Typically, numpy arrays expect one data type to be able to process all of these functions efficiently and more effectively, but we can convert across types at the array level via

x.astype('f') # changes to a float

In [38]:
A.astype('f') # changes to a float

array([[1., 9., 5., 1., 8.],
       [2., 3., 7., 3., 8.],
       [7., 9., 9., 4., 9.],
       [9., 5., 2., 4., 4.],
       [1., 4., 4., 5., 7.],
       [1., 6., 1., 9., 7.],
       [7., 2., 3., 6., 7.],
       [1., 6., 6., 9., 5.],
       [3., 6., 1., 8., 3.],
       [3., 4., 7., 2., 2.],
       [3., 6., 8., 7., 4.],
       [5., 6., 5., 8., 6.],
       [1., 4., 8., 5., 6.],
       [8., 9., 6., 4., 5.],
       [4., 4., 2., 9., 5.],
       [5., 3., 3., 7., 8.],
       [2., 9., 3., 4., 3.],
       [4., 4., 1., 6., 2.],
       [3., 9., 3., 6., 6.],
       [7., 5., 4., 8., 5.]], dtype=float32)

In [39]:
nested_list = [['a','b','c'],[1,2,3],[.3,.55,1.2]]
nested_list

[['a', 'b', 'c'], [1, 2, 3], [0.3, 0.55, 1.2]]

In [40]:
data = np.array(nested_list).T
data

array([['a', '1', '0.3'],
       ['b', '2', '0.55'],
       ['c', '3', '1.2']], dtype='<U32')

Above, we see how transforming from a nested list to an array, the "lowest common denom" of the string data type. Below, we see an example of the required formatting to keep track of different data types for each array, but this is where we shift to Pandas to use Series and Dataframes instead.

In [43]:
data = np.zeros((3), dtype={'names':('v1', 'v2', 'v3'),
                            'formats':('U5', 'i', 'f')})
data
data['v1'] = ['a','b','c']
data['v2'] = [1,2,3]
data['v3'] = [.3,.55,1.2]
data

array([('a', 1, 0.3 ), ('b', 2, 0.55), ('c', 3, 1.2 )],
      dtype=[('v1', '<U5'), ('v2', '<i4'), ('v3', '<f4')])

# Pandas

Pandas uses the flexibility and efficiency of numpy, but can deal with heterogenous data types much more effectively.

A `pandas` series is a one-dimensional labeled array capable of holding heterogeneous data types (e.g. integer, boolean, strings, etc.). The axis in a series as "index" --- similar to a list or `numpy` array--- however, we can use other data types to serve as an index, which allows for some powerful ways for manipulating the array. At it's core, a `Pandas` `Series` is nothing but a column in an excel sheet or an `R` `data.frame`. 

In [46]:
s = pd.Series(list("georgetown"))
s

0    g
1    e
2    o
3    r
4    g
5    e
6    t
7    o
8    w
9    n
dtype: object

In [51]:
s.index
# this by default inserts the iterative range assigned to the above series
s[2]
# we can have control over what the index is
s2 = pd.Series([i for i in range(10)], index=list('georgetown'))
s2['g']
# because we set the index as the characters of 'georgetown' we can look up by the key of the letters as index
# returns both instances of g, this would have been an issue for dictionary datatype, but not a Series

g    0
g    4
dtype: int64

## DataFrame

A pandas DataFrame is a two dimensional, relational data structure with the capacity to handle heterogeneous data types.

    "relational" = each column value contained within a row entry corresponds with the same observation.
    "two dimensional" = a matrix data structure (no 𝑁

    -dimensional arrays). The data construct can be accessed through row/column indices.
    "heterogeneous" = different data types can be contained within each column series. This means, for example, string, integer, and boolean values can coexist in the same data structure and retain the specific properties of their data type class.

Put simply, a DataFrame is a collection of pandas series where each index position corresponds to the same observation. To be explicit, let's peek under the hood and look at the object construction
Constructor

To create a pandas DataFrame, we call the pd.DataFrame() constructor.
Construction using dict()

As input, we need to feed in a dictionary, where the keys are the column names and the values are the relational data input.

Data must be relational. If the dimensions do not align, an error will be thrown.

In [52]:
my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c','d','e','f']}
pd.DataFrame(my_dict)

Unnamed: 0,A,B,C
0,1,2.0,a
1,2,3.0,b
2,3,1.0,c
3,4,0.3,d
4,5,4.0,e
5,6,1.0,f


### Indexing Dataframes

Setting the index as the unit of observation: We can redefine the index to work as a way to keep our unit of observation: consistent, clean, and easy to use.

We can't use something like df[1,:] because it doesn't recognize what we're asking for. Because it is a sort of dictionary, we have to reference the keys. So D['Var1'] would return the column instead.

Important to remember that we have 2 indices in a Dataframe for rows and columns, so we need to keep track of both, which leads us to iloc and loc built in functions.


To **access** the indices in a `DataFrame`, we need to use two build-in methods:

- `.iloc[]` = use the numerical index position to call to locations in the `DataFrame`. (_The `i` is short for `index`._)
- `.loc[]` = use the labels to call to the location in the data frame. 


A few things to note about .loc[]

    calls all named index positions. Above we get back all the requested rows (rather than the numerical range which returns one below the max value). This is because .loc[] treats the index as a labeled feature rather than a numerical one.
    selecting ranges from labeled indices works the same as numerical indices. That is we can make calls to all variables in between (see below).





In [69]:
col_names = ["Country","Year","Cases","Population"]
list_dat = [["Afghanistan", 1999, 745, 19987071],
            ["Afghanistan", 2000, 2666, 20595360],
            ["Brazil", 1999,  37737,   172006362],
            ["Brazil", 2000,  80488,  174504898],
            ["China",  1999,  212258, 1272915272],
            ["China",  2000,  213766, 1280428583]]
dat = pd.DataFrame(list_dat,columns=col_names)
dat

Unnamed: 0,Country,Year,Cases,Population
0,Afghanistan,1999,745,19987071
1,Afghanistan,2000,2666,20595360
2,Brazil,1999,37737,172006362
3,Brazil,2000,80488,174504898
4,China,1999,212258,1272915272
5,China,2000,213766,1280428583


In [70]:
# instead of the 0-> index, we can set the index by a value in our data itself
dat = dat.set_index('Country')
dat

Unnamed: 0_level_0,Year,Cases,Population
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


In [71]:
# returns multiple locations where the index is represented
dat.loc['Brazil',:]

Unnamed: 0_level_0,Year,Cases,Population
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898


In [72]:
# returns us the 0-> index
dat = dat.reset_index()
dat

Unnamed: 0,Country,Year,Cases,Population
0,Afghanistan,1999,745,19987071
1,Afghanistan,2000,2666,20595360
2,Brazil,1999,37737,172006362
3,Brazil,2000,80488,174504898
4,China,1999,212258,1272915272
5,China,2000,213766,1280428583


In [73]:
# multi-dimensional indexing
dat = dat.set_index(keys=['Country', 'Year'])
dat
# index becomes a tuple, as seen below

Unnamed: 0_level_0,Unnamed: 1_level_0,Cases,Population
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


In [74]:
dat.index

MultiIndex([('Afghanistan', 1999),
            ('Afghanistan', 2000),
            (     'Brazil', 1999),
            (     'Brazil', 2000),
            (      'China', 1999),
            (      'China', 2000)],
           names=['Country', 'Year'])

In [75]:
# now our index requires a tuple, give us all columns associated
dat.loc[("Afghanistan",2000),:]

Cases             2666
Population    20595360
Name: (Afghanistan, 2000), dtype: int64

In [76]:
dat.loc[("Afghanistan",2000):("Brazil",1999),:]

Unnamed: 0_level_0,Unnamed: 1_level_0,Cases,Population
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362


In [77]:
dat.loc[[("Afghanistan",2000),("China",1999)],:]

Unnamed: 0_level_0,Unnamed: 1_level_0,Cases,Population
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,2000,2666,20595360
China,1999,212258,1272915272


In [78]:
dat.xs("Brazil",level="Country")
# .xs() allows us to go into more detail with a single command when 
# we have multiple indices via the level attribute

Unnamed: 0_level_0,Cases,Population
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1999,37737,172006362
2000,80488,174504898


In [63]:
# boolean example
dat.loc[dat.index.get_level_values('Year') == 2000,:]

Unnamed: 0_level_0,Unnamed: 1_level_0,Cases,Population
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,2000,2666,20595360
Brazil,2000,80488,174504898
China,2000,213766,1280428583


In [65]:
# sorting 
dat.sort_index(ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Cases,Population
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
China,2000,213766,1280428583
China,1999,212258,1272915272
Brazil,2000,80488,174504898
Brazil,1999,37737,172006362
Afghanistan,2000,2666,20595360
Afghanistan,1999,745,19987071


In [66]:
dat.reset_index(inplace=True)

In [79]:
dat.columns = dat.columns.str.upper()
dat.columns

# can't do these individually, have to do them all at once, due to the dictionary nature of the dataframe

Index(['CASES', 'POPULATION'], dtype='object')

In [80]:
#instead of something like dat.columns[dat.columns == "POPULATION"] = "POP"

dat.rename(columns={"POPULATION":"POP"},
             inplace=True) # Makes the change in-place rather than making a copy
dat

Unnamed: 0_level_0,Unnamed: 1_level_0,CASES,POP
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


### Stacking/Unstacking

The ease at which we can assign various index values offers a lot of flexibility regarding how we choose to shape the data. In fact, we can reshape the data in may ways by .stack()ing and .unstack()ing it.

We need to tell it the level at which we want to operate with multiple indices, passed through as an argument.

