# Applying python to data analysis 

So far, what we have been doing is a foundation for applying python to data analysis. 
What we need for this task: 
* The basic python types (`list`, `set`, `dict`, `tuple`):
 
 * How to use those types. 
 * How to construct new ones. 
* The data storage types (`ndarray`, `DataFrame`): 
 
 * How to make one. 
 * How to manipulate one. 
  * filtering
  * constructing new columns. 
  * transforming between types
  
# Now we move on to the final step of the journey. 
* Use this knowledge to do actual data analysis. 
* Learn to use the pre-packaged Python libraries that are constructed to help. 

# Some important caveats
* `numpy` predates `pandas`
 
 * Most data analysis libraries support the `numpy` format `ndarray`.
 * Some data analysis libraries don't support the `pandas` format `DataFrame`.  
* Libraries contain general-purpose methods but usually avoid special purposes. 

 * If everyone else needs to do something, chances are that there's a library that helps. 
 * If -- on the other hand -- your needs are unique, the likelihood of a library existing is small. 

* Libraries support the common patterns of data abstraction in python, and things that seem reasonable usually are. 
 * However, some things may have unexpected results. 

# Some ubiquitous patterns
### 1. If you want to construct something, and have something else, try the constructor. 
Some examples: 

In [1]:
import numpy as np
import pandas as pd
nd = np.array([[1,2,3], [4,5,6]])
nd

array([[1, 2, 3],
       [4, 5, 6]])

In [2]:
# what if we want a DataFrame? 
df = pd.DataFrame(nd)
df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


In [3]:
# What if we want an array from a DataFrame? 
v2 = np.array(df)
v2

array([[1, 2, 3],
       [4, 5, 6]])

### 2. Modify behaviors with extra optional arguments.

In [4]:
# What if we want a DataFrame with row and column labels?
d2 = pd.DataFrame(nd, index=['d','e'], columns=['a', 'b', 'c'])
d2

Unnamed: 0,a,b,c
d,1,2,3
e,4,5,6


# Aside: how do optional arguments work? 
Consider the following example: 

In [5]:
def foo(number, multiplier=2):
    return number*multiplier

print(foo(2))
print(foo(2,7))
print(foo(3, multiplier=20))

4
14
60


* `multiplier=2` determines an optional argument. 
* The value given is used if there is no value in the call. 
* You may use positional or named calls (`multiplier=2`) in calling the function. 

### 3. Arguments that are sequences can be specified in many valid ways. 
Anything that is an `iterable` usually works.

Compare, e.g., 

In [6]:
pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['x', 'y', 'z'])

Unnamed: 0,x,y,z
0,1,2,3
1,4,5,6


In [7]:
pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=('x', 'y', 'z'))

Unnamed: 0,x,y,z
0,1,2,3
1,4,5,6


or even (to be totally perverse about it): 

In [8]:
pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns={'x': 42, 'y': 20, 'z': 10})

Unnamed: 0,x,y,z
0,1,2,3
1,4,5,6


# Why did that work? 
* The columns parameter takes any `iterable`. 
* Iterating over a dictionary returns its keys. 
* The fact that they have values is ignored. 

As an expansion of the general principle, consider: 

In [9]:
pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=range(20, 23))

Unnamed: 0,20,21,22
0,1,2,3
1,4,5,6


# Why did that work? 
* `range` returns an iterable. 
* That's all that's needed. 

# Let's make sure we can do some basic things.
It's often important to convert between the basic types `array`, `DataFrame`, and `Series` to get things done. Here are some examples.

In [2]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('03-07-py-np-pd-wrapup.ok')
#ok.auth(inline=True)

Assignment: Applying python
OK, version v1.14.15



1. Consider the `DataFrame`:

In [11]:
df = pd.DataFrame({'name': ['xavier', 'mark', 'ben'],
                   'species': ['cat', 'dog', 'dog'],
                   'fleas': [20, 100, 30],
                   'ticks': [2, 4, 2]})
df

Unnamed: 0,name,species,fleas,ticks
0,xavier,cat,20,2
1,mark,dog,100,4
2,ben,dog,30,2


1. Create a `numpy` `ndarray` `nf` from `df` that contains only the numeric columns of `df`.  While the specific value of our `df` is simple, your recipe should work even if `df` has thousands of rows. 

In [12]:
# Your answer:
nf = np.array(df.loc[:,'fleas':'ticks'])
nf

array([[ 20,   2],
       [100,   4],
       [ 30,   2]], dtype=int64)

In [13]:
_ = ok.grade('q01')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



2. What is wrong with just using `nf = np.array(df)`? Why? Hint: consider what happens when trying to take a statistic of a non-numeric column.

___Your answer:___ 

Now consider the `array`: 

In [14]:
column_labels = ['x', 'y', 'z']
row_labels = ['a', 'b', 'c']
n3 = np.array([[1,2,3],[4,5,6],[7,8,9]])
n3

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

3. Create a `DataFrame` `d3` from this that has the column and row labels specified. Make the `index` the row labels. 

In [15]:
# Your answer: 
d3 = pd.DataFrame(n3, index=row_labels, columns=column_labels)
print(d3)

   x  y  z
a  1  2  3
b  4  5  6
c  7  8  9


In [16]:
_ = ok.grade('q03')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



Now consider the following data: 

In [17]:
%more e4.csv

4. Read in this file and convert to an `array` 'n4'. Omit non-numeric columns. Hint: read as a `DataFrame`, read up on how to not use the first line as a header. 

In [18]:
# Your answer: 
d4 = pd.read_csv('e4.csv', header=None)
n4 = np.array(d4.loc[:, 1:2])
n4

array([[210, 400],
       [500, 422],
       [ 40,  50]], dtype=int64)

In [19]:
_ = ok.grade('q04')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



Consider these structures: 

In [20]:
n5a = np.array([["jack", 20], ["moe", 30], ['manny', 0]])
n5b = np.array([['moe', 340], ['jack', 40], ['manny', 40]])

5. Do an inner join on `n5a` and `n5b` on the keys that are the first elements of each row, and make the result an array `n5`consisting of only the numbers. Hint: convert to `DataFrame`, do the join, convert back to `array`. Make the result an array of ints if needed. 

In [21]:
# Your answer: 
d5a = pd.DataFrame(n5a)
d5b = pd.DataFrame(n5b)
d5 = pd.merge(d5a, d5b, on=0, how='inner')
d5.loc[:,'1_x':'1_y']
n5 = np.array(d5.loc[:, '1_x':'1_y'], dtype='int')
n5

array([[ 20,  40],
       [ 30, 340],
       [  0,  40]])

In [23]:
_ = ok.grade('q05')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

