# <span style="color:blue">Course Plan 11/6/2019</span>
## <span style="color:blue">(Updated 11/8/2019)</span>

## Updated schedule through the rest of the semester

|  Wk   |  M    |  W     | Topic   | Notebooks | Due |
| :---: | :---: | :----: | :------ | :----- | :---: |
|  8  |  10/21  | 23  | **Numpy:** Data Abstraction, **Numpy:** Multi-dimensional arrays,  | Midterm, 03-01, 03-02 | 10/30 |
|  9  |  28  | 30  | **Numpy:** Reading into multi-dimensional arrays, **Pandas:** Dataframes and reading into them;  Merging and matching Dataframes| 03-03, 03-04, 03-05 | 10/30 |
|  10  |  11/4  | 6  | **Pandas:** , Series and Views; Wrap Up Unit 3| 03-06, 03-07 | 11/10 |
|  11 |  &mdash; | 13   | Classification and Clustering, **Case Study:** Iris Data Set | 04-02, 04-03  | 11/17 |
|   |    |    | Notebooks under development&dagger;  | <del>04-04, 04-06, 04-07</del>  |
|  12 |  18  | 20  | **Case Study:** [World Happiness Report](https://worldhappiness.report/ed/2019/)  | 04-04, 05-01 | 11/24 |
|  13 |  25   | &mdash;  | [Geopandas](http://geopandas.org/), **Case Study:** World Happiness Map | 05-03 | 12/01 |
|  14 |  12/2 | 4 |  **Case Study:** Twitter Sentiment Analysis | 05-04 | 12/08 |
|  16 |  | 12/13 | **(Take Home) Final Exam**  |

&dagger; We will not be covering these notebooks this semester. Feel free to peruse them if interested.

<hr/>


# Applying python to data analysis 

So far, what we have been doing is a foundation for applying python to data analysis. 
What we need for this task: 
* The basic python types (`list`, `set`, `dict`, `tuple`):
 
 * How to use those types. 
 * How to construct new ones. 
* The data storage types (`ndarray`, `DataFrame`): 
 
 * How to make one. 
 * How to manipulate one. 
  * filtering
  * constructing new columns. 
  * transforming between types
  
# Now we move on to the final step of the journey. 
* Use this knowledge to do actual data analysis. 
* Learn to use the pre-packaged Python libraries that are constructed to help. 

# Some important caveats
* `numpy` predates `pandas`
 
 * Most data analysis libraries support the `numpy` format `ndarray`.
 * Some data analysis libraries don't support the `pandas` format `DataFrame`.  
* Libraries contain general-purpose methods but usually avoid special purposes. 

 * If there is a common need, chances are that there's a library that helps. 
 * If -- on the other hand -- your needs are unique, the likelihood of a library existing is small. 

* Libraries support the common patterns of data abstraction in python, and things that seem reasonable usually are. 
 * However, some things may have unexpected results. 

# Some ubiquitous patterns

### 1. If you want to construct something, and have something else, try the constructor. 

In the following cell, we want the result to be
```
    array([[1, 2, 3],
           [4, 5, 6]])

```

In [None]:
import numpy as np

nd = np.array(...)
nd

In [None]:
# what if we want a DataFrame? 

import pandas as pd
df = pd.DataFrame(...)
df

### 2. Modify behaviors with extra optional arguments.


What if we want a DataFrame with row and column labels, like this...?


|        |   `a`   |   `b`   |   `c`   |
| -----: | :-----: | :-----: | :-----: |
|   `d`  |   `1`   |   `2`   |   `3`   |
|   `e`  |   `4`   |   `5`   |   `6`   |


In [None]:
# What if we want a DataFrame with row and column labels?
d2 = pd.DataFrame(nd, ...)
d2

In [None]:
# What if we want an array within the DataFrame? 
v2 = np.array(df)
v2

# Aside: how do optional arguments work? 
Consider the following example: 

In [None]:
def foo(number, multiplier=2):
    return number*multiplier

print(foo(2))
print(foo(2,7))
print(foo(3, multiplier=20))

* `multiplier=2` determines an optional argument. 
* The value given is used if there is no value in the call. 
* You may use positional or named calls (`multiplier=2`) in calling the function. 

### 3. Arguments that are sequences can be specified in many valid ways. 

Anything that is an `iterable` usually works. _Psst. what's an iterable?_


In [None]:
pd.DataFrame(..., columns=['x', 'y', 'z'])

In [None]:
pd.DataFrame(..., columns=('x', 'y', 'z'))

or even (to be totally perverse about it): 

In [None]:
pd.DataFrame(..., columns={'x': 42, 'y': 20, 'z': 10})

# Why did that work? 
* The columns parameter takes any `iterable`. 
* Iterating over a dictionary returns its keys. 
* The fact that they have values is ignored. 

As an expansion of the general principle, consider: 

In [None]:
pd.DataFrame(..., columns=range(20, 23))

# Why did that work? 
* `range` returns an iterable. 
* That's _all_ that's needed. 

# Let's make sure we can do some basic things.
It's often important to convert between the basic types `array`, `DataFrame`, and `Series` to get things done. Here are some examples.

In [None]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('03-07-py-np-pd-wrapup.ok')

#### 1. Consider the `DataFrame`:

In [None]:
df = pd.DataFrame({'name': ['xavier', 'mark', 'ben'],
                   'species': ['cat', 'dog', 'dog'],
                   'fleas': [20, 100, 30],
                   'ticks': [2, 4, 2]})
df

#### 2. Create a `numpy` `ndarray` `nf` from `df` that contains only the numeric columns of `df`.  

While the specific value of our `df` is simple, your recipe should work even if `df` has thousands of rows. 

In [None]:
# Your answer:
nf = np.array(df.loc[:,'fleas':'ticks'])
nf

In [None]:
_ = ok.grade('q01')

#### 2a. What is wrong with just using `nf = np.array(df)`? 

Why? Hint: consider what happens when trying to take a statistic of a non-numeric column.

In [None]:
nf = np.array(df)
nf

Now consider the `array`: 

In [None]:
column_labels = ['x', 'y', 'z']
row_labels = ['a', 'b', 'c']
n3 = np.array([[1,2,3],[4,5,6],[7,8,9]])
n3

#### 3. Create a `DataFrame` `d3` from this that...

... has the column and row labels specified. Make the `index` the row labels. 

In [None]:
# Your answer: 
d3 = pd.DataFrame(n3, index=row_labels, columns=column_labels)
print(d3)

In [None]:
_ = ok.grade('q03')

Now consider the following data: 

In [None]:
%more e4.csv

#### 4. Read in this file and convert to an `array` 'n4'. 

Omit non-numeric columns. Hint: read as a `DataFrame`, read up on how to not use the first line as a header. 

In [None]:
# Your answer: 
d4 = pd.read_csv('e4.csv', header=None)
n4 = np.array(d4.loc[:, 1:2])
n4

In [None]:
_ = ok.grade('q04')