# <span style="color:blue">Course Updates 10/30/2019</span>

## Updated schedule through the rest of the semester

|  Wk   |  M    |  W     | Topic   | Notebooks |
| :---: | :---: | :----: | :------ | :----- |
|  8  |  10/21  | 23  | **Numpy:** Data Abstraction, **Numpy:** Multi-dimensional arrays,  | Midterm, 03-01, 03-02 |
|  9  |  28  | 30  | **Numpy:** Reading into multi-dimensional arrays, **Pandas:** Dataframes and reading into them;  Merging and matching Dataframes| 03-03, 03-04, 03-05 |
|  10  |  11/4  | 6  | **Pandas:** , Series and Views; Wrap Up Unit 3| 03-06, 03-07 |
|  11 |  11  | 13   | Classification and Clustering, **Case Study:** Iris Data Set | 04-02, 04-03  |
|   |    |    | Notebooks under development&dagger;  | <del>04-04, 04-06, 04-07</del>  |
|  12 |  18  | 20  | **Case Study:** [World Happiness Report](https://worldhappiness.report/ed/2019/)  | 04-04, 05-01 |
|  13 |  25   | &mdash;  | [Geopandas](http://geopandas.org/), **Case Study:** World Happiness Map | 05-03 |
|  14 |  12/2 | 4 |  **Case Study:** Twitter Sentiment Analysis | 05-04 |
|  16 |  | 12/13 | **(Take Home) Final Exam**  |

&dagger; We will not be covering these notebooks this semester. Feel free to peruse them if interested.

<hr/>

## JupyterHub Merging Behavior

According to the [documentation](https://jupyterhub.github.io/nbgitpuller/topic/automatic-merging.html#topic-automatic-merging), it shouldn't be necessary for the instructor to create new versions of notebooks.

> If the student has deleted a file locally, but the file is still present in the remote repo, the file from the remote repo is pulled into the student’s directory. This enables the use case where a student wants to ‘start over’ a file after having made many changes to it. They can simply delete the file, click the nbgitpuller link again, and get a fresh copy.

You just need to delete your local version in order to get the latest. Let us try it with the `03-04-dataframes-in-pandas` notebook. Load the notebook by clicking on the Canvas link. It may bring an old copy of the notebook (if you had worked on it before) or a fresh copy (if you hadn't).

1. Make an innocuous change to the notebook (put your name into a **`code`** cell, for example) and Save and Checkpoint it. 
2. To avoid losing your work, use **File >> Make a Copy...** to create a backup. 
3. Delete the notebook by going up a directory level. In other words, if your URL is `.../a/b/c.ipynb`, go to `.../a/b/` in a different tab, select the `.../a/b/c.ipynb` file and delete it.
4. Click on the Canvas link again. _The changes you had made in step 1 should have disappeared, proving that deleting the file and coming back in through Canvas loads a fresh copy of the notebook!_
5. Delete the copy you made in step 2.

# <span style="color:blue">Dataframes in Pandas</span>


# Fundamentals of Numpy

* Numpy is limited by the basic limit of having to have the same type for every element of an `ndarray`. That is its strength and its weakness. <img align="right" style="padding-left:10px; height: 65%; width: 65%" src="Addenda/figures/numpy-arrays.png" >
    * Numpy's ability to deal with multidimensional data is a major advantage when doing scientific calculations using vectors, matrices and tensors.
    * The exercise in 03-03-data-ingest revealed this limitation of Numpy when we couldn't read a rectangular structure with dissimilar data types into an `ndarray`. See SciPy [Structured Arrays and Structured Datatypes](https://docs.scipy.org/doc/numpy/user/basics.rec.html) for more details.
    * Metadata information &mdash; what each row or column represents must be maintained outside of the `ndarray`
* Columns in a typical spreadsheet can each be of a different type. 
* Thus, Numpy does not correctly express spreadsheets. 
* The `DataFrame` concept completely embodies the concept of a spreadsheet with multiple column types. 

These workbooks are based upon [pandas in 10 minutes](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) and subsequent detailed tutorials.

# Pandas and numpy
* Pandas is -- in fact -- based upon `numpy` and `numpy.ndarray` concepts. 
* The convenience of Pandas comes from being higher-level. 
* In a detailed analysis, `numpy.ndarray` is often still required, e.g., for TensorFlow and related tools. 
* Thus, one has to understand "both levels of abstraction."

# Pandas vs. Numpy

| numpy | pandas |
|-------|--------|
| Strength is multidimensional arrays and tensors | Strength is spreadsheets and time series |
| Assumes integer axes | Allows axes based upon timestamp and other indexes |
| Assumes homogeneous types | Allows heterogeneous column types |
| Row queries are complex | Rpw queries are simple |
| Rudimentary csv handling | Advanced csv handling includes all special cases | 
 
# Why most people use Pandas
* Most data is in spreadsheets. 
* Unlike numpy, pandas makes reading spreadsheets trivial. 

# My advice 
* if your data is csv, then *read it into pandas and then reformat it into the appropriate numpy objects as needed.* 
Consider: 

In [None]:
%more data1.csv

In [1]:
import pandas as pd
data1 = pd.read_csv('data1.csv')
data1

Unnamed: 0,heading1,heading2,comments
0,1,2,"This is, of course, problematic in numpy"
1,3,4,"This is, too"


# Observations
* Pandas is tuned to read MS Excel spreadsheets. 
* You can thus stop worrying about commas in strings. 
* Printouts are made pretty via correct pretty-printing calls. 
* Row number/index is listed to left. 

# What happened to `numpy`? 
* `data1` is a `DataFrame`. 
* Each of the columns are represented in `numpy`

Consider

In [2]:
data1['heading1']

0    1
1    3
Name: heading1, dtype: int64

In [3]:
data1.heading1

0    1
1    3
Name: heading1, dtype: int64

In [4]:
type(data1)

pandas.core.frame.DataFrame

In [5]:
data1['heading1'].values

array([1, 3])

In [7]:
type(data1.heading1)

pandas.core.series.Series

*Numpy is still underneath!*



# The reality of higher-level abstractions
* Do one thing well. 
* Are relatively poor at doing other things. 

# Pandas is
* excellent at parsing spreadsheets. 
* relatively poor at numerical operations, at which `numpy` excels. 
* Absolutely horrible at tensors. My advice is simply *don't!* Use `numpy` instead. 

# Pandas joys
Compared to `numpy`
* Columns are accessed by name. 
* Everything prints prettily. 
* Row selection is intuitive.

Consider: 


In [None]:
data1[data1.heading1 > 1 ]

In [None]:
data1['sum'] = data1.heading1 + data1.heading2
data1

In [None]:
data1['approved'] = True
data1

# Observations
* Can access columns as if they are dictionaries. 
* Can access columns as if they are class members. 
* Can do column arithmetic. Result is a new column. 
* Can set a new column to the same value for all rows. 
* Can add columns to the `DataFrame` dynamically by using a new keyword for each new column. 

In [None]:
data1.loc[data1.heading1 > 2, 'heading2']

In [None]:
data1.loc[data1.heading1 > 2, 'heading2'] = 200
data1[data1.heading1 > 2]

# Whoa there! 

* What just happened? 
* You might recall that one of the major pains in `numpy` is that row data is immutable. 
* Here we managed to set a row and column based upon conditions upon all rows and columns. 
* `data1.loc[<row selector>, <column selector>] = <value>`
* This is difficult to understand, but really powerful. 
* It's also not all-powerful, and what it can't do is important. 

# A tale of Lvalues and Rvalues 
* At its core, Pandas very heavily uses the hacks available in Python classes. 
* These allow it to control which expressions are Lvalues and which are Rvalues. 
* An *Lvalue* is anything that can be on the left of the = sign in an assignment. 
* An *Rvalue* can be on the right hand side of the = sign in an assignment. 
* For the most part, *Lvalues are Rvalues*, but not vice-versa. 
* But, in Pandas, there is an active tension between 
    
    * Selecting data and 
    * Setting data. 

* That plays out by defining *different dyntaxes for setting and selection.*

# What this means in practice is that:
* One should be wary of placing expressions like df[..] on the left-hand side of the =, 
* because *there is a distinct syntax df.loc[...] that is designed for that!*

# An aside: how we control Lvalues and Rvalues in Python
* Python classes allow one to define methods that are different for whether the object is on the left-hand or right-hand side of the = sign. 
* Here is a simple demo

In [None]:
class Foo(): 
    items = []
    def __getitem__(self, index): 
        print("I'm getting the the value at index {}".format(index))
        return self.items[index]
    def __setitem__(self, index, value): 
        print("I'm setting item at index {} to {}".format(index, value))
        while (len(self.items) < index+1): 
            self.items.append(None)
        self.items[index] = value

f = Foo()
f[4] = 'yo'  # f,__setitem__(4, 'yo')
print(f.items)
f[4]  # f.__getitem__(4)

# This insane little class
* implements a *self-extending list*. 
* values that are not defined are set to `None`.
* Without this intervention, if `f` were a regular list, this code would result in a runtime error. 

# The Lvalue/Rvalue minefield
* When learning Pandas and specifically `DataFrame`s, it's really difficult to keep straight what can be on the left-hand-side of the = sign in an assignment. 
* This can be a coding minefield, where assignment statements can "blow up" when you least expect them to do so. 
* Some things that look like Lvalues actually are.
    * Assigning a list of values to a whole column. 
* Some things that look like Lvalues are not, e.g., 
    * Assigning a value to part of a column. 

# Let's put this into practice.  

First, let's load some interesting data into a DataFrame: 

In [None]:
towns = pd.read_csv('2010_Population_By_Town.csv')
print("First 10 rows are:")
towns.head()  # First 5 rows. Remove qualifier to see all 

(source: US census, state of Conn, data.gov) 

1. Write an expression for rows for towns with population above 100000

In [None]:
# your answer
...

2. Write an expression for all towns whose name starts with 'C'. Hint: use >= 'C', < 'D' to select. 

In [None]:
# your answer: 
...

3. Write code to create a new column `Cool` and mark `Clinton` and `Wolcott` as `Cool` by setting their `Cool` columns to `True` and everyone else's to `False`.


In [None]:
# Your answer: 
...

In [None]:
# use this to test your answer
towns[towns.Cool == True]

4. List all towns with at least 14 characters in their names. Hint: you can apply `.str.len()` to a column to get its length as a string.

In [None]:
# Your answer: 
...

Consider the following additional table: 

In [None]:
tax = pd.read_csv('2012_Retail_Sales_By_Town_ALL_NAICS.csv', engine='python', skipfooter=8)
print("First 5 rows of tax are:")
tax.head()  # first 5 rows; you can remove the qualifier to see the whole table. 

5. Compute "Tax per capita" as a new column by dividing "Total Tax Due" by "Number of Taxpayers". There is whitespace at the end of the label for "Total Tax Due"! 

In [None]:
# Note the following before beginning: 
list(tax)

In [None]:
# Your answer: 
...

In [None]:
# Use this to test your answer: 
tax.head()  # first 5 rows of solution

6. List the towns in which the tax per capita is larger than $80,000 per entity (Gasp! This is sales tax! Entities can be businesses, though).

In [None]:
# Your answer:
...