# Pandas and dataframes

# Pandas and dataframes

* Numpy is limited by the basic limit of having to have the same type for every element of an `ndarray`. 
    * The exercise in 03-03-data-ingest revealed this limitation of Numpy when we couldn't read a rectangular structure with dissimilar data types into an `ndarray`. See SciPy [Structured Arrays and Structured Datatypes](https://docs.scipy.org/doc/numpy/user/basics.rec.html) for more details.
* Columns in a typical spreadsheet can each be of a different type. 
* Thus, Numpy does not correctly express spreadsheets. 
* The `DataFrame` concept completely embodies the concept of a spreadsheet with multiple column types. 

These workbooks are based upon [pandas in 10 minutes](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) and subsequent detailed tutorials.

# Pandas and numpy
* Pandas is -- in fact -- based upon `numpy` and `numpy.ndarray` concepts. 
* The convenience of Pandas comes from being higher-level. 
* In a detailed analysis, `numpy.ndarray` is often still required, e.g., for TensorFlow and related tools. 
* Thus, one has to understand "both levels of abstraction."

# Pandas vs. Numpy

| numpy | pandas |
|-------|--------|
| Strength is multidimensional arrays and tensors | Strength is spreadsheets and time series |
| Assumes integer axes | Allows axes based upon timestamp and other indexes |
| Assumes homogeneous types | Allows heterogeneous column types |
| Row queries are complex | Rpw queries are simple |
| Rudimentary csv handling | Advanced csv handling includes all special cases | 
 
# Why most people use Pandas
* Most data is in spreadsheets. 
* Unlike numpy, pandas makes reading spreadsheets trivial. 

# My advice 
* if your data is csv, then *read it into pandas and then reformat it into the appropriate numpy objects as needed.* 
Consider: 

In [1]:
%more data1.csv

In [2]:
import pandas as pd
data1 = pd.read_csv('data1.csv')
data1

Unnamed: 0,heading1,heading2,comments
0,1,2,"This is, of course, problematic in numpy"
1,3,4,"This is, too"


# Observations
* Pandas is tuned to read MS Excel spreadsheets. 
* You can thus stop worrying about commas in strings. 
* Printouts are made pretty via correct pretty-printing calls. 
* Row number/index is listed to left. 

# What happened to `numpy`? 
* `data1` is a `DataFrame`. 
* Each of the columns are represented in `numpy`

Consider

In [3]:
data1['heading1']

0    1
1    3
Name: heading1, dtype: int64

In [4]:
data1.heading1

0    1
1    3
Name: heading1, dtype: int64

In [5]:
type(data1)

pandas.core.frame.DataFrame

In [6]:
data1['heading1'].values

array([1, 3])

*Numpy is still underneath!*

# The reality of higher-level abstractions
* Do one thing well. 
* Are relatively poor at doing other things. 

# Pandas is
* excellent at parsing spreadsheets. 
* relatively poor at numerical operations, at which `numpy` excels. 
* Absolutely horrible at tensors. My advice is simply *don't!* Use `numpy` instead. 

# Pandas joys
Compared to `numpy`
* Columns are accessed by name. 
* Everything prints prettily. 
* Row selection is intuitive.

Consider: 


In [7]:
data1[data1.heading1 > 1 ]

Unnamed: 0,heading1,heading2,comments
1,3,4,"This is, too"


In [8]:
data1['sum'] = data1.heading1 + data1.heading2
data1

Unnamed: 0,heading1,heading2,comments,sum
0,1,2,"This is, of course, problematic in numpy",3
1,3,4,"This is, too",7


In [9]:
data1['approved'] = True
data1

Unnamed: 0,heading1,heading2,comments,sum,approved
0,1,2,"This is, of course, problematic in numpy",3,True
1,3,4,"This is, too",7,True


# Observations
* Can access columns as if they are dictionaries. 
* Can access columns as if they are class members. 
* Can do column arithmetic. Result is a new column. 
* Can set a new column to the same value for all rows. 
* Can add columns to the `DataFrame` dynamically by using a new keyword for each new column. 

In [10]:
data1.loc[data1.heading1 > 2, 'heading2']

1    4
Name: heading2, dtype: int64

In [11]:
data1.loc[data1.heading1 > 2, 'heading2'] = 200
data1[data1.heading1 > 2]

Unnamed: 0,heading1,heading2,comments,sum,approved
1,3,200,"This is, too",7,True


# Whoa there! 

* What just happened? 
* You might recall that one of the major pains in `numpy` is that row data is immutable. 
* Here we managed to set a row and column based upon conditions upon all rows and columns. 
* `data1.loc[<row selector>, <column selector>] = <value>`
* This is difficult to understand, but really powerful. 
* It's also not all-powerful, and what it can't do is important. 

# A tale of Lvalues and Rvalues 
* At its core, Pandas very heavily uses the hacks available in Python classes. 
* These allow it to control which expressions are Lvalues and which are Rvalues. 
* An *Lvalue* is anything that can be on the left of the = sign in an assignment. 
* An *Rvalue* can be on the right hand side of the = sign in an assignment. 
* For the most part, *Lvalues are Rvalues*, but not vice-versa. 
* But, in Pandas, there is an active tension between 
    
    * Selecting data and 
    * Setting data. 

* That plays out by defining *different dyntaxes for setting and selection.*

# What this means in practice is that:
* One should be wary of placing expressions like df[..] on the left-hand side of the =, 
* because *there is a distinct syntax df.loc[...] that is designed for that!*

# An aside: how we control Lvalues and Rvalues in Python
* Python classes allow one to define methods that are different for whether the object is on the left-hand or right-hand side of the = sign. 
* Here is a simple demo

In [15]:
class Foo(): 
    items = []
    def __getitem__(self, index): 
        print("I'm getting the the value at index {}".format(index))
        return self.items[index]
    def __setitem__(self, index, value): 
        print("I'm setting item at index {} to {}".format(index, value))
        while (len(self.items) < index+1): 
            self.items.append(None)
        self.items[index] = value

f = Foo()
f[4] = 'yo'  # f,__setitem__(4, 'yo')
print(f.items)
f[4]  # f.__getitem__(4)

I'm setting item at index 4 to yo
[None, None, None, None, 'yo']
I'm getting the the value at index 4


'yo'

# This insane little class
* implements a *self-extending list*. 
* values that are not defined are set to `None`.
* Without this intervention, if `f` were a regular list, this code would result in a runtime error. 

# The Lvalue/Rvalue minefield
* When learning Pandas and specifically `DataFrame`s, it's really difficult to keep straight what can be on the left-hand-side of the = sign in an assignment. 
* This can be a coding minefield, where assignment statements can "blow up" when you least expect them to do so. 
* Some things that look like Lvalues actually are.
    * Assigning a list of values to a whole column. 
* Some things that look like Lvalues are not, e.g., 
    * Assigning a value to part of a column. 

# Let's put this into practice.  

First, let's load some interesting data into a DataFrame: 

In [19]:
towns = pd.read_csv('2010_Population_By_Town.csv')
print("First 10 rows are:")
towns.head()  # First 5 rows. Remove qualifier to see all 

First 10 rows are:


Unnamed: 0,Town Num,TOWN,2010 Population
0,1,Andover,3303
1,2,Ansonia,19249
2,3,Ashford,4317
3,4,Avon,18098
4,5,Barkhamsted,3799


(source: US census, state of Conn, data.gov) 

1. Write an expression for rows for towns with population above 100000

In [33]:
# your answer
towns.loc[towns['2010 Population'] > 100000]

Unnamed: 0,Town Num,TOWN,2010 Population
14,15,Bridgeport,144229
63,64,Hartford,124775
92,93,New Haven,129779
134,135,Stamford,122643
150,151,Waterbury,110366


2. Write an expression for all towns whose name starts with 'C'. Hint: use >= 'C', < 'D' to select. 

In [40]:
# your answer: 
towns.loc[(towns['TOWN']>='C') & (towns['TOWN'] < 'D')]

Unnamed: 0,Town Num,TOWN,2010 Population
20,21,Canaan,1234
21,22,Canterbury,5132
22,23,Canton,10292
23,24,Chaplin,2305
24,25,Cheshire,29261
25,26,Chester,3994
26,27,Clinton,13260
27,28,Colchester,16068
28,29,Colebrook,1485
29,30,Columbia,5485


3. Write code to create a new column `Cool` and mark `Clinton` and `Wolcott` as `Cool` by setting their `Cool` columns to `True` and everyone else's to `False`.


In [41]:
# Your answer: 
towns['Cool'] = (towns['TOWN'] == 'Clinton') | (towns['TOWN'] == 'Wolcott')

In [42]:
# use this to test your answer
towns[towns.Cool == True]

Unnamed: 0,Town Num,TOWN,2010 Population,Cool
26,27,Clinton,13260,True
165,166,Wolcott,16680,True


4. List all towns with at least 14 characters in their names. Hint: you can apply `.str.len()` to a column to get its length as a string.

In [43]:
# Your answer: 
towns[towns['TOWN'].str.len() >= 14]

Unnamed: 0,Town Num,TOWN,2010 Population,Cool
98,99,North Branford,14407,False
101,102,North Stonington,5297,False


Consider the following additional table: 

In [44]:
tax = pd.read_csv('2012_Retail_Sales_By_Town_ALL_NAICS.csv', engine='python', skipfooter=8)
print("First 5 rows of tax are:")
tax.head()  # first 5 rows; you can remove the qualifier to see the whole table. 

First 5 rows of tax are:


Unnamed: 0,Municipality,Number of Taxpayers,Total Retail Sales of Goods,Total Tax Due **(Excluding Tax at 9.35% Rate),Tax Due At 6.35%,Tax Due at 7%
0,ANDOVER (001),127,5529457.0,428026.0,427872.0,59.0
1,ANSONIA (002),402,89228175.0,3817640.0,3813289.0,2042.0
2,ASHFORD (003),166,12598684.0,659952.0,654175.0,5653.0
3,AVON (004),739,205383841.0,8690891.0,8575388.0,113441.0
4,BARKHAMSTED (005),174,30403443.0,695373.0,694846.0,0.0


5. Compute "Tax per capita" as a new column by dividing "Total Tax Due" by "Number of Taxpayers". There is whitespace at the end of the label for "Total Tax Due"! 

In [45]:
# Note the following before beginning: 
list(tax)

['Municipality',
 'Number of Taxpayers',
 'Total Retail Sales of Goods',
 'Total Tax Due **(Excluding Tax at 9.35% Rate) ',
 'Tax Due At 6.35%',
 'Tax Due at 7%']

In [46]:
# Your answer: 
tax['Tax per capita'] = tax[list(tax)[3]] / tax[list(tax)[1]]

In [47]:
# Use this to test your answer: 
tax.head()  # first 5 rows of solution

Unnamed: 0,Municipality,Number of Taxpayers,Total Retail Sales of Goods,Total Tax Due **(Excluding Tax at 9.35% Rate),Tax Due At 6.35%,Tax Due at 7%,Tax per capita
0,ANDOVER (001),127,5529457.0,428026.0,427872.0,59.0,3370.283465
1,ANSONIA (002),402,89228175.0,3817640.0,3813289.0,2042.0,9496.616915
2,ASHFORD (003),166,12598684.0,659952.0,654175.0,5653.0,3975.614458
3,AVON (004),739,205383841.0,8690891.0,8575388.0,113441.0,11760.339648
4,BARKHAMSTED (005),174,30403443.0,695373.0,694846.0,0.0,3996.396552


6. List the towns in which the tax per capita is larger than $80,000 per entity (Gasp! This is sales tax! Entities can be businesses, though).

In [51]:
# Your answer:
tax.loc[tax['Tax per capita']>80000, ('Municipality', 'Tax per capita')]

Unnamed: 0,Municipality,Tax per capita
76,MANCHESTER (077),95146.873341
100,NORTH HAVEN (101),80101.566667
106,ORANGE (107),82842.027864
151,WATERFORD (152),99412.651163


# When you are done with this workbook, 
* Save and checkpoint.
* Change ready to True in the cell below.
* Run the cell below.

In [None]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('03-04-dataframes-in-pandas.ok')
ok.auth(inline=True)

In [None]:
ready = False  # change to True when ready to submit
if not ready:
    raise Exception("change ready to True when ready to submit")
_ = ok.submit()