### Pandas continued

See chapter 5 of Wes McKinney Python for Data Analysis

https://wesmckinney.com/book/pandas-basics.html#pandas_construction


### DataFrame

A DataFrame is like an excel spreadsheet, i.e. 2D object

- column names are in a list called `columns`
- rows or names of rows are in a list called `index` (not rows)
- if you construct a DataFrame from a dictionary, the **keys** of the dictionary become **column** names, not row names 


In [1]:
import pandas as pd
import numpy as np

matrix=np.arange(12).reshape(3,4)
print(matrix)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [2]:
df=pd.DataFrame(matrix, columns=['A', 'B', 'C', 'D'], index=['row1', 'row2', 'row3'])

In [3]:
df

Unnamed: 0,A,B,C,D
row1,0,1,2,3
row2,4,5,6,7
row3,8,9,10,11


In [4]:
df['A']

row1    0
row2    4
row3    8
Name: A, dtype: int64

In [5]:
df[row1] #gives error, must use loc and iloc for rows!

NameError: name 'row1' is not defined

In [6]:
df.loc['row1']

A    0
B    1
C    2
D    3
Name: row1, dtype: int64

In [7]:
df.iloc[0]  #iloc is `integer locate', python counting starts at 0

A    0
B    1
C    2
D    3
Name: row1, dtype: int64

### Deleting a row or column:

use `.drop(column=[])` for dropping columns 

or `.drop(index=[])` for dropping rows (index means rows)

In [8]:
df

Unnamed: 0,A,B,C,D
row1,0,1,2,3
row2,4,5,6,7
row3,8,9,10,11


In [9]:
df.drop(columns=['A'])

Unnamed: 0,B,C,D
row1,1,2,3
row2,5,6,7
row3,9,10,11


In [10]:
df

Unnamed: 0,A,B,C,D
row1,0,1,2,3
row2,4,5,6,7
row3,8,9,10,11


Notice above that column A has not been deleted from `df` because the `drop` operation first creates a **copy** of `df`, from which it deletes the column, and it does not act on `df`. But you want it to delete the column in your original DataFrame, use `inplace=True` as below:

In [11]:
df.drop(columns=['A'], inplace=True)

In [12]:
df

Unnamed: 0,B,C,D
row1,1,2,3
row2,5,6,7
row3,9,10,11


In [13]:
# delete row2
df.drop(index=['row2'])

Unnamed: 0,B,C,D
row1,1,2,3
row3,9,10,11


### Replacing a column

In [14]:
df

Unnamed: 0,B,C,D
row1,1,2,3
row2,5,6,7
row3,9,10,11


In [15]:
df['B']=[2,3, 4]

In [16]:
df

Unnamed: 0,B,C,D
row1,2,2,3
row2,3,6,7
row3,4,10,11


###  Replacing a row

In [17]:
df.loc['row2']=[20,20,20]

In [18]:
df

Unnamed: 0,B,C,D
row1,2,2,3
row2,20,20,20
row3,4,10,11


# Filtering based on value of a column

Suppose we want all rows where value in column C is >5.

First create a true-false 'mask':

In [23]:
df['C']>5

row1    False
row2     True
row3     True
Name: C, dtype: bool

In [26]:
mask=df['C']>5
df[mask]

Unnamed: 0,B,C,D
row2,20,20,20
row3,4,10,11


In [27]:
mask2=df['B']<=4

In [29]:
mask2

row1     True
row2    False
row3     True
Name: B, dtype: bool

In [31]:
df[mask]

Unnamed: 0,B,C,D
row2,20,20,20
row3,4,10,11


Adding two dataframes with possibly some different columns or indexes is possible, will just introduce `NaN`s in appropriate places. This example is copied from McKinney section 5.2:

In [33]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
   .....:                    index=["Ohio", "Texas", "Colorado"])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
   .....:                    index=["Utah", "Ohio", "Texas", "Oregon"])


In [35]:
df1, df2

(            b    c    d
 Ohio      0.0  1.0  2.0
 Texas     3.0  4.0  5.0
 Colorado  6.0  7.0  8.0,
           b     d     e
 Utah    0.0   1.0   2.0
 Ohio    3.0   4.0   5.0
 Texas   6.0   7.0   8.0
 Oregon  9.0  10.0  11.0)

In [34]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


### Applying functions to entries of a data frame

In [37]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [38]:
np.square(df1)  #vectorized numpy operations accept dataframes as input

Unnamed: 0,b,c,d
Ohio,0.0,1.0,4.0
Texas,9.0,16.0,25.0
Colorado,36.0,49.0,64.0


### Apply a function to rows or columns

syntax

`<dataframe>.apply(function, axis=)`

In [40]:
def my_function(x):
    return x**2

df.apply(my_function)

Unnamed: 0,B,C,D
row1,4,4,9
row2,400,400,400
row3,16,100,121


In [41]:
df

Unnamed: 0,B,C,D
row1,2,2,3
row2,20,20,20
row3,4,10,11


In [42]:
#replace just column B with it entries squares
df['B']=df['B'].apply(my_function)

In [43]:
df

Unnamed: 0,B,C,D
row1,4,2,3
row2,400,20,20
row3,16,10,11


In [44]:
df.loc['row1']=df.loc['row1'].apply(my_function)

In [45]:
df

Unnamed: 0,B,C,D
row1,16,4,9
row2,400,20,20
row3,16,10,11


### Sorting

Can sort index, but I think more common is to sort the values:

In [46]:
df

Unnamed: 0,B,C,D
row1,16,4,9
row2,400,20,20
row3,16,10,11


In [48]:
df.sort_values(by=['D','B'])

Unnamed: 0,B,C,D
row1,16,4,9
row3,16,10,11
row2,400,20,20


### Axis, again

Yesterday I said (correctly)

### axis=0 means varying rows, keeping column fixed

I wrote 
### axis=0 means columns
but I should have written more precisely
### axis=0 means (operation is applied to) columns

and while correct, the usage in Pandas/McKinney is that axis means direction 

`axis=0` is `axis="rows`
because if you say sum in the direction of rows axis, you are summing up to columns.

And
`axis=1` is synonymous with `axis="columns"`
since doing a sum along the direction of the columns axis is summing each row.

Very confusing, I'm sorry.


In [49]:

df.sum(axis=0)

B    432
C     34
D     40
dtype: int64

In [51]:
#the following command sums the ROWS
df.sum(axis='columns')

row1     29
row2    440
row3     37
dtype: int64

###  concat

In [None]:
### concat

### groupby
### split-apply-combine

https://wesmckinney.com/book/data-aggregation.html?q=groupby#groupby_fundamentals

In [52]:
# example from McKinney Chapter 10

df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None],
                 "key2" : pd.Series([1, 2, 1, 2, 1, None, 1],
                                     dtype="Int64"),
                 "data1" : np.random.standard_normal(7),
                  "data2" : np.random.standard_normal(7)})



In [53]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,1.977258,-1.168861
1,a,2.0,-0.354764,0.396667
2,,1.0,-1.020026,1.32636
3,b,2.0,-0.315253,-1.029003
4,b,1.0,-0.568018,-0.776543
5,a,,0.547041,0.748713
6,,1.0,0.317716,0.028884


In [55]:
grouped = df["data1"].groupby(df["key1"])
grouped


<pandas.core.groupby.generic.SeriesGroupBy object at 0x1242ba610>

In [57]:
grouped.sum()

key1
a    2.169535
b   -0.883270
Name: data1, dtype: float64

# Week 3 video
HW: Go through one of the examples in Chapter 13
and make an 5-10 min instructional video

https://wesmckinney.com/book/data-analysis-examples