## Some notes about the importance of working with data

* Working with data is an absolutely critical aspect of machine learning.
* Applying the algorithms is only one **small** aspect of the overall work.
* Evaluating, cleaning, and preprocessing the data is usually more of the work.
  * AND it's more important.
  * Garbage in garbage out, as they say.


* **What's the difference between cleaning and preprocessing data?**
* We're going to be largely working with data that has already been very well cleaned.
  * There is a lot that can go into collecting and maintaining good data.
  * Plus doing EDA to evaluate the data's quality.
  * But we already have a lot to cover, and are working with well curated datasets, to focus on the algorithms. Just don't forget to apply your statistical intuition and your rational skepticism to the model's outputs!
  

* Describe training/validation/test data.

# What is Pandas, anyway?

**Pan**el **da**ta

- Uses Numpy for efficient data access & manipulation
- Adds labels and indexes for ease of use
- `Series` and `DataFrame` are the two core classes in Pandas

# Let's jump in head-first

First, we need to `import pandas`:

In [1]:
import pandas as pd

Now, let's load some data:

In [2]:
closing_prices = pd.read_csv('../data/closing-prices.csv')
closing_prices

Unnamed: 0.1,Unnamed: 0,F,TSLA,GOOG,IBM,AAPL
0,2014-01-02,12.0890,150.10,,157.6001,72.7741
1,2014-01-03,12.1438,149.56,,158.5430,71.1756
2,2014-01-06,12.1986,147.00,,157.9993,71.5637
3,2014-01-07,12.0420,149.36,,161.1508,71.0516
4,2014-01-08,12.1673,151.28,,159.6728,71.5019
...,...,...,...,...,...,...
1002,2017-12-22,11.9489,325.20,1060.12,147.7588,173.0230
1003,2017-12-26,11.9679,317.29,1056.74,148.0786,168.6334
1004,2017-12-27,11.8729,311.64,1049.37,148.3693,168.6630
1005,2017-12-28,11.9489,315.36,1048.14,149.2510,169.1376


`closing_prices` is a `DataFrame`, and it's the main Pandas data structure we'll be using.

We can see the first few rows of a `DataFrame` by calling its `head()` method:

In [3]:
closing_prices.head()

Unnamed: 0.1,Unnamed: 0,F,TSLA,GOOG,IBM,AAPL
0,2014-01-02,12.089,150.1,,157.6001,72.7741
1,2014-01-03,12.1438,149.56,,158.543,71.1756
2,2014-01-06,12.1986,147.0,,157.9993,71.5637
3,2014-01-07,12.042,149.36,,161.1508,71.0516
4,2014-01-08,12.1673,151.28,,159.6728,71.5019


## Getting basic info about a `DataFrame`

We can get basic info about a `DataFrame` (columns, counts, data types, memory usage) by using the `.info()` method:

In [4]:
closing_prices.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1007 entries, 0 to 1006
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1007 non-null   object 
 1   F           1007 non-null   float64
 2   TSLA        1007 non-null   float64
 3   GOOG        949 non-null    float64
 4   IBM         1007 non-null   float64
 5   AAPL        1007 non-null   float64
dtypes: float64(5), object(1)
memory usage: 105.3 KB


Notice that the first column ("Unnamed: 0") is a Python `object`, which means Pandas couldn't infer its type. We'll fix the name and the type momentarily.

## Accessing the columns of a `DataFrame`

We can look at a single column of a `DataFrame` either by using the `[]` operator:

In [5]:
closing_prices['Unnamed: 0']

0       2014-01-02
1       2014-01-03
2       2014-01-06
3       2014-01-07
4       2014-01-08
           ...    
1002    2017-12-22
1003    2017-12-26
1004    2017-12-27
1005    2017-12-28
1006    2017-12-29
Name: Unnamed: 0, Length: 1007, dtype: object

... or (sometimes) the `.` notation:

In [6]:
closing_prices.TSLA

0       150.10
1       149.56
2       147.00
3       149.36
4       151.28
         ...  
1002    325.20
1003    317.29
1004    311.64
1005    315.36
1006    311.35
Name: TSLA, Length: 1007, dtype: float64

A single column of data from a Pandas `DataFrame` is a `Series`. 

We can get a list of the columns of a `DataFrame` by using its `.columns` property:

In [7]:
closing_prices.columns

Index(['Unnamed: 0', 'F', 'TSLA', 'GOOG', 'IBM', 'AAPL'], dtype='object')

We can also give a `list` of columns we're interested in, and Pandas will give us a new `DataFrame` with just those columns:

In [8]:
columns = ['F', 'GOOG', 'IBM']
closing_prices[columns].head()

Unnamed: 0,F,GOOG,IBM
0,12.089,,157.6001
1,12.1438,,158.543
2,12.1986,,157.9993
3,12.042,,161.1508
4,12.1673,,159.6728


We'll often shorten this:

In [9]:
closing_prices[['F', 'GOOG', 'IBM']].head()

Unnamed: 0,F,GOOG,IBM
0,12.089,,157.6001
1,12.1438,,158.543
2,12.1986,,157.9993
3,12.042,,161.1508
4,12.1673,,159.6728


We can even create a single-column `DataFrame` by just passing a list of **one** column name:

In [10]:
closing_prices[['F']].head()

Unnamed: 0,F
0,12.089
1,12.1438
2,12.1986
3,12.042
4,12.1673


Note the difference between the above an accessing a single column (which gives us a `Series`):

In [11]:
closing_prices['F'].head()

0    12.0890
1    12.1438
2    12.1986
3    12.0420
4    12.1673
Name: F, dtype: float64

## Accessing the individual elements of a `Series`

We can access individual cells within a `Series` by using the `[]` and their index:

In [12]:
tsla = closing_prices.TSLA
tsla[0]

150.1

We can also select a range of cells by giving a `slice` (this will actually give us another `Series`):

In [13]:
tsla[10:15]

10    170.97
11    170.01
12    176.68
13    178.56
14    181.50
Name: TSLA, dtype: float64

## Dealing with dates

We can parse the dates from the "Unnamed: 0" column and create a new Series with the parsed datetimes by using the `pd.to_datetime()` function:

In [14]:
dates = pd.to_datetime(closing_prices['Unnamed: 0'])
dates.head()

0   2014-01-02
1   2014-01-03
2   2014-01-06
3   2014-01-07
4   2014-01-08
Name: Unnamed: 0, dtype: datetime64[ns]

## Replacing columns

Now that we've "fixed" the dates, we can get rid of the old "Unnamed: 0" column by `.pop()`ping it off:

In [15]:
closing_prices.pop('Unnamed: 0')

0       2014-01-02
1       2014-01-03
2       2014-01-06
3       2014-01-07
4       2014-01-08
           ...    
1002    2017-12-22
1003    2017-12-26
1004    2017-12-27
1005    2017-12-28
1006    2017-12-29
Name: Unnamed: 0, Length: 1007, dtype: object

... and we can replace it with our new "date" column:

In [16]:
closing_prices['Date'] = dates
closing_prices.head()

Unnamed: 0,F,TSLA,GOOG,IBM,AAPL,Date
0,12.089,150.1,,157.6001,72.7741,2014-01-02
1,12.1438,149.56,,158.543,71.1756,2014-01-03
2,12.1986,147.0,,157.9993,71.5637,2014-01-06
3,12.042,149.36,,161.1508,71.0516,2014-01-07
4,12.1673,151.28,,159.6728,71.5019,2014-01-08


In [17]:
closing_prices.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1007 entries, 0 to 1006
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   F       1007 non-null   float64       
 1   TSLA    1007 non-null   float64       
 2   GOOG    949 non-null    float64       
 3   IBM     1007 non-null   float64       
 4   AAPL    1007 non-null   float64       
 5   Date    1007 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(5)
memory usage: 47.3 KB


(and we cut our memory usage in half as an extra bonus!)

# The Index

Many datasets have a column that serves as a nice index to the dataset. In our case, the date fills this role.

The `.set_index()` method on a `DataFrame` creates a new `DataFrame` based on the old one, but with an index column chosen:

In [18]:
closing_prices.set_index('Date')

Unnamed: 0_level_0,F,TSLA,GOOG,IBM,AAPL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-01-02,12.0890,150.10,,157.6001,72.7741
2014-01-03,12.1438,149.56,,158.5430,71.1756
2014-01-06,12.1986,147.00,,157.9993,71.5637
2014-01-07,12.0420,149.36,,161.1508,71.0516
2014-01-08,12.1673,151.28,,159.6728,71.5019
...,...,...,...,...,...
2017-12-22,11.9489,325.20,1060.12,147.7588,173.0230
2017-12-26,11.9679,317.29,1056.74,148.0786,168.6334
2017-12-27,11.8729,311.64,1049.37,148.3693,168.6630
2017-12-28,11.9489,315.36,1048.14,149.2510,169.1376


Many of the Pandas methods have an optional `inplace` keyword argument that lets you modify the objects in-place instead of creating new objects. Let's do that here:

In [19]:
closing_prices.set_index('Date', inplace=True)
closing_prices.head()

Unnamed: 0_level_0,F,TSLA,GOOG,IBM,AAPL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-01-02,12.089,150.1,,157.6001,72.7741
2014-01-03,12.1438,149.56,,158.543,71.1756
2014-01-06,12.1986,147.0,,157.9993,71.5637
2014-01-07,12.042,149.36,,161.1508,71.0516
2014-01-08,12.1673,151.28,,159.6728,71.5019


In [20]:
closing_prices.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1007 entries, 2014-01-02 to 2017-12-29
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   F       1007 non-null   float64
 1   TSLA    1007 non-null   float64
 2   GOOG    949 non-null    float64
 3   IBM     1007 non-null   float64
 4   AAPL    1007 non-null   float64
dtypes: float64(5)
memory usage: 47.2 KB


## Accessing rows in a `DataFrame`

To access the rows in a `DataFrame`, use the `.loc` and `.iloc` *accessors*.

`.loc` accesses rows by their **index value** and returns a `Series`:

In [21]:
closing_prices.loc['2014-01-08']

F        12.1673
TSLA    151.2800
GOOG         NaN
IBM     159.6728
AAPL     71.5019
Name: 2014-01-08 00:00:00, dtype: float64

`.iloc` accesses rows by their 0-based offset (their "integer position") and returns a `Series`:

In [22]:
closing_prices.iloc[4]

F        12.1673
TSLA    151.2800
GOOG         NaN
IBM     159.6728
AAPL     71.5019
Name: 2014-01-08 00:00:00, dtype: float64

We can give a second argument to both `.loc` and `.iloc` to specify a column, which gives us just that single value:

In [23]:
closing_prices.loc['2014-01-08', 'TSLA']

151.28

In [24]:
closing_prices.iloc[4, 1]

151.28

`.loc` and `.iloc` also work with slices. 

In the case of `.loc`, a slice includes both the first **and the last** elements of a slice (unlike regular Python, which includes the first but not the last element of a slice):

In [25]:
closing_prices.loc['2014-01-02':'2014-01-07']

Unnamed: 0_level_0,F,TSLA,GOOG,IBM,AAPL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-01-02,12.089,150.1,,157.6001,72.7741
2014-01-03,12.1438,149.56,,158.543,71.1756
2014-01-06,12.1986,147.0,,157.9993,71.5637
2014-01-07,12.042,149.36,,161.1508,71.0516


In the case of `.iloc`, the last element of the slice is **not** included (matching the behavior of regular Python):

In [26]:
closing_prices[0:4]

Unnamed: 0_level_0,F,TSLA,GOOG,IBM,AAPL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-01-02,12.089,150.1,,157.6001,72.7741
2014-01-03,12.1438,149.56,,158.543,71.1756
2014-01-06,12.1986,147.0,,157.9993,71.5637
2014-01-07,12.042,149.36,,161.1508,71.0516


We can also use `.loc` and `.iloc` with `Series` objects:

In [27]:
closing_prices.F.loc['2014-01-03':'1/6/14']

Date
2014-01-03    12.1438
2014-01-06    12.1986
Name: F, dtype: float64

In [28]:
closing_prices.F.iloc[1:3]

Date
2014-01-03    12.1438
2014-01-06    12.1986
Name: F, dtype: float64

## Reading the CSV data (redux)

Many of the `DataFrame` manipulations we used above can be done at initial data import time by using optional keyword arguments to `pd.read_csv()`:

In [29]:
closing_prices = pd.read_csv(
    '../data/closing-prices.csv',
    index_col=0,
    parse_dates=[0],
)
closing_prices.head()

Unnamed: 0,F,TSLA,GOOG,IBM,AAPL
2014-01-02,12.089,150.1,,157.6001,72.7741
2014-01-03,12.1438,149.56,,158.543,71.1756
2014-01-06,12.1986,147.0,,157.9993,71.5637
2014-01-07,12.042,149.36,,161.1508,71.0516
2014-01-08,12.1673,151.28,,159.6728,71.5019


In [30]:
closing_prices.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1007 entries, 2014-01-02 to 2017-12-29
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   F       1007 non-null   float64
 1   TSLA    1007 non-null   float64
 2   GOOG    949 non-null    float64
 3   IBM     1007 non-null   float64
 4   AAPL    1007 non-null   float64
dtypes: float64(5)
memory usage: 47.2 KB


# A Word About NaN

Pandas uses the special floating-point value `nan` in floating-point columns to flag missing or invalid data. 

`nan` has the special property that it is not *comparable* with any other value, and it *propagates* when combined with other data:

In [31]:
nan = float('nan')   # or numpy.nan

In [32]:
nan == nan

False

In [33]:
nan < nan, nan > nan

(False, False)

In [34]:
nan + 5

nan

In [35]:
nan * 0

nan

Often we'll want to treat `nan`, `None`, and `pd.NaT` (a special Pandas "not-a-time" value) similarly.

Pandas lets us calculate whether a value is one of these using the `pd.isna()` function:

In [36]:
pd.isna(nan), pd.isna(None), pd.isna(pd.NaT)

(True, True, True)

# Math with `Series`

Often, we may want to perform some calculations on a `Series` object.

If we perform an operation between a `Series` and a scalar value, the scalar value is *broadcast* to the `Series` as we would expect:

In [37]:
(closing_prices.F * 2).head()

2014-01-02    24.1780
2014-01-03    24.2876
2014-01-06    24.3972
2014-01-07    24.0840
2014-01-08    24.3346
Name: F, dtype: float64

We can also perform any binary operation between two series, which applies the operation *element-by-element* (matrix multiplication or dot/cross-product are *not* done this way):

In [38]:
(closing_prices.TSLA + closing_prices.F).head()

2014-01-02    162.1890
2014-01-03    161.7038
2014-01-06    159.1986
2014-01-07    161.4020
2014-01-08    163.4473
dtype: float64

You could combine the above for a 'portfolio value', for instance:

In [39]:
portfolio = (10 * closing_prices.TSLA + 200 * closing_prices.F)
portfolio.head()

2014-01-02    3918.80
2014-01-03    3924.36
2014-01-06    3909.72
2014-01-07    3902.00
2014-01-08    3946.26
dtype: float64

We can also perform aggregations:

In [40]:
portfolio.min(), portfolio.max(), portfolio.mean(), portfolio.median()

(3408.1000000000004, 6032.219999999999, 4799.839365441906, 4744.42)

If you want a quick set of summary statistics, you can `.describe()` a `Series` or a `DataFrame`:

In [41]:
portfolio.describe()

count    1007.000000
mean     4799.839365
std       481.910807
min      3408.100000
25%      4479.800000
50%      4744.420000
75%      5080.120000
max      6032.220000
dtype: float64

In [42]:
closing_prices.describe()

Unnamed: 0,F,TSLA,GOOG,IBM,AAPL
count,1007.0,1007.0,949.0,1007.0,1007.0
mean,11.784478,244.294371,714.496848,145.480157,111.896291
std,0.999545,50.686052,154.754186,13.132729,25.481085
min,9.7092,139.34,492.55,106.9694,65.7553
25%,11.009,207.8,561.68,137.819,93.87305
50%,11.7321,230.48,716.65,144.9101,107.9158
75%,12.4712,262.015,806.36,155.92045,122.17295
max,14.1901,385.0,1077.14,171.0528,174.417


We can also perform comparisons and generate **Boolean** columns:

In [43]:
(closing_prices.IBM > 160).head()

2014-01-02    False
2014-01-03    False
2014-01-06    False
2014-01-07     True
2014-01-08    False
Name: IBM, dtype: bool

For the purposes of aggregation, `True = 1` and `False = 0`, so we can see how many days IBM was above 160:

In [44]:
(closing_prices.IBM > 160).head()

2014-01-02    False
2014-01-03    False
2014-01-06    False
2014-01-07     True
2014-01-08    False
Name: IBM, dtype: bool

... or the percentage of days it was:

In [45]:
(closing_prices.IBM > 160).mean()

0.17576961271102284

# Indexing using a boolean `Series`, aka "filtering"

If we pass a boolean `Series` into the `[]` operator, Pandas will show us a new `DataFrame` containing only rows for which the boolean `Series` evaluated to `True`:

In [46]:
high_ibm = closing_prices.IBM > 170
closing_prices[high_ibm]

Unnamed: 0,F,TSLA,GOOG,IBM,AAPL
2017-02-15,11.5434,279.76,818.98,170.799,132.4227
2017-02-16,11.4612,268.95,824.16,170.564,132.2615
2017-02-22,11.58,273.51,830.76,170.3007,133.9862
2017-02-23,11.4795,255.99,831.33,170.7708,133.4195
2017-02-24,11.3972,257.0,828.64,170.4888,133.5465
2017-03-01,11.5983,250.02,835.24,171.0528,136.6052


We normally just combine those two lines for the elegant syntax:

In [47]:
closing_prices[closing_prices.IBM > 170]

Unnamed: 0,F,TSLA,GOOG,IBM,AAPL
2017-02-15,11.5434,279.76,818.98,170.799,132.4227
2017-02-16,11.4612,268.95,824.16,170.564,132.2615
2017-02-22,11.58,273.51,830.76,170.3007,133.9862
2017-02-23,11.4795,255.99,831.33,170.7708,133.4195
2017-02-24,11.3972,257.0,828.64,170.4888,133.5465
2017-03-01,11.5983,250.02,835.24,171.0528,136.6052


## Dropping data

We may want to just drop rows containing invalid data. To do this, we would use the `dropna` method. 

In our dataset, GOOG data is missing until 3/27, so we can drop all the invalid data like this (returning a new `DataFrame`)

In [48]:
closing_prices.dropna()

Unnamed: 0,F,TSLA,GOOG,IBM,AAPL
2014-03-27,12.0359,207.32,558.46,162.1367,71.1357
2014-03-28,12.1938,212.37,559.99,162.6663,71.0563
2014-03-31,12.3121,208.45,556.97,164.4087,71.0404
2014-04-01,12.8804,216.97,567.16,166.1254,71.6903
2014-04-02,12.9909,230.29,567.00,165.3140,71.8094
...,...,...,...,...,...
2017-12-22,11.9489,325.20,1060.12,147.7588,173.0230
2017-12-26,11.9679,317.29,1056.74,148.0786,168.6334
2017-12-27,11.8729,311.64,1049.37,148.3693,168.6630
2017-12-28,11.9489,315.36,1048.14,149.2510,169.1376


# Time Series Data

Pandas handles time series data quite well. We've already seen how you can use strings to index into datetime indexes. 

If you have a datetime column (not an index), then many date manipulation methods are available via the `.dt.` accessor:

In [49]:
sales = pd.read_csv('../data/sales.csv', parse_dates=['date'])
sales.head()

Unnamed: 0,order_num,line_num,date,sku,qty
0,0,0,2011-01-01,sku4333,6
1,0,1,2011-01-01,sku76536,7
2,1,0,2011-01-02,sku75108,3
3,1,1,2011-01-02,sku78838,9
4,1,2,2011-01-02,sku77480,9


In [50]:
sales['weekday'] = sales.date.dt.weekday
sales.head()

Unnamed: 0,order_num,line_num,date,sku,qty,weekday
0,0,0,2011-01-01,sku4333,6,5
1,0,1,2011-01-01,sku76536,7,5
2,1,0,2011-01-02,sku75108,3,6
3,1,1,2011-01-02,sku78838,9,6
4,1,2,2011-01-02,sku77480,9,6


In [51]:
sales['weekend'] = sales.weekday > 4
sales.head()

Unnamed: 0,order_num,line_num,date,sku,qty,weekday,weekend
0,0,0,2011-01-01,sku4333,6,5,True
1,0,1,2011-01-01,sku76536,7,5,True
2,1,0,2011-01-02,sku75108,3,6,True
3,1,1,2011-01-02,sku78838,9,6,True
4,1,2,2011-01-02,sku77480,9,6,True


In [52]:
sales.date.dt.day_name()

0        Saturday
1        Saturday
2          Sunday
3          Sunday
4          Sunday
          ...    
2445    Wednesday
2446    Wednesday
2447    Wednesday
2448    Wednesday
2449     Thursday
Name: date, Length: 2450, dtype: object

# Similarly, use to_csv to export data. See the docs:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html