# Introduction to the Data Science pipeline
### Analyzing the tennis data set

#### Structure of the data file

Open the file `tennis-data.txt` in an editor and peruse its contents. 

You will see and initial set of lines providing meta-information about the data,  then 14 lines of actual data (observations), and finally a few additional lines of text information.  The structure of the file is:

```
header
data
footer
```

The columns in the data are separated by white space.  Pandas provides functions for reading data in a very wide range of formats, skipping header and footer lines, transforming data types (e.g., string dates of the form 2017-01-25 to actual data types) etc.

#### Possible Questions

The data science pipeline is of the form:

```
Questions -> Wrangling -> Exploration -> Modeling -> Communication
```
Pandas can help us with the wrangling, exploration stages, and communication stages.  It also helps to set up data structures needed for the modeling phase.

Given the simplicity of the tennis data set, there are not many descriptive statisitical questions we can ask.  But, some questions could be:

- What is the prior probability of playing tennis?
- Produce a frquency count for the various categories in each feature?
- When the outlook is sunny is the temperature always hot?

Think of a few more ...

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

Read the data set with `read_table` and use the NBs support for interactive exploration to examine named (keyword) parameters.  Used the methods `.head()` and `.tail()` to ensure that only the data lines (observations) are read and the header/footer lines are skipped.

You will find it convenient to use `skiprows` in conjunction with `.head()` and `skipfooter` in conjunction `.tail()`.  I have given the needed values of 5 and 10 below.  I recommend you make these 0 and then experiment with `.head` and `.tail`

#### Reading the data

In [None]:
df = pd.read_table('tennis-data.txt', sep='\s+', engine='python', 
                   skiprows=5,skipfooter=10)
#df.head()
#df.tail()

You can check to see if all of the observations have been read by determining the length of the data frame, which should be 14.

In [None]:
len(df)

#### .dtypes, .info, .describe

We can get high level information about a dataframe with a few methods. `.dtypes` gives us the datatypes of each column.

In [None]:
df.dtypes

`.info()` gives us few more details (number of non-null objects etc)

In [None]:
df.info()

`.describe()` gives us more stats.  The information returned by `describe` varies based on whether the column has numerical data or not.  The tennis data set only has non-numerical data (strings):

In [None]:
df.describe()

#### labels, index, columns, values

The rows of a data frame are identified by _labels_.  Unlike, the primary keys of a database table, labels need not be unique.  The columns of a data frame are also identified by labels.  The sequence of labels used to identify all the rows is known as the **index** and the sequence of labels used to identify the columns is known as **columns** :-).  The labels of the index and the columns will be shown in bold in the NB.

By default, pandas sets the row labels (index) to be a sequence of integers from 0 and the column labels to be the first row of data.  Hence we get:

In [None]:
df.index

In [None]:
df.columns

We usually want to give more meaningful names to row labels.  We can change the index of a dataframe with the method `set_index`.  Keep in mind that most operations on data frames do not change the data frame.  Rather a new one is created.  Sometime we re-assign the new data frame to the existing variable (as below) and sometimes we assign it to a new variable so that we have access to both data frames.

In [None]:
df = df.set_index('day')
df.head()

If we now ask for the index we get:

In [None]:
df.index

We can change the index to any column we want with `set_index`, but in doing so we loose the existing index.

In [None]:
df.set_index('playtennis').head()

But as is the case most of the time, `set_index` also creates a new data frame i.e., the original data frame df is not changed

In [None]:
df.head()

If you want to go back to the default index of a sequence of numbers we use `reset_index`

In [None]:
d2 = df.head()
d2.head()

In [None]:
d2.reset_index().head()

#### .values

We can also extra the values of a data frame:

In [None]:
df.values

But this **rarely** done in practice --- we work with the values of a data frame when they are IN the data frame.

#### Retrieving columns of a DataFrame

We can ask for a single column by specifying the column label

In [None]:
df['playtennis']

A single column of a data frame is a data structure known as a **Series** object.

In [None]:
s = df['playtennis']
type(s)

A series object can be created on its own.  For now, we will form series objects from columns of a data frame.  The index of a series is the same index of the data frame it is part of.

In [None]:
s.index

We can extract all the values of a series into a NumPy array.

In [None]:
s.values

We can extract more than one column from a data frame to create  "sub data frame"

In [None]:
cols = ['outlook', 'playtennis']
df[cols]

Often, we DON'T store the col names in a variable and use that to extract.  Rather we directly specify the list of column names.  The resultant double `[[...]]` may initially appear unusual, but one gets used to it

In [None]:
df[['outlook','playtennis']]

Note that with this double bracket notation if we specify only 1 column name we get a data frame with one column NOT a series. Ensure you understand the difference between `df['playtennis']` and `df[['playtennis']]`

In [None]:
df[['playtennis']]

#### Retrieving rows of a DataFrame

We retrieve rows by using one of two methods `.loc[]` and `.iloc[]`.  We use `.loc` in conjunction with row labels.  Here are the first 5 rows of the data frame again: 

In [None]:
df.head()

When we extrac a single row, we get a series.  `loc` (and `iloc`) are indexers where we use square brackets and NOT parenthesis (which we would use if they were functions).

In [None]:
df.loc['d4']

When we extract multiple rows, we get a data frame.  Note that again we specify the labels in an array.

In [None]:
df.loc[['d2','d8','d11']]

We can do the usual range indexing, except that now the end label is included

In [None]:
df.loc['d3':'d9']

`iloc` does indexing by position as opposed to labels.  Positions are 0 based

In [None]:
df.iloc[0]

We can do range indexing.  But this time the end value is NOT included:

In [None]:
df.iloc[0:4]  # will give us the 0th, 1st, 2nd, and 3rd rows

#### value_counts

Now that we know how to extract columns and rows of a data frame, lets return to the series data type.  We often want to do a frequency count of the values of a series.  Suppose we want to count the number of `yes` and `no`s in the playtennis column, we could do:

In [None]:
s=df['playtennis']
s.value_counts()

Note that the result of `value_counts` is itself a series!

In [None]:
s2 = s.value_counts()
type(s2)

We can compute the relative frequency of `yes` in the whole data set by counting the total number of `yes` and dividing by the total number of entries in the data frame

In [None]:
s.value_counts().loc['yes']/len(s)

The above expression could be broken into pieces as below, but the above is more idiomatic

In [None]:
cnts = s.value_counts()
cnts.loc['yes']/len(s)

We can also compute the relative frequency of both `yes` and `no` with the below.  Recall the notion of **broadcasting**: when a series is operated on by a scalar, all values in the series are operated on (there is an implicit loop).

In [None]:
s.value_counts()/len(s)

#### Boolean Indexing

Consider the following boolean expression.  Due to broadcasting we get a series of `True` / `False` values.

In [None]:
df['outlook'] == 'sunny'

We can store this series of `True`/`False` values and use it to index a data frame.  We then get those rows for which the boolean mask is `True`

In [None]:
boolean_mask = df['outlook'] == 'sunny'
df[boolean_mask]

The pandorable way of doing this is NOT to explicitly store the boolean mask but rather to use it inplace

In [None]:
d2 = df[df['outlook'] == 'sunny']
d2

#### drop

Like other operations `drop` is not a destructive operation.  It returns a new data frame.

In [None]:
df.drop('d2')
df.head()

In [None]:
df.drop('d2').head()

We can drop multiple rows by specifying the labels in a list

In [None]:
d1 = df.drop(['d10', 'd12', 'd14'])
d1.tail(5)

If we want to drop a column, we need to specify an **axis**.  You can think of a data frame with the (0,0) coordinate in the upper left.  axis=0 moves downwards and axis=1 moves to the right.

In [None]:
df.drop('wind', axis=1).head()

In [None]:
df.count()

#### apply

This is in the spirit of a list comprehension:  when we want to apply a function to a single series or all the columns of a data frame we use `apply`

First, lets see how it works on a series

In [None]:
s = df['playtennis']
s.head()

Let us up case all the values

In [None]:
s.apply(lambda v: v.upper())

When we apply a function to a data frame the argument to that function is a series.  Hence it doesn't make sense to do

```
df.apply(lambda x: x.upper())
```
because at this point `x` is a series.

Rather we need to `apply` a function that can be applied to series like `count`

In [None]:
df.apply(lambda ser: ser.count())

As we saw earlier, when `describe` is applied to a series, we get another series

In [None]:
s.describe()

When we apply `describe` to a data frame it is applied to each column of the data frame (which are series objects).  The resultant collection of series objects are then assembled back into a data frame

In [None]:
df.apply(lambda ser: ser.describe())

#### Combining Series

Lets hand create a couple of series with different but overlapping indexes

In [None]:
s1 = pd.Series([80, 70, 90], index='abe bob cathy'.split())
s1
s2 = pd.Series([10, 20, 30], index='bob don abe'.split())

print(s1)
print()
print(s2)


Due to broad casting, when we perform an operation on a series at a whole, all of the values in the series are operated on

In [None]:
s1+5

In [None]:
(s1+5)*10

We can join two series together with `append`

In [None]:
s1.append(s2)

Something interesting happens when we do an entry by entry operation on a series

In [None]:
s1+s2

Pandas automatically aligns row labels and performs the operation only on those rows.  The rest are deemed "Not a Number" `NaN`

### Grouping

Similar to the `GROUP BY` clause of SQL, Pandas supports the ability to group rows in a number of ways

In [None]:
grps = df.groupby('playtennis')

`grps` has a data type of its own.  Its constituent parts are data frames

In [None]:
type(grps)

We can get information on the groups and their constituent rows

In [None]:
grps.groups

The size of each group is available as a series object

In [None]:
grps.size()

We can get the individual data frames in a group with `.get_group`

In [None]:
grps.get_group('no')

We can also iterate across all the groups in a grougby object

In [None]:
for k,g in grps:
    print(k)
    print(g)
    print()

Separate them with a list comprehension

In [None]:
lst=[(k,g) for (k,g) in grps]
lst

In [None]:
type(lst[0][1])

In [None]:
lst[0][1]

#### .apply on groupby objects

We can also apply a function to a grouby object.  In this instance the function that is applied takes a data frame as the argument:

In [None]:
grps.apply(lambda d: len(d))

Lets spend some time dissecting the below

In [None]:
df.apply(lambda s: s.value_counts())

Blend individual data frames

#### .agg or .aggregating values

`agg` is similar to `apply` but differs in the following crucial ways.

   [1] `apply` can be used with a series, dataframe or group.  The function being applied takes the components of the data type to which it is applied:
   
   ```Something.apply(lambda x: ______ )
   ```
   
   If `Something` is 
       - a series then x is a value
       - a data frame then x is a column (series)
       - a group then x is a data frame
       
       
   [2] `agg` can only be applied to a group and the function is applied to each column of the data frames in the group i.e., x is a series.  Also, multiple functions can be used during aggregation

In [None]:
grps.get_group('no').count()

In [None]:
len(grps.get_group('no'))+10

In [None]:
grps.agg(['count', lambda s: len(s)+10])

In [None]:
def f(s):
    return len(s)+100

grps.agg(['count', f])