In [None]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it 
!test ! -e ds-assets && git clone https://github.com/lutzhamel/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/" 
import sys
sys.path.append(home)      # add home folder to module search path

Cloning into 'ds-assets'...
remote: Enumerating objects: 168, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 168 (delta 0), reused 2 (delta 0), pack-reused 164[K
Receiving objects: 100% (168/168), 7.40 MiB | 24.29 MiB/s, done.
Resolving deltas: 100% (60/60), done.


# Data Manipulation with Pandas

Pandas supports 1-D (Series), 2-D (DataFrame), and 3-D (Panel) data structures.  Here we cover DataFrames because they most closely resemble the kind of data tables data scientists mostly look at.

The advantage of Pandas is that it stores the data together with its *metadata*.

The most often used meta data with Pandas are the **column names** and the **index**.


In [None]:
import pandas
import numpy # for random number generation

In [None]:
df = pandas.read_csv(home+"mammals.csv")

In [None]:
df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
2,4,no,no,no,False
3,4,yes,yes,no,True
4,3,no,no,no,False


# DataFrame Parts

A dataframe is composed of different parts that work together to give a coherent view of the data:

In [None]:
df.columns

Index(['Legs', 'Wings', 'Fur', 'Feathers', 'Mammal'], dtype='object')

In [None]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [None]:
df.values

array([[4, 'no', 'yes', 'no', True],
       [2, 'yes', 'no', 'yes', False],
       [4, 'no', 'no', 'no', False],
       [4, 'yes', 'yes', 'no', True],
       [3, 'no', 'no', 'no', False]], dtype=object)

We can change the parts of the data.  For example, we can create a new index for our dataframe:

In [None]:
df.index = ['Dog', 'Duck', 'Frog', 'Bat', 'Bar Stool']

In [None]:
df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,no,False
Bat,4,yes,yes,no,True
Bar Stool,3,no,no,no,False


# Indexing and Slicing

For array-style indexing Pandas  uses the **loc**, **iloc**, and **ix** indexers. 

Using the **iloc** indexer, we can index the underlying array as if it is a simple array using row and column integer values (hence the i in iloc). The DataFrame index and column labels are maintained in the result:

In [None]:
df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,no,False
Bat,4,yes,yes,no,True
Bar Stool,3,no,no,no,False


In [None]:
df.iloc[:2,1:4]

Unnamed: 0,Wings,Fur,Feathers
Dog,no,yes,no
Duck,yes,no,yes


Using the **loc** indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [None]:
df.loc[:'Duck','Wings':'Feathers']

Unnamed: 0,Wings,Fur,Feathers
Dog,no,yes,no
Duck,yes,no,yes


Notice that when slicing with an explicit index (i.e., data.loc['a':'c']), the final index is included in the slice, while when slicing with an implicit index (i.e., data.iloc[0:2]), the final index is excluded from the slice.


# Converting Categorical Data to Numerical Data

The machine learning algorithms in sklearn only operate on numerical data.  That means any data that is categorical has to be converted to numerical data.  **This is only true for the independent variables**.  The target variable can be categorical or numeric.



For categorical variables that only have **two labels** that conversion is easy, replace the labels with 0 and 1.  Consider our Mammals data set.

In [None]:
df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,no,False
Bat,4,yes,yes,no,True
Bar Stool,3,no,no,no,False


In [None]:
# define a function that turns 'yes' into 1 and 'no' into 0
def f(x):
  if x == 'yes':
    return 1
  elif x == 'no':
    return 0
  else:
    # something strange happened...
    return x

# make a copy
df_numeric = df.copy()
# replace the categorical variables with numeric ones
df_numeric['Wings'] = df_numeric['Wings'].apply(f)
df_numeric['Fur'] = df_numeric['Fur'].apply(f)
df_numeric['Feathers'] = df_numeric['Feathers'].apply(f)
df_numeric

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,0,1,0,True
1,2,1,0,1,False
2,4,0,0,0,False
3,4,1,1,0,True
4,3,0,0,0,False


We cannot use this approach with categorical variables with more than 2 labels. Consider, if we had a variable called 'Colors' and it had three labels: 'Yellow', 'Red', and 'Blue'.  If we were to replace the labels with numerical values such as 'Yellow'=1, 'Red'=2, and 'Blue'=3, then we inadvertently introduced an ordering to colors, namely,

'Yellow' < 'Red' < 'Blue'

which is of course not true. To get around this we introduce what is called **dummy encoding** where each label of a categorical variable is represented as its own **dummy variable**.

To see this let's go back to our tennis data set.



In [None]:
df_tennis = pandas.read_csv(home+"tennis.csv")
df_tennis

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes
5,rainy,cool,normal,True,no
6,overcast,cool,normal,True,yes
7,sunny,mild,high,False,no
8,sunny,cool,normal,False,yes
9,rainy,mild,normal,False,yes


In [None]:
# find the number of labels in each column
for v in list(df_tennis.columns):
  print("# of labels in {}: {}".format(v,df_tennis[v].value_counts().shape[0]))

# of labels in outlook: 3
# of labels in temp: 3
# of labels in humidity: 2
# of labels in windy: 2
# of labels in play: 2


This means we have to do dummy encoding for variables outlook and temp and we can perform a straight forward label replacement for variables humidity and windy.  We will leave the target variable alone.

In [None]:
def f_humidity(x):
  if x == 'normal':
    return 0
  elif x == 'high':
    return 1
  else:
    return x

def f_windy(x):
  if x == False:
    return 0
  elif x == True:
    return 1
  else:
    return x

df_tennis_numeric = df_tennis.copy()
# replace binary variables
df_tennis_numeric['humidity'] = df_tennis_numeric['humidity'].apply(f_humidity)
df_tennis_numeric['windy'] = df_tennis_numeric['windy'].apply(f_windy)
# replace multi-lable variables
df_tennis_numeric = pandas.get_dummies(df_tennis_numeric,columns=['outlook','temp'])
df_tennis_numeric

Unnamed: 0,humidity,windy,play,outlook_overcast,outlook_rainy,outlook_sunny,temp_cool,temp_hot,temp_mild
0,1,0,no,0,0,1,0,1,0
1,1,1,no,0,0,1,0,1,0
2,1,0,yes,1,0,0,0,1,0
3,1,0,yes,0,1,0,0,0,1
4,0,0,yes,0,1,0,1,0,0
5,0,1,no,0,1,0,1,0,0
6,0,1,yes,1,0,0,1,0,0
7,1,0,no,0,0,1,0,0,1
8,0,0,yes,0,0,1,1,0,0
9,0,0,yes,0,1,0,0,0,1


Let's try to build a decision tree on this now that it is in numeric shape suitable for sklearn.

In [None]:
from sklearn import tree
from treeviz import tree_print

features_df = df_tennis_numeric.drop(['play'],axis=1)
target_df = pandas.DataFrame(df_tennis_numeric['play'])

dtree = tree.DecisionTreeClassifier(criterion='entropy')
dtree.fit(features_df,target_df)
tree_print(dtree,features_df)

if outlook_overcast =< 0.5: 
  |then if humidity =< 0.5: 
  |  |then if windy =< 0.5: 
  |  |  |then yes
  |  |  |else if outlook_sunny =< 0.5: 
  |  |  |  |then no
  |  |  |  |else yes
  |  |else if outlook_rainy =< 0.5: 
  |  |  |then no
  |  |  |else if windy =< 0.5: 
  |  |  |  |then yes
  |  |  |  |else no
  |else yes
<---------->
Tree Depth:  4


The tree looks a bit different because we are splitting on 0/1.  But we can see that the outlook variable is still the most predictive variable. Note: something =< 0.5 means something == 0 since the values are only 1 and 0.

# Data Access Patterns

We can use relational and boolean expressions when selecting data from a dataframe.

In order to see that we have to realize that there is another simple way to select frame columns:

In [None]:
df[['Wings', 'Mammal']] # using a list of column names to access columns

Unnamed: 0,Wings,Mammal
Dog,no,True
Duck,yes,False
Frog,no,False
Bat,yes,True
Bar Stool,no,False


Relational Operators:

In [None]:
df[df.Wings == 'yes'] # accessing rows for which an equality holds

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Duck,2,yes,no,yes,False
Bat,4,yes,yes,no,True


In [None]:
df[df.Wings == 'yes'].Mammal # accessing attribute values for rows for which the equality holds

Duck    False
Bat      True
Name: Mammal, dtype: bool

In [None]:
df[(df.Wings == 'yes') & (df.Fur == 'yes')] # boolean operations

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Bat,4,yes,yes,no,True


# Missing or Duplicated Data
* Pandas flags missing values with NaN (not a number).
* In most cases, any computations applied to a dataframe with NaNs will ignore the NaNs
* However, it is still a good idea to clean up the dataframe
* In general we have two options to deal with missing data:
 * Either drop the row or columns that has NaNs
 * Or try to substitute a reasonable value for the NaN
 

In [None]:
df_missing = pandas.read_csv(home+"mammals-missing.csv")
df_missing

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
2,4,no,no,,False
3,4,,yes,no,True
4,3,no,no,no,False


In [None]:
# look at the values of the isnull dataframe
df_missing.isnull().values

array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False,  True, False],
       [False,  True, False, False, False],
       [False, False, False, False, False]])

In [None]:
# find out how many values are missing
# NOTE: sum treats 'True' as 1 and 'False' as 0 
df_missing.isnull().values.sum()

2

In [None]:
# drop rows that have NaNs
df_missing.dropna(how='any',axis=0)

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
4,3,no,no,no,False


In [None]:
# dropping columns that have NaNs
# NOTE: this is NOT always a good idea -- empty dataframe!
df_missing.dropna(how='any',axis=1)

Unnamed: 0,Legs,Fur,Mammal
0,4,yes,True
1,2,no,False
2,4,no,False
3,4,yes,True
4,3,no,False


# Replacing Missing Data

We can also try to estimate the missing data - **impute** it.

We replace the missing values by the means/mode of each column.

In [None]:
df_missing

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
2,4,no,no,,False
3,4,,yes,no,True
4,3,no,no,no,False


In [None]:
# compute the mode of each column
df_missing.mode()

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,no,no,False


In [None]:
# fill the missing values in each column
df_new = df_missing.copy()
for c in df_new.columns:
    df_new[c].fillna(df_missing[c].mode()[0], inplace=True)

df_new

4
no
no
no
False


Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
2,4,no,no,no,False
3,4,no,yes,no,True
4,3,no,no,no,False


# Broadcasting

Binary arithmetic operators are applied element by element to dataframes assuming equal sized dataframes.

Broadcasting refers to the fact that Python will reuse a scalar in order to complete the binary operation.


In [None]:
df = pandas.DataFrame([[1,2],[3,4]])
df

Unnamed: 0,0,1
0,1,2
1,3,4


In [None]:
# element by element operation
df + df

Unnamed: 0,0,1
0,2,4
1,6,8


In [None]:
# broadcasting a scalar
# NOTE: the scalar is applied to ALL elements
#       of the dataframe
df + 10

Unnamed: 0,0,1
0,11,12
1,13,14


In [None]:
# we can now say things like this
df + df == 2*df

Unnamed: 0,0,1
0,True,True
1,True,True


# Reading

* 3.1 [Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)
* 3.2 [Data Indexing and Selection](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html)
* 3.3 [Operating on Data in Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.03-operations-in-pandas.html)
* 3.4 [Handling Missing Data](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html)


# Lab Exercise

See BrightSpace Assignment #2