In [28]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it
!test ! -e ds-assets && git clone https://github.com/lutzhamel/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/"
import sys
sys.path.append(home)      # add home folder to module search path

Already up to date.


# Data Manipulation with Pandas

Pandas supports 1-D (Series), 2-D (DataFrame), and 3-D (Panel) data structures.  Here we cover DataFrames because they most closely resemble the kind of data tables data scientists mostly look at.

The advantage of Pandas is that it stores the data together with its *metadata*.

The most often used meta data with Pandas are the **column names** and the **index**.


In [29]:
import pandas as pd

In [30]:
df = pd.read_csv(home+"mammals.csv")

In [31]:
df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
2,4,no,no,no,False
3,4,yes,yes,no,True
4,3,no,no,no,False


# DataFrame Parts

A dataframe is composed of different parts that work together to give a coherent view of the data:

In [32]:
df.columns

Index(['Legs', 'Wings', 'Fur', 'Feathers', 'Mammal'], dtype='object')

In [33]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [34]:
df.values

array([[4, 'no', 'yes', 'no', True],
       [2, 'yes', 'no', 'yes', False],
       [4, 'no', 'no', 'no', False],
       [4, 'yes', 'yes', 'no', True],
       [3, 'no', 'no', 'no', False]], dtype=object)

We can change the parts of the data.  For example, we can create a new index for our dataframe:

In [35]:
df.index = ['Dog', 'Duck', 'Frog', 'Bat', 'Bar Stool']

In [36]:
df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,no,False
Bat,4,yes,yes,no,True
Bar Stool,3,no,no,no,False


# Indexing and Slicing

Array-style indexing Pandas using the **iloc** indexer (there are other indexers, see docs).

Using the **iloc** indexer, we can index the underlying array as if it is a simple array using row and column integer values (hence the i in iloc). The DataFrame index and column labels are maintained in the result:

In [37]:
df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,no,False
Bat,4,yes,yes,no,True
Bar Stool,3,no,no,no,False


The `iloc` indexer works similar to Python indexing with `[start:stop-1:inc]` but can index in multiple dimensions


In [38]:
df.iloc[:2,1:4]

Unnamed: 0,Wings,Fur,Feathers
Dog,no,yes,no
Duck,yes,no,yes


# Converting Categorical Data to Numerical Data

**The machine learning algorithms in sklearn only operate on numerical data**.  That means any data that is categorical has to be converted to numerical data.  **This is only true for the independent variables**.  The target variable can be categorical or numeric.



We accomplish the conversion via **dummy variables** or, more formal, **indicator variables**.

Pandas supports the **get_dummies** function that converts categorical variables into dummy/indicator variables.

Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.

**IMPORTANT**: Just converting labels into numerical values does not work unless we are dealing with nominal categorical values. Doing this simple conversion for ordinal categorical values will **introduce unwanted/implicit biases** into the data.

Let's try it using our mammal dataset.

In [39]:
df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,no,False
Bat,4,yes,yes,no,True
Bar Stool,3,no,no,no,False


In [40]:
df_dummies1 = pd.get_dummies(df)
df_dummies1

Unnamed: 0,Legs,Mammal,Wings_no,Wings_yes,Fur_no,Fur_yes,Feathers_no,Feathers_yes
Dog,4,True,1,0,0,1,1,0
Duck,2,False,0,1,1,0,0,1
Frog,4,False,1,0,1,0,1,0
Bat,4,True,0,1,0,1,1,0
Bar Stool,3,False,1,0,1,0,1,0


By default, boolean values are not converted into dummy variables.
If we really had to convert these to numerical values as well we can force Pandas to do so.




In [41]:
df_dummies2 = pd.get_dummies(df_dummies1,columns=['Mammal'])
df_dummies2

Unnamed: 0,Legs,Wings_no,Wings_yes,Fur_no,Fur_yes,Feathers_no,Feathers_yes,Mammal_False,Mammal_True
Dog,4,1,0,0,1,1,0,0,1
Duck,2,0,1,1,0,0,1,1,0
Frog,4,1,0,1,0,1,0,1,0
Bat,4,0,1,0,1,1,0,0,1
Bar Stool,3,1,0,1,0,1,0,1,0


Let's try this on our **tennis dataset** and see if we can modify the data in such a way that we can build a decision tree.



In [42]:
tennis_df = pd.read_csv(home+"tennis.csv")
tennis_df.head()

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes


Let's try to build a decision tree on this.

In [43]:
from sklearn import tree
from treeviz import tree_print

features_df = tennis_df.drop(columns=['play'])
target_df = pd.DataFrame(tennis_df[['play']])

dtree = tree.DecisionTreeClassifier(criterion='entropy')
try:
  dtree.fit(features_df,target_df)
except Exception as e:
  print(e)


could not convert string to float: 'sunny'


Notice that the tree algorithm complains that it cannot convert the categorical label 'sunny' into a number for training purposes.

&rarr; We need to introduce dummy variables. But we don't want to convert our target variable 'play'. We explicitly state which columns to convert.

In [44]:
tennis_dummies_df = pd.get_dummies(tennis_df, columns=['outlook','temp','humidity','windy'])
tennis_dummies_df.head()

Unnamed: 0,play,outlook_overcast,outlook_rainy,outlook_sunny,temp_cool,temp_hot,temp_mild,humidity_high,humidity_normal,windy_False,windy_True
0,no,0,0,1,0,1,0,1,0,1,0
1,no,0,0,1,0,1,0,1,0,0,1
2,yes,1,0,0,0,1,0,1,0,1,0
3,yes,0,1,0,0,0,1,1,0,1,0
4,yes,0,1,0,1,0,0,0,1,1,0


Let's try to build a decision tree on this now that it is in numeric shape suitable for sklearn.

In [45]:
from sklearn import tree
from treeviz import tree_print

features_df = tennis_dummies_df.drop(columns=['play'])
target_df = pd.DataFrame(tennis_dummies_df[['play']])

dtree = tree.DecisionTreeClassifier(criterion='entropy')
dtree.fit(features_df,target_df)
tree_print(dtree,features_df)

if outlook_overcast =< 0.5: 
  |then if humidity_high =< 0.5: 
  |  |then if windy_True =< 0.5: 
  |  |  |then yes
  |  |  |else if temp_mild =< 0.5: 
  |  |  |  |then no
  |  |  |  |else yes
  |  |else if outlook_rainy =< 0.5: 
  |  |  |then no
  |  |  |else if windy_True =< 0.5: 
  |  |  |  |then yes
  |  |  |  |else no
  |else yes
<---------->
Tree Depth:  4


The tree looks a bit different because we are splitting on 0/1.  But we can see that the outlook variable is still the most predictive variable. Note: something =< 0.5 means something == 0 since the values are only 1 and 0.

# Data Access Patterns

We can use list, relational, and boolean expressions when selecting data from a dataframe.


In [46]:
df = pd.read_csv(home+"mammals.csv")
df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
2,4,no,no,no,False
3,4,yes,yes,no,True
4,3,no,no,no,False


In [47]:
 # using a list of column names to access columns
list = ['Wings','Mammal']
df[list]

Unnamed: 0,Wings,Mammal
0,no,True
1,yes,False
2,no,False
3,yes,True
4,no,False


Using a relational expression to access data.

In [48]:
 # accessing rows for which an equality holds
 df[df.Wings == 'yes']

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
1,2,yes,no,yes,False
3,4,yes,yes,no,True


In [49]:
# combining relational operations with boolean operators
df[(df.Wings == 'yes') & (df.Fur == 'yes')]

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
3,4,yes,yes,no,True


# Missing or Duplicated Data
* Pandas flags missing values with NaN (not a number).
* In most cases, any computations applied to a dataframe with NaNs will ignore the NaNs
* However, it is still a good idea to clean up the dataframe
* In general, there exist sophisticated procedures to deal with missing data, here we limit ourselves to **dropping the row or columns that has NaNs**.


In [50]:
df_missing = pd.read_csv(home+"mammals-missing.csv")
df_missing

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
2,4,no,no,,False
3,4,,yes,no,True
4,3,no,no,no,False


In [51]:
# look at the values of the isnull dataframe
df_missing.isnull().sum()

Legs        0
Wings       1
Fur         0
Feathers    1
Mammal      0
dtype: int64

In [52]:
# for the following we need the definitions
COLUMNS = 1
INDEX = 0

In [53]:
# drop rows that have NaNs
df_missing.dropna(how='any',axis=INDEX)

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
4,3,no,no,no,False


In [54]:
# dropping columns that have NaNs
df_missing.dropna(how='any',axis=COLUMNS)

Unnamed: 0,Legs,Fur,Mammal
0,4,yes,True
1,2,no,False
2,4,no,False
3,4,yes,True
4,3,no,False


**NOTE**: In most data sets we have more rows than columns, so **in most cases you want to delete rows rather than columns** in order to eliminate missing data.

# Reading

* 3.1 [Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)
* 3.2 [Data Indexing and Selection](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html)
* 3.3 [Operating on Data in Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.03-operations-in-pd.html)
* 3.4 [Handling Missing Data](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html)


# Project

See BrightSpace Assignment #2