In [1]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it
!test ! -e ds-assets && git clone https://github.com/lutzhamel/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/"
import sys
sys.path.append(home)      # add home folder to module search path

Already up to date.


# Data Manipulation with Pandas

Pandas supports 1-D (Series), 2-D (DataFrame), and 3-D (Panel) data structures.  Here we cover DataFrames because they most closely resemble the kind of data tables data scientists mostly look at.

The advantage of Pandas is that it stores the data together with its *metadata*.

The most often used meta data with Pandas are the **column names** and the **index**.


In [2]:
import pandas as pd

In [3]:
mammal_df = pd.read_csv(home+"mammals.csv")

In [4]:
mammal_df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
2,4,no,no,no,False
3,4,yes,yes,no,True
4,3,no,no,no,False


# DataFrame Parts

A dataframe is composed of different parts that work together to give a coherent view of the data: **columns** (variables), **index** (rows), and the **values**.

In [5]:
mammal_df.columns


Index(['Legs', 'Wings', 'Fur', 'Feathers', 'Mammal'], dtype='object')

In [6]:
mammal_df.index

RangeIndex(start=0, stop=5, step=1)

In [7]:
mammal_df.values

array([[4, 'no', 'yes', 'no', True],
       [2, 'yes', 'no', 'yes', False],
       [4, 'no', 'no', 'no', False],
       [4, 'yes', 'yes', 'no', True],
       [3, 'no', 'no', 'no', False]], dtype=object)

**Observation**: We say that the **columns** and the **index** constitute the **meta data** of the dataframe.  Only the **values** property of the dataframe holds the actual data.

We can change any and all the the parts of the dataframe (although you should **never** change the data of a dataframe via the **values** property).  For example, we can create a new index for our dataframe.

In [8]:
mammal_df.index = ['Dog', 'Duck', 'Frog', 'Bat', 'Bar Stool']

In [9]:
mammal_df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,no,False
Bat,4,yes,yes,no,True
Bar Stool,3,no,no,no,False


# Indexing and Slicing


 We use a list of column names to select columns from a dataframe.


In [10]:
col_list = ['Wings','Mammal']
mammal_df[col_list]

Unnamed: 0,Wings,Mammal
Dog,no,True
Duck,yes,False
Frog,no,False
Bat,yes,True
Bar Stool,no,False



Array-style indexing in Pandas dataframes using the **iloc** indexer (there are other indexers, see docs).

Using the **iloc** indexer, we can index the underlying array as if it is a simple array using row and column integer values (hence the i in iloc). The indexer returns a dataframe.

In [11]:
mammal_df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,no,False
Bat,4,yes,yes,no,True
Bar Stool,3,no,no,no,False


The `iloc` indexer works similar to Python indexing with `[start:stop-1:inc]` but can index in multiple dimensions.


In [12]:
mammal_df.iloc[:2,1:4]

Unnamed: 0,Wings,Fur,Feathers
Dog,no,yes,no
Duck,yes,no,yes


# Data-based Data Selection

We can use relational and boolean expressions when selecting data from a dataframe.


Using a relational expression to access data.

In [13]:
 # accessing rows for which an equality holds
 mammal_df[mammal_df.Wings == 'yes']

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Duck,2,yes,no,yes,False
Bat,4,yes,yes,no,True


In [14]:
# combining relational operations with boolean operators
mammal_df[(mammal_df.Wings == 'yes') & (mammal_df.Fur == 'yes')]

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Bat,4,yes,yes,no,True


# Missing Data


In [15]:
# for the following we need the definitions
COLUMNS = 1
INDEX = 0



* Pandas flags missing values with NaN (not a number).
* In most cases, any computations applied to a dataframe with NaNs will ignore the NaNs
* However, it is still a good idea to clean up the dataframe
* In general, there exist sophisticated procedures to deal with missing data, here we limit ourselves to **dropping the row or columns that has NaNs**.


In [16]:
df_missing = pd.read_csv(home+"mammals-missing.csv")
df_missing.index = ['Dog', 'Duck', 'Frog', 'Bat', 'Bar Stool']
df_missing

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,,False
Bat,4,,yes,no,True
Bar Stool,3,no,no,no,False


**Observation**: Notice the NaN values in the dataframe indicating missing values.

We can use the **isnull** function to detect missing values in the dataframe.

In [17]:
df_missing.isnull()

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,False,False,False,False,False
Duck,False,False,False,False,False
Frog,False,False,False,True,False
Bat,False,True,False,False,False
Bar Stool,False,False,False,False,False


**Observation**: For each missing value we find a **True** in the returned dataframe.

Rather than printing out the dataframe and then search for the True values we can use the **sum** function and the fact that Python treat True as 1 in order to quickly detect missing values.

In [18]:
df_missing.isnull().sum(axis=INDEX)

Legs        0
Wings       1
Fur         0
Feathers    1
Mammal      0
dtype: int64

In [19]:
df_missing.isnull().sum(axis=COLUMNS)

Dog          0
Duck         0
Frog         1
Bat          1
Bar Stool    0
dtype: int64

If we don't care where the missing values are and we just want to find out that there are missing values we can first sum over the dataframe (defaults to INDEX sum) and the sum over the resulting vector (series).

In [20]:
df_missing.isnull().sum().sum()

2

In [21]:
# drop rows that have NaNs
df_missing.dropna(how='any',axis=INDEX)

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Bar Stool,3,no,no,no,False


In [22]:
# dropping columns that have NaNs
df_missing.dropna(how='any',axis=COLUMNS)

Unnamed: 0,Legs,Fur,Mammal
Dog,4,yes,True
Duck,2,no,False
Frog,4,no,False
Bat,4,yes,True
Bar Stool,3,no,False


**NOTE**: In most data sets we have more rows than columns, so **in most cases you want to delete rows rather than columns** in order to eliminate missing data.

# Converting Categorical Data to Numerical Data




We accomplish the conversion via **dummy variables** or, more formally, **indicator variables**.

Pandas supports the **get_dummies** function that converts categorical variables in a dataframe into dummy/indicator variables.

Each variable is converted into as many 0/1 dummy/indicator variables as there are different values and the original variable is deleted from the dataset. Columns in the resulting dataframe are each named after a value. The resulting names consist of the original variable name and the value name.  Consider the variable **Fur** in the mammals dataset which has two values: **yes** and **no**.  The resulting indicator variable names are: **Fur_yes** and **Fur_no**.

**IMPORTANT**: Just converting labels into numerical values does not work unless we are dealing with ordinal categorical values. Doing this simple conversion for nominal categorical values will **introduce unwanted/implicit biases** into the data.

Let's try it using our mammal dataset.

In [23]:
mammal_df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,no,False
Bat,4,yes,yes,no,True
Bar Stool,3,no,no,no,False


In [24]:
df_dummies1 = pd.get_dummies(mammal_df)
df_dummies1

Unnamed: 0,Legs,Mammal,Wings_no,Wings_yes,Fur_no,Fur_yes,Feathers_no,Feathers_yes
Dog,4,True,1,0,0,1,1,0
Duck,2,False,0,1,1,0,0,1
Frog,4,False,1,0,1,0,1,0
Bat,4,True,0,1,0,1,1,0
Bar Stool,3,False,1,0,1,0,1,0


**Observation**:  Notice that the **Fur** variable has been converted to **Fur_yes** and **Fur_no**.

By default, boolean values are not converted into dummy variables.
If we really had to convert these to numerical values as well we can force Pandas to do so.




In [25]:
df_dummies2 = pd.get_dummies(df_dummies1,columns=['Mammal'])
df_dummies2

Unnamed: 0,Legs,Wings_no,Wings_yes,Fur_no,Fur_yes,Feathers_no,Feathers_yes,Mammal_False,Mammal_True
Dog,4,1,0,0,1,1,0,0,1
Duck,2,0,1,1,0,0,1,1,0
Frog,4,1,0,1,0,1,0,1,0
Bat,4,0,1,0,1,1,0,0,1
Bar Stool,3,1,0,1,0,1,0,1,0


# Sklearn needs numerical Data

**The machine learning algorithms in sklearn only operate on numerical data**.  That means any data that is categorical has to be converted to numerical data.  **This is only true for the independent variables**.  The target variable can be categorical or numeric.


Let's try this on our **tennis dataset** and see if we can modify the data in such a way that we can build a decision tree.



In [26]:
tennis_df = pd.read_csv(home+"tennis.csv")
tennis_df.head()

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,weak,no
1,sunny,hot,high,strong,no
2,overcast,hot,high,weak,yes
3,rainy,mild,high,weak,yes
4,rainy,cool,normal,weak,yes


Let's try to build a decision tree on this.

In [27]:
from sklearn import tree
from treeviz import tree_print

features_df = tennis_df.drop(columns=['play'])
target_df = pd.DataFrame(tennis_df[['play']])

dtree = tree.DecisionTreeClassifier(criterion='entropy')
try:
  dtree.fit(features_df,target_df)
except Exception as e:
  print(e)


could not convert string to float: 'sunny'


Notice that the tree algorithm complains that it cannot convert the categorical label 'sunny' into a number for training purposes.

&rarr; We need to introduce dummy variables. But we don't want to convert our target variable 'play'. We explicitly state which columns to convert.

We need to be explicit which columns we want to convert.  For example, we don't want to convert the **play** column because that is our target variable.

In [28]:
tennis_dummies_df = pd.get_dummies(tennis_df, columns=['outlook','temp','humidity','windy'])
tennis_dummies_df.head()

Unnamed: 0,play,outlook_overcast,outlook_rainy,outlook_sunny,temp_cool,temp_hot,temp_mild,humidity_high,humidity_normal,windy_strong,windy_weak
0,no,0,0,1,0,1,0,1,0,0,1
1,no,0,0,1,0,1,0,1,0,1,0
2,yes,1,0,0,0,1,0,1,0,0,1
3,yes,0,1,0,0,0,1,1,0,0,1
4,yes,0,1,0,1,0,0,0,1,0,1


Let's try to build a decision tree on this now that it is in numeric shape suitable for sklearn.

In [29]:
from sklearn import tree
from treeviz import tree_print

features_df = tennis_dummies_df.drop(columns=['play'])
target_df = tennis_dummies_df[['play']]

dtree = tree.DecisionTreeClassifier(criterion='entropy')
dtree.fit(features_df,target_df)
tree_print(dtree,features_df)

if outlook_overcast =< 0.5: 
  |then if humidity_high =< 0.5: 
  |  |then if windy_strong =< 0.5: 
  |  |  |then yes
  |  |  |else if temp_cool =< 0.5: 
  |  |  |  |then yes
  |  |  |  |else no
  |  |else if outlook_rainy =< 0.5: 
  |  |  |then no
  |  |  |else if windy_weak =< 0.5: 
  |  |  |  |then no
  |  |  |  |else yes
  |else yes
<---------->
Tree Depth:  4


The tree looks a bit different because we are splitting on 0/1.  But we can see that the outlook variable is still the most predictive variable.

**Note**: (something =< 0.5) means (something == 0) since the values are only 1 and 0.

Let's see if this tree behaves as well as the tree built on the original categorical data.

In [30]:
from sklearn.metrics import accuracy_score
predict_df = pd.DataFrame(dtree.predict(features_df), columns=['play'])
print("The accuracy of our model is: {}%".format(accuracy_score(target_df, predict_df)*100))

The accuracy of our model is: 100.0%


**Observation**: Yup, still predicts all the rows correctly, just like the original tree.

# Reading

* 3.1 [Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)
* 3.2 [Data Indexing and Selection](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html)
* 3.3 [Operating on Data in Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.03-operations-in-pd.html)
* 3.4 [Handling Missing Data](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html)


# Project

See BrightSpace Assignment #2