Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
Fetching contributors…

Cannot retrieve contributors at this time

1859 lines (1859 sloc) 81.34 kb
{
"metadata": {
"name": "",
"signature": "sha256:94a35a1cb8a7080e711ad0ee699aa3a6a4667dff06964e81f0a47944363bfa8b"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Python for Data Science"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Joe McCarthy](http://interrelativity.com/joe), \n",
"*Director, Analytics & Data Science*, [Atigeo, LLC](http://atigeo.com)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from IPython.display import display, Image, HTML"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Navigation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notebooks in this primer:\n",
"\n",
"* [1. Introduction](1_Introduction.ipynb)\n",
"* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n",
"* [3. Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n",
"* **4. Using Python to Build and Use a Simple Decision Tree Classifier** (*you are here*)\n",
"* [5. Next Steps](5_Next_Steps.ipynb)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# reconstitute relevent elements from the IPython environment active in previous notebook session\n",
"from collections import defaultdict\n",
"import simple_ml\n",
"clean_instances = simple_ml.load_instances('agaricus-lepiota.data', filter_missing_values=True)\n",
"attribute_names = simple_ml.load_attribute_names('agaricus-lepiota.attributes')\n",
"attribute_names_and_values = simple_ml.load_attribute_names_and_values('agaricus-lepiota.attributes')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"4. Using Python to Build and Use a Simple Decision Tree Classifier"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Decision Trees"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wikipedia offers the following description of a [decision tree](https://en.wikipedia.org/wiki/Decision_tree) (with italics added to emphasize terms that will be elaborated below):\n",
"\n",
"> A decision tree is a flowchart-like structure in which each *internal node* represents a *test* of an *attribute*, each branch represents an *outcome* of that test and each *leaf node* represents *class label* (a decision taken after testing all attributes in the path from the root to the leaf). Each path from the root to a leaf can also be represented as a classification rule.\n",
"\n",
"The image below depicts a decision tree created from the UCI mushroom dataset that appears on [Andy G's blog post about Decision Tree Learning](http://gieseanw.wordpress.com/2012/03/03/decision-tree-learning/), where \n",
"\n",
"* a white box represents an *internal node* (and the label represents the *attribute* being tested)\n",
"* a blue box represents an attribute value (an *outcome* of the *test* of that attribute)\n",
"* a green box represents a *leaf node* with a *class label* of *edible*\n",
"* a red box represents a *leaf node* with a *class label* of *poisonous*\n",
"\n",
"<img src=\"http://gieseanw.files.wordpress.com/2012/03/mushroomdecisiontree.png\" style=\"width: 800px;\" />\n",
"\n",
"It is important to note that the UCI mushroom dataset consists entirely of [categorical variables](https://en.wikipedia.org/wiki/Categorical_variable), i.e., every variable (or *attribute*) has an enumerated set of possible values. Many datasets include numeric variables that can take on int or float values. Tests for such variables typically use comparison operators, e.g., $age < 65$ or $36,250 < adjusted\\_gross\\_income <= 87,850$. *[Aside: Python supports boolean expressions containing multiple comparison operators, such as the expression comparing adjusted_gross_income in the preceding example.]*\n",
"\n",
"Our simple decision tree will only accommodate categorical variables. We will closely follow a version of the [decision tree learning algorithm implementation](http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3) offered by Chris Roach.\n",
"\n",
"Our goal in the following sections is to use Python to\n",
"\n",
"* *create* a simple decision tree based on a set of training instances\n",
"* *classify* (predict class labels for) for an instance using a simple decision tree\n",
"* *evaluate* the performance of the simple decision tree on classifying a set of test instances\n",
"\n",
"First, we will explore some concepts and algorithms used in building and using decision trees."
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Entropy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When building a supervised classification model, the frequency distribution of attribute values is a potentially important factor in determining the relative importance of each attribute at various stages in the model building process.\n",
"\n",
"In data modeling, we can use frequency distributions to compute ***entropy***, a measure of disorder (impurity) in a set.\n",
"\n",
"We compute the entropy of multiplying the proportion of instances with each class label by the log of that proportion, and then taking the negative sum of those terms.\n",
"\n",
"More precisely, for a 2-class (binary) classification task:\n",
"\n",
"$entropy(S) = - p_1 log_2 (p_1) - p_2 log_2 (p_2)$\n",
"\n",
"where $p_i$ is proportion (relative frequency) of class *i* within the set *S*.\n",
"\n",
"From the output above, we know that the proportion of `clean_instances` that are labeled `'e'` (class `edible`) in the UCI dataset is $3488 \\div 5644 = 0.618$, and the proportion labeled `'p'` (class `poisonous`) is $2156 \\div 5644 = 0.382$.\n",
"\n",
"After importing the Python [`math`](http://docs.python.org/2/library/math.html) module, we can use the [`math.log(x[, base])`](http://docs.python.org/2/library/math.html#math.log) function in computing the entropy of the `clean_instances` of the UCI mushroom data set as follows:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import math\n",
"entropy = - (3488 / 5644.0) * math.log(3488 / 5644.0, 2) - (2156 / 5644.0) * math.log(2156 / 5644.0, 2)\n",
"print entropy"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.959441337353\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Exercise 6: define entropy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a function, `entropy(instances)`, that computes the entropy of `instances`. You may assume the class label is in position 0; we will later see how to specify default parameter values in function definitions.\n",
"\n",
"[Note: the class label in many data files is the *last* rather than the *first* item on each line.]"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# your function definition here\n",
"\n",
"# delete 'simple_ml.' below to test your function\n",
"print simple_ml.entropy(clean_instances)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.959441337353\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Information Gain"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Informally, a decision tree is constructed using a recursive algorithm that \n",
"\n",
"* selects the *best* attribute \n",
"* splits the set into subsets based on the values of that attribute (each subset is composed of instances from the original set that have the same value for that attribute)\n",
"* repeats the process on each of these subsets until a stopping condition is met (e.g., a subset has no instances or has instances which all have the same class label)\n",
"\n",
"Entropy is a metric that can be used in selecting the best attribute for each split: the best attribute is the one resulting in the *largest decrease in entropy* for a set of instances. [Note: other metrics can be used for determining the best attribute]\n",
"\n",
"*Information gain* measures the decrease in entropy that results from splitting a set of instances based on an attribute.\n",
"\n",
"$IG(S, a) = entropy(S) - [p(s_1) \u00d7 entropy(s_1) + p(s_2) \u00d7 entropy(s_2) ... + p(s_n) \u00d7 entropy(s_n)]$\n",
"\n",
"Where $n$ is the number of distinct values of attribute $a$, and $s_i$ is the subset of $S$ where all instances have the $i$th value of $a$."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Information gain for different attributes:\\n'\n",
"for i in range(1, len(attribute_names)):\n",
" print '{:5.3f} {:2} {}'.format(simple_ml.information_gain(clean_instances, i), i, attribute_names[i])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Information gain for different attributes:\n",
"\n",
"0.017 1 cap-shape\n",
"0.005 2 cap-surface\n",
"0.195 3 cap-color\n",
"0.140 4 bruises?\n",
"0.860 5 odor\n",
"0.004 6 gill-attachment\n",
"0.058 7 gill-spacing\n",
"0.032 8 gill-size\n",
"0.213 9 gill-color\n",
"0.275 10 stalk-shape\n",
"0.097 11 stalk-root\n",
"0.425 12 stalk-surface-above-ring\n",
"0.409 13 stalk-surface-below-ring\n",
"0.306 14 stalk-color-above-ring\n",
"0.279 15 stalk-color-below-ring\n",
"0.000 16 veil-type"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"0.002 17 veil-color\n",
"0.012 18 ring-number\n",
"0.463 19 ring-type\n",
"0.583 20 spore-print-color\n",
"0.110 21 population\n",
"0.101 22 habitat\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can sort the attributes based in decreasing order of information gain."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Information gain for different attributes:\\n'\n",
"sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) for i in range(1, len(attribute_names))], \n",
" reverse=True)\n",
"print sorted_information_gain_indexes, '\\n'\n",
"\n",
"for gain, i in sorted_information_gain_indexes:\n",
" print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Information gain for different attributes:\n",
"\n",
"[(0.8596704358849709, 5), (0.5828694793608379, 20), (0.46290566555455265, 19), (0.42456477093655975, 12), (0.40865780788318695, 13), (0.3062989793570199, 14), (0.27891994708759504, 15), (0.2750355212178639, 10), (0.2127971869976022, 9), (0.19495343617580085, 3), (0.1400386042032834, 4), (0.1097880400299237, 21), (0.10067585994181227, 22), (0.09733858997769329, 11), (0.05836192763098613, 7), (0.03242975884332899, 8), (0.01740692300090696, 1), (0.01205967443646827, 18), (0.004572013423856602, 2), (0.0044397141315495325, 6), (0.0019702590992403124, 17), (0.0, 16)]"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" \n",
"\n",
"0.860 5 odor\n",
"0.583 20 spore-print-color\n",
"0.463 19 ring-type\n",
"0.425 12 stalk-surface-above-ring\n",
"0.409 13 stalk-surface-below-ring\n",
"0.306 14 stalk-color-above-ring\n",
"0.279 15 stalk-color-below-ring\n",
"0.275 10 stalk-shape\n",
"0.213 9 gill-color\n",
"0.195 3 cap-color\n",
"0.140 4 bruises?\n",
"0.110 21 population\n",
"0.101 22 habitat\n",
"0.097 11 stalk-root\n",
"0.058 7 gill-spacing\n",
"0.032 8 gill-size\n",
"0.017 1 cap-shape\n",
"0.012 18 ring-number\n",
"0.005 2 cap-surface\n",
"0.004 6 gill-attachment\n",
"0.002 17 veil-color\n",
"0.000 16 veil-type\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following variation does not use a list comprehension:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Information gain for different attributes:\\n'\n",
"\n",
"information_gain_values = []\n",
"for i in range(1, len(attribute_names)):\n",
" information_gain_values.append((simple_ml.information_gain(clean_instances, i), i))\n",
" \n",
"sorted_information_gain_indexes = sorted(information_gain_values, \n",
" reverse=True)\n",
"print sorted_information_gain_indexes, '\\n'\n",
"\n",
"for gain, i in sorted_information_gain_indexes:\n",
" print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Information gain for different attributes:\n",
"\n",
"[(0.8596704358849709, 5), (0.5828694793608379, 20), (0.46290566555455265, 19), (0.42456477093655975, 12), (0.40865780788318695, 13), (0.3062989793570199, 14), (0.27891994708759504, 15), (0.2750355212178639, 10), (0.2127971869976022, 9), (0.19495343617580085, 3), (0.1400386042032834, 4), (0.1097880400299237, 21), (0.10067585994181227, 22), (0.09733858997769329, 11), (0.05836192763098613, 7), (0.03242975884332899, 8), (0.01740692300090696, 1), (0.01205967443646827, 18), (0.004572013423856602, 2), (0.0044397141315495325, 6), (0.0019702590992403124, 17), (0.0, 16)]"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" \n",
"\n",
"0.860 5 odor\n",
"0.583 20 spore-print-color\n",
"0.463 19 ring-type\n",
"0.425 12 stalk-surface-above-ring\n",
"0.409 13 stalk-surface-below-ring\n",
"0.306 14 stalk-color-above-ring\n",
"0.279 15 stalk-color-below-ring\n",
"0.275 10 stalk-shape\n",
"0.213 9 gill-color\n",
"0.195 3 cap-color\n",
"0.140 4 bruises?\n",
"0.110 21 population\n",
"0.101 22 habitat\n",
"0.097 11 stalk-root\n",
"0.058 7 gill-spacing\n",
"0.032 8 gill-size\n",
"0.017 1 cap-shape\n",
"0.012 18 ring-number\n",
"0.005 2 cap-surface\n",
"0.004 6 gill-attachment\n",
"0.002 17 veil-color\n",
"0.000 16 veil-type\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Exercise 7: define information_gain()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a function, `information_gain(instances, i)`, that returns the information gain achieved by selecting the `i`th attribute to split `instances`. It should exhibit the same behavior as the `simple_ml` version of the function."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# your definition of information_gain(instances, i) here\n",
"\n",
"# delete 'simple_ml.' below to test your function\n",
"sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) for i in range(1, len(attribute_names))], \n",
" reverse=True)\n",
"\n",
"print 'Information gain for different attributes:\\n'\n",
"for gain, i in sorted_information_gain_indexes:\n",
" print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Information gain for different attributes:\n",
"\n",
"0.860 5 odor\n",
"0.583 20 spore-print-color\n",
"0.463 19 ring-type\n",
"0.425 12 stalk-surface-above-ring\n",
"0.409 13 stalk-surface-below-ring\n",
"0.306 14 stalk-color-above-ring\n",
"0.279 15 stalk-color-below-ring\n",
"0.275 10 stalk-shape\n",
"0.213 9 gill-color\n",
"0.195 3 cap-color\n",
"0.140 4 bruises?\n",
"0.110 21 population\n",
"0.101 22 habitat\n",
"0.097 11 stalk-root\n",
"0.058 7 gill-spacing\n",
"0.032 8 gill-size\n",
"0.017 1 cap-shape\n",
"0.012 18 ring-number\n",
"0.005 2 cap-surface\n",
"0.004 6 gill-attachment\n",
"0.002 17 veil-color\n",
"0.000 16 veil-type\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Building a Simple Decision Tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will implement a modified version of the [ID3](https://en.wikipedia.org/wiki/ID3_algorithm) algorithm for building a simple decision tree.\n",
"\n",
" ID3 (Examples, Target_Attribute, Attributes)\n",
" Create a root node for the tree\n",
" If all examples are positive, Return the single-node tree Root, with label = +.\n",
" If all examples are negative, Return the single-node tree Root, with label = -.\n",
" If number of predicting attributes is empty, then Return the single node tree Root,\n",
" with label = most common value of the target attribute in the examples.\n",
" Otherwise Begin\n",
" A \u2190 The Attribute that best classifies examples.\n",
" Decision Tree attribute for Root = A.\n",
" For each possible value, v_i, of A,\n",
" Add a new tree branch below Root, corresponding to the test A = v_i.\n",
" Let Examples(v_i) be the subset of examples that have the value v_i for A\n",
" If Examples(v_i) is empty\n",
" Then below this new branch add a leaf node with label = most common target value in the examples\n",
" Else below this new branch add the subtree ID3 (Examples(v_i), Target_Attribute, Attributes \u2013 {A})\n",
" End\n",
" Return Root"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In building a decision tree, we will need to split the instances based on the index of the *best* attribute, i.e., the attribute that offers the *highest information gain*. We will use separate utility functions to handle these subtasks. To simplify the functions, we will rely exclusively on attribute indexes rather than attribute names.\n",
"\n",
"***Note:*** the algorithm above is *recursive*, i.e., the there is a recursive call to `ID3` within the definition of `ID3`. Covering recursion is beyond the scope of this primer, but there are a number of other resources on [using recursion in Python](https://www.google.com/search?q=python+recursion). Familiarity with recursion will be important for understanding both the tree construction and classification functions below.\n",
"\n",
"First, we will define a function to split a set of instances based on any attribute. This function will return a dictionary where the *key* of each dictionary is a distinct value of the specified `attribute_index`, and the *value* of each dictionary is a list representing the subset of `instances` that have that attribute value."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def split_instances(instances, attribute_index):\n",
" '''Returns a list of dictionaries, splitting a list of instances according to their values of a specified attribute''\n",
" \n",
" The key of each dictionary is a distinct value of attribute_index,\n",
" and the value of each dictionary is a list representing the subset of instances that have that value for the attribute'''\n",
" partitions = defaultdict(list)\n",
" for instance in instances:\n",
" partitions[instance[attribute_index]].append(instance)\n",
" return partitions\n",
"\n",
"partitions = split_instances(clean_instances, 5)\n",
"print [(partition, len(partitions[partition])) for partition in partitions]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[('a', 400), ('c', 192), ('f', 1584), ('m', 36), ('l', 400), ('n', 2776), ('p', 256)]\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we can split instances based on a particular attribute, we would like to be able to choose the *best* attribute with which to split the instances, where *best* is defined as the attribute that provides the greatest information gain if instances were split based on that attribute. We will want to restrict the candidate attributes so that we don't bother trying to split on an attribute that was used higher up in the decision tree (or use the target attribute as a candidate)."
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Exercise 8: define choose_best_attribute_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a function, `choose_best_attribute_index(instances, candidate_attribute_indexes)`, that returns the index in the list of `candidate_attribute_indexes` that provides the highest information gain if `instances` are split based on that attribute index."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# your function here\n",
"\n",
"# delete 'simple_ml.' below to test your function:\n",
"print 'Best attribute index:', simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names)))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Best attribute index: "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"5\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A leaf node in a decision tree represents the most frequently occurring - or majority - class value for that path through the tree. We will need a function that determines the majority value for the class index among a set of instances.\n",
"\n",
"We earlier saw how the [`defaultdict`](http://docs.python.org/2/library/collections.html#collections.defaultdict) container in the [`collections`](http://docs.python.org/2/library/collections.html) module can be used to simplify the construction of a dictionary containing the counts of all attribute values for all attributes, by automatically setting the count for any attribute value to zero when the attribute value is first added to the dictionary.\n",
"\n",
"The `collections` module has another useful container, a [`Counter`](http://docs.python.org/2/library/collections.html#collections.Counter) class, that can further simplify the construction of a specialized dictionary of counts. When a `Counter` object is instantiated with a list of items, it returns a dictionary-like container in which the *keys* are the unique items in the list, and the *values* are the counts of each unique item in that list. \n",
"\n",
"This container has an additional method, [`most_common([n])`](http://docs.python.org/2/library/collections.html#collections.Counter.most_common), which returns a list of 2-element tuples representing the values and their associated counts for the most common `n` values; if `n` is omitted, the method returns all tuples.\n",
"\n",
"The following is an example of how we can use a `Counter` to represent the frequency of different class labels, and how we can identify the most frequent value and its count."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from collections import Counter\n",
"\n",
"class_counts = Counter([instance[0] for instance in clean_instances])\n",
"print 'class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(\n",
" class_counts, # the Counter object\n",
" class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count\n",
" class_counts.most_common(1)[0][0]) # the most common value (1st element in that tuple)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"class_counts: Counter({'e': 3488, 'p': 2156}); most_common(1): [('e', 3488)], most_common(1)[0][0]: e\n"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following variation does not use a list comprehension:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class_values = []\n",
"for instance in clean_instances:\n",
" class_values.append(instance[0])\n",
" \n",
"class_counts = Counter(class_values)\n",
"print 'class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(\n",
" class_counts, # the Counter object\n",
" class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count\n",
" class_counts.most_common(1)[0][0]) # the most common value (1st element in that tuple)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"class_counts: Counter({'e': 3488, 'p': 2156}); most_common(1): [('e', 3488)], most_common(1)[0][0]: e\n"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before putting all this together to define a decision tree construction function, it may be helpful to cover a few additional aspects of Python the function will utilize."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python offers a very flexible mechanism for the [testing of truth values](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#testing-for-truth-values): in an **if** condition, any null object, zero-valued numerical expression or empty container (string, list, dictionary or tuple) is interpreted as *False* (i.e., *not True*):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n",
" print '\"{}\" is'.format(x),\n",
" if x:\n",
" print True\n",
" else:\n",
" print False"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\"False\" is False\n",
"\"None\" is False\n",
"\"0\" is False\n",
"\"0.0\" is False\n",
"\"\" is False\n",
"\"[]\" is False\n",
"\"{}\" is False\n",
"\"()\" is False\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python also offers a [conditional expression (ternary operator)](http://docs.python.org/2/reference/expressions.html#conditional-expressions) that allows the functionality of an if/else statement that returns a value to be implemented as an expression. For example, the if/else statement in the code above could be implemented as a conditional expression as follows:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n",
" print '\"{}\" is {}'.format(x, True if x else False) # using conditional expression as second argument to format()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\"False\" is False\n",
"\"None\" is False\n",
"\"0\" is False\n",
"\"0.0\" is False\n",
"\"\" is False\n",
"\"[]\" is False\n",
"\"{}\" is False\n",
"\"()\" is False\n"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python function definitions can specify [default parameter values](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values) indicating the value those parameters will have if no argument is explicitly provided when the function is called. Arguments can also be passed using [keyword parameters](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments) indicting which parameter will be assigned a specific argument value (which may or may not correspond to the order in which the parameters are defined).\n",
"\n",
"The [Python Tutorial page on default parameters](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values) includes the following warning:\n",
"\n",
"> Important warning: The default value is evaluated only once. This makes a difference when the default is a mutable object such as a list, dictionary, or instances of most classes. \n",
"\n",
"Thus it is generally better to use the Python null object, `None`, rather than an empty `list` (`[]`), `dict` (`{}`) or other mutable data structure when specifying default parameter values for any of those data types."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def parameter_test(parameter1=None, parameter2=None):\n",
" '''Prints the values of parameter1 and parameter2'''\n",
" print 'parameter1: {}; parameter2: {}'.format(parameter1, parameter2)\n",
" \n",
"parameter_test() # no args are required\n",
"parameter_test(1) # if any args are provided, 1st arg gets assigned to parameter1\n",
"parameter_test(1, 2) # 2nd arg gets assigned to parameter2\n",
"parameter_test(2) # remember: if only 1 arg, 1st arg gets assigned to arg1\n",
"parameter_test(parameter2=2) # can use keyword to [only] provide an explicit value for parameter2\n",
"parameter_test(parameter2=2, parameter1=1) # can use keywords for either arg, in either order"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"parameter1: None; parameter2: None\n",
"parameter1: 1; parameter2: None\n",
"parameter1: 1; parameter2: 2\n",
"parameter1: 2; parameter2: None\n",
"parameter1: None; parameter2: 2\n",
"parameter1: 1; parameter2: 2\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Exercise 9: define majority_value()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a function, `majority_value(instances, class_index)`, that returns the most frequently occurring value of `class_index` in `instances`. The `class_index` parameter should be optional, and have a default value of `0` (zero)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# your definition of majority_value(instances) here\n",
"\n",
"# delete 'simple_ml.' below to test your function:\n",
"print 'Majority value of index {}: {}'.format(0, simple_ml.majority_value(clean_instances)) # note: relying on default parameter here\n",
"# although there is only one class_index for the dataset, we'll test it by providing non-default values\n",
"print 'Majority value of index {}: {}'.format(1, simple_ml.majority_value(clean_instances, 1)) # using an optional 2nd argument\n",
"print 'Majority value of index {}: {}'.format(2, simple_ml.majority_value(clean_instances, class_index=2)) # using a keyword"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Majority value of index 0: e\n",
"Majority value of index 1: x\n",
"Majority value of index 2: y\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The recursive `create_decision_tree()` function below uses an optional parameter, `class_index`, which defaults to `0`. This is to accommodate other datasets in which the class label is the last element on each line (which would be most easily specified by using a `-1` value). Most data files in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) have the class labels as either the first element or the last element.\n",
"\n",
"To show how the decision tree is being built, an optional `trace` parameter, when non-zero, will generate some trace information as the tree is constructed. The indentation level is incremented with each recursive call via the use of the conditional expression (ternary operator), `trace + 1 if trace else 0`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def create_decision_tree(instances, candidate_attribute_indexes=None, class_index=0, default_class=None, trace=0):\n",
" '''Returns a new decision tree trained on a list of instances.\n",
" \n",
" The tree is constructed by recursively selecting and splitting instances based on \n",
" the highest information_gain of the candidate_attribute_indexes.\n",
" \n",
" The class label is found in position class_index.\n",
" \n",
" The default_class is the majority value for the current node's parent in the tree.\n",
" A positive (int) trace value will generate trace information with increasing levels of indentation.\n",
" \n",
" Derived from the simplified ID3 algorithm presented in Building Decision Trees in Python by Christopher Roach,\n",
" http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3'''\n",
" \n",
" # if no candidate_attribute_indexes are provided, assume that we will use all but the target_attribute_index\n",
" if candidate_attribute_indexes is None:\n",
" candidate_attribute_indexes = range(len(instances[0]))\n",
" candidate_attribute_indexes.remove(class_index)\n",
" \n",
" class_labels_and_counts = Counter([instance[class_index] for instance in instances])\n",
"\n",
" # If the dataset is empty or the candidate attributes list is empty, return the default value\n",
" if not instances or not candidate_attribute_indexes:\n",
" if trace:\n",
" print '{}Using default class {}'.format('< ' * trace, default_class)\n",
" return default_class\n",
" \n",
" # If all the instances have the same class label, return that class label\n",
" elif len(class_labels_and_counts) == 1:\n",
" class_label = class_labels_and_counts.most_common(1)[0][0]\n",
" if trace:\n",
" print '{}All {} instances have label {}'.format('< ' * trace, len(instances), class_label)\n",
" return class_label\n",
" else:\n",
" default_class = simple_ml.majority_value(instances, class_index)\n",
"\n",
" # Choose the next best attribute index to best classify the instances\n",
" best_index = simple_ml.choose_best_attribute_index(instances, candidate_attribute_indexes, class_index) \n",
" if trace:\n",
" print '{}Creating tree node for attribute index {}'.format('> ' * trace, best_index)\n",
"\n",
" # Create a new decision tree node with the best attribute index and an empty dictionary object (for now)\n",
" tree = {best_index:{}}\n",
"\n",
" # Create a new decision tree sub-node (branch) for each of the values in the best attribute field\n",
" partitions = simple_ml.split_instances(instances, best_index)\n",
"\n",
" # Remove that attribute from the set of candidates for further splits\n",
" remaining_candidate_attribute_indexes = [i for i in candidate_attribute_indexes if i != best_index]\n",
" for attribute_value in partitions:\n",
" if trace:\n",
" print '{}Creating subtree for value {} ({}, {}, {}, {})'.format(\n",
" '> ' * trace,\n",
" attribute_value, \n",
" len(partitions[attribute_value]), \n",
" len(remaining_candidate_attribute_indexes), \n",
" class_index, \n",
" default_class)\n",
" \n",
" # Create a subtree for each value of the the best attribute\n",
" subtree = create_decision_tree(\n",
" partitions[attribute_value],\n",
" remaining_candidate_attribute_indexes,\n",
" class_index,\n",
" default_class,\n",
" trace + 1 if trace else 0)\n",
"\n",
" # Add the new subtree to the empty dictionary object in the new tree/node we just created\n",
" tree[best_index][attribute_value] = subtree\n",
"\n",
" return tree\n",
"\n",
"# split instances into separate training and testing sets\n",
"training_instances = clean_instances[:-20]\n",
"testing_instances = clean_instances[-20:]\n",
"tree = create_decision_tree(training_instances, trace=1) # remove trace=1 to turn off tracing\n",
"print tree"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"> Creating tree node for attribute index 5\n",
"> Creating subtree for value a (400, 21, 0, e)\n",
"< < All 400 instances have label e\n",
"> Creating subtree for value c (192, 21, 0, e)\n",
"< < All 192 instances have label p\n",
"> Creating subtree for value f (1584, 21, 0, e)\n",
"< < All 1584 instances have label p\n",
"> Creating subtree for value m (28, 21, 0, e)\n",
"< < All 28 instances have label p\n",
"> Creating subtree for value l (400, 21, 0, e)\n",
"< < All 400 instances have label e\n",
"> Creating subtree for value n (2764, 21, 0, e)\n",
"> > Creating tree node for attribute index 20\n",
"> > Creating subtree for value k (1296, 20, 0, e)\n",
"< < < All 1296 instances have label e\n",
"> > Creating subtree for value r (72, 20, 0, e)\n",
"< < < All 72 instances have label p\n",
"> > Creating subtree for value w (100, 20, 0, e)\n",
"> > > Creating tree node for attribute index 21\n",
"> > > Creating subtree for value y (24, 19, 0, e)\n",
"< < < < All 24 instances have label e\n",
"> > > Creating subtree for value c (16, 19, 0, e)\n",
"< < < < All 16 instances have label p\n",
"> > > Creating subtree for value v (60, 19, 0, e)\n",
"< < < < All 60 instances have label e\n",
"> > Creating subtree for value n (1296, 20, 0, e)\n",
"< < < All 1296 instances have label e"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"> Creating subtree for value p (256, 21, 0, e)\n",
"< < All 256 instances have label p\n",
"{5: {'a': 'e', 'c': 'p', 'f': 'p', 'm': 'p', 'l': 'e', 'n': {20: {'k': 'e', 'r': 'p', 'w': {21: {'y': 'e', 'c': 'p', 'v': 'e'}}, 'n': 'e'}}, 'p': 'p'}}\n"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The structure of the tree shown above is rather difficult to discern from the normal printed representation of a dictionary.\n",
"\n",
"The Python [`pprint`](http://docs.python.org/2/library/pprint.html) module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.\n",
"\n",
"The [`pprint.pprint(object, stream=None, indent=1, width=80, depth=None)`](http://docs.python.org/2/library/pprint.html#pprint.pprint) method will print `object` to a `stream` (a default value of `None` will dictate the use of [sys.stdout](http://docs.python.org/2/library/sys.html#sys.stdout), the same destination as `print` statement output), using `indent` spaces to differentiate nesting levels, using up to a maximum `width` columns and up to to a maximum nesting level `depth` (`None` indicating no maximum).\n",
"\n",
"We will use the a variation on the import statement that imports one or more functions into the current namespace:\n",
"\n",
" from pprint import pprint\n",
" \n",
"This will to enable us to use `pprint()` rather than having to use dotted notation, i.e., `pprint.pprint()`. \n",
"\n",
"Note that if we wanted to define our own `pprint()` function, we would be best only using\n",
"\n",
" import pprint\n",
" \n",
"so that we can still access the `pprint()` function in the `pprint` module (since defining `pprint()` in the current namespace would otherwise override the imported definition of the function)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from pprint import pprint\n",
"pprint(tree)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"{5: {'a': 'e',\n",
" 'c': 'p',\n",
" 'f': 'p',\n",
" 'l': 'e',\n",
" 'm': 'p',\n",
" 'n': {20: {'k': 'e',\n",
" 'n': 'e',\n",
" 'r': 'p',\n",
" 'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},\n",
" 'p': 'p'}}\n"
]
}
],
"prompt_number": 18
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Classifying Instances with a Simple Decision Tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Usually, when we construct a decision tree based on a set of *training* instances, we do so with the intent of using that tree to classify a set of one or more *testing* instances.\n",
"\n",
"We will define a function, `classify(tree, instance, default_class=None)`, to use a decision `tree` to classify a single `instance`, where an optional `default_class` can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.\n",
"\n",
"We will use a design pattern in which we will use a series of `if` statements, each of which returns a value if the condition is true, rather than a nested series of `if`, `elif` and/or `else` clauses, as it helps constrain the levels of indentation in the function."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def classify(tree, instance, default_class=None):\n",
" '''Returns a classification label for instance, given a decision tree'''\n",
" if not tree:\n",
" return default_class\n",
" if not isinstance(tree, dict): \n",
" return tree\n",
" attribute_index = tree.keys()[0]\n",
" attribute_values = tree.values()[0]\n",
" instance_attribute_value = instance[attribute_index]\n",
" if instance_attribute_value not in attribute_values:\n",
" return default_class\n",
" return classify(attribute_values[instance_attribute_value], instance, default_class)\n",
"\n",
"for instance in testing_instances:\n",
" predicted_label = classify(tree, instance)\n",
" actual_label = instance[0]\n",
" print 'predicted: {}; actual: {}'.format(predicted_label, actual_label)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"predicted: p; actual: p\n",
"predicted: p; actual: p\n",
"predicted: p; actual: p\n",
"predicted: e; actual: e\n",
"predicted: e; actual: e\n",
"predicted: p; actual: p\n",
"predicted: e; actual: e\n",
"predicted: e; actual: e\n",
"predicted: e; actual: e\n",
"predicted: p; actual: p\n",
"predicted: e; actual: e\n",
"predicted: e; actual: e\n",
"predicted: e; actual: e\n",
"predicted: p; actual: p\n",
"predicted: e; actual: e\n",
"predicted: e; actual: e\n",
"predicted: e; actual: e\n",
"predicted: e; actual: e\n",
"predicted: p; actual: p\n",
"predicted: p; actual: p\n"
]
}
],
"prompt_number": 19
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Evaluating the Accuracy of a Simple Decision Tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is often helpful to evaluate the performance of a model using a dataset not used in the training of that model. In the simple example shown above, we used all but the last 20 instances to train a simple decision tree, then classified those last 20 instances using the tree.\n",
"\n",
"The advantage of this training/testing split is that visual inspection of the classifications (sometimes called *predictions*) is relatively straightforward, revealing that all 20 instances were correctly classified.\n",
"\n",
"There are a variety of metrics that can be used to evaluate the performance of a model. [Scikit Learn's Model Evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html) library provides an overview and implementation of several possible metrics. For now, we'll simply measure the *accuracy* of a model, i.e., the percentage of testing instances that are correctly classified (*true positives* and *true negatives*).\n",
"\n",
"The accuracy of the model above, given the set of 20 testing instances, is 100% (20/20).\n",
"\n",
"The function below calculates the classification accuracy of a `tree` over a set of `testing_instances` (with an optional `class_index` parameter indicating the position of the class label in each instance)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def classification_accuracy(tree, testing_instances, class_index=0, default_class=None):\n",
" '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index'''\n",
" num_correct = 0\n",
" for i in xrange(len(testing_instances)):\n",
" prediction = classify(tree, testing_instances[i], default_class)\n",
" actual_value = testing_instances[i][class_index]\n",
" if prediction == actual_value:\n",
" num_correct += 1\n",
" return float(num_correct) / len(testing_instances)\n",
"\n",
"print classification_accuracy(tree, testing_instances)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1.0\n"
]
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [`zip([iterable, ...])`](http://docs.python.org/2.7/library/functions.html#zip) function combines 2 or more sequences or iterables; the function returns a list of tuples, where the *i*th tuple contains the *i*th element from each of the argument sequences or iterables."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"zip([0, 1, 2], ['a', 'b', 'c'])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 21,
"text": [
"[(0, 'a'), (1, 'b'), (2, 'c')]"
]
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use [list comprehensions](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions), the `Counter` class and the `zip()` function to modify `classification_accuracy()` so that it returns a packed tuple with \n",
"\n",
"* the number of correctly classified instances\n",
"* the number of incorrectly classified instances\n",
"* the percentage of instances correctly classified"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def classification_accuracy(tree, instances, class_index=0, default_class=None):\n",
" '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index'''\n",
" predicted_labels = [classify(tree, instance, default_class) for instance in instances]\n",
" actual_labels = [x[class_index] for x in instances]\n",
" counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\n",
" return counts[True], counts[False], float(counts[True]) / len(instances)\n",
"\n",
"print classification_accuracy(tree, testing_instances)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(20, 0, 1.0)\n"
]
}
],
"prompt_number": 22
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We sometimes want to partition the instances into subsets of equal sizes to measure performance. One metric this partitioning allows us to compute is a [learning curve](https://en.wikipedia.org/wiki/Learning_curve), i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka *folds*) would be to conduct an [*n-fold cross validation*](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) evaluation.\n",
"\n",
"The following function, `partition_instances(instances, num_partitions)`, partitions a set of `instances` into `num_partitions` relatively equally sized subsets.\n",
"\n",
"We'll use this as yet another opportunity to demonstrate the power of using list comprehensions, this time, to condense the use of nested `for` loops."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def partition_instances(instances, num_partitions):\n",
" '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''\n",
" return [[instances[j] for j in xrange(i, len(instances), num_partitions)] for i in xrange(num_partitions)]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before testing this function on the 5644 `clean_instances` from the UCI mushroom dataset, let's create a small number of simplified instances to verify that the function has the desired behavior."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"instance_length = 3\n",
"num_instances = 5\n",
"\n",
"simplified_instances = [[j for j in xrange(i, instance_length + i)] for i in xrange(num_instances)]\n",
"\n",
"print 'Instances:', simplified_instances\n",
"partitions = partition_instances(simplified_instances, 2)\n",
"print 'Partitions:', partitions"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Instances: [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]\n",
"Partitions: [[[0, 1, 2], [2, 3, 4], [4, 5, 6]], [[1, 2, 3], [3, 4, 5]]]\n"
]
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following variations do not use list comprehensions."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def partition_instances(instances, num_partitions):\n",
" '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''\n",
" partitions = []\n",
" for i in xrange(num_partitions):\n",
" partition = []\n",
" # iterate over instances starting at position i in increments of num_paritions\n",
" for j in xrange(i, len(instances), num_partitions): \n",
" partition.append(instances[j])\n",
" partitions.append(partition)\n",
" return partitions\n",
"\n",
"simplified_instances = []\n",
"for i in xrange(num_instances):\n",
" new_instance = []\n",
" for j in xrange(i, instance_length + i):\n",
" new_instance.append(j)\n",
" simplified_instances.append(new_instance)\n",
"\n",
"print 'Instances:', simplified_instances\n",
"partitions = partition_instances(simplified_instances, 2)\n",
"print 'Partitions:', partitions"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Instances: [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]\n",
"Partitions: [[[0, 1, 2], [2, 3, 4], [4, 5, 6]], [[1, 2, 3], [3, 4, 5]]]\n"
]
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [`enumerate(sequence, start=0)`](http://docs.python.org/2.7/library/functions.html#enumerate) function creates an iterator that successively returns the index and value of each element in a `sequence`, beginning at the `start` index."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for i, x in enumerate(['a', 'b', 'c']):\n",
" print i, x"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0 a\n",
"1 b\n",
"2 c\n"
]
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use `enumerate()` to facilitate slightly more rigorous testing of our `partition_instances` function on our `simplified_instances`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for i in xrange(5):\n",
" print '\\n# partitions:', i\n",
" for j, partition in enumerate(partition_instances(simplified_instances, i)):\n",
" print 'partition {}: {}'.format(j, partition)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"# partitions: 0\n",
"\n",
"# partitions: 1\n",
"partition 0: [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]\n",
"\n",
"# partitions: 2\n",
"partition 0: [[0, 1, 2], [2, 3, 4], [4, 5, 6]]\n",
"partition 1: [[1, 2, 3], [3, 4, 5]]\n",
"\n",
"# partitions: 3\n",
"partition 0: [[0, 1, 2], [3, 4, 5]]\n",
"partition 1: [[1, 2, 3], [4, 5, 6]]\n",
"partition 2: [[2, 3, 4]]\n",
"\n",
"# partitions: 4\n",
"partition 0: [[0, 1, 2], [4, 5, 6]]\n",
"partition 1: [[1, 2, 3]]\n",
"partition 2: [[2, 3, 4]]\n",
"partition 3: [[3, 4, 5]]\n"
]
}
],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Returning our attention to the UCI mushroom dataset, the following will partition our `clean_instances` into 10 relatively equally sized disjoint subsets. We will use a list comprehension to print out the length of each partition"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"partitions = partition_instances(clean_instances, 10)\n",
"print [len(partition) for partition in partitions]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[565, 565, 565, 565, 564, 564, 564, 564, 564, 564]\n"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following variation does not use a list comprehension."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for partition in partitions:\n",
" print len(partition), # note the comma at the end\n",
"print"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"565 565 565 565 564 564 564 564 564 564\n"
]
}
],
"prompt_number": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following shows the different trees that are constructed based on partition 0 (first 10th) of `clean_instances`, partitions 0 and 1 (first 2/10ths) of `clean_instances` and all `clean_instances`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tree0 = create_decision_tree(partitions[0])\n",
"print 'Tree trained with {} instances:'.format(len(partitions[0]))\n",
"pprint(tree0)\n",
"\n",
"tree1 = create_decision_tree(partitions[0] + partitions[1])\n",
"print '\\nTree trained with {} instances:'.format(len(partitions[0] + partitions[1]))\n",
"pprint(tree1)\n",
"\n",
"tree = create_decision_tree(clean_instances)\n",
"print '\\nTree trained with {} instances:'.format(len(clean_instances))\n",
"pprint(tree)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Tree trained with 565 instances:\n",
"{5: {'a': 'e',\n",
" 'c': 'p',\n",
" 'f': 'p',\n",
" 'l': 'e',\n",
" 'm': 'p',\n",
" 'n': {20: {'k': 'e', 'n': 'e', 'r': 'p', 'w': 'e'}},\n",
" 'p': 'p'}}\n",
"\n",
"Tree trained with 1130 instances:\n",
"{5: {'a': 'e',\n",
" 'c': 'p',\n",
" 'f': 'p',\n",
" 'l': 'e',\n",
" 'm': 'p',\n",
" 'n': {20: {'k': 'e',\n",
" 'n': 'e',\n",
" 'r': 'p',\n",
" 'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},\n",
" 'p': 'p'}}\n",
"\n",
"Tree trained with 5644 instances:"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"{5: {'a': 'e',\n",
" 'c': 'p',\n",
" 'f': 'p',\n",
" 'l': 'e',\n",
" 'm': 'p',\n",
" 'n': {20: {'k': 'e',\n",
" 'n': 'e',\n",
" 'r': 'p',\n",
" 'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},\n",
" 'p': 'p'}}\n"
]
}
],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The only difference between the first two trees - *tree0* and *tree1* - is that in the first tree, instances with no `odor` (attribute index `5` is `'n'`) and a `spore-print-color` of white (attribute `20` = `'w'`) are classified as `edible` (`'e'`). With additional training data in the 2nd partition, an additional distinction is made such that instances with no `odor`, a white `spore-print-color` and a clustered `population` (attribute `21` = `'c'`) are classified as `poisonous` (`'p'`), while all other instances with no `odor` and a white `spore-print-color` (and any other value for the `population` attribute) are classified as `edible` (`'e'`).\n",
"\n",
"Note that there is no difference between `tree1` and `tree` (the tree trained with all instances). This early convergence on an optimal model is uncommon on most datasets (outside the UCI repository)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we can partition our instances into subsets, we can use these subsets to construct different-sized training sets in the process of computing a learning curve.\n",
"\n",
"We will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.\n",
"\n",
"The [`list.extend(L)`](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) method enables us to extend `list` by appending all the items in another list, `L`, to the end of `list`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"x = [1, 2, 3]\n",
"x.extend([4, 5])\n",
"print x"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[1, 2, 3, 4, 5]\n"
]
}
],
"prompt_number": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now define the function, `compute_learning_curve(instances, num_partitions=10)`, that will take a list of `instances`, partition it into `num_partitions` relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first `num_partitions - 1` partitions then tested with instances in the last partition. That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on. \n",
"\n",
"The function will return a list of `num_partitions - 1` tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on the `num_partitions - 1` set). This will provide some indication of the relative impact of the size of the training set on model performance."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def compute_learning_curve(instances, num_partitions=10):\n",
" '''Returns a list of training sizes and scores for incrementally increasing partitions.\n",
"\n",
" The list contains 2-element tuples, each representing a training size and score.\n",
" The i-th training size is the number of instances in partitions 0 through num_partitions - 2.\n",
" The i-th score is the accuracy of a tree trained with instances \n",
" from partitions 0 through num_partitions - 2\n",
" and tested on instances from num_partitions - 1 (the last partition).'''\n",
" \n",
" partitions = partition_instances(instances, num_partitions)\n",
" testing_instances = partitions[-1][:]\n",
" training_instances = []\n",
" accuracy_list = []\n",
" for i in xrange(0, num_partitions - 1):\n",
" # for each iteration, the training set is composed of partitions 0 through i - 1\n",
" training_instances.extend(partitions[i][:])\n",
" tree = create_decision_tree(training_instances)\n",
" partition_accuracy = classification_accuracy(tree, testing_instances)\n",
" accuracy_list.append((len(training_instances), partition_accuracy))\n",
" return accuracy_list\n",
"\n",
"accuracy_list = compute_learning_curve(clean_instances)\n",
"print accuracy_list"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[(565, (562, 2, 0.9964539007092199)), (1130, (564, 0, 1.0)), (1695, (564, 0, 1.0)), (2260, (564, 0, 1.0)), (2824, (564, 0, 1.0)), (3388, (564, 0, 1.0)), (3952, (564, 0, 1.0)), (4516, (564, 0, 1.0)), (5080, (564, 0, 1.0))]\n"
]
}
],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Due to the quick convergence on an optimal decision tree for classifying the UCI mushroom dataset, we can use a larger number of smaller partitions to see a little more variation in acccuracy performance."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"accuracy_list = compute_learning_curve(clean_instances, 100)\n",
"print accuracy_list[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[(57, (55, 1, 0.9821428571428571)), (114, (56, 0, 1.0)), (171, (55, 1, 0.9821428571428571)), (228, (56, 0, 1.0)), (285, (56, 0, 1.0)), (342, (56, 0, 1.0)), (399, (56, 0, 1.0)), (456, (56, 0, 1.0)), (513, (56, 0, 1.0)), (570, (56, 0, 1.0))]\n"
]
}
],
"prompt_number": 33
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Object-Oriented Programming: Defining a Python Class to Encapsulate a Simple Decision Tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The simple decision tree defined above uses a Python dictionary for its representation. One can imagine using other data structures, and/or extending the decision tree to support confidence estimates, numeric features and other capabilities that are often included in more fully functional implementations. To support future extensibility, and hide the details of the representation from the user, it would be helpful to have a user-defined class for simple decision trees.\n",
"\n",
"Python is an [object-oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming) language, offering simple syntax and semantics for defining classes and instantiating objects of those classes. *[It is assumed that the reader is already familiar with the concepts of object-oriented programming]*\n",
"\n",
"A Python [class](http://docs.python.org/2/tutorial/classes.html) starts with the keyword **`class`** followed by a class name (identifier), a colon ('`:`'), and then any number of statements, which typically take the form of assignment statements for class or instance variables and/or function definitions for class methods. All statements are indented to reflect their inclusion in the class definition.\n",
"\n",
"The members - methods, class variables and instance variables - of a class are accessed by prepending `self.` to each reference. Class methods always include `self` as the first parameter. \n",
"\n",
"All class members in Python are *public* (accessible outside the class). There is no mechanism for *private* class members, but identifiers with leading double underscores (*\\_\\_member_identifier*) are 'mangled' (translated into *\\_class_name\\__member_identifier*), and thus not directly accessible outside their class, and can be used to approximate private members by Python programmers. \n",
"\n",
"There is also no mechanism for *protected* identifiers - accessible only within a defining class and its subclasses - in the Python language, and so Python programmers have adopted the convention of using a single underscore (*\\_identifier*) at the start of any identifier that is intended to be protected (i.e., not to be accessed outside the class or its subclasses). \n",
"\n",
"Some Python programmers only use the single underscore prefixes and avoid double underscore prefixes due to unintended consequences that can arise when names are mangled. The following warning about single and double underscore prefixes is issued in [Code Like a Pythonista](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#naming):\n",
"\n",
"> try to avoid the __private form. I never use it. Trust me. If you use it, you WILL regret it later\n",
"\n",
"We will follow this advice and avoid using the double underscore prefix in user-defined member variables and methods.\n",
"\n",
"Python has a number of pre-defined [special method names](http://docs.python.org/2/reference/datamodel.html#special-method-names), all of which are denoted by leading and trailing double underscores. For example, the [`object.__init__(self[, ...])`](http://docs.python.org/2/reference/datamodel.html#object.__init__) method is used to specify instructions that should be executed whenever a new object of a class is instantiated. \n",
"\n",
"The code below defines a class, `SimpleDecisionTree`, with a single pseudo-protected member variable `_tree` and a pseudo-protected tree construction method `_create()`, two public methods - `classify()` and `pprint()` - and an initialization method that takes an optional list of training `instances` and a `target_attribute_index`. \n",
"\n",
"The `_create()` method is identical to the `create_decision_tree()` function above, with the inclusion of the `self` parameter (as it is now a class method). The `classify()` method is a similarly modified version of the `classify()` and `classification_accuracy()` functions above, with references to `tree` converted to `self._tree`. The `pprint()` method prints the tree in a human-readable format.\n",
"\n",
"Note that other machine learning libraries may use different terminology for the methods we've defined here. For example, in the [`sklearn.tree.DecisionTreeClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class (and in most `sklearn` classifier classes), the method for constructing a classifier is named [`fit()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) - since it \"fits\" the data to a model - and the method for classifying instances is named [`predict()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) - since it is predicting the class label for an instance.\n",
"\n",
"Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in `SimpleDecisionTree.py`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class SimpleDecisionTree:\n",
"\n",
" _tree = {} # this instance variable becomes accessible to class methods via self._tree\n",
"\n",
" def __init__(self, instances=None, target_attribute_index=0): # note the use of self as the first parameter\n",
" if instances:\n",
" self._tree = self._create(instances, range(1, len(instances[0])), target_attribute_index)\n",
" \n",
" def _create(self, instances, candidate_attribute_indexes, target_attribute_index=0, default_class=None):\n",
" class_labels_and_counts = Counter([instance[target_attribute_index] for instance in instances])\n",
" if not instances or not candidate_attribute_indexes:\n",
" return default_class\n",
" elif len(class_labels_and_counts) == 1:\n",
" class_label = class_labels_and_counts.most_common(1)[0][0]\n",
" return class_label\n",
" else:\n",
" default_class = simple_ml.majority_value(instances, target_attribute_index)\n",
" best_index = simple_ml.choose_best_attribute_index(instances, candidate_attribute_indexes, target_attribute_index)\n",
" tree = {best_index:{}}\n",
" partitions = simple_ml.split_instances(instances, best_index)\n",
" remaining_candidate_attribute_indexes = [i for i in candidate_attribute_indexes if i != best_index]\n",
" for attribute_value in partitions:\n",
" subtree = self._create(\n",
" partitions[attribute_value],\n",
" remaining_candidate_attribute_indexes,\n",
" target_attribute_index,\n",
" default_class)\n",
" tree[best_index][attribute_value] = subtree\n",
" return tree\n",
" \n",
" # calls the internal \"protected\" method to classify the instance given the _tree\n",
" def classify(self, instance, default_class=None):\n",
" return self._classify(self._tree, instance, default_class)\n",
" \n",
" # a method intended to be \"protected\" that can implement the recursive algorithm to classify an instance given a tree\n",
" def _classify(self, tree, instance, default_class=None):\n",
" if not tree:\n",
" return default_class\n",
" if not isinstance(tree, dict):\n",
" return tree\n",
" attribute_index = tree.keys()[0]\n",
" attribute_values = tree.values()[0]\n",
" instance_attribute_value = instance[attribute_index]\n",
" if instance_attribute_value not in attribute_values:\n",
" return default_class\n",
" return self._classify(attribute_values[instance_attribute_value], instance, default_class)\n",
" \n",
" def classification_accuracy(self, instances, default_class=None):\n",
" predicted_labels = [self.classify(instance, default_class) for instance in instances]\n",
" actual_labels = [x[0] for x in instances]\n",
" counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\n",
" return counts[True], counts[False], float(counts[True]) / len(instances)\n",
" \n",
" def pprint(self):\n",
" pprint(self._tree)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 34
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following statements instantiate a `SimpleDecisionTree`, using all but the last 20 `clean_instances`, prints out the tree using its `pprint()` method, and then uses the `classify()` method to print the classification of the last 20 `clean_instances`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"simple_decision_tree = SimpleDecisionTree(training_instances)\n",
"simple_decision_tree.pprint()\n",
"print\n",
"for instance in testing_instances:\n",
" predicted_label = simple_decision_tree.classify(instance)\n",
" actual_label = instance[0]\n",
" print 'Model: {}; truth: {}'.format(predicted_label, actual_label)\n",
"print\n",
"print 'Classification accuracy:', simple_decision_tree.classification_accuracy(testing_instances)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"{5: {'a': 'e',\n",
" 'c': 'p',\n",
" 'f': 'p',\n",
" 'l': 'e',\n",
" 'm': 'p',\n",
" 'n': {20: {'k': 'e',\n",
" 'n': 'e',\n",
" 'r': 'p',\n",
" 'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},\n",
" 'p': 'p'}}\n",
"\n",
"Model: p; truth: p\n",
"Model: p; truth: p\n",
"Model: p; truth: p\n",
"Model: e; truth: e\n",
"Model: e; truth: e\n",
"Model: p; truth: p\n",
"Model: e; truth: e\n",
"Model: e; truth: e\n",
"Model: e; truth: e\n",
"Model: p; truth: p\n",
"Model: e; truth: e\n",
"Model: e; truth: e\n",
"Model: e; truth: e\n",
"Model: p; truth: p\n",
"Model: e; truth: e\n",
"Model: e; truth: e\n",
"Model: e; truth: e\n",
"Model: e; truth: e\n",
"Model: p; truth: p\n",
"Model: p; truth: p\n",
"\n",
"Classification accuracy: (20, 0, 1.0)\n"
]
}
],
"prompt_number": 35
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Navigation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notebooks in this primer:\n",
"\n",
"* [1. Introduction](1_Introduction.ipynb)\n",
"* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n",
"* [3. Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n",
"* **4. Using Python to Build and Use a Simple Decision Tree Classifier** (*you are here*)\n",
"* [5. Next Steps](5_Next_Steps.ipynb)"
]
}
],
"metadata": {}
}
]
}
Jump to Line
Something went wrong with that request. Please try again.