4_Python_Simple_Decision_Tree.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python for Data Science"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Joe McCarthy](http://interrelativity.com/joe), \n",
    "*Data Scientist*, [Indeed](http://www.indeed.com/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from IPython.display import display, Image, HTML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Navigation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notebooks in this primer:\n",
    "\n",
    "* [1. Introduction](1_Introduction.ipynb)\n",
    "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n",
    "* [3. Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n",
    "* **4. Using Python to Build and Use a Simple Decision Tree Classifier** (*you are here*)\n",
    "* [5. Next Steps](5_Next_Steps.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Using Python to Build and Use a Simple Decision Tree Classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Decision Trees"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wikipedia offers the following description of a [decision tree](https://en.wikipedia.org/wiki/Decision_tree) (with italics added to emphasize terms that will be elaborated below):\n",
    "\n",
    "> A decision tree is a flowchart-like structure in which each *internal node* represents a *test* of an *attribute*, each branch represents an *outcome* of that test and each *leaf node* represents *class label* (a decision taken after testing all attributes in the path from the root to the leaf). Each path from the root to a leaf can also be represented as a classification rule.\n",
    "\n",
    "*\\[Decision trees can also be used for regression, wherein the goal is to predict a continuous value rather than a class label, but we will focus here solely on their use for classification.\\]*\n",
    "\n",
    "The image below depicts a decision tree created from the UCI mushroom dataset that appears on [Andy G's blog post about Decision Tree Learning](http://gieseanw.wordpress.com/2012/03/03/decision-tree-learning/), where \n",
    "\n",
    "* a white box represents an *internal node* (and the label represents the *attribute* being tested)\n",
    "* a blue box represents an attribute value (an *outcome* of the *test* of that attribute)\n",
    "* a green box represents a *leaf node* with a *class label* of *edible*\n",
    "* a red box represents a *leaf node* with a *class label* of *poisonous*\n",
    "\n",
    "<img src=\"http://gieseanw.files.wordpress.com/2012/03/mushroomdecisiontree.png\" style=\"width: 800px;\" />\n",
    "\n",
    "It is important to note that the UCI mushroom dataset consists entirely of [categorical variables](https://en.wikipedia.org/wiki/Categorical_variable), i.e., every variable (or *attribute*) has an enumerated set of possible values. Many datasets include numeric variables that can take on `int` or `float` values. Tests for such variables typically use comparison operators, e.g., $age < 65$ or $36,250 < adjusted\\_gross\\_income <= 87,850$. *[Aside: Python supports boolean expressions containing multiple comparison operators, such as the expression comparing adjusted_gross_income in the preceding example.]*\n",
    "\n",
    "Our simple decision tree will only accommodate categorical variables. We will closely follow a version of the [decision tree learning algorithm implementation](http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3) offered by Chris Roach.\n",
    "\n",
    "Our goal in the following sections is to use Python to\n",
    "\n",
    "* ***create*** a simple decision tree using a set of *training* instances\n",
    "* ***classify*** (predict class labels) for a set of *test* instances using a simple decision tree\n",
    "* ***evaluate*** the performance of a simple decision tree on classifying a set of test instances\n",
    "\n",
    "First, we will explore some concepts and algorithms used in building and using decision trees."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Entropy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When building a supervised classification model, the frequency distribution of attribute values is a potentially important factor in determining the relative importance of each attribute at various stages in the model building process.\n",
    "\n",
    "In data modeling, we can use frequency distributions to compute ***entropy***, a measure of disorder (impurity) in a set.\n",
    "\n",
    "We compute the entropy of multiplying the proportion of instances with each class label by the log of that proportion, and then taking the negative sum of those terms.\n",
    "\n",
    "More precisely, for a 2-class (binary) classification task:\n",
    "\n",
    "$entropy(S) = - p_1 log_2 (p_1) - p_2 log_2 (p_2)$\n",
    "\n",
    "where $p_i$ is proportion (relative frequency) of class *i* within the set *S*.\n",
    "\n",
    "From the output above, we know that the proportion of `clean_instances` that are labeled `'e'` (class `edible`) in the UCI dataset is $3488 \\div 5644 = 0.618$, and the proportion labeled `'p'` (class `poisonous`) is $2156 \\div 5644 = 0.382$.\n",
    "\n",
    "After importing the Python [`math`](http://docs.python.org/2/library/math.html) module, we can use the [`math.log(x[, base])`](http://docs.python.org/2/library/math.html#math.log) function in computing the entropy of the `clean_instances` of the UCI mushroom data set as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "entropy = \\\n",
    "    - (3488 / 5644) * math.log(3488 / 5644, 2) \\\n",
    "    - (2156 / 5644) * math.log(2156 / 5644, 2)\n",
    "print(entropy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6: define `entropy()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a function, `entropy(instances)`, that computes the entropy of `instances`. You may assume the class label is in position 0; we will later see how to specify default parameter values in function definitions.\n",
    "\n",
    "[Note: the class label in many data files is the *last* rather than the *first* item on each line.]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# your function definition here\n",
    "\n",
    "# delete 'simple_ml.' in the function call below to test your function\n",
    "print(simple_ml.entropy(clean_instances))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Information Gain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Informally, a decision tree is constructed from a set of instances using a recursive algorithm that \n",
    "\n",
    "* selects the *best* attribute \n",
    "* splits the set into subsets based on the values of that attribute (each subset is composed of instances from the original set that have the same value for that attribute)\n",
    "* repeats the process on each of these subsets until a stopping condition is met (e.g., a subset has no instances or has instances which all have the same class label)\n",
    "\n",
    "Entropy is a metric that can be used in selecting the best attribute for each split: the best attribute is the one resulting in the *largest decrease in entropy* for a set of instances. [Note: other metrics can be used for determining the best attribute]\n",
    "\n",
    "*Information gain* measures the decrease in entropy that results from splitting a set of instances based on an attribute.\n",
    "\n",
    "$IG(S, a) = entropy(S) - [p(s_1) × entropy(s_1) + p(s_2) × entropy(s_2) ... + p(s_n) × entropy(s_n)]$\n",
    "\n",
    "Where \n",
    "* $n$ is the number of distinct values of attribute $a$\n",
    "* $s_i$ is the subset of $S$ where all instances have the $i$th value of $a$\n",
    "* $p(s_i)$ is the proportion of instances in $S$ that have the $i$th value of $a$\n",
    "\n",
    "We'll use the definition of `information_gain()` in `simple_ml` to print the information gain for each of the attributes in the mushroom dataset ... before asking you to write your own definition of the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Information gain for different attributes:', end='\\n\\n')\n",
    "for i in range(1, len(attribute_names)):\n",
    "    print('{:5.3f}  {:2} {}'.format(\n",
    "        simple_ml.information_gain(clean_instances, i), i, attribute_names[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can sort the attributes based in decreasing order of information gain, which shows that `odor` is the best attribute for the first split in a decision tree that models the instances in this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Information gain for different attributes:', end='\\n\\n')\n",
    "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i)\n",
    "                                          for i in range(1, len(attribute_names))], \n",
    "                                         reverse=True)\n",
    "for gain, i in sorted_information_gain_indexes:\n",
    "    print('{:5.3f}  {:2} {}'.format(gain, i, attribute_names[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*\\[The following variation does not use a list comprehension\\]*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Information gain for different attributes:', end='\\n\\n')\n",
    "\n",
    "information_gain_values = []\n",
    "for i in range(1, len(attribute_names)):\n",
    "    information_gain_values.append((simple_ml.information_gain(clean_instances, i), i))\n",
    "    \n",
    "sorted_information_gain_indexes = sorted(information_gain_values, \n",
    "                                         reverse=True)\n",
    "for gain, i in sorted_information_gain_indexes:\n",
    "    print('{:5.3f}  {:2} {}'.format(gain, i, attribute_names[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 7: define `information_gain()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a function, `information_gain(instances, i)`, that returns the information gain achieved by selecting the `i`th attribute to split `instances`. It should exhibit the same behavior as the `simple_ml` version of the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# your definition of information_gain(instances, i) here\n",
    "\n",
    "# delete 'simple_ml.' in the function call below to test your function\n",
    "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) \n",
    "                                          for i in range(1, len(attribute_names))], \n",
    "                                         reverse=True)\n",
    "\n",
    "print('Information gain for different attributes:', end='\\n\\n')\n",
    "for gain, i in sorted_information_gain_indexes:\n",
    "    print('{:5.3f}  {:2} {}'.format(gain, i, attribute_names[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building a Simple Decision Tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will implement a modified version of the [ID3](https://en.wikipedia.org/wiki/ID3_algorithm) algorithm for building a simple decision tree.\n",
    "\n",
    "    ID3 (Examples, Target_Attribute, Candidate_Attributes)\n",
    "        Create a Root node for the tree\n",
    "        If all examples have the same value of the Target_Attribute, \n",
    "            Return the single-node tree Root with label = that value \n",
    "        If the list of Candidate_Attributes is empty,\n",
    "            Return the single node tree Root,\n",
    "                with label = most common value of Target_Attribute in the examples.\n",
    "        Otherwise Begin\n",
    "            A ← The Attribute that best classifies examples (most information gain)\n",
    "            Decision Tree attribute for Root = A.\n",
    "            For each possible value, v_i, of A,\n",
    "                Add a new tree branch below Root, corresponding to the test A = v_i.\n",
    "                Let Examples(v_i) be the subset of examples that have the value v_i for A\n",
    "                If Examples(v_i) is empty,\n",
    "                    Below this new branch add a leaf node \n",
    "                        with label = most common target value in the examples\n",
    "                Else \n",
    "                    Below this new branch add the subtree \n",
    "                        ID3 (Examples(v_i), Target_Attribute, Attributes – {A})\n",
    "        End\n",
    "        Return Root"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*\\[**Note:** the algorithm above is *recursive*, i.e., the there is a recursive call to `ID3` within the definition of `ID3`. Covering recursion is beyond the scope of this primer, but there are a number of other resources on [using recursion in Python](https://www.google.com/search?q=python+recursion). Familiarity with recursion will be important for understanding both the tree construction and classification functions below.\\]*\n",
    "\n",
    "In building a decision tree, we will need to split the instances based on the index of the *best* attribute, i.e., the attribute that offers the *highest information gain*. We will use separate utility functions to handle these subtasks. To simplify the functions, we will rely exclusively on attribute *indexes* rather than attribute *names*.\n",
    "\n",
    "First, we will define a function, **`split_instances(instances, attribute_index)`**, to split a set of instances based on any attribute. This function will return a dictionary where each *key* is a distinct value of the specified `attribute_index`, and the *value* of each key is a list representing the subset of `instances` that have that `attribute_index` value.\n",
    "\n",
    "We will use a [**`defaultdict`**](http://docs.python.org/2/library/collections.html#defaultdict-objects), a specialized dictionary class in the [**`collections`**](http://docs.python.org/2/library/collections.html) module, which automatically creates an appropriate default value for a new key. For example, a `defaultdict(int)` automatically initializes a new dictionary entry to 0 (zero); a `defaultdict(list)` automatically initializes a new dictionary entry to the empty list (`[]`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "\n",
    "def split_instances(instances, attribute_index):\n",
    "    '''Returns a list of dictionaries, splitting a list of instances \n",
    "        according to their values of a specified attribute index\n",
    "    \n",
    "    The key of each dictionary is a distinct value of attribute_index,\n",
    "    and the value of each dictionary is a list representing \n",
    "       the subset of instances that have that value for the attribute\n",
    "    '''\n",
    "    partitions = defaultdict(list)\n",
    "    for instance in instances:\n",
    "        partitions[instance[attribute_index]].append(instance)\n",
    "    return partitions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To test the function, we will partition the `clean_instances` based on the `odor` attribute (index position 5) and print out the size (number of instances) in each partition rather than the lists of instances in each partition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "partitions = split_instances(clean_instances, 5)\n",
    "print([(partition, len(partitions[partition])) for partition in partitions])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we can split instances based on a particular attribute, we would like to be able to choose the *best* attribute with which to split the instances, where *best* is defined as the attribute that provides the greatest information gain if instances were split based on that attribute. We will want to restrict the candidate attributes so that we don't bother trying to split on an attribute that was used higher up in the decision tree (or use the target attribute as a candidate)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 8: define `choose_best_attribute_index()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a function, `choose_best_attribute_index(instances, candidate_attribute_indexes)`, that returns the index in the list of `candidate_attribute_indexes` that provides the highest information gain if `instances` are split based on that attribute index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# your function here\n",
    "\n",
    "# delete 'simple_ml.' in the function call below to test your function\n",
    "print('Best attribute index:', \n",
    "      simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A leaf node in a decision tree represents the most frequently occurring - or majority - class value for that path through the tree. We will need a function that determines the majority value for the class index among a set of instances. One way to do this is to use the [`Counter`](https://docs.python.org/2/library/collections.html#counter-objects) class introduced above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "class_counts = Counter([instance[0] for instance in clean_instances])\n",
    "print('class_counts: {}\\n  most_common(1): {}\\n  most_common(1)[0][0]: {}'.format(\n",
    "    class_counts,  # the Counter object\n",
    "    class_counts.most_common(1),  # returns a list in which the 1st element is a tuple with the most common value and its count\n",
    "    class_counts.most_common(1)[0][0]))  # the most common value (1st element in that tuple)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*\\[The following variation does not use a list comprehension\\]*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "class_counts = Counter()  # create an empty counter\n",
    "for instance in clean_instances:\n",
    "    class_counts[instance[0]] += 1\n",
    "    \n",
    "print ('class_counts: {}\\n  most_common(1): {}\\n  most_common(1)[0][0]: {}'.format(\n",
    "    class_counts,\n",
    "    class_counts.most_common(1), \n",
    "    class_counts.most_common(1)[0][0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is often useful to compute the number of unique values and/or the total number of values in a `Counter`.\n",
    "\n",
    "The number of unique values is simply the number of dictionary entries.\n",
    "\n",
    "The total number of values can be computed by taking the [**`sum()`**](https://docs.python.org/2/library/functions.html#sum) of all the counts (the *value* of each *key: value* pair ... or *key, value* tuple, if we use `Counter().most_common()`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Number of unique values: {}'.format(len(class_counts)))\n",
    "print('Total number of values:  {}'.format(sum([v \n",
    "                                                for k, v in class_counts.most_common()])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before putting all this together to define a decision tree construction function, we will cover a few additional aspects of Python used in that function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Truth values in Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python offers a very flexible mechanism for the [testing of truth values](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#testing-for-truth-values): in an **if** condition, any null object, zero-valued numerical expression or empty container (string, list, dictionary or tuple) is interpreted as *False* (i.e., *not True*):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n",
    "    print('\"{}\" is'.format(x), end=' ')\n",
    "    if x:\n",
    "        print(True)\n",
    "    else:\n",
    "        print(False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes, particularly with function parameters, it is helpful to differentiate `None` from empty lists and other data structures with a `False` truth value (one common use case is illustrated in `create_decision_tree()` below)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n",
    "    print('\"{} is None\" is'.format(x), end=' ')\n",
    "    if x is None:\n",
    "        print(True)\n",
    "    else:\n",
    "        print(False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conditional expressions (ternary operators)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python also offers a [conditional expression (ternary operator)](http://docs.python.org/2/reference/expressions.html#conditional-expressions) that allows the functionality of an if/else statement that returns a value to be implemented as an expression. For example, the if/else statement in the code above could be implemented as a conditional expression as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n",
    "    print('\"{}\" is {}'.format(x, True if x else False)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More on optional parameters in Python functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python function definitions can specify [default parameter values](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values) indicating the value those parameters will have if no argument is explicitly provided when the function is called. Arguments can also be passed using [keyword parameters](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments) indicting which parameter will be assigned a specific argument value (which may or may not correspond to the order in which the parameters are defined).\n",
    "\n",
    "The [Python Tutorial page on default parameters](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values) includes the following warning:\n",
    "\n",
    "> Important warning: The default value is evaluated only once. This makes a difference when the default is a mutable object such as a list, dictionary, or instances of most classes. \n",
    "\n",
    "Thus it is generally better to use the Python null object, `None`, rather than an empty `list` (`[]`), `dict` (`{}`) or other mutable data structure when specifying default parameter values for any of those data types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def parameter_test(parameter1=None, parameter2=None):\n",
    "    '''Prints the values of parameter1 and parameter2'''\n",
    "    print('parameter1: {}; parameter2: {}'.format(parameter1, parameter2))\n",
    "    \n",
    "parameter_test()  # no args are required\n",
    "parameter_test(1)  # if any args are provided, 1st arg gets assigned to parameter1\n",
    "parameter_test(1, 2)  # 2nd arg gets assigned to parameter2\n",
    "parameter_test(2)  # remember: if only 1 arg, 1st arg gets assigned to arg1\n",
    "parameter_test(parameter2=2)  # can use keyword to provide a value only for parameter2\n",
    "parameter_test(parameter2=2, parameter1=1) #  can use keywords for either arg, in either order"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 9: define `majority_value()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a function, `majority_value(instances, class_index)`, that returns the most frequently occurring value of `class_index` in `instances`. The `class_index` parameter should be optional, and have a default value of `0` (zero).\n",
    "\n",
    "Your function definition should support the use of optional arguments as used in the function calls below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# your definition of majority_value(instances) here\n",
    "\n",
    "# delete 'simple_ml.' in the function calls below to test your function\n",
    "\n",
    "print('Majority value of index {}: {}'.format(\n",
    "    0, simple_ml.majority_value(clean_instances))) \n",
    "\n",
    "# although there is only one class_index for the dataset, \n",
    "# we'll test the function by specifying other indexes using optional / keyword arguments\n",
    "print('Majority value of index {}: {}'.format(\n",
    "    1, simple_ml.majority_value(clean_instances, 1)))  # using argument order\n",
    "print('Majority value of index {}: {}'.format(\n",
    "    2, simple_ml.majority_value(clean_instances, class_index=2)))  # using keyword argument"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building a Simple Decision Tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The recursive `create_decision_tree()` function below uses an optional parameter, `class_index`, which defaults to `0`. This is to accommodate other datasets in which the class label is the last element on each line (which would be most easily specified by using a `-1` value). Most data files in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) have the class labels as either the first element or the last element.\n",
    "\n",
    "To show how the decision tree is being built, an optional `trace` parameter, when non-zero, will generate some trace information as the tree is constructed. The indentation level is incremented with each recursive call via the use of the conditional expression (ternary operator), `trace + 1 if trace else 0`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def create_decision_tree(instances, \n",
    "                         candidate_attribute_indexes=None, \n",
    "                         class_index=0, \n",
    "                         default_class=None, \n",
    "                         trace=0):\n",
    "    '''Returns a new decision tree trained on a list of instances.\n",
    "    \n",
    "    The tree is constructed by recursively selecting and splitting instances based on \n",
    "    the highest information_gain of the candidate_attribute_indexes.\n",
    "    \n",
    "    The class label is found in position class_index.\n",
    "    \n",
    "    The default_class is the majority value for the current node's parent in the tree.\n",
    "    A positive (int) trace value will generate trace information \n",
    "        with increasing levels of indentation.\n",
    "    \n",
    "    Derived from the simplified ID3 algorithm presented in Building Decision Trees in Python \n",
    "        by Christopher Roach,\n",
    "    http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3\n",
    "    '''\n",
    "    \n",
    "    # if no candidate_attribute_indexes are provided, \n",
    "    # assume that we will use all but the target_attribute_index\n",
    "    # Note that None != [], \n",
    "    # as an empty candidate_attribute_indexes list is a recursion stopping condition\n",
    "    if candidate_attribute_indexes is None:\n",
    "        candidate_attribute_indexes = [i \n",
    "                                       for i in range(len(instances[0])) \n",
    "                                       if i != class_index]\n",
    "        # Note: do not use candidate_attribute_indexes.remove(class_index)\n",
    "        # as this would destructively modify the argument,\n",
    "        # causing problems during recursive calls\n",
    "        \n",
    "    class_labels_and_counts = Counter([instance[class_index] for instance in instances])\n",
    "\n",
    "    # If the dataset is empty or the candidate attributes list is empty, \n",
    "    # return the default value\n",
    "    if not instances or not candidate_attribute_indexes:\n",
    "        if trace:\n",
    "            print('{}Using default class {}'.format('< ' * trace, default_class))\n",
    "        return default_class\n",
    "    \n",
    "    # If all the instances have the same class label, return that class label\n",
    "    elif len(class_labels_and_counts) == 1:\n",
    "        class_label = class_labels_and_counts.most_common(1)[0][0]\n",
    "        if trace:\n",
    "            print('{}All {} instances have label {}'.format(\n",
    "                '< ' * trace, len(instances), class_label))\n",
    "        return class_label\n",
    "    else:\n",
    "        default_class = simple_ml.majority_value(instances, class_index)\n",
    "\n",
    "        # Choose the next best attribute index to best classify the instances\n",
    "        best_index = simple_ml.choose_best_attribute_index(\n",
    "            instances, candidate_attribute_indexes, class_index)        \n",
    "        if trace:\n",
    "            print('{}Creating tree node for attribute index {}'.format(\n",
    "                    '> ' * trace, best_index))\n",
    "\n",
    "        # Create a new decision tree node with the best attribute index \n",
    "        # and an empty dictionary object (for now)\n",
    "        tree = {best_index:{}}\n",
    "\n",
    "        # Create a new decision tree sub-node (branch) for each of the values \n",
    "        # in the best attribute field\n",
    "        partitions = simple_ml.split_instances(instances, best_index)\n",
    "\n",
    "        # Remove that attribute from the set of candidates for further splits\n",
    "        remaining_candidate_attribute_indexes = [i \n",
    "                                                 for i in candidate_attribute_indexes \n",
    "                                                 if i != best_index]\n",
    "        for attribute_value in partitions:\n",
    "            if trace:\n",
    "                print('{}Creating subtree for value {} ({}, {}, {}, {})'.format(\n",
    "                    '> ' * trace,\n",
    "                    attribute_value, \n",
    "                    len(partitions[attribute_value]), \n",
    "                    len(remaining_candidate_attribute_indexes), \n",
    "                    class_index, \n",
    "                    default_class))\n",
    "                \n",
    "            # Create a subtree for each value of the the best attribute\n",
    "            subtree = create_decision_tree(\n",
    "                partitions[attribute_value],\n",
    "                remaining_candidate_attribute_indexes,\n",
    "                class_index,\n",
    "                default_class,\n",
    "                trace + 1 if trace else 0)\n",
    "\n",
    "            # Add the new subtree to the empty dictionary object \n",
    "            # in the new tree/node we just created\n",
    "            tree[best_index][attribute_value] = subtree\n",
    "\n",
    "    return tree\n",
    "\n",
    "# split instances into separate training and testing sets\n",
    "training_instances = clean_instances[:-20]\n",
    "test_instances = clean_instances[-20:]\n",
    "tree = create_decision_tree(training_instances, trace=1)  # remove trace=1 to turn off tracing\n",
    "print(tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The structure of the tree shown above is rather difficult to discern from the normal printed representation of a dictionary.\n",
    "\n",
    "The Python [**`pprint`**](http://docs.python.org/2/library/pprint.html) module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.\n",
    "\n",
    "The [**`pprint.pprint(object, stream=None, indent=1, width=80, depth=None)`**](http://docs.python.org/2/library/pprint.html#pprint.pprint) method will print `object` to a `stream` (a default value of `None` will dictate the use of [sys.stdout](http://docs.python.org/2/library/sys.html#sys.stdout), the same destination as `print` function output), using `indent` spaces to differentiate nesting levels, using up to a maximum `width` columns and up to to a maximum nesting level `depth` (`None` indicating no maximum)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pprint import pprint\n",
    "\n",
    "pprint(tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classifying Instances with a Simple Decision Tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Usually, when we construct a decision tree based on a set of *training* instances, we do so with the intent of using that tree to classify a set of one or more *test* instances.\n",
    "\n",
    "We will define a function, **`classify(tree, instance, default_class=None)`**, to use a decision `tree` to classify a single `instance`, where an optional `default_class` can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.\n",
    "\n",
    "We will use a design pattern in which we will use a series of `if` statements, each of which returns a value if the condition is true, rather than a nested series of `if`, `elif` and/or `else` clauses, as it helps constrain the levels of indentation in the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def classify(tree, instance, default_class=None):\n",
    "    '''Returns a classification label for instance, given a decision tree'''\n",
    "    if not tree:  # if the node is empty, return the default class\n",
    "        return default_class\n",
    "    if not isinstance(tree, dict):  # if the node is a leaf, return its class label\n",
    "        return tree\n",
    "    attribute_index = list(tree.keys())[0]  # using list(dict.keys()) for Python 3 compatibility\n",
    "    attribute_values = list(tree.values())[0]\n",
    "    instance_attribute_value = instance[attribute_index]\n",
    "    if instance_attribute_value not in attribute_values:  # this value was not in training data\n",
    "        return default_class\n",
    "    # recursively traverse the subtree (branch) associated with instance_attribute_value\n",
    "    return classify(attribute_values[instance_attribute_value], instance, default_class)\n",
    "\n",
    "for instance in test_instances:\n",
    "    predicted_label = classify(tree, instance)\n",
    "    actual_label = instance[0]\n",
    "    print('predicted: {}; actual: {}'.format(predicted_label, actual_label))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating the Accuracy of a Simple Decision Tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is often helpful to evaluate the performance of a model using a dataset not used in the training of that model. In the simple example shown above, we used all but the last 20 instances to train a simple decision tree, then classified those last 20 instances using the tree.\n",
    "\n",
    "The advantage of this training/test split is that visual inspection of the classifications (sometimes called *predictions*) is relatively straightforward, revealing that all 20 instances were correctly classified.\n",
    "\n",
    "There are a variety of metrics that can be used to evaluate the performance of a model. [Scikit Learn's Model Evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html) library provides an overview and implementation of several possible metrics. For now, we'll simply measure the *accuracy* of a model, i.e., the percentage of test instances that are correctly classified (*true positives* and *true negatives*).\n",
    "\n",
    "The accuracy of the model above, given the set of 20 test instances, is 100% (20/20).\n",
    "\n",
    "The function below calculates the classification accuracy of a `tree` over a set of `test_instances` (with an optional `class_index` parameter indicating the position of the class label in each instance)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def classification_accuracy(tree, test_instances, class_index=0, default_class=None):\n",
    "    '''Returns the accuracy of classifying test_instances with tree, \n",
    "    where the class label is in position class_index'''\n",
    "    num_correct = 0\n",
    "    for i in range(len(test_instances)):\n",
    "        prediction = classify(tree, test_instances[i], default_class)\n",
    "        actual_value = test_instances[i][class_index]\n",
    "        if prediction == actual_value:\n",
    "            num_correct += 1\n",
    "    return num_correct / len(test_instances)\n",
    "\n",
    "print(classification_accuracy(tree, test_instances))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to showing the percentage of correctly classified instances, it may be helpful to return the actual counts of correctly and incorrectly classified instances, e.g., if we want to compile a total count of correctly and incorrectly classified instances over a collection of test instances.\n",
    "\n",
    "In order to do so, we'll use the [**`zip([iterable, ...])`**](http://docs.python.org/2.7/library/functions.html#zip) function, which combines 2 or more sequences or iterables; the function returns a list of tuples, where the *i*th tuple contains the *i*th element from each of the argument sequences or iterables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "zip([0, 1, 2], ['a', 'b', 'c'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use [list comprehensions](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions), the `Counter` class and the `zip()` function to modify `classification_accuracy()` so that it returns a packed tuple with \n",
    "\n",
    "* the percentage of instances correctly classified\n",
    "* the number of correctly classified instances\n",
    "* the number of incorrectly classified instances\n",
    "\n",
    "We'll also modify the function to use `instances` rather than `test_instances`, as we sometimes want to be able to valuate the accuracy of a model when tested on the training instances used to create it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def classification_accuracy(tree, instances, class_index=0, default_class=None):\n",
    "    '''Returns the accuracy of classifying test_instances with tree, \n",
    "    where the class label is in position class_index'''\n",
    "    predicted_labels = [classify(tree, instance, default_class) \n",
    "                        for instance in instances]\n",
    "    actual_labels = [x[class_index] \n",
    "                     for x in instances]\n",
    "    counts = Counter([x == y \n",
    "                      for x, y in zip(predicted_labels, actual_labels)])\n",
    "    return counts[True] / len(instances), counts[True], counts[False]\n",
    "\n",
    "print(classification_accuracy(tree, test_instances))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We sometimes want to partition the instances into subsets of equal sizes to measure performance. One metric this partitioning allows us to compute is a [learning curve](https://en.wikipedia.org/wiki/Learning_curve), i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka *folds*) would be to conduct an [*n-fold cross validation*](https://en.wikipedia.org/wiki/Cross-validation_(statistics) evaluation.\n",
    "\n",
    "The following function, **`partition_instances(instances, num_partitions)`**, partitions a set of `instances` into `num_partitions` relatively equally sized subsets.\n",
    "\n",
    "We'll use this as yet another opportunity to demonstrate the power of using list comprehensions, this time, to condense the use of nested `for` loops."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def partition_instances(instances, num_partitions):\n",
    "    '''Returns a list of relatively equally sized disjoint sublists (partitions) \n",
    "    of the list of instances'''\n",
    "    return [[instances[j] \n",
    "             for j in range(i, len(instances), num_partitions)]\n",
    "            for i in range(num_partitions)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before testing this function on the 5644 `clean_instances` from the UCI mushroom dataset, we'll create a small number of simplified instances to verify that the function has the desired behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "instance_length = 3\n",
    "num_instances = 5\n",
    "\n",
    "simplified_instances = [[j \n",
    "                         for j in range(i, instance_length + i)] \n",
    "                        for i in range(num_instances)]\n",
    "\n",
    "print('Instances:', simplified_instances)\n",
    "partitions = partition_instances(simplified_instances, 2)\n",
    "print('Partitions:', partitions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*\\[The following variations do not use list comprehensions\\]*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def partition_instances(instances, num_partitions):\n",
    "    '''Returns a list of relatively equally sized disjoint sublists (partitions) \n",
    "    of the list of instances'''\n",
    "    partitions = []\n",
    "    for i in range(num_partitions):\n",
    "        partition = []\n",
    "        # iterate over instances starting at position i in increments of num_paritions\n",
    "        for j in range(i, len(instances), num_partitions): \n",
    "            partition.append(instances[j])\n",
    "        partitions.append(partition)\n",
    "    return partitions\n",
    "\n",
    "simplified_instances = []\n",
    "for i in range(num_instances):\n",
    "    new_instance = []\n",
    "    for j in range(i, instance_length + i):\n",
    "        new_instance.append(j)\n",
    "    simplified_instances.append(new_instance)\n",
    "\n",
    "print('Instances:', simplified_instances)\n",
    "partitions = partition_instances(simplified_instances, 2)\n",
    "print('Partitions:', partitions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [**`enumerate(sequence, start=0)`**](http://docs.python.org/2.7/library/functions.html#enumerate) function creates an iterator that successively returns the index and value of each element in a `sequence`, beginning at the `start` index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for i, x in enumerate(['a', 'b', 'c']):\n",
    "    print(i, x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use `enumerate()` to facilitate slightly more rigorous testing of our `partition_instances` function on our `simplified_instances`.\n",
    "\n",
    "Note that since we are printing values rather than accumulating values, we will not use nested list comprehensions for this task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for i in range(num_instances):\n",
    "    print('\\n# partitions:', i)\n",
    "    for j, partition in enumerate(partition_instances(simplified_instances, i)):\n",
    "        print('partition {}: {}'.format(j, partition))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Returning our attention to the UCI mushroom dataset, the following will partition our `clean_instances` into 10 relatively equally sized disjoint subsets. We will use a list comprehension to print out the length of each partition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "partitions = partition_instances(clean_instances, 10)\n",
    "print([len(partition) for partition in partitions])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*\\[The following variation does not use a list comprehension\\]*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for partition in partitions:\n",
    "    print(len(partition), end=' ')\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following shows the different trees that are constructed based on partition 0 (first 10th) of `clean_instances`, partitions 0 and 1 (first 2/10ths) of `clean_instances` and all `clean_instances`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tree0 = create_decision_tree(partitions[0])\n",
    "print('Tree trained with {} instances:'.format(len(partitions[0])))\n",
    "pprint(tree0)\n",
    "print()\n",
    "\n",
    "tree1 = create_decision_tree(partitions[0] + partitions[1])\n",
    "print('Tree trained with {} instances:'.format(len(partitions[0] + partitions[1])))\n",
    "pprint(tree1)\n",
    "print()\n",
    "\n",
    "tree = create_decision_tree(clean_instances)\n",
    "print('Tree trained with {} instances:'.format(len(clean_instances)))\n",
    "pprint(tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The only difference between the first two trees - *tree0* and *tree1* - is that in the first tree, instances with no `odor` (attribute index `5` is `'n'`) and a `spore-print-color` of white (attribute `20` = `'w'`) are classified as `edible` (`'e'`). With additional training data in the 2nd partition, an additional distinction is made such that instances with no `odor`, a white `spore-print-color` and a clustered `population` (attribute `21` = `'c'`) are classified as `poisonous` (`'p'`), while all other instances with no `odor` and a white `spore-print-color` (and any other value for the `population` attribute) are classified as `edible` (`'e'`).\n",
    "\n",
    "Note that there is no difference between `tree1` and `tree` (the tree trained with all instances). This early convergence on an optimal model is uncommon on most datasets (outside the UCI repository)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Learning curves"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we can partition our instances into subsets, we can use these subsets to construct different-sized training sets in the process of computing a learning curve.\n",
    "\n",
    "We will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.\n",
    "\n",
    "The [**`list.extend(L)`**](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) method enables us to extend `list` by appending all the items in another list, `L`, to the end of `list`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x = [1, 2, 3]\n",
    "x.extend([4, 5])\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now define the function, **`compute_learning_curve(instances, num_partitions=10)`**, which will take a list of `instances`, partition it into `num_partitions` relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first `num_partitions - 1` partitions then tested with instances in the last partition, a variant of . That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on. \n",
    "\n",
    "The function will return a list of `num_partitions - 1` tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on the `num_partitions - 1` set). This will provide some indication of the relative impact of the size of the training set on model performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def compute_learning_curve(instances, num_partitions=10):\n",
    "    '''Returns a list of training sizes and scores for incrementally increasing partitions.\n",
    "\n",
    "    The list contains 2-element tuples, each representing a training size and score.\n",
    "    The i-th training size is the number of instances in partitions 0 through num_partitions - 2.\n",
    "    The i-th score is the accuracy of a tree trained with instances \n",
    "    from partitions 0 through num_partitions - 2\n",
    "    and tested on instances from num_partitions - 1 (the last partition).'''\n",
    "    \n",
    "    partitions = partition_instances(instances, num_partitions)\n",
    "    test_instances = partitions[-1][:]\n",
    "    training_instances = []\n",
    "    accuracy_list = []\n",
    "    for i in range(0, num_partitions - 1):\n",
    "        # for each iteration, the training set is composed of partitions 0 through i - 1\n",
    "        training_instances.extend(partitions[i][:])\n",
    "        tree = create_decision_tree(training_instances)\n",
    "        partition_accuracy = classification_accuracy(tree, test_instances)\n",
    "        accuracy_list.append((len(training_instances), partition_accuracy))\n",
    "    return accuracy_list\n",
    "\n",
    "accuracy_list = compute_learning_curve(clean_instances)\n",
    "print(accuracy_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The UCI mushroom dataset is a particularly clean and simple data set, enabling quick convergence on an optimal decision tree for classifying new instances using relatively few training instances. \n",
    "\n",
    "We can use a larger number of smaller partitions to see a little more variation in accuracy performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "accuracy_list = compute_learning_curve(clean_instances, 100)\n",
    "print(accuracy_list[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Object-Oriented Programming: Defining a Python Class to Encapsulate a Simple Decision Tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The simple decision tree defined above uses a Python dictionary for its representation. One can imagine using other data structures, and/or extending the decision tree to support confidence estimates, numeric features and other capabilities that are often included in more fully functional implementations. To support future extensibility, and hide the details of the representation from the user, it would be helpful to have a user-defined class for simple decision trees.\n",
    "\n",
    "Python is an [object-oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming) language, offering simple syntax and semantics for defining classes and instantiating objects of those classes. *[It is assumed that the reader is already familiar with the concepts of object-oriented programming]*\n",
    "\n",
    "A Python [class](http://docs.python.org/2/tutorial/classes.html) starts with the keyword **`class`** followed by a class name (identifier), a colon ('`:`'), and then any number of statements, which typically take the form of assignment statements for class or instance variables and/or function definitions for class methods. All statements are indented to reflect their inclusion in the class definition.\n",
    "\n",
    "The members - methods, class variables and instance variables - of a class are accessed by prepending `self.` to each reference. Class methods always include `self` as the first parameter. \n",
    "\n",
    "All class members in Python are *public* (accessible outside the class). There is no mechanism for *private* class members, but identifiers with leading double underscores (*\\_\\_member_identifier*) are 'mangled' (translated into *\\_class_name\\__member_identifier*), and thus not directly accessible outside their class, and can be used to approximate private members by Python programmers. \n",
    "\n",
    "There is also no mechanism for *protected* identifiers - accessible only within a defining class and its subclasses - in the Python language, and so Python programmers have adopted the convention of using a single underscore (*\\_identifier*) at the start of any identifier that is intended to be protected (i.e., not to be accessed outside the class or its subclasses). \n",
    "\n",
    "Some Python programmers only use the single underscore prefixes and avoid double underscore prefixes due to unintended consequences that can arise when names are mangled. The following warning about single and double underscore prefixes is issued in [Code Like a Pythonista](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#naming):\n",
    "\n",
    "> try to avoid the __private form. I never use it. Trust me. If you use it, you WILL regret it later\n",
    "\n",
    "We will follow this advice and avoid using the double underscore prefix in user-defined member variables and methods.\n",
    "\n",
    "Python has a number of pre-defined [special method names](http://docs.python.org/2/reference/datamodel.html#special-method-names), all of which are denoted by leading and trailing double underscores. For example, the [**`object.__init__(self[, ...])`**](http://docs.python.org/2/reference/datamodel.html#object.__init__) method is used to specify instructions that should be executed whenever a new object of a class is instantiated. \n",
    "\n",
    "Note that other machine learning libraries may use different terminology for some of the functions we defined above. For example, in the [`sklearn.tree.DecisionTreeClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class (and in most `sklearn` classifier classes), the method for constructing a classifier is named [`fit()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) - since it \"fits\" the data to a model - and the method for classifying instances is named [`predict()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) - since it is predicting the class label for an instance.\n",
    "\n",
    "In keeping with this common terminology, the code below defines a class, **`SimpleDecisionTree`**, with a single pseudo-protected member variable `_tree`, three public methods - `fit()`, `predict()` and `pprint()` - and two pseudo-protected auxilary methods - `_create_tree()` and `_predict()` - to augment the `fit()` and `predict()` methods, respectively. \n",
    "\n",
    "The `fit()` method is identical to the `create_decision_tree()` function above, with the inclusion of the `self` parameter (as it is now a class method rather than a function). The `predict()` method is a similarly modified version of the `classify()` function, with the added capability to predict the label of either a single instance or a list of instances. The `classification_accuracy()` method is similar to the function of the same name (with the addition of the `self` parameter). The `pprint()` method prints the tree in a human-readable format.\n",
    "\n",
    "Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in **`simple_decision_tree.py`**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "class SimpleDecisionTree:\n",
    "\n",
    "    _tree = {}  # this instance variable becomes accessible to class methods via self._tree\n",
    "\n",
    "    def __init__(self):\n",
    "        # this is where we would initialize any parameters to the SimpleDecisionTree\n",
    "        pass\n",
    "            \n",
    "    def fit(self, \n",
    "            instances, \n",
    "            candidate_attribute_indexes=None,\n",
    "            target_attribute_index=0,\n",
    "            default_class=None):\n",
    "        if not candidate_attribute_indexes:\n",
    "            candidate_attribute_indexes = [i \n",
    "                                           for i in range(len(instances[0]))\n",
    "                                           if i != target_attribute_index]\n",
    "        self._tree = self._create_tree(instances,\n",
    "                                       candidate_attribute_indexes,\n",
    "                                       target_attribute_index,\n",
    "                                       default_class)\n",
    "        \n",
    "    def _create_tree(self,\n",
    "                     instances,\n",
    "                     candidate_attribute_indexes,\n",
    "                     target_attribute_index=0,\n",
    "                     default_class=None):\n",
    "        class_labels_and_counts = Counter([instance[target_attribute_index] \n",
    "                                           for instance in instances])\n",
    "        if not instances or not candidate_attribute_indexes:\n",
    "            return default_class\n",
    "        elif len(class_labels_and_counts) == 1:\n",
    "            class_label = class_labels_and_counts.most_common(1)[0][0]\n",
    "            return class_label\n",
    "        else:\n",
    "            default_class = simple_ml.majority_value(instances, target_attribute_index)\n",
    "            best_index = simple_ml.choose_best_attribute_index(instances, \n",
    "                                                               candidate_attribute_indexes, \n",
    "                                                               target_attribute_index)\n",
    "            tree = {best_index:{}}\n",
    "            partitions = simple_ml.split_instances(instances, best_index)\n",
    "            remaining_candidate_attribute_indexes = [i \n",
    "                                                     for i in candidate_attribute_indexes \n",
    "                                                     if i != best_index]\n",
    "            for attribute_value in partitions:\n",
    "                subtree = self._create_tree(\n",
    "                    partitions[attribute_value],\n",
    "                    remaining_candidate_attribute_indexes,\n",
    "                    target_attribute_index,\n",
    "                    default_class)\n",
    "                tree[best_index][attribute_value] = subtree\n",
    "            return tree\n",
    "    \n",
    "    def predict(self, instances, default_class=None):\n",
    "        if not isinstance(instances, list):\n",
    "            return self._predict(self._tree, instance, default_class)\n",
    "        else:\n",
    "            return [self._predict(self._tree, instance, default_class) \n",
    "                    for instance in instances]\n",
    "    \n",
    "    def _predict(self, tree, instance, default_class=None):\n",
    "        if not tree:\n",
    "            return default_class\n",
    "        if not isinstance(tree, dict):\n",
    "            return tree\n",
    "        attribute_index = list(tree.keys())[0]  # using list(dict.keys()) for Py3 compatibiity\n",
    "        attribute_values = list(tree.values())[0]\n",
    "        instance_attribute_value = instance[attribute_index]\n",
    "        if instance_attribute_value not in attribute_values:\n",
    "            return default_class\n",
    "        return self._predict(attribute_values[instance_attribute_value],\n",
    "                             instance,\n",
    "                             default_class)\n",
    "    \n",
    "    def classification_accuracy(self, instances, default_class=None):\n",
    "        predicted_labels = self.predict(instances, default_class)\n",
    "        actual_labels = [x[0] for x in instances]\n",
    "        counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\n",
    "        return counts[True] / len(instances), counts[True], counts[False]\n",
    "    \n",
    "    def pprint(self):\n",
    "        pprint(self._tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following statements instantiate a `SimpleDecisionTree`, using all but the last 20 `clean_instances`, prints out the tree using its `pprint()` method, and then uses the `classify()` method to print the classification of the last 20 `clean_instances`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "simple_decision_tree = SimpleDecisionTree()\n",
    "simple_decision_tree.fit(training_instances)\n",
    "simple_decision_tree.pprint()\n",
    "print()\n",
    "\n",
    "predicted_labels = simple_decision_tree.predict(test_instances)\n",
    "actual_labels = [instance[0] for instance in test_instances]\n",
    "for predicted_label, actual_label in zip(predicted_labels, actual_labels):\n",
    "    print('Model: {}; truth: {}'.format(predicted_label, actual_label))\n",
    "print()\n",
    "print('Classification accuracy:', simple_decision_tree.classification_accuracy(test_instances))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Navigation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notebooks in this primer:\n",
    "\n",
    "* [1. Introduction](1_Introduction.ipynb)\n",
    "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n",
    "* [3. Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n",
    "* **4. Using Python to Build and Use a Simple Decision Tree Classifier** (*you are here*)\n",
    "* [5. Next Steps](5_Next_Steps.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}