<a href="https://colab.research.google.com/github/lordbigot/UTS_ML2019_ID13191655/blob/master/A2B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Vector-based normal-finding decision tree**

**Algorithm Brief**

*Concept*

For the creation of a new algorithm, I considered the fundamentals of decision tree classifiers.

Decision Tree Classifiers consist of a series of internal nodes, which split the data into smaller and smaller sets. Typically, each decision node evaluates one attribute, in comparison to a target value. Everything below the value falls into a first partition, everything equal to or above the value falls into a second partition.

However, it is possible to use more than one attribute in a single step. Consider a linear node, weighing all of the attributes and comparing the result to a static value. Splitting the data set this way reduces the bias that comes from dependence on orthogonal lines to represent the data space. It also significantly increases the complexity of the algorithm, as well as the variance.

However, using linear splitting opens unique opportunities. Consider the following example:

![alt text](https://i.postimg.cc/GtZy5Hh1/justification.png)

In the above diagram, the depicted space is a form of underlying truth. The red space definitively belongs to one class, the yellow space belongs to a different class, and the orange space is disputed. An ideal solution would put lines cleanly through the middle of the orange space.

![alt text](https://i.postimg.cc/c1jwH575/justification-1.png)

If an orthagonal decision tree algorithm was used, the added line represents a plausible first step. Note that the presence of additional red on the left side of the diagram is likely to weight the line further towards the bottom of the space than an ideal solution.

![alt text](https://i.postimg.cc/DyXsbdBf/justification-2.png)

This second line completes a good solution. Any additional lines beyond this point would be overfitting the data.

![alt text](https://i.postimg.cc/q7jCdmMw/justification-3.png)

This diagram shows 2 possible first steps produced by a tree using linear nodes. This data used directly appears to produce worse results than the orthogonal solution. However, the use of lines enables additional calculation. These two lines can be considered an approximation of a curve, and their intersection a point of interest

![alt text](https://i.postimg.cc/GpcsV0r9/justification-4.png)

The line that has been added represents the tangent to the proposed approximation of a curve.

![alt text](https://i.postimg.cc/FHXSbdBy/justification-5.png)

The line that has been added represents the normal to the proposed approximation of a curve.

![alt text](https://i.postimg.cc/vZ7V0YJp/justification-6.png)

If the normal is used to partition the data, instead of either earlier linear node, this will split the data into regions that can be judged more fairly.

![alt text](https://i.postimg.cc/Tw152rK6/justification-7.png)

This possible result indicates how the use of the normal may limit the influence of unrelated curves upon the final tree, and may limit variance. Note that the depicted scenario has been deliberately chosen to favour orthagonal lines, and yet the linear normal method displays a significant advantage.

*Inputs*

This algorithm will be dependent on specific gradients, and so for best results, all data points should be linearly normalised into an n+1-dimensional array with attribute values (up to index n-1) between 0.0 and 1.0, and whatever attribute best represents the class in index n.

If, after the training set has been used to find values, the testing set includes results that fall outside the boundaries, the classifier should be able to handle these inputs normally.

*Outputs*

After training data is input, the output should be in the form of a decision tree. An efficient method of storing this in Python is a list with a length of n+2. The first n values represent multipliers for each attribute of a new data point. If the sum of each attribute, multiplied by its multiplier, is less than 1.0, then it should be sent to the nested tree at index n. If this sum is equal to or greater than 1.0, it should instead be sent to the nested tree at index n+1. This procedure is continued until in the place where a tree would normally be, an object is present, representing the decided class.

The output represents the decided class. It is a single value of whatever type the class attributes provided in the training data were. This should not be a list.

*Intermediate Data Structure*

The classifier is best resolved as a recursive function, which must accept its current set as a parameter.

The classifier needs to store vectors, and can do so in the format of an n-dimensional array.

The classifier must be capable of identifying the proportions of all labels present in the set it has been passed. This information can be stored as 2 parallel 1-dimensional arrays.

The classifier must compare gradients. The metric I am using to compare them is derived from entropy, but I have labelled it "certainty" to match it's new meaning. a certainty of 1.0 is the best-case scenario, and anything less indicates that the attached gradient cannot divide the data completely in half.

The calculation of the "normal" in n-dimensions requires an abstract representation of the circle that the two gradients lie between. My solution uses the following formulae as its basis:

x = asin(α)

y = asin(α+θ)

where: x represents an attribute of the first vector; y represents the same attribute of the second vector; a represents the amplitude of the sine wave, the distance between the maximum attribute and 0; α represents the angle on the sine wave at which the first vector is situated, and θ represents the angle between the first and second vector. It's possible that α will be 180° out of phase, and a will be negative, but use of Python 3's atan2() function effectively negate any mistake due to this.

**Implementation**

The skeleton of this classifier, within it's main function, indicates the other requirements. 

In [0]:
def next_classifier(training_set):
    label_index = len(training_set[0]) - 1
    labels, label_frequencies = list_labels(training_set, label_index)
    if len(labels) == 1:
        return labels[0]
    elif len(training_set) < MINIMUM_LENGTH:
        max_index = 0
        for i in range(1, len(labels)):
            if label_frequencies[max_index] < label_frequencies[i]:
                max_index = i
        return labels[max_index]
    for i in range(len(labels)):
        if label_frequencies[i] > len(training_set) * MINIMUM_PROPORTION:
            return labels[i]
    gradients = gradient_set
    most_certain_gradients = maximum_certainty(training_set, gradients, label_index)
    best_gradient = gradients[most_certain_gradients[0][0]]
    second_gradient = gradients[most_certain_gradients[1][0]]
    best_gradient_certainty = most_certain_gradients[0][1]
    second_gradient_certainty = most_certain_gradients[1][1]
    if best_gradient_certainty < 1.0 and NORMAL_FACTOR * second_gradient_certainty >= best_gradient_certainty:
        modulo_one = vector_modulo(best_gradient)
        for i in range(len(best_gradient)):
            best_gradient[i] /= modulo_one
        modulo_two = vector_modulo(second_gradient)
        for i in range(len(second_gradient)):
            second_gradient[i] /= modulo_two
        theta = math.acos(numpy.dot(best_gradient, second_gradient))
        alpha = []
        for i in range(len(best_gradient)):
            alpha.append(
                math.atan2(best_gradient[i] * math.sin(theta),
                           second_gradient[i] - best_gradient[i] * math.cos(theta)))
        a = []
        for i in range(len(best_gradient)):
            if math.sin(alpha[i]) == 0:
                a.append(second_gradient[i] / math.sin(alpha[i] + theta))
                continue
            a.append(best_gradient[i] / math.sin(alpha[i]))
        new_gradient = []
        for i in range(len(best_gradient)):
            new_gradient.append(a[i] * math.sin(alpha[i] + theta / 2.0 + math.pi / 2.0))
        sorted_set = sort_list_by_proportion(training_set, new_gradient)
        best_certainty = 0
        best_index = 0
        for i in range(2, len(sorted_set) - 1):
            split_set = sort_list_by_proportion(sorted_set[:i], best_gradient)
            leftover_set = sort_list_by_proportion(sorted_set[i:], second_gradient)
            best = max(grade_list_by_certainty(split_set, label_index)) + \
                   max(grade_list_by_certainty(leftover_set, label_index))
            if best_certainty < best:
                best_certainty = best
                best_index = i
        for i in range(2, len(sorted_set) - 1):
            split_set = sort_list_by_proportion(sorted_set[:i], second_gradient)
            leftover_set = sort_list_by_proportion(sorted_set[i:], best_gradient)
            best = max(grade_list_by_certainty(split_set, label_index)) + \
                   max(grade_list_by_certainty(leftover_set, label_index))
            if best_certainty < best:
                best_certainty = best
                best_index = i
        sum_value = 0
        for i in range(label_index):
            sum_value += (sorted_set[best_index][i] + sorted_set[best_index + 1][i]) / 2.0 * new_gradient[i]
        if sum_value < 0:
            sum_value *= -1
        for i in range(label_index):
            new_gradient[i] /= sum_value
        result = new_gradient
        result.append(next_classifier(sorted_set[:best_index]))
        result.append(next_classifier(sorted_set[best_index:]))
        return result
    elif best_gradient_certainty == 1.0:
        sorted_set = sort_list_by_proportion(training_set, best_gradient)
        first_index = 0
        second_index = len(sorted_set) - 1
        first_class = sorted_set[first_index][label_index]
        second_class = sorted_set[second_index][label_index]
        while first_index + 1 < second_index:
            third_index = (first_index + second_index) // 2
            third_class = sorted_set[third_index][label_index]
            if third_class == first_class:
                first_index = third_index
            else:
                second_index = third_index
        barrier_sum = 0.0
        for i in range(label_index):
            barrier_sum += (sorted_set[first_index][i] + sorted_set[second_index][i]) / 2.0 * best_gradient[i]
        if barrier_sum < 0:
            barrier_sum *= -1
        result = []
        for i in range(label_index):
            result.append(best_gradient[i] / barrier_sum)
        result.append(first_class)
        result.append(second_class)
        return result
    sorted_set = sort_list_by_proportion(training_set, best_gradient)
    best_certainty = 0
    best_index = 0
    for i in range(2, len(sorted_set) - 1):
        split_set = sorted_set[:i]
        leftover_set = sorted_set[i:]
        best = max(grade_list_by_certainty(split_set, label_index)) + \
               max(grade_list_by_certainty(leftover_set, label_index))
        if best_certainty < best:
            best_certainty = best
            best_index = i
    sum_value = 0
    for i in range(label_index):
        sum_value += (sorted_set[best_index][i] + sorted_set[best_index + 1][i]) / 2.0 * best_gradient[i]
    if sum_value < 0:
        sum_value *= -1
    for i in range(label_index):
        best_gradient[i] /= sum_value
    result = best_gradient
    result.append(next_classifier(sorted_set[:best_index]))
    result.append(next_classifier(sorted_set[best_index:]))
    return result

The classifier requires the use of a set of gradients. My early attempts to produce this based on the data were slow and ineffectual, so the algorithm now uses a consistent list, generated each time a new list is loaded.

In [0]:
gradient_set = []


def initialize_gradient_set(n):
    lesser_set = []
    if n > 1:
        lesser_set = initialize_gradient_set(n - 1)
    else:
        return [[1]]
    result = []
    for i in range(len(lesser_set)):
        for j in range(n):
            result.append(lesser_set[i][:j] + [0] + lesser_set[i][j:])
        maximum_value = 0
        for j in range(n - 1):
            if lesser_set[i][j] > maximum_value:
                maximum_value = lesser_set[i][j]
        print("max_value:", maximum_value)
        max_index = int(math.log2(maximum_value))
        print("max_index:", max_index)
        for j in range(n):
            for k in range(max_index + 2):
                new_set = lesser_set[i][:j] + [int(math.pow(2, k))] + lesser_set[i][j:]
                breaking = False
                for old_set in result:
                    if new_set == old_set:
                        breaking = True
                        break
                if breaking:
                    continue
                result.append(new_set)
            for k in range(max_index + 2):
                new_set = lesser_set[i][:j] + [-int(math.pow(2, k))] + lesser_set[i][j:]
                if new_set[0] < 0:
                    for l in range(len(new_set)):
                        new_set[l] *= -1
                breaking = False
                for old_set in result:
                    if new_set == old_set:
                        breaking = True
                        break
                if breaking:
                    continue
                result.append(new_set)
    return result

The classifier needs a function to sort a list by any vector array.

In [0]:
def sort_list_by_proportion(data, proportions):
    new_list = data.copy()
    editable_length = len(new_list)
    while editable_length > 1:
        max_index = 0
        maximum_value = value_vector_by_proportions(new_list[0], proportions)
        for i in range(1, editable_length):
            current_value = value_vector_by_proportions(new_list[i], proportions)
            if current_value > maximum_value:
                max_index = i
                maximum_value = current_value
        editable_length -= 1
        new_list[editable_length], new_list[max_index] = new_list[max_index], new_list[editable_length]
    return new_list

Entropy calculations all use a basic formula to combine the bits of information with the size of the known information.

The sum of the entropy of the proportions of a set can be used to determine how uniform the set is.

In [0]:
def entropy(probability):
    return probability * math.log2(probability)


def list_entropy(data, label_index):
    proportions = class_proportions(data, label_index)
    result = 0.0
    for proportion in proportions:
        result -= entropy(proportion)
    return result

Effective use of entropy is dependent on proper formulae to deal with lists.


In [0]:
def class_proportions(data, label_index):
    labels = []
    label_frequencies = []
    for item in data:
        found_label = False
        for i in range(len(labels)):
            if item[label_index] == labels[i]:
                found_label = True
                label_frequencies[i] += 1
        if not found_label:
            labels.append(item[label_index])
            label_frequencies.append(1)
    result = []
    for frequency in label_frequencies:
        result.append(float(frequency) / float(len(data)))
    return result


def grade_list_by_certainty(data, label_index):
    new_list = []
    for i in range(1, len(data)):
        proportion_before_i = float(i) / len(data)
        proportion_after_i = 1.0 - proportion_before_i
        new_list.append(1.0 - ((list_entropy(data[:i], label_index) * proportion_before_i)
                               + (list_entropy(data[i:], label_index) * proportion_after_i)))
    return new_list

The ability to evaulate the certainty of a list is useful in greating a fuly informed comparison of gradients, and finding which can produce a complete result.

In [0]:
def maximum_certainty(data, gradients, label_index):
    print("Flag MC-1")
    print("gradients:", gradients)
    certainties = [[0, -1.0]]
    for h in range(len(gradients)):
        sorted_data = sort_list_by_proportion(data, gradients[h])
        marks = grade_list_by_certainty(sorted_data, label_index)
        for i in range(len(marks)):
            breaking = False
            for certainty_index in range(len(certainties)):
                if certainties[certainty_index][1] < marks[i]:
                    if certainties[certainty_index][0] != h:
                        certainties.insert(certainty_index, [h, marks[i]])
                    else:
                        certainties[certainty_index][0] = h
                        certainties[certainty_index][1] = marks[i]
                    breaking = True
                    break
                if certainties[certainty_index][0] == h:
                    breaking = True
                    break
            if not breaking:
                certainties.append([h, marks[i]])
    return certainties

Calculating the complete distance of a vector early on is necessary for the creation a unit circle, necessary to calculate the normal. Luckily, this is a very simple formula.

In [0]:
def vector_modulo(vector):
    return pow(sum(pow(i, 2) for i in vector), 0.5)

Several of the features of the classifier are dependent on knowing the proportions of the classes present in the set. In fact, if the proportion of one label within the set passes over a certain threshold, the classifier will be considered complete.

In [0]:
def list_labels(data, label_index):
    labels = []
    label_frequencies = []
    for item in data:
        found_label = False
        for i in range(len(labels)):
            if item[label_index] == labels[i]:
                found_label = True
                label_frequencies[i] += 1
        if not found_label:
            labels.append(item[label_index])
            label_frequencies.append(1)
    return [labels, label_frequencies]

Some constants are used in the above formulae. Here's what they represent:

The normal factor is the ratio of how close the certainty value of two possible vector nodes must be for finding a normal to be considered a good course of action, over just directly using the better vector.

The minimum length is the minimum number of data points that the algorithm is willing to split, instead of just assigning the best available value. The maximum proportion is the point at which a value is considered effectively representative of the whole set. Both of these limits exist to prevent overfitting, but they may need to be altered depending on the specific nature of the data.

In [0]:
NORMAL_FACTOR = 1.2
MINIMUM_LENGTH = 4
MAXIMUM_PROPORTION = 0.95

**Test**

The following code tests the accuracy of the above algorithm upon a set of data randomly selected to match a known underlying truth

In [0]:
def diamond_test():
    global gradient_set
    gradient_set = initialize_gradient_set(2)
    training_data = []
    import random
    view = []
    for i in range(21):
        view.append("                     ")
    for i in range(300):
        point = [random.random(), random.random()]
        if point[0] * 2 + point[1] > 1.5 and point[0] - point[1] * 2 < -0.5:
            point.append(1)
            view[math.floor((point[0] + 0.025) / 0.05)] = view[math.floor((point[0] + 0.025) / 0.05)] \
                                                              [:math.floor((point[1] + 0.025) / 0.05)] + "1" \
                                                          + view[math.floor((point[0] + 0.025) / 0.05)][
                                                            math.floor((point[1] + 0.025) / 0.05) + 1:]
        else:
            point.append(0)
            view[math.floor((point[0] + 0.025) / 0.05)] = view[math.floor((point[0] + 0.025) / 0.05)][
                                                          :math.floor((point[1] + 0.025) / 0.05)] + "0" + view[
                                                                                                              math.floor(
                                                                                                                  (
                                                                                                                          point[
                                                                                                                              0] + 0.025) / 0.05)][
                                                                                                          math.floor((
                                                                                                                             point[
                                                                                                                                 1] + 0.025) / 0.05) + 1:]
        training_data.append(point)
    classifier = next_classifier(training_data)
    for i in range(21):
        print(view[i])
    print(classifier)
    for i in range(21):
        line = ""
        for j in range(21):
            point = [i / 20.0, j / 20.0]
            line += str(classify_vector(point, classifier))
        print(line)

Executing this test repeatedly in a development environment I produced the following results:

The following is the underlying truth, the ideal result for all other results to be compared to:

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000011

000000000000000001111

000000000000000111111

000000000000011111111

000000000001111111111

000000000001111111111

000000000001111111111

000000000000111111111

000000000000011111111

000000000000011111111

000000000000001111111

000000000000001111111

000000000000000111111

000000000000000111111

000000000000000011111

The following depict the appoximations produced by the algorithm:

[2.54432421402722, -1.8338451360721781, [8.828184911424573, -12.248452278886026, [-5.797405599084514, 8.043470604001524, [0.0, 1.7599002934694448, 0, 1], [1.333766989884266, 0.666883494942133, 0, 1]], 0], 0]

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000001

000000000000000000111

000000000000000011111

000000000000001111111

000000000000111111111

000000000011111111111

000000000001111111111

000000000000111111111

000000000000111111111

000000000000111111111

000000000000111111111

000000000000111111111

000000000000011111111

000000000000000111111

000000000000000011111

000000000000000001111

[2.2296978512073298, -1.607075284197018, [5.610344594580424, -7.7839373239561755, [1.3347194210343847, 0.6673597105171923, 0, 1], 0], 0]

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000001

000000000000000000111

000000000000000011111

000000000000001111111

000000000000111111111

000000000011111111111

000000001111111111111

000000011111111111111

000000011111111111111

000000001111111111111

000000000111111111111

000000000011111111111

000000000000111111111

000000000000011111111

000000000000001111111

000000000000000011111

[3.330385031667594, -2.4004057179113363, [-6.309592089865182, 8.754091400140329, 0, [1.3342375400368536, 0.6671187700184268, 0, 1]], [1.8663649701339, -3.7327299402678, 1, 0]]

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000001

000000000000000000111

110000000000000011111

111000000000001111111

111110000000111111111

111111000011111111111

011111100001111111111

011111111001111111111

001111111100111111111

001111111111011111111

000111111111101111111

000111111111111111111

000011111111111111111

000011111111111111111

000001111111111111111

000001111111111111111

[-9.765671333647328, 2.3053622806618006, [0.36891361632398023, -1.562743156573656, [0.0, 1.288140391703908, 0, 1], [0.42444317709396595, -1.797970150756021, [0.0, 1.4426725051496716, 0, 1], [-0.461127511154061, 1.9533674835438966, 0, [2.416857374116822, 0.0, 0, 1]]]], 0]

000000000000000000000

000000000000000000000

000000000000000011000

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

000000000000000011111

As this test clearly demonstrates, this algorithm is still prone to flaws, but can produce some very accurate comparisons. Although it is in need of further refinement, it can successfully produce some accurate approximations.

However, the execution time is substantially lacking. This test, on a set of 200 2-dimensional data points, took several minutes to produce a single classifier. The time increase due to raising the number of data points may occur on a polynomial scale, and raising the dimensions on an exponential scale.

This raises concerning issues, as discussed in the Conclusion 