In [740]:
import random as rand
from statistics import mean, stdev
from math import floor

yes = 1
no = 0

# A Decision

Let's say we wanted to build something that would decide for us if we wanted to go to the movies. We could base our decision on 2 variables, how much money do we have and how many of our friends are going. These two variables are independent because how many of are friends are going to the movies does not affect how much money we have (if only it did). 

However, we need to know both variables to make our decision. For instance, if we have enough money to go, but none of are friends are going then we probably won't want to go. On the other hand, if a lot of our friends are going, but we don't have the money we won't go either. Only the right combination of the variables leads to a yes. 

Because the answer is binary (yes or no) we can call this task a binary classification. We are trying to make something that can take in an amount of money, and a number of friends and output one of two classes, yes or no. 

We can think of this task as a function

If we simply created this function with no other variables there would be no machine learning involved. There is no state that we can update to affect what decision is made. So we know we need some other values in order to affect the final decision. Also we need a way to combine the extra values with the input to actually have our function return a 1 or a 0 for a yes or no. One way to do this is with a threshold. We can say that if our combine inputs exceed some value the our answer is yes and if they're below some value then no. 


# Importance Values
by changing the importance values below you will 
change what "decision" the decide_on_movies function 
makes. These values affect how "important" the input 
value is in making the decision. In other words they 
are weights for each input. In concrete terms, if you 
are a movie buff and you don't care about going to the 
movies by yourself, then the importance value for 
number of friends may be 0. This would mean that you 
don't care at all about how many of your friends are 
going, you just care about whether or not you have 
enough money.

In [768]:
# =========== MODIFY THESE VALUES ===========

money_importance = .1
friend_importance = .1
threshold = 5

# =========== MODIFY THESE VALUES ===========

In [769]:
# actually "make the decision"
def decide_on_movies(number_friends, amount_money):
    combine_inputs = (money_importance * amount_money) + (friend_importance * number_friends)

    if combine_inputs > threshold:
        return yes
    else:
        return no

# A Dataset

Now these importance values are like knobs or sliders that we can tune to affect the final outcome of our decision maker. But, how do we know what changes to make so that our decision maker is actually good? We need some way to determine how good our decisions are that way if we change the importance values we can see if they are better or worse than they were before. What if we try to compare the number of correct answers with the number of incorrect answers. In order to do that we need a number of existing answers. Here is where our dataset comes in. We need some examples of inputs and their corresponding correct outputs. Because of the simplicity of the data, we can create some examples. We can create 100 just to get a good percentage. 

In [761]:
def generate_example():
    money = rand.randint(0, 20)
    friends = rand.randint(0, 5)
    decision = no
    
    # if we have more than $10 and at least 
    # 1 of our friends are going then we will go
    if money > 10 and friends > 0:
        decision = yes
        
    return (friends, money, decision)
   

In [762]:
def evaluate_model(num_examples):
    # create a dataset
    examples = [generate_example() for i in range(num_examples)]

    # use our decision maker to "decide" whether or not we should go to the movies
    predictions = [decide_on_movies(friends, money) for (friends, money, decision) in examples]

    # pull out all the actual answers from our dataset
    correct_answers = [decision for (f, m, decision) in examples]

    num_correct = 0

    # see how many answers our decision maker got right
    for (prediction, correct_answer) in zip(predictions, correct_answers):
        if prediction == correct_answer:
            num_correct += 1
    
    # the ratio of correct answers to the total number of examples
    return (num_correct/num_examples)


In [767]:
# evaluate the model multiple times to get an accurate representation of how it performs
num_examples = 100
num_evaluations = 100
evaluations = [evaluate_model(num_examples) for i in range(num_evaluations)]

avg = mean(evaluations)
variance = stdev(evaluations)
print('{}% correct, {}% spread'.format(floor(avg * 100), floor(variance * 100)))


60% correct, 4% spread


If you run the preceeding block you will see that the decision maker gets about 60% of the answers correct on whether or not we should go to the movies. I randomly chose the importance values and they don't give us very good results! So how can we update them to improve our decision maker?

# Training

This is how we are going to make our decision maker better. When we run our evaluator, we see how well our model performs. We could then change the importance values and see if the model gets better or worse. If we decreased the values and the decision maker got better, then we could continue decreasing them until our results no longer improve. Try modifying the importance values above to see if you can improve the performance.