# The University of Melbourne, School of Computing and Information Systems
# COMP90049 Introduction to Machine Learning, 2020 Semester 1
-----
## Project 1: Understanding Student Success with Naive Bayes
-----
###### Student Name(s): John Philip Harkness
###### Python version: 3
###### Submission deadline: 11am, Wed 22 Apr 2019

This iPython notebook is a template which you will use for your Project 1 submission. 

Marking will be applied on the five functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

1. a) From the law of conditional probability, we have the following formula:
            
      P(y|x) = P(x|y)P(y) / P(x)
      
      Our objective is to find:    ÿ  =  argmax(P(x|y)P(y))  where x consists of many features, x1, x2, ..., xm
      Therefore, we need to find ÿ  =  argmax(P(x1, x2, ..., xm|y)P(y))

      The "naive" assumption that we make is that, conditioned on the class y, all the features x1, x2, ..., xm 
      are assumed to be independent. In the context of the student dataset, this means: given a student
      has grade A+, all of the features (famsize, daily alcohol consumption, internet access etc) are independent
      of one another. This does not make any sense, but is necessary for us to be able to compute the probability. 
      
      With the given dataset, there is no way we can compute, for example, 
      P(x1 = "MS", x2 = "M", ..., x29 = "none"|y = "A")P(Y = "A").
      That is why it is necessary to assume conditional independence, so that we can compute
      P(x1 = "MS"|y = "A")P(x2 = "M")....P(x29 = "none")P(Y = "A), because these probabilites are attainable 
      from the dataset.
      
      It is problematic because the features aren't independent. For example, traveltime is dependent on address.
      If a student lives in a rural area, they are far more likely to have to travel further distances to get 
      to school. Furthermore, studytime is dependent on traveltime. In summary, the algorithm is ignoring the law
      of conditional probability, upon which it depends. Thus the results cannot be interpreted as mathematically
      correct, and hence this algorithm is "naive".
   
   
   
   
  

In [1]:
# RUN THIS FIRST
# This function should open a data file in csv, and transform it into a usable format
def load_data():
    with open('student.csv', 'r') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=';')
        return list(csv_reader)


In [2]:
# RUN THIS SECOND
# This function should split a data set into a training set and hold-out test set

# Import libraries
import csv
import random
import math

# This is where the partition size is chosen
def get_test_size():
    test_size = 65
    return test_size


# Split data method
def split_data():
    #CHANGE "method" TO "holdout" FOR THE HOLDOUT STRATEGY
    method = "holdout"
    data = load_data()
    line_count = 0
    test_size = get_test_size()
    test_data = []
    training_data = []
    index = [1, 65, 130, 195, 260, 325, 390, 455, 520, 585, 650]

    
    
    # PARTITION CONTROL HERE
    if method == "cross_validation":
        partition = 0
        
        # Iterate through the data and seperate the instances into their respective lists
        for line in data:
            # Skip first row
            if line_count == 0:
                pass
            elif line_count in range(index[partition], index[partition + 1]):
                x = line[0].split(',')
                test_data.append(x)
            else:
                x = line[0].split(',')
                training_data.append(x)
            line_count = line_count + 1
    
    elif method == "holdout":
        # Initialise an array of n random ints, which are the indices of the test data
        rands = random.sample(range(1, 650), get_test_size())
        for line in data:
            # Skip first row
            if line_count == 0:
                pass
            elif line_count in rands:
                x = line[0].split(',')
                test_data.append(x)
            else:
                x = line[0].split(',')
                training_data.append(x)
            line_count = line_count + 1
    else:
        print("Error. Not a valid input for METHOD in split_data()")

    return training_data, test_data



In [3]:
# Call the split_data() method
data = split_data()
training_data = data[0]


# Function to initialize array of zeros for each countlist
def zerolistmaker(n):
    listofzeros = [0] * n
    return listofzeros

# Count() function iterates through the training data and counts the occurences of each attribute associated
# with the six possible grades
def count():
    
    school = zerolistmaker(12)
    sex = zerolistmaker(12)
    address = zerolistmaker(12)
    famsize = zerolistmaker(12)
    pstatus = zerolistmaker(12)
    medu = zerolistmaker(24)
    fedu = zerolistmaker(24)
    mjob = zerolistmaker(30)
    fjob = zerolistmaker(30)
    reason = zerolistmaker(24)
    guardian = zerolistmaker(18)
    traveltime = zerolistmaker(30)
    studytime = zerolistmaker(30)
    failures = zerolistmaker(30)
    schoolsup = zerolistmaker(12)
    famsup = zerolistmaker(12)
    paid = zerolistmaker(12)
    activities = zerolistmaker(12)
    nursery = zerolistmaker(12)
    higher = zerolistmaker(12)
    internet = zerolistmaker(12)
    romantic = zerolistmaker(12)
    famrel = zerolistmaker(30)
    freetime = zerolistmaker(30)
    goout = zerolistmaker(30)
    dalc = zerolistmaker(30)
    walc = zerolistmaker(30)
    health = zerolistmaker(30)
    absences = zerolistmaker(30)
    
    for line in training_data:
     # A+ instances
        if line[29] == "A+":
            # count school instances
            if line[0] == "GP":            
                school[0] += 1
            elif line[0] == "MS":
                school[1] += 1
            else:
                print("Error in count() 1")
            
            # count sex instances
            if line[1] == "F":
                sex[0] += 1
            elif line[1] == "M":
                sex[1] += 1
            else:
                print("Error in count() 2")
                
            # count address instances
            if line[2] == "U":
                address[0] += 1
            elif line[2] == "R":
                address[1] += 1
            else:
                print("Error in count() 3")
            
            # count famsize instances
            if line[3] == "LE3":
                famsize[0] += 1
            elif line[3] == "GT3":
                famsize[1] += 1
            else:
                print("Error in count() 4")

            # count pstatus instances
            if line[4] == "T":
                pstatus[0] += 1
            elif line[4] == "A":
                pstatus[1] += 1
            else:
                print("Error in count() 5")
            
            # count medu instances
            if line[5] == "low":
                medu[0] += 1
            elif line[5] == "none":
                medu[1] += 1  
            elif line[5] == "mid":
                medu[2] += 1
            elif line[5] == "high":
                medu[3] += 1 
            else:
                print("Error in count() 6")
            
            # count fedu instances
            if line[6] == "low":
                fedu[0] += 1
            elif line[6] == "none":
                fedu[1] += 1  
            elif line[6] == "mid":
                fedu[2] += 1
            elif line[6] == "high":
                fedu[3] += 1 
            else:
                print("Error in count() 7")
            
            # count mjob instances
            if line[7] == "teacher":
                mjob[0] += 1
            elif line[7] == "health":
                mjob[1] += 1  
            elif line[7] == "services":
                mjob[2] += 1
            elif line[7] == "at_home":
                mjob[3] += 1 
            elif line[7] == "other":
                mjob[4] += 1
            else:
                print("Error in count() 8")
                
            # count fjob instances
            if line[8] == "teacher":
                fjob[0] += 1
            elif line[8] == "health":
                fjob[1] += 1  
            elif line[8] == "services":
                fjob[2] += 1
            elif line[8] == "at_home":
                fjob[3] += 1 
            elif line[8] == "other":
                fjob[4] += 1
            else:
                print("Error in count() 9")
                
            # count reason instances
            if line[9] == "home":
                reason[0] += 1
            elif line[9] == "reputation":
                reason[1] += 1  
            elif line[9] == "course":
                reason[2] += 1
            elif line[9] == "other":
                reason[3] += 1
            else:
                print("Error in count() 10")
                
            # count guardian instances
            if line[10] == "mother":
                guardian[0] += 1
            elif line[10] == "father":
                guardian[1] += 1  
            elif line[10] == "other":
                guardian[2] += 1
            else:
                print("Error in count() 11")
            
            # count traveltime instances
            if line[11] == "none":
                traveltime[0] += 1
            elif line[11] == "low":
                traveltime[1] += 1  
            elif line[11] == "medium":
                traveltime[2] += 1
            elif line[11] == "high":
                traveltime[3] += 1 
            elif line[11] == "very_high":
                traveltime[4] += 1
            else:
                print("Error in count() 12")
           
            # count studytime instances
            if line[12] == "none":
                studytime[0] += 1
            elif line[12] == "low":
                studytime[1] += 1  
            elif line[12] == "medium":
                studytime[2] += 1
            elif line[12] == "high":
                studytime[3] += 1 
            elif line[12] == "very_high":
                studytime[4] += 1
            else:
                print("Error in count() 13")

            # count failures instances
            if line[13] == "none":
                failures[0] += 1
            elif line[13] == "low":
                failures[1] += 1  
            elif line[13] == "medium":
                failures[2] += 1
            elif line[13] == "high":
                failures[3] += 1 
            elif line[13] == "very_high":
                failures[4] += 1
            else:
                print("Error in count() 14")
            
            # count schoolsup instances
            if line[14] == "yes":
                schoolsup[0] += 1
            elif line[14] == "no":
                schoolsup[1] += 1
            else:
                print("Error in count() 15")

            # count famsup instances
            if line[15] == "yes":
                famsup[0] += 1
            elif line[15] == "no":
                famsup[1] += 1
            else:
                print("Error in count() 16")

            # count paid instances
            if line[16] == "yes":
                paid[0] += 1
            elif line[16] == "no":
                paid[1] += 1
            else:
                print("Error in count() 17")
                
            # count activities instances
            if line[17] == "yes":
                activities[0] += 1
            elif line[17] == "no":
                activities[1] += 1
            else:
                print("Error in count() 18")
                
            # count nursery instances
            if line[18] == "yes":
                nursery[0] += 1
            elif line[18] == "no":
                nursery[1] += 1
            else:
                print("Error in count() 19")
                
            # count higher instances
            if line[19] == "yes":
                higher[0] += 1
            elif line[19] == "no":
                higher[1] += 1
            else:
                print("Error in count() 20")
                
            # count internet instances
            if line[20] == "yes":
                internet[0] += 1
            elif line[20] == "no":
                internet[1] += 1
            else:
                print("Error in count() 21")
                
            # count romantic instances
            if line[21] == "yes":
                romantic[0] += 1
            elif line[21] == "no":
                romantic[1] += 1
            else:
                print("Error in count() 22")
             
            # count famrel instances
            if line[22] == "1":
                famrel[0] += 1
            elif line[22] == "2":
                famrel[1] += 1  
            elif line[22] == "3":
                famrel[2] += 1
            elif line[22] == "4":
                famrel[3] += 1 
            elif line[22] == "5":
                famrel[4] += 1
            else:
                print("Error in count() 23")
 
            # count freetime instances
            if line[23] == "1":
                freetime[0] += 1
            elif line[23] == "2":
                freetime[1] += 1  
            elif line[23] == "3":
                freetime[2] += 1
            elif line[23] == "4":
                freetime[3] += 1 
            elif line[23] == "5":
                freetime[4] += 1
            else:
                print("Error in count() 24")

            # count goout instances
            if line[24] == "1":
                goout[0] += 1
            elif line[24] == "2":
                goout[1] += 1  
            elif line[24] == "3":
                goout[2] += 1
            elif line[24] == "4":
                goout[3] += 1 
            elif line[24] == "5":
                goout[4] += 1
            else:
                print("Error in count() 25")
            
            # count dalc instances
            if line[25] == "1":
                dalc[0] += 1
            elif line[25] == "2":
                dalc[1] += 1  
            elif line[25] == "3":
                dalc[2] += 1
            elif line[25] == "4":
                dalc[3] += 1 
            elif line[25] == "5":
                dalc[4] += 1 
            else:
                print("Error in count() 26")
            
            # count walc instances
            if line[26] == "1":
                walc[0] += 1
            elif line[26] == "2":
                walc[1] += 1  
            elif line[26] == "3":
                walc[2] += 1
            elif line[26] == "4":
                walc[3] += 1 
            elif line[26] == "5":
                walc[4] += 1
            else:
                print("Error in count() 27")
            
            # count health instances
            if line[27] == "1":
                health[0] += 1
            elif line[27] == "2":
                health[1] += 1  
            elif line[27] == "3":
                health[2] += 1
            elif line[27] == "4":
                health[3] += 1 
            elif line[27] == "5":
                health[4] += 1
            else:
                print("Error in count() 28")
            
            # count absences instances
            if line[28] == "none":
                absences[0] += 1
            elif line[28] == "one_to_three":
                absences[1] += 1  
            elif line[28] == "four_to_six":
                absences[2] += 1
            elif line[28] == "seven_to_ten":
                absences[3] += 1 
            elif line[28] == "more_than_ten":
                absences[4] += 1
            else:
                print("Error in count() 29")
        
        
        # A instances
        elif line[29] == "A":
            # count school instances
            if line[0] == "GP":
                school[2] += 1
            elif line[0] == "MS":
                school[3] += 1
            else:
                print("Error in count() 30")
            
            # count sex instances
            if line[1] == "F":
                sex[2] += 1
            elif line[1] == "M":
                sex[3] += 1
            else:
                print("Error in count() 31")
            
            # count address instances
            if line[2] == "U":
                address[2] += 1
            elif line[2] == "R":
                address[3] += 1
            
            # count famsize instances
            if line[3] == "LE3":
                famsize[2] += 1
            elif line[3] == "GT3":
                famsize[3] += 1

            # count pstatus instances
            if line[4] == "T":
                pstatus[2] += 1
            elif line[4] == "A":
                pstatus[3] += 1
            
            # count medu instances
            if line[5] == "low":
                medu[4] += 1
            elif line[5] == "none":
                medu[5] += 1  
            elif line[5] == "mid":
                medu[6] += 1
            elif line[5] == "high":
                medu[7] += 1              
            
            # count fedu instances
            if line[6] == "low":
                fedu[4] += 1
            elif line[6] == "none":
                fedu[5] += 1  
            elif line[6] == "mid":
                fedu[6] += 1
            elif line[6] == "high":
                fedu[7] += 1              
            
            # count mjob instances
            if line[7] == "teacher":
                mjob[5] += 1
            elif line[7] == "health":
                mjob[6] += 1  
            elif line[7] == "services":
                mjob[7] += 1
            elif line[7] == "at_home":
                mjob[8] += 1 
            elif line[7] == "other":
                mjob[9] += 1     
                
            # count fjob instances
            if line[8] == "teacher":
                fjob[5] += 1
            elif line[8] == "health":
                fjob[6] += 1  
            elif line[8] == "services":
                fjob[7] += 1
            elif line[8] == "at_home":
                fjob[8] += 1 
            elif line[8] == "other":
                fjob[9] += 1     
                
            # count reason instances
            if line[9] == "home":
                reason[4] += 1
            elif line[9] == "reputation":
                reason[5] += 1  
            elif line[9] == "course":
                reason[6] += 1
            elif line[9] == "other":
                reason[7] += 1 
                
            # count guardian instances
            if line[10] == "mother":
                guardian[3] += 1
            elif line[10] == "father":
                guardian[4] += 1  
            elif line[10] == "other":
                guardian[5] += 1      
            
            # count traveltime instances
            if line[11] == "none":
                traveltime[5] += 1
            elif line[11] == "low":
                traveltime[6] += 1  
            elif line[11] == "medium":
                traveltime[7] += 1
            elif line[11] == "high":
                traveltime[8] += 1 
            elif line[11] == "very_high":
                traveltime[9] += 1     
           
            # count studytime instances
            if line[12] == "none":
                studytime[5] += 1
            elif line[12] == "low":
                studytime[6] += 1  
            elif line[12] == "medium":
                studytime[7] += 1
            elif line[12] == "high":
                studytime[8] += 1 
            elif line[12] == "very_high":
                studytime[9] += 1     

            # count failures instances
            if line[13] == "none":
                failures[5] += 1
            elif line[13] == "low":
                failures[6] += 1  
            elif line[13] == "medium":
                failures[7] += 1
            elif line[13] == "high":
                failures[8] += 1 
            elif line[13] == "very_high":
                failures[9] += 1     
            
            # count schoolsup instances
            if line[14] == "yes":
                schoolsup[2] += 1
            elif line[14] == "no":
                schoolsup[3] += 1

            # count famsup instances
            if line[15] == "yes":
                famsup[2] += 1
            elif line[15] == "no":
                famsup[3] += 1

            # count paid instances
            if line[16] == "yes":
                paid[2] += 1
            elif line[16] == "no":
                paid[3] += 1
                
            # count activities instances
            if line[17] == "yes":
                activities[2] += 1
            elif line[17] == "no":
                activities[3] += 1
                
            # count nursery instances
            if line[18] == "yes":
                nursery[2] += 1
            elif line[18] == "no":
                nursery[3] += 1
                
            # count higher instances
            if line[19] == "yes":
                higher[2] += 1
            elif line[19] == "no":
                higher[3] += 1
                
            # count internet instances
            if line[20] == "yes":
                internet[2] += 1
            elif line[20] == "no":
                internet[3] += 1
                
            # count romantic instances
            if line[21] == "yes":
                romantic[2] += 1
            elif line[21] == "no":
                romantic[3] += 1
             
            # count famrel instances
            if line[22] == "1":
                famrel[5] += 1
            elif line[22] == "2":
                famrel[6] += 1  
            elif line[22] == "3":
                famrel[7] += 1
            elif line[22] == "4":
                famrel[8] += 1 
            elif line[22] == "5":
                famrel[9] += 1     
 
            # count freetime instances
            if line[23] == "1":
                freetime[5] += 1
            elif line[23] == "2":
                freetime[6] += 1  
            elif line[23] == "3":
                freetime[7] += 1
            elif line[23] == "4":
                freetime[8] += 1 
            elif line[23] == "5":
                freetime[9] += 1     

            # count goout instances
            if line[24] == "1":
                goout[5] += 1
            elif line[24] == "2":
                goout[6] += 1  
            elif line[24] == "3":
                goout[7] += 1
            elif line[24] == "4":
                goout[8] += 1 
            elif line[24] == "5":
                goout[9] += 1     
            
            # count dalc instances
            if line[25] == "1":
                dalc[5] += 1
            elif line[25] == "2":
                dalc[6] += 1  
            elif line[25] == "3":
                dalc[7] += 1
            elif line[25] == "4":
                dalc[8] += 1 
            elif line[25] == "5":
                dalc[9] += 1     
            
            # count walc instances
            if line[26] == "1":
                walc[5] += 1
            elif line[26] == "2":
                walc[6] += 1  
            elif line[26] == "3":
                walc[7] += 1
            elif line[26] == "4":
                walc[8] += 1 
            elif line[26] == "5":
                walc[9] += 1     
            
            # count health instances
            if line[27] == "1":
                health[5] += 1
            elif line[27] == "2":
                health[6] += 1  
            elif line[27] == "3":
                health[7] += 1
            elif line[27] == "4":
                health[8] += 1 
            elif line[27] == "5":
                health[9] += 1     
            
            # count absences instances
            if line[28] == "none":
                absences[5] += 1
            elif line[28] == "one_to_three":
                absences[6] += 1  
            elif line[28] == "four_to_six":
                absences[7] += 1
            elif line[28] == "seven_to_ten":
                absences[8] += 1 
            elif line[28] == "more_than_ten":
                absences[9] += 1     
                
        # B instances
        elif line[29] == "B":
            # count school instances
            if line[0] == "GP":
                school[4] += 1
            elif line[0] == "MS":
                school[5] += 1
            
            # count sex instances
            if line[1] == "F":
                sex[4] += 1
            elif line[1] == "M":
                sex[5] += 1
            
            # count address instances
            if line[2] == "U":
                address[4] += 1
            elif line[2] == "R":
                address[5] += 1
            
            # count famsize instances
            if line[3] == "LE3":
                famsize[4] += 1
            elif line[3] == "GT3":
                famsize[5] += 1

            # count pstatus instances
            if line[4] == "T":
                pstatus[4] += 1
            elif line[4] == "A":
                pstatus[5] += 1
            
            # count medu instances
            if line[5] == "low":
                medu[8] += 1
            elif line[5] == "none":
                medu[9] += 1  
            elif line[5] == "mid":
                medu[10] += 1
            elif line[5] == "high":
                medu[11] += 1              
            
            # count fedu instances
            if line[6] == "low":
                fedu[8] += 1
            elif line[6] == "none":
                fedu[9] += 1  
            elif line[6] == "mid":
                fedu[10] += 1
            elif line[6] == "high":
                fedu[11] += 1              
            
            # count mjob instances
            if line[7] == "teacher":
                mjob[10] += 1
            elif line[7] == "health":
                mjob[11] += 1  
            elif line[7] == "services":
                mjob[12] += 1
            elif line[7] == "at_home":
                mjob[13] += 1 
            elif line[7] == "other":
                mjob[14] += 1     
                
            # count fjob instances
            if line[8] == "teacher":
                fjob[10] += 1
            elif line[8] == "health":
                fjob[11] += 1  
            elif line[8] == "services":
                fjob[12] += 1
            elif line[8] == "at_home":
                fjob[13] += 1 
            elif line[8] == "other":
                fjob[14] += 1     
                
            # count reason instances
            if line[9] == "home":
                reason[8] += 1
            elif line[9] == "reputation":
                reason[9] += 1  
            elif line[9] == "course":
                reason[10] += 1
            elif line[9] == "other":
                reason[11] += 1 
                
            # count guardian instances
            if line[10] == "mother":
                guardian[6] += 1
            elif line[10] == "father":
                guardian[7] += 1  
            elif line[10] == "other":
                guardian[8] += 1      
            
            # count traveltime instances
            if line[11] == "none":
                traveltime[10] += 1
            elif line[11] == "low":
                traveltime[11] += 1  
            elif line[11] == "medium":
                traveltime[12] += 1
            elif line[11] == "high":
                traveltime[13] += 1 
            elif line[11] == "very_high":
                traveltime[14] += 1     
           
            # count studytime instances
            if line[12] == "none":
                studytime[10] += 1
            elif line[12] == "low":
                studytime[11] += 1  
            elif line[12] == "medium":
                studytime[12] += 1
            elif line[12] == "high":
                studytime[13] += 1 
            elif line[12] == "very_high":
                studytime[14] += 1     

            # count failures instances
            if line[13] == "none":
                failures[10] += 1
            elif line[13] == "low":
                failures[11] += 1  
            elif line[13] == "medium":
                failures[12] += 1
            elif line[13] == "high":
                failures[13] += 1 
            elif line[13] == "very_high":
                failures[14] += 1     
            
            # count schoolsup instances
            if line[14] == "yes":
                schoolsup[4] += 1
            elif line[14] == "no":
                schoolsup[5] += 1

            # count famsup instances
            if line[15] == "yes":
                famsup[4] += 1
            elif line[15] == "no":
                famsup[5] += 1

            # count paid instances
            if line[16] == "yes":
                paid[4] += 1
            elif line[16] == "no":
                paid[5] += 1
                
            # count activities instances
            if line[17] == "yes":
                activities[4] += 1
            elif line[17] == "no":
                activities[5] += 1
                
            # count nursery instances
            if line[18] == "yes":
                nursery[4] += 1
            elif line[18] == "no":
                nursery[5] += 1
                
            # count higher instances
            if line[19] == "yes":
                higher[4] += 1
            elif line[19] == "no":
                higher[5] += 1
                
            # count internet instances
            if line[20] == "yes":
                internet[4] += 1
            elif line[20] == "no":
                internet[5] += 1
                
            # count romantic instances
            if line[21] == "yes":
                romantic[4] += 1
            elif line[21] == "no":
                romantic[5] += 1
             
            # count famrel instances
            if line[22] == "1":
                famrel[10] += 1
            elif line[22] == "2":
                famrel[11] += 1  
            elif line[22] == "3":
                famrel[12] += 1
            elif line[22] == "4":
                famrel[13] += 1 
            elif line[22] == "5":
                famrel[14] += 1     
 
            # count freetime instances
            if line[23] == "1":
                freetime[10] += 1
            elif line[23] == "2":
                freetime[11] += 1  
            elif line[23] == "3":
                freetime[12] += 1
            elif line[23] == "4":
                freetime[13] += 1 
            elif line[23] == "5":
                freetime[14] += 1     

            # count goout instances
            if line[24] == "1":
                goout[10] += 1
            elif line[24] == "2":
                goout[11] += 1  
            elif line[24] == "3":
                goout[12] += 1
            elif line[24] == "4":
                goout[13] += 1 
            elif line[24] == "5":
                goout[14] += 1     
            
            # count dalc instances
            if line[25] == "1":
                dalc[10] += 1
            elif line[25] == "2":
                dalc[11] += 1  
            elif line[25] == "3":
                dalc[12] += 1
            elif line[25] == "4":
                dalc[13] += 1 
            elif line[25] == "5":
                dalc[14] += 1     
            
            # count walc instances
            if line[26] == "1":
                walc[10] += 1
            elif line[26] == "2":
                walc[11] += 1  
            elif line[26] == "3":
                walc[12] += 1
            elif line[26] == "4":
                walc[13] += 1 
            elif line[26] == "5":
                walc[14] += 1     
            
            # count health instances
            if line[27] == "1":
                health[10] += 1
            elif line[27] == "2":
                health[11] += 1  
            elif line[27] == "3":
                health[12] += 1
            elif line[27] == "4":
                health[13] += 1 
            elif line[27] == "5":
                health[14] += 1     
            
            # count absences instances
            if line[28] == "none":
                absences[10] += 1
            elif line[28] == "one_to_three":
                absences[11] += 1  
            elif line[28] == "four_to_six":
                absences[12] += 1
            elif line[28] == "seven_to_ten":
                absences[13] += 1 
            elif line[28] == "more_than_ten":
                absences[14] += 1     
                
        # C instances
        elif line[29] == "C":
            # count school instances
            if line[0] == "GP":
                school[6] += 1
            elif line[0] == "MS":
                school[7] += 1
            
            # count sex instances
            if line[1] == "F":
                sex[6] += 1
            elif line[1] == "M":
                sex[7] += 1
            
            # count address instances
            if line[2] == "U":
                address[6] += 1
            elif line[2] == "R":
                address[7] += 1
            
            # count famsize instances
            if line[3] == "LE3":
                famsize[6] += 1
            elif line[3] == "GT3":
                famsize[7] += 1

            # count pstatus instances
            if line[4] == "T":
                pstatus[6] += 1
            elif line[4] == "A":
                pstatus[7] += 1
            
            # count medu instances
            if line[5] == "low":
                medu[12] += 1
            elif line[5] == "none":
                medu[13] += 1  
            elif line[5] == "mid":
                medu[14] += 1
            elif line[5] == "high":
                medu[15] += 1              
            
            # count fedu instances
            if line[6] == "low":
                fedu[12] += 1
            elif line[6] == "none":
                fedu[13] += 1  
            elif line[6] == "mid":
                fedu[14] += 1
            elif line[6] == "high":
                fedu[15] += 1              
            
            # count mjob instances
            if line[7] == "teacher":
                mjob[15] += 1
            elif line[7] == "health":
                mjob[16] += 1  
            elif line[7] == "services":
                mjob[17] += 1
            elif line[7] == "at_home":
                mjob[18] += 1 
            elif line[7] == "other":
                mjob[19] += 1     
                
            # count fjob instances
            if line[8] == "teacher":
                fjob[15] += 1
            elif line[8] == "health":
                fjob[16] += 1  
            elif line[8] == "services":
                fjob[17] += 1
            elif line[8] == "at_home":
                fjob[18] += 1 
            elif line[8] == "other":
                fjob[19] += 1     
                
            # count reason instances
            if line[9] == "home":
                reason[12] += 1
            elif line[9] == "reputation":
                reason[13] += 1  
            elif line[9] == "course":
                reason[14] += 1
            elif line[9] == "other":
                reason[15] += 1 
                
            # count guardian instances
            if line[10] == "mother":
                guardian[9] += 1
            elif line[10] == "father":
                guardian[10] += 1  
            elif line[10] == "other":
                guardian[11] += 1      
            
            # count traveltime instances
            if line[11] == "none":
                traveltime[15] += 1
            elif line[11] == "low":
                traveltime[16] += 1  
            elif line[11] == "medium":
                traveltime[17] += 1
            elif line[11] == "high":
                traveltime[18] += 1 
            elif line[11] == "very_high":
                traveltime[19] += 1     
           
            # count studytime instances
            if line[12] == "none":
                studytime[15] += 1
            elif line[12] == "low":
                studytime[16] += 1  
            elif line[12] == "medium":
                studytime[17] += 1
            elif line[12] == "high":
                studytime[18] += 1 
            elif line[12] == "very_high":
                studytime[19] += 1     

            # count failures instances
            if line[13] == "none":
                failures[15] += 1
            elif line[13] == "low":
                failures[16] += 1  
            elif line[13] == "medium":
                failures[17] += 1
            elif line[13] == "high":
                failures[18] += 1 
            elif line[13] == "very_high":
                failures[19] += 1     
            
            # count schoolsup instances
            if line[14] == "yes":
                schoolsup[6] += 1
            elif line[14] == "no":
                schoolsup[7] += 1

            # count famsup instances
            if line[15] == "yes":
                famsup[6] += 1
            elif line[15] == "no":
                famsup[7] += 1

            # count paid instances
            if line[16] == "yes":
                paid[6] += 1
            elif line[16] == "no":
                paid[7] += 1
                
            # count activities instances
            if line[17] == "yes":
                activities[6] += 1
            elif line[17] == "no":
                activities[7] += 1
                
            # count nursery instances
            if line[18] == "yes":
                nursery[6] += 1
            elif line[18] == "no":
                nursery[7] += 1
                
            # count higher instances
            if line[19] == "yes":
                higher[6] += 1
            elif line[19] == "no":
                higher[7] += 1
                
            # count internet instances
            if line[20] == "yes":
                internet[6] += 1
            elif line[20] == "no":
                internet[7] += 1
                
            # count romantic instances
            if line[21] == "yes":
                romantic[6] += 1
            elif line[21] == "no":
                romantic[7] += 1
             
            # count famrel instances
            if line[22] == "1":
                famrel[15] += 1
            elif line[22] == "2":
                famrel[16] += 1  
            elif line[22] == "3":
                famrel[17] += 1
            elif line[22] == "4":
                famrel[18] += 1 
            elif line[22] == "5":
                famrel[19] += 1     
 
            # count freetime instances
            if line[23] == "1":
                freetime[15] += 1
            elif line[23] == "2":
                freetime[16] += 1  
            elif line[23] == "3":
                freetime[17] += 1
            elif line[23] == "4":
                freetime[18] += 1 
            elif line[23] == "5":
                freetime[19] += 1     

            # count goout instances
            if line[24] == "1":
                goout[15] += 1
            elif line[24] == "2":
                goout[16] += 1  
            elif line[24] == "3":
                goout[17] += 1
            elif line[24] == "4":
                goout[18] += 1 
            elif line[24] == "5":
                goout[19] += 1     
            
            # count dalc instances
            if line[25] == "1":
                dalc[15] += 1
            elif line[25] == "2":
                dalc[16] += 1  
            elif line[25] == "3":
                dalc[17] += 1
            elif line[25] == "4":
                dalc[18] += 1 
            elif line[25] == "5":
                dalc[19] += 1     
            
            # count walc instances
            if line[26] == "1":
                walc[15] += 1
            elif line[26] == "2":
                walc[16] += 1  
            elif line[26] == "3":
                walc[17] += 1
            elif line[26] == "4":
                walc[18] += 1 
            elif line[26] == "5":
                walc[19] += 1     
            
            # count health instances
            if line[27] == "1":
                health[15] += 1
            elif line[27] == "2":
                health[16] += 1  
            elif line[27] == "3":
                health[17] += 1
            elif line[27] == "4":
                health[18] += 1 
            elif line[27] == "5":
                health[19] += 1     
            
            # count absences instances
            if line[28] == "none":
                absences[15] += 1
            elif line[28] == "one_to_three":
                absences[16] += 1  
            elif line[28] == "four_to_six":
                absences[17] += 1
            elif line[28] == "seven_to_ten":
                absences[18] += 1 
            elif line[28] == "more_than_ten":
                absences[19] += 1     
                
     
    
    
        
        
        
        # D instances
        elif line[29] == "D":
            # count school instances
            if line[0] == "GP":
                school[8] += 1
            elif line[0] == "MS":
                school[9] += 1
            
            # count sex instances
            if line[1] == "F":
                sex[8] += 1
            elif line[1] == "M":
                sex[9] += 1
            
            # count address instances
            if line[2] == "U":
                address[8] += 1
            elif line[2] == "R":
                address[9] += 1
            
            # count famsize instances
            if line[3] == "LE3":
                famsize[8] += 1
            elif line[3] == "GT3":
                famsize[9] += 1

            # count pstatus instances
            if line[4] == "T":
                pstatus[8] += 1
            elif line[4] == "A":
                pstatus[9] += 1
            
            # count medu instances
            if line[5] == "low":
                medu[16] += 1
            elif line[5] == "none":
                medu[17] += 1  
            elif line[5] == "mid":
                medu[18] += 1
            elif line[5] == "high":
                medu[19] += 1              
            
            # count fedu instances
            if line[6] == "low":
                fedu[16] += 1
            elif line[6] == "none":
                fedu[17] += 1  
            elif line[6] == "mid":
                fedu[18] += 1
            elif line[6] == "high":
                fedu[19] += 1              
            
            # count mjob instances
            if line[7] == "teacher":
                mjob[20] += 1
            elif line[7] == "health":
                mjob[21] += 1  
            elif line[7] == "services":
                mjob[22] += 1
            elif line[7] == "at_home":
                mjob[23] += 1 
            elif line[7] == "other":
                mjob[24] += 1     
                
            # count fjob instances
            if line[8] == "teacher":
                fjob[20] += 1
            elif line[8] == "health":
                fjob[21] += 1  
            elif line[8] == "services":
                fjob[22] += 1
            elif line[8] == "at_home":
                fjob[23] += 1 
            elif line[8] == "other":
                fjob[24] += 1     
                
            # count reason instances
            if line[9] == "home":
                reason[16] += 1
            elif line[9] == "reputation":
                reason[17] += 1  
            elif line[9] == "course":
                reason[18] += 1
            elif line[9] == "other":
                reason[19] += 1 
                
            # count guardian instances
            if line[10] == "mother":
                guardian[12] += 1
            elif line[10] == "father":
                guardian[13] += 1  
            elif line[10] == "other":
                guardian[14] += 1      
            
            # count traveltime instances
            if line[11] == "none":
                traveltime[20] += 1
            elif line[11] == "low":
                traveltime[21] += 1  
            elif line[11] == "medium":
                traveltime[22] += 1
            elif line[11] == "high":
                traveltime[23] += 1 
            elif line[11] == "very_high":
                traveltime[24] += 1     
           
            # count studytime instances
            if line[12] == "none":
                studytime[20] += 1
            elif line[12] == "low":
                studytime[21] += 1  
            elif line[12] == "medium":
                studytime[22] += 1
            elif line[12] == "high":
                studytime[23] += 1 
            elif line[12] == "very_high":
                studytime[24] += 1     

            # count failures instances
            if line[13] == "none":
                failures[20] += 1
            elif line[13] == "low":
                failures[21] += 1  
            elif line[13] == "medium":
                failures[22] += 1
            elif line[13] == "high":
                failures[23] += 1 
            elif line[13] == "very_high":
                failures[24] += 1     
            
            # count schoolsup instances
            if line[14] == "yes":
                schoolsup[8] += 1
            elif line[14] == "no":
                schoolsup[9] += 1

            # count famsup instances
            if line[15] == "yes":
                famsup[8] += 1
            elif line[15] == "no":
                famsup[9] += 1

            # count paid instances
            if line[16] == "yes":
                paid[8] += 1
            elif line[16] == "no":
                paid[9] += 1
                
            # count activities instances
            if line[17] == "yes":
                activities[8] += 1
            elif line[17] == "no":
                activities[9] += 1
                
            # count nursery instances
            if line[18] == "yes":
                nursery[8] += 1
            elif line[18] == "no":
                nursery[9] += 1
                
            # count higher instances
            if line[19] == "yes":
                higher[8] += 1
            elif line[19] == "no":
                higher[9] += 1
                
            # count internet instances
            if line[20] == "yes":
                internet[8] += 1
            elif line[20] == "no":
                internet[9] += 1
                
            # count romantic instances
            if line[21] == "yes":
                romantic[8] += 1
            elif line[21] == "no":
                romantic[9] += 1
             
            # count famrel instances
            if line[22] == "1":
                famrel[20] += 1
            elif line[22] == "2":
                famrel[21] += 1  
            elif line[22] == "3":
                famrel[22] += 1
            elif line[22] == "4":
                famrel[23] += 1 
            elif line[22] == "5":
                famrel[24] += 1     
 
            # count freetime instances
            if line[23] == "1":
                freetime[20] += 1
            elif line[23] == "2":
                freetime[21] += 1  
            elif line[23] == "3":
                freetime[22] += 1
            elif line[23] == "4":
                freetime[23] += 1 
            elif line[23] == "5":
                freetime[24] += 1     

            # count goout instances
            if line[24] == "1":
                goout[20] += 1
            elif line[24] == "2":
                goout[21] += 1  
            elif line[24] == "3":
                goout[22] += 1
            elif line[24] == "4":
                goout[23] += 1 
            elif line[24] == "5":
                goout[24] += 1     
            
            # count dalc instances
            if line[25] == "1":
                dalc[20] += 1
            elif line[25] == "2":
                dalc[21] += 1  
            elif line[25] == "3":
                dalc[22] += 1
            elif line[25] == "4":
                dalc[23] += 1 
            elif line[25] == "5":
                dalc[24] += 1     
            
            # count walc instances
            if line[26] == "1":
                walc[20] += 1
            elif line[26] == "2":
                walc[21] += 1  
            elif line[26] == "3":
                walc[22] += 1
            elif line[26] == "4":
                walc[23] += 1 
            elif line[26] == "5":
                walc[24] += 1     
            
            # count health instances
            if line[27] == "1":
                health[20] += 1
            elif line[27] == "2":
                health[21] += 1  
            elif line[27] == "3":
                health[22] += 1
            elif line[27] == "4":
                health[23] += 1 
            elif line[27] == "5":
                health[24] += 1     
            
            # count absences instances
            if line[28] == "none":
                absences[20] += 1
            elif line[28] == "one_to_three":
                absences[21] += 1  
            elif line[28] == "four_to_six":
                absences[22] += 1
            elif line[28] == "seven_to_ten":
                absences[23] += 1 
            elif line[28] == "more_than_ten":
                absences[24] += 1     
                
                
                
                
        # F instances
        elif line[29] == "F":
            # count school instances
            if line[0] == "GP":
                school[10] += 1
            elif line[0] == "MS":
                school[11] += 1
            
            # count sex instances
            if line[1] == "F":
                sex[10] += 1
            elif line[1] == "M":
                sex[11] += 1
            
            # count address instances
            if line[2] == "U":
                address[10] += 1
            elif line[2] == "R":
                address[11] += 1
            
            # count famsize instances
            if line[3] == "LE3":
                famsize[10] += 1
            elif line[3] == "GT3":
                famsize[11] += 1
            else:
                print("Error in count() 606")

            # count pstatus instances
            if line[4] == "T":
                pstatus[10] += 1
            elif line[4] == "A":
                pstatus[11] += 1
            
            # count medu instances
            if line[5] == "low":
                medu[20] += 1
            elif line[5] == "none":
                medu[21] += 1  
            elif line[5] == "mid":
                medu[22] += 1
            elif line[5] == "high":
                medu[23] += 1              
            
            # count fedu instances
            if line[6] == "low":
                fedu[20] += 1
            elif line[6] == "none":
                fedu[21] += 1  
            elif line[6] == "mid":
                fedu[22] += 1
            elif line[6] == "high":
                fedu[23] += 1              
            
            # count mjob instances
            if line[7] == "teacher":
                mjob[25] += 1
            elif line[7] == "health":
                mjob[26] += 1  
            elif line[7] == "services":
                mjob[27] += 1
            elif line[7] == "at_home":
                mjob[28] += 1 
            elif line[7] == "other":
                mjob[29] += 1     
                
            # count fjob instances
            if line[8] == "teacher":
                fjob[25] += 1
            elif line[8] == "health":
                fjob[26] += 1  
            elif line[8] == "services":
                fjob[27] += 1
            elif line[8] == "at_home":
                fjob[28] += 1 
            elif line[8] == "other":
                fjob[29] += 1     
                
            # count reason instances
            if line[9] == "home":
                reason[20] += 1
            elif line[9] == "reputation":
                reason[21] += 1  
            elif line[9] == "course":
                reason[22] += 1
            elif line[9] == "other":
                reason[23] += 1 
                
            # count guardian instances
            if line[10] == "mother":
                guardian[15] += 1
            elif line[10] == "father":
                guardian[16] += 1  
            elif line[10] == "other":
                guardian[17] += 1      
            
            # count traveltime instances
            if line[11] == "none":
                traveltime[25] += 1
            elif line[11] == "low":
                traveltime[26] += 1  
            elif line[11] == "medium":
                traveltime[27] += 1
            elif line[11] == "high":
                traveltime[28] += 1 
            elif line[11] == "very_high":
                traveltime[29] += 1     
           
            # count studytime instances
            if line[12] == "none":
                studytime[25] += 1
            elif line[12] == "low":
                studytime[26] += 1  
            elif line[12] == "medium":
                studytime[27] += 1
            elif line[12] == "high":
                studytime[28] += 1 
            elif line[12] == "very_high":
                studytime[29] += 1     

            # count failures instances
            if line[13] == "none":
                failures[25] += 1
            elif line[13] == "low":
                failures[26] += 1  
            elif line[13] == "medium":
                failures[27] += 1
            elif line[13] == "high":
                failures[28] += 1 
            elif line[13] == "very_high":
                failures[29] += 1     
            
            # count schoolsup instances
            if line[14] == "yes":
                schoolsup[10] += 1
            elif line[14] == "no":
                schoolsup[11] += 1

            # count famsup instances
            if line[15] == "yes":
                famsup[10] += 1
            elif line[15] == "no":
                famsup[11] += 1

            # count paid instances
            if line[16] == "yes":
                paid[10] += 1
            elif line[16] == "no":
                paid[11] += 1
                
            # count activities instances
            if line[17] == "yes":
                activities[10] += 1
            elif line[17] == "no":
                activities[11] += 1
                
            # count nursery instances
            if line[18] == "yes":
                nursery[10] += 1
            elif line[18] == "no":
                nursery[11] += 1
                
            # count higher instances
            if line[19] == "yes":
                higher[10] += 1
            elif line[19] == "no":
                higher[11] += 1
                
            # count internet instances
            if line[20] == "yes":
                internet[10] += 1
            elif line[20] == "no":
                internet[11] += 1
            else:
                print("Error in count() 606")
                
            # count romantic instances
            if line[21] == "yes":
                romantic[10] += 1
            elif line[21] == "no":
                romantic[11] += 1
             
            # count famrel instances
            if line[22] == "1":
                famrel[25] += 1
            elif line[22] == "2":
                famrel[26] += 1  
            elif line[22] == "3":
                famrel[27] += 1
            elif line[22] == "4":
                famrel[28] += 1 
            elif line[22] == "5":
                famrel[29] += 1     
 
            # count freetime instances
            if line[23] == "1":
                freetime[25] += 1
            elif line[23] == "2":
                freetime[26] += 1  
            elif line[23] == "3":
                freetime[27] += 1
            elif line[23] == "4":
                freetime[28] += 1 
            elif line[23] == "5":
                freetime[29] += 1     

            # count goout instances
            if line[24] == "1":
                goout[25] += 1
            elif line[24] == "2":
                goout[26] += 1  
            elif line[24] == "3":
                goout[27] += 1
            elif line[24] == "4":
                goout[28] += 1 
            elif line[24] == "5":
                goout[29] += 1     
            
            # count dalc instances
            if line[25] == "1":
                dalc[25] += 1
            elif line[25] == "2":
                dalc[26] += 1  
            elif line[25] == "3":
                dalc[27] += 1
            elif line[25] == "4":
                dalc[28] += 1 
            elif line[25] == "5":
                dalc[29] += 1     
            
            # count walc instances
            if line[26] == "1":
                walc[25] += 1
            elif line[26] == "2":
                walc[26] += 1  
            elif line[26] == "3":
                walc[27] += 1
            elif line[26] == "4":
                walc[28] += 1 
            elif line[26] == "5":
                walc[29] += 1     
            
            # count health instances
            if line[27] == "1":
                health[25] += 1
            elif line[27] == "2":
                health[26] += 1  
            elif line[27] == "3":
                health[27] += 1
            elif line[27] == "4":
                health[28] += 1 
            elif line[27] == "5":
                health[29] += 1  
            else:
                print("Error in count() 605")
            
            # count absences instances
            if line[28] == "none":
                absences[25] += 1
            elif line[28] == "one_to_three":
                absences[26] += 1  
            elif line[28] == "four_to_six":
                absences[27] += 1
            elif line[28] == "seven_to_ten":
                absences[28] += 1 
            elif line[28] == "more_than_ten":
                absences[29] += 1  
            else:
                print("Error in count() 606")
            

    return school, sex, address, famsize, pstatus, medu, fedu, mjob, fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, dalc, walc, health, absences   
        
        
    
             
    
        
# This function should build a supervised NB model
        
def train():
    count_lists = count()
    probabilities = []

    
    # Now perform smoothing on the count lists and divide by respective grade counts to derive probability
    
    for each in count_lists:
        for attr in each:
            if attr == 0:
                attr = 1
            probabilities.append(attr)  
    return probabilities


In [4]:
data = split_data()
training_data = data[0]
test_data = data[1]
probabilities = train()

# Count the number of A+s, As, Bs etc. in the training data
A_plus_count = 0
A_count = 0
B_count = 0
C_count = 0
D_count = 0
F_count = 0
    
for instance in training_data:
    if instance[29] == "A+":
        A_plus_count += 1
    elif instance[29] == "A":
        A_count += 1
    elif instance[29] == "B":
        B_count += 1
    elif instance[29] == "C":
        C_count += 1
    elif instance[29] == "D":
        D_count += 1
    elif instance[29] == "F":
        F_count += 1


attributes = ["school","sex","address","famsize","pstatus","medu","fedu","mjob","fjob","reason","guardian","traveltime","studytime","failures","schoolsup","famsup","paid","activities","nursery","higher","internet","romantic","famrel","freetime","goout","dalc","walc","health","absences"]

# CREATE THE INDEX RETRIEVAL
# This function takes a grade, attribute and value and returns the number of occurences of it
# in the training data so the probabilities can be computed
def get_prob(grade, attr, value):
    attr_index = 0
    grade_index = 0
    attr_count = 0
    if grade == "A+":
        grade_index = 0
    elif grade == "A":
        grade_index = 1
    elif grade == "B":
        grade_index = 2
    elif grade == "C":
        grade_index = 3
    elif grade == "D":
        grade_index = 4
    elif grade == "F":
        grade_index = 5
    else:
        print("Error")
        
    if attr == "school":
        attr_index = 0
        attr_count = 2
    elif attr =="sex":
        attr_index = 12
        attr_count = 2
    elif attr == "address":
        attr_index = 24
        attr_count = 2
    elif attr == "famsize":
        attr_index = 36
        attr_count = 2
    elif attr == "pstatus":
        attr_index = 48
        attr_count = 2
    elif attr == "medu":
        attr_index = 60
        attr_count = 4
    elif attr == "fedu":
        attr_index = 84
        attr_count = 4
    elif attr == "mjob":
        attr_index = 108
        attr_count = 5
    elif attr == "fjob":
        attr_index = 138
        attr_count = 5
    elif attr == "reason":
        attr_index = 168
        attr_count = 4
    elif attr == "guardian":
        attr_index = 192
        attr_count = 3
    elif attr == "traveltime":
        attr_index = 210
        attr_count = 5
    elif attr == "studytime":
        attr_index = 240
        attr_count = 5
    elif attr == "failures":
        attr_index = 270
        attr_count = 5
    elif attr == "schoolsup":
        attr_index = 300
        attr_count = 2
    elif attr == "famsup":
        attr_index = 312
        attr_count = 2
    elif attr == "paid":
        attr_index = 324
        attr_count = 2
    elif attr == "activities":
        attr_index = 336
        attr_count = 2
    elif attr == "nursery":
        attr_index = 348
        attr_count = 2
    elif attr == "higher":
        attr_index = 360
        attr_count = 2
    elif attr == "internet":
        attr_index = 372
        attr_count = 2
    elif attr == "romantic":
        attr_index = 384
        attr_count = 2
    elif attr == "famrel":
        attr_index = 396
        attr_count = 5
    elif attr == "freetime":
        attr_index = 426
        attr_count = 5
    elif attr == "goout":
        attr_index = 456
        attr_count = 5
    elif attr == "dalc":
        attr_index = 486
        attr_count = 5
    elif attr == "walc":
        attr_index = 516
        attr_count = 5
    elif attr == "health":
        attr_index = 546
        attr_count = 5
    elif attr == "absences":
        attr_index = 576
        attr_count = 5
    else:
        print("Error")

    if value == "yes" or value == "none" or value == "1" or value == "F" or value == "U" or value == "LE3" or value == "T" or value == "GP" or value == "teacher" or value == "home" or value == "mother" or value == "very_bad":
        value_index = 0
    elif value == "no" or value == "2" or value == "low" or value == "M" or value == "MS" or value == "R" or value == "GT3" or value == "A" or value == "health" or value == "reputation" or value == "father"or value == "one_to_three":
        value_index = 1
    elif value == "mid" or value == "3" or value == "services" or value == "course" or value == "medium" or value == "four_to_six":
        value_index = 2
    elif value == "high" or value == "at_home" or value == "seven_to_ten" or value == "4":
        value_index = 3
    elif value == "very_high" or value == "other" or value == "more_than_ten" or value == "5":
        value_index = 4
    else:
        print("Error")
      
    if (attr == "reason") and (value == "other"):
        value_index = 3
    if (attr == "guardian") and (value == "other"):
        value_index = 2
    if ((attr == "medu") or (attr == "fedu")) and (value == "none"):
        value_index = 1
    if ((attr == "medu") or (attr == "fedu")) and (value == "low"):
        value_index = 0

    return attr_index + (attr_count*grade_index) + value_index


# This function computes the "naive" conditional probability for a given instance
def predict_initial(instance):
    A_plus = 1
    A = 1
    B = 1
    C = 1
    D = 1
    F = 1
    
    # The following 13 lines of code compute the probability that instance is an A+, A, B.. 
    for each in range(29):
        A_plus = A_plus*(probabilities[get_prob("A+", attributes[each], instance[each])] / A_plus_count)
        A = A*(probabilities[get_prob("A", attributes[each], instance[each])] / A_count)
        B = B*(probabilities[get_prob("B", attributes[each], instance[each])] / B_count)
        C = C*(probabilities[get_prob("C", attributes[each], instance[each])] / C_count)
        D = D*(probabilities[get_prob("D", attributes[each], instance[each])] / D_count)
        F = F*(probabilities[get_prob("F", attributes[each], instance[each])] / F_count)

    A_plus = A_plus*(A_plus_count/(649 - get_test_size()))
    A = A*(A_count/(649 - get_test_size()))
    B = B*(B_count/(649 - get_test_size()))
    C = C*(C_count/(649 - get_test_size()))
    D = D*(D_count/(649 - get_test_size()))
    F = F*(F_count/(649 - get_test_size()))
    
    # Return the grade with the highest probability, hence classifying the instance
    if(max(A_plus, A, B, C, D, F) == A_plus):
        return "A+"
    elif(max(A_plus, A, B, C, D, F) == A):
        return "A"
    elif(max(A_plus, A, B, C, D, F) == B):
        return "B"
    elif(max(A_plus, A, B, C, D, F) == C):
        return "C"
    elif(max(A_plus, A, B, C, D, F) == D):
        return "D"
    elif(max(A_plus, A, B, C, D, F) == F):
        return "F"
        


# This function handles the input (list of instances / singular instance)
def predict(instance):
    if type(instance) == list:
        if type(instance[0]) == str:
            output = predict_initial(instance)
        elif type(instance[0]) == list:
            for each in instance:
                output = predict_initial(each)
    else:
        print("Error in input type. Input must be either an instance or a list of instances")
            
    return output

In [5]:
# This function should evaluate a set of predictions in terms of accuracy
def evaluate():
    count = 0
    results = []
    print(len(test_data))
    print("COUNT  --  ACTUAL  --   PREDICTED")
    for i in range(len(test_data)):        
        results.append(predict(test_data[i]))
        if test_data[i][29] == results[i]:
            count += 1
        print(i, "           ",test_data[i][29],"           ",results[i])
    return count/len(test_data)

evaluate()


65
COUNT  --  ACTUAL  --   PREDICTED
0             D             F
1             D             C
2             B             D
3             C             C
4             F             D
5             F             D
6             B             D
7             B             A
8             B             A
9             D             D
10             A             D
11             D             F
12             B             B
13             D             D
14             B             A
15             F             F
16             F             A
17             D             D
18             D             D
19             B             C
20             B             B
21             B             A+
22             D             C
23             B             B
24             F             F
25             D             D
26             D             D
27             D             D
28             F             F
29             A+             A
30             C             C
31        

0.38461538461538464

1. c)   Using the holdout strategy with a 90 : 10 ratio (65 test instances):
    
        Accuracy: 43.08%
    
    
    Below are some observations:
    
    There were 12 predictions incorrect by more than one grade. This means the classifier predicted, with
    81% accuracy, the grades to be either correct or only one grade away.
    
    
    Observations on instances predicted correctly:
    
    
       Of all except one of the test instances that were predicted correctly to be an F or D, medu and fedu were 
       either none, low or mid. Studytime was either low or medium. 50% of them had for_to_six absences (compared 
       with 24% of all the results in the dataset). This suggests these attributes are a strong predictor of poor 
       academic performance. 
       
    Observations on instances predicted incorrectly:
    
       Of all the grades that were incorrectly predicted to be A+, A or B, 66% had either high or very high 
       study_time (vs 20% of the 649 instances in the dataset), and 66% had very low weekend alcohol 
       consumption (compared with 38% of all the results). All but one had a parent that didn't have high or mid 
       education level. This suggests these attributes have an overly large influnce on the prediciton.
       
       
    
         

2. a) Accuracy: simply quantifies how frequently the classifier is correct. This is particularly useful for
   comparing classifiers (providing the classifiers are trained / tested over the same instances).
           
           Accuracy = (TP + TN) / (TP + FP + FN + TN)
           where TP = True Positive, TN = True Negative, FP = False Positive etc..
           
   For context, we may want to know about the classifiers performance predicting the A+ grade in the student 
   dataset.

   Precision: of all the instances that the classifier predicted to be A+, what proportion of those were correct?
   Important for understanding how common false positives are for a given class.
   
           Precision = TP / (TP + FP)
   
   Recall: of all the truly A+ instances in the test data, what proportion of those did the classifier correctly 
   predict? Useful measure for understanding the presence of false negatives in a certain class.
   
           Recall = TP / (TP + FN)
   
   F-1: Precision and Recall generally have an inverse relationship although a good classifier will have high 
   precision and high recall. The F-1 score is a metric that evaluates the extent to which there is high
   precision AND high recall. The F-1 score is another useful tool for understanding the performance of the 
   classifier.
   
           F-1 = (2*P*R) / (P + R)
           
    Macro Averaging: This is simply taking the average score of the precision or recall for each of the classes
    
           MacroPrecision = (sum(Pi) (over 1 to c)) / c       where Pi is the Precision for class i,  
                                                              c is the number of classes
           
           MacroRecall = (sum(Ri) (over 1 to c)) / c       where Ri is the Recall for class i
           
           
   
    Micro Averaging: This is averaging the Precision / Recall for the classes by adding all the TPs and FPs / FNs
    (depending on whether it is Recall or Precision) for each class into a single pool, and then performing the 
    average
    
           MicroPrecision = (sum(TPi) (over 1 to c)) / ((sum(TPi) + sum(FPi) (over 1 to c))      
          
           
           MicroRecall = (sum(TPi) (over 1 to c)) / ((sum(TPi) + sum(FNi) (over 1 to c))   
           
               where TPi is the count of True Positives for class i
                     FPi ................False Positvies ........
                     FNi ................False Negatives ........
        
       

In [6]:
# Code for Q2.b)

# This function makes that results accessible for this code block
def getResults():
    results = []
    for i in range(len(test_data)):        
        results.append(predict(test_data[i]))
    return results


# Compute the precision
def precision():
    results = getResults()
    sumation = 0
    TP = [0,0,0,0,0,0]
    FP = [0,0,0,0,0,0]
    precision_array = []
    for i in range(len(test_data)):
        # Count the True Positives
        if results[i] == test_data[i][29]:
            if results[i] == "A+":
                TP[0] += 1
            elif results[i] == "A":
                TP[1] += 1
            elif results[i] == "B":
                TP[2] += 1
            elif results[i] == "C":
                TP[3] += 1
            elif results[i] == "D":
                TP[4] += 1
            elif results[i] == "F":
                TP[5] += 1
        
        # Count the False Positives
        else:
            if results[i] == "A+":
                FP[0] += 1
            elif results[i] == "A":
                FP[1] += 1
            elif results[i] == "B":
                FP[2] += 1
            elif results[i] == "C":
                FP[3] += 1
            elif results[i] == "D":
                FP[4] += 1
            elif results[i] == "F":
                FP[5] += 1
                
    # Handle the case where the count was zero
    for i in range(6):
        if TP[i] == 0 and FP[i] == 0:
            precision_array.append("N/A")
        else:
            precision_array.append(TP[i] / (TP[i] + FP[i]))
    
    for i in range(6):
        if type(precision_array[i]) != str:
            sumation = sumation + precision_array[i]
        
    return precision_array, sumation / 6

# Compute the recall
def recall():
    results = getResults()
    TP = [0,0,0,0,0,0]
    count = [0,0,0,0,0,0]
    recall_array = []
    sumation = 0
    for i in range(len(test_data)):
        # Count total occurencs of each grade so that False Negatives can be determined
        if test_data[i][29] == "A+":
            count[0] += 1
        elif test_data[i][29] == "A":
            count[1] += 1
        elif test_data[i][29] == "B":
            count[2] += 1
        elif test_data[i][29] == "C":
            count[3] += 1
        elif test_data[i][29] == "D":
            count[4] += 1
        elif test_data[i][29] == "F":
            count[5] += 1
        
        # Count the True Positives
        if results[i] == test_data[i][29]:
            if results[i] == "A+":
                TP[0] += 1
            elif results[i] == "A":
                TP[1] += 1
            elif results[i] == "B":
                TP[2] += 1
            elif results[i] == "C":
                TP[3] += 1
            elif results[i] == "D":
                TP[4] += 1
            elif results[i] == "F":
                TP[5] += 1     
    
    for i in range(6):
        if count[i] != 0:
            recall_array.append(TP[i] / count[i])
        else:
            recall_array.append("N/A")
    for i in range(6):
        if type(recall_array[i]) != str:
            sumation = sumation + recall_array[i]
        
    return recall_array, sumation / 6

# Compute F-1 metric
def f_one():
    
    get_precision = precision()
    get_recall = recall()
    
    precision_array = get_precision[0]
    precision_avg = get_precision[1]
    recall_array = get_recall[0]
    recall_avg = get_recall[1]
    f_one_array = []
    
    f_one_avg = (2*precision_avg*recall_avg) / (precision_avg + recall_avg)
    for i in range(6):
        if type(precision_array[i]) == str or type(recall_array[i]) == str:
            f_one_array.append("N/A")
        elif precision_array[i] == recall_array[i] == 0:
            f_one_array.append("N/A")
        else: 
            f_one_array.append((2*precision_array[i]*recall_array[i]) / (precision_array[i] + recall_array[i]))
        
    
    return f_one_array, f_one_avg

# Output the results
def compare():
    grades = ["A+", "A ", "B ", "C ", "D ", "F "]
    column_one = precision()
    column_two = recall()
    column_three = f_one()
    
    precision_array = column_one[0]
    precision_avg = column_one[1]
    
    recall_array = column_two[0]
    recall_avg = column_two[1]
    
    f_one_array = column_three[0]
    f_one_avg = column_three[1]
    
    print("~~~~~ BY CLASS ~~~~~~")
    print("\nGRADE ~ PRECISION ~ RECALL  ~~~~  F-1")
    for i in range(6):
        if(type(precision_array[i]) == type(recall_array[i]) == type(f_one_array[i]) == str):
            print(grades[i], "   ","N/A    ", "   ", "N/A    ", "    ", "N/A    ")
        elif(type(precision_array[i]) == type(f_one_array[i]) == str):
            print(grades[i], "   ","N/A    ", "   ", format(float(recall_array[i]), ".6f"), "    ", "N/A    ")
        elif(type(recall_array[i]) == type(f_one_array[i]) == str):
            print(grades[i], "   ",format(float(precision_array[i]), ".6f"), "   ", "N/A    ", "    ", "N/A    ")
        elif(type(precision_array[i]) == type(recall_array[i]) == str):
            print(grades[i], "   ","N/A    ", "   ", "N/A    ", "    ", format(float(f_one_array[i]), ".6f"))
        elif(type(precision_array[i]) == str):
            print(grades[i], "   ","N/A    ", "   ", format(float(recall_array[i]), ".6f"), "    ", format(float(f_one_array[i]), ".6f"))
        elif(type(recall_array[i]) == str):
            print(grades[i], "   ",format(float(precision_array[i]), ".6f"), "   ", "N/A    ", "    ", format(float(f_one_array[i]), ".6f"))
        elif(type(f_one_array[i]) == str):
            print(grades[i], "   ",format(float(precision_array[i]), ".6f"), "   ", format(float(recall_array[i]), ".6f"), "    ", "N/A    ")
        else:
            print(grades[i], "   ",format(float(precision_array[i]), ".6f"), "   ", format(float(recall_array[i]), ".6f"), "    ", format(float(f_one_array[i]), ".6f"))
    print("\n\n\n~~~~ MACRO AVERAGING ~~~~")
    print("\nPRECISION ~~ RECALL  ~~~~  F-1")
    print(format(float(precision_avg), ".6f"), "   ", format(float(recall_avg), ".6f"), "    ", format(float(f_one_avg), ".6f"))
    
compare()

~~~~~ BY CLASS ~~~~~~

GRADE ~ PRECISION ~ RECALL  ~~~~  F-1
A+     0.000000     0.000000      N/A    
A      0.100000     0.200000      0.133333
B      0.666667     0.222222      0.333333
C      0.357143     0.500000      0.416667
D      0.458333     0.523810      0.488889
F      0.400000     0.400000      0.400000



~~~~ MACRO AVERAGING ~~~~

PRECISION ~~ RECALL  ~~~~  F-1
0.330357     0.307672      0.318611


2. b) (Written response)
     
     Note: in some cases, there were no instances of a particlular grade predicted or present in the partition, 
     and so the F-1 value was N/A (and in some cases the precision / recall too).
     
     Over the ten partitions of the dataset, single number macro averages for Precision / Recall / F-1 followed
     the accuracy for that partition with a small degree of attraction. Accuracy was always higher, which is to be
     expected. 
     
        ACCURACY PRECISION  RECALLL     F-1
     
     P1 0.28125	 0.235165	0.230533	0.232826
     
     P2 0.43076	 0.305979	0.299138	0.30252
     
     P3 0.43076	 0.387393	0.355449	0.370734
     
     P4 0.32307	 0.289286	0.294871	0.292052
     
     P5 0.38461	 0.301901	0.298039	0.299958
     
     P6 0.30769	 0.2395	    0.229345	0.234313
     
     P7 0.26154	 0.191313	0.189894	0.190601
     
     P8 0.4	     0.479167	0.381657	0.424889
     
     P9 0.36923	 0.284416	0.314881	0.298874
     
     P10 0.2923  0.294935	0.278605	0.286537
    
    
    Precision was higher than recall for most of the partitions. Following from Q2A, the results show that the 
    classifier identifies less FPs than FNs. In context, the extent to which the 
    classifier is correct when it makes a prediction is greater than the extent to which it wrongly predicts
    a grade.
    
    Looking at the BY CLASS evaluations, there is a clear trend that the classifier is better at predicting 
    lower grades. Ignoring A+ (too few A+ instances to offer reliable insight) we have the following average
    precision, recall and f-1 for the ten iterations:
    
             PRECISION  RECALL      F-1
  
     A       0.2083951  0.30703457	0.23975028
     B       0.2718406	0.2332884	0.2363511
     C       0.3719635	0.38085875	0.369275
     D       0.4244813	0.4631103	0.4369471
     F       0.4579366	0.3745931	0.3988734
     
     (NOTE: N/A or 0 entries were ommitted from the averages)
      
      
    The F-1 measure is essentially a conservative average of precision and recall.
    The F-1 measure for grades C, D and F are significantly higher than A and B, indicating that the classifier
    is better at predicting these grades. Given that the entire dataset is made up of 70.1% C, D or F grades, we
    would expect to see this, given that the classifier is better trained to predict these grades.
    
    

 3. a)  I chose to use cross validation with a ratio of 10 : 90 (test : train). The hypothesis is that with 65 
        instances per partition (10 folds), the test data will be representative of the overall dataset. 
        
        i) The cross validation method works by dividing the dataset into n equal (as close as possible) 
           partitions, and running the classifier n times where each of the partitions are used once as the 
           test data (and all the instances that are not the test data for that iteration are the training 
           data). The accuracy for each iteration is then averaged. The more partitions the less variance.
           The best choice is to partition on every single instance, in this case 649 partitions, however it 
           would be too slow.
           
        ii) In the Holdout Strategy there is a a possibility that the randomly chosen test data is not 
           representative of the overall dataset, for examle the test data could be mostly D or F, which the 
           classifier is better at predicting. This would likley give a higher performance than what the 
           classifier would achieve on representative test data. Cross Validation is preferable over the hold 
           out method because every instance is used in both the training data and the test data. So there is 
           zero possibility that the test data is not representative. More partitions results in less variance
           but more computations, so I chose 10 partitions (most common). 
    

     b) Cross Validation Accuracy: 34.8%
        Holdout Accuracy: 43.08%
        
        This shows how unreliable the Holdout strategy can be. The event where the test data is 
        "non representative" was evident here, where there was a disproportionately high number of 
        Cs, Ds and Fs in the test data (75%, compared with 70% in the full dataset). The Cross Validation 
        accuracy is the one we should take to be more "true" given the theory explained in ii). 
        
        To illistrate the issue of variance that we face with the Holdout strategy three more trials 
        were performed and their accuracies listed below:
        
        T1  0.32307
        
        T2  0.33846
        
        T3  0.4
        
        
        On the other hand, the Cross Validation strategy is more reproducible. For the same partition size,
        it will achieve the same accuracy every time.

## Questions (you may respond in a cell or cells below):

You should respond to Question 1 and two additional questions of your choice. A response to a question should take about 100–250 words, and make reference to the data wherever possible.

### Question 1: Naive Bayes Concepts and Implementation

- a Explain the ‘naive’ assumption underlying Naive Bayes. (1) Why is it necessary? (2) Why can it be problematic? Link your discussion to the features of the students data set. [no programming required]
- b Implement the required functions to load the student dataset, and estimate a Naive Bayes model. Evaluate the resulting classifier using the hold-out strategy, and measure its performance using accuracy.
- c What accuracy does your classifier achieve? Manually inspect a few instances for which your classifier made correct predictions, and some for which it predicted incorrectly, and discuss any patterns you can find.

### Question 2: A Closer Look at Evaluation

- a You learnt in the lectures that precision, recall and f-1 measure can provide a more holistic and realistic picture of the classifier performance. (i) Explain the intuition behind accuracy, precision, recall, and F1-measure, (ii) contrast their utility, and (iii) discuss the difference between micro and macro averaging in the context of the data set. [no programming required]
- b Compute precision, recall and f-1 measure of your model’s predictions on the test data set (1) separately for each class, and (2) as a single number using macro-averaging. Compare the results against your accuracy scores from Question 1. In the context of the student dataset, and your response to question 2a analyze the additional knowledge you gained about your classifier performance.

### Question 3: Training Strategies 

There are other evaluation strategies, which tend to be preferred over the hold-out strategy you implemented in Question 1.
- a Select one such strategy, (i) describe how it works, and (ii) explain why it is preferable over hold-out evaluation. [no programming required]
- b Implement your chosen strategy from Question 3a, and report the accuracy score(s) of your classifier under this strategy. Compare your outcomes against your accuracy score in Question 1, and explain your observations in the context of your response to question 3a.

### Question 4: Model Comparison

In order to understand whether a machine learning model is performing satisfactorily we typically compare its performance against alternative models. 
- a Choose one (simple) comparison model, explain (i) the workings of your chosen model, and (ii) why you chose this particular model. 
- b Implement your model of choice. How does the performance of the Naive Bayes classifier compare against your additional model? Explain your observations.

### Question 5: Bias and Fairness in Student Success Prediction

As machine learning practitioners, we should be aware of possible ethical considerations around the
applications we develop. The classifier you developed in this assignment could for example be used
to classify college applicants into admitted vs not-admitted – depending on their predicted
grade.
- a Discuss ethical problems which might arise in this application and lead to unfair treatment of the applicants. Link your discussion to the set of features provided in the students data set. [no programming required]
- b Select ethically problematic features from the data set and remove them from the data set. Use your own judgment (there is no right or wrong), and document your decisions. Train your Naive Bayes classifier on the resulting data set containing only ‘unproblematic’ features. How does the performance change in comparison to the full classifier?
- c The approach to fairness we have adopted is called “fairness through unawareness” – we simply deleted any questionable features from our data. Removing all problematic features does not guarantee a fair classifier. Can you think of reasons why removing problematic features is not enough? [no programming required]
