## Homework 2 -  Classification
***
**Name**: $<$insert name here$>$ 
***

Remember that you are encouraged to discuss the problems with your instructors and classmates, but **you must write all code and solutions on your own**.

The rules to be followed for the assignment are:

- Do **NOT** load additional packages beyond what we've shared in the cells below.
- Some problems with code may be autograded.  If we provide a function or class API **do not** change it.
- Do not change the location of the data or data directory.  Use only relative paths to access the data.

In [1]:
import argparse
import pandas as pd
import numpy as np
import pickle
from pathlib import Path
from collections import defaultdict

### [10 points] Problem 1 - Building a Decision Tree
***

A sample dataset has been provided to you in the './data/dataset.csv' path. Here are the attributes for the dataset. Use this dataset to test your functions.

- Age - ["<=30", "31-40", ">40"]
- Income - ["low", "medium", "high"]
- Student - ["no", "yes"]
- Credit Rating - ["fair", "excellent"]
- Loan - ["no", "yes"]

Note:
- A sample dataset to test your code has been provided in the location "data/dataset.csv". Please maintain this as it would be necessary while grading.
- Do not change the variable names of the returned values.
- After calculating each of those values, assign them to the corresponding value that is being returned.
- The "Loan" attribute should be used as the target variable while making calculations for your decision tree.

In [2]:
import math
import pandas as pd

def information_gain_target(dataset_file): 
    
#        Input: dataset_file - A string variable which references the path to the dataset file.
#        Output: ig_loan - A floating point variable which holds the information entropy associated with the target variable.
#        
#        NOTE: 
#        1. Return the information gain associated with the target variable in the dataset.
#        2. The Loan attribute is the target variable
#        3. The pandas dataframe has the following attributes: Age, Income, Student, Credit Rating, Loan
#        4. Perform your calculations for information gain and assign it to the variable ig_loan


    df = pd.read_csv(dataset_file)
    ig_loan = 0
    
    # your code here
    
    total_count = len(df)
    yes_count = len(df[df['Loan'] == 'yes'])
    no_count = len(df[df['Loan'] == 'no'])
    
    if yes_count == 0 or no_count == 0:
        ig_loan = 0
    else:
        p_yes = yes_count / total_count
        p_no = no_count / total_count
        ig_loan = - (p_yes * math.log2(p_yes) + p_no * math.log2(p_no))
        
    return ig_loan

information_gain_target('data/dataset.csv')


0.9798687566511528

In [3]:

# This cell has hidden test cases that will run after you submit your assignment. 


In [4]:
attribute_values = {
    "Age": ["<=30", "31-40", ">40"],
    "Income": ["low", "medium", "high"],
    "Student": ["yes", "no"],
    "Credit Rating": ["fair", "excellent"]
}

attributes = ["Age", "Income", "Student", "Credit Rating"]

In [5]:
def information_gain(p_count_yes, p_count_no):
    
#   A helper function that returns the information gain when given counts of number of yes and no values. 
#   Please complete this function before you proceed to the information_gain_attributes function below.
    
    # your code here
    total_count = p_count_yes + p_count_no
    p_yes = p_count_yes / total_count
    p_no = p_count_no / total_count

    if p_yes == 0 or p_no == 0:
        return 0

    ig = -(p_yes * math.log2(p_yes)) - (p_no * math.log2(p_no))
    
    return ig

In [6]:
import operator

def information_gain_attributes(dataset_file, ig_loan, attributes, attribute_values):
    
#        Input: 
#            1. dataset_file - A string variable which references the path to the dataset file.
#            2. ig_loan - A floating point variable representing the information gain of the target variable "Loan".
#            3. attributes - A python list which has all the attributes of the dataset
#            4. attribute_values - A python dictionary representing the values each attribute can hold.
#            
#        Output: results - A python dictionary representing the information gain associated with each variable.
#            1. ig_attributes - A sub dictionary representing the information gain for each attribute.
#            2. best_attribute - Returns the attribute which has the highest information gain.
#        
#        NOTE: 
#        1. The Loan attribute is the target variable
#        2. The pandas dataframe has the following attributes: Age, Income, Student, Credit Rating, Loan
    
    results = {
        "ig_attributes": {
            "Age": 0,
            "Income": 0,
            "Student": 0,
            "Credit Rating": 0
        },
        "best_attribute": ""
    }
    
            
    df = pd.read_csv(dataset_file)
    d_range = len(df)

    for attribute in attributes:
        attribute_values_list = attribute_values[attribute]
        attribute_entropy = 0

        for value in attribute_values_list:
            value_count = len(df[df[attribute] == value])
            value_loan_counts = df[df[attribute] == value]['Loan'].value_counts()

            p_yes = value_loan_counts.get('yes', 0)
            p_no = value_loan_counts.get('no', 0)

            attribute_entropy += (value_count / d_range) * information_gain(p_yes, p_no)            
        
        results["ig_attributes"][attribute] = ig_loan - attribute_entropy
        
    
    results["best_attribute"] = max(results["ig_attributes"].items(), key=operator.itemgetter(1))[0]
    return results

In [7]:
dataset_file = 'data/dataset.csv'
ig_loan = information_gain_target(dataset_file)
attributes = ["Age", "Income", "Student", "Credit Rating"]
attribute_values = {
    "Age": ["<=30", "31-40", ">40"],
    "Income": ["low", "medium", "high"],
    "Student": ["yes", "no"],
    "Credit Rating": ["fair", "excellent"]
}

results = information_gain_attributes(dataset_file, ig_loan, attributes, attribute_values)
print(results)

{'ig_attributes': {'Age': 0.2419726756283742, 'Income': 0.012398717114751934, 'Student': 0.19570962879973097, 'Credit Rating': 0.07181901063117269}, 'best_attribute': 'Age'}


In [8]:

# This cell has hidden test cases that will run after you submit your assignment. 


### [10 points] Problem 2 - Building a Naive Bayes Classifier
***

A sample dataset has been provided to you in the './data/dataset.csv' path. Here are the attributes for the dataset. Use this dataset to test your functions.

- Age - ["<=30", "31-40", ">40"]
- Income - ["low", "medium", "high"]
- Student - ["no", "yes"]
- Credit Rating - ["fair", "excellent"]
- Loan - ["no", "yes"]

Note:
- A sample dataset to test your code has been provided in the location "data/dataset.csv". Please maintain this as it would be necessary while grading.
- Do not change the variable names of the returned values.
- After calculating each of those values, assign them to the corresponding value that is being returned.
- The "Loan" attribute should be used as the target variable while making calculations for your naive bayes classifier.

In [9]:
from collections import defaultdict

def naive_bayes(dataset_file, attributes, attribute_values):

#   Input:
#       1. dataset_file - A string variable which references the path to the dataset file.
#       2. attributes - A python list which has all the attributes of the dataset
#       3. attribute_values - A python dictionary representing the values each attribute can hold.
#        
#   Output: A probabilities dictionary which contains the values of when the input attribute is yes or no
#       depending on the corresponding Loan attribute.
#                
#   Hint: Starter code has been provided to you to calculate the probabilities.

    probabilities = {
        "Age": { "<=30": {"yes": 0, "no": 0}, "31-40": {"yes": 0, "no": 0}, ">40": {"yes": 0, "no": 0} },
        "Income": { "low": {"yes": 0, "no": 0}, "medium": {"yes": 0, "no": 0}, "high": {"yes": 0, "no": 0}},
        "Student": { "yes": {"yes": 0, "no": 0}, "no": {"yes": 0, "no": 0} },
        "Credit Rating": { "fair": {"yes": 0, "no": 0}, "excellent": {"yes": 0, "no": 0} },
        "Loan": {"yes": 0, "no": 0}
    }
    
    df = pd.read_csv(dataset_file)
    d_range = len(df)
    
    vcount = df["Loan"].value_counts()
    vcount_loan_yes = vcount["yes"]
    vcount_loan_no = vcount["no"]
    
    probabilities["Loan"]["yes"] = vcount_loan_yes/d_range
    probabilities["Loan"]["no"] = vcount_loan_no/d_range
    
    for attribute in attributes:
        value_counts = dict()
        vcount = df[attribute].value_counts()
        for att_value in attribute_values[attribute]:
            
            # your code here

            filtered_df_yes = df[df["Loan"] == "yes"]
            num_yes = len(filtered_df_yes[filtered_df_yes[attribute] == att_value])
            filtered_range_yes = len(filtered_df_yes)
            
            filtered_df_no = df[df["Loan"] == "no"]
            num_no = len(filtered_df_no[filtered_df_no[attribute] == att_value])
            filtered_range_no = len(filtered_df_no)
            
            
            probabilities[attribute][att_value]["yes"] = num_yes/filtered_range_yes
            probabilities[attribute][att_value]["no"] = num_no/filtered_range_no
    
    return probabilities

In [10]:

# This cell has hidden test cases that will run after you submit your assignment. 
