# Introduction
For our project, we chose to use a data file from an adoption shelter in Austin, TX. The animals from the shelter were saved into a .csv file that we had to clean and then classify using a few different methods. We wanted to classify how long an incoming dog would spend at the shelter based on their breed, age, and a few other attributes. Being able to classify well how long an animal would stay at the shelter would be usful for the people running the shelters so the can work hard to get all the animals adopted and plan for their stay making sure they have everything they need to make the animals time in the shelter easier.

## Cleaning the data
To clean the data, we first had to remove all of the instances that were not dogs so cats, birds, etc. We then saved this data in a new file, dogs_data.csv. In order to classify the dogs and predict the time they would spend at the shelter, we had to remove quite a few attributes that were repeated or challenging to use. We kept most of the discretized and categorical data but removed the location they were found, their names, time they were taken in, time they were adopted, and any attributes that were repetative. We saved this file as clean_data.csv and it is the file we referred back to in order to get the data needed for classification

In [2]:
import copy

def preprocess():

    attr, table = utils.parse_csv("adoption_data.csv")

    # Preserve animal entries for dogs and classifying attribute entry 
    animal_index = attr.index('animal_type_intake')
    class_index = attr.index('time_bucket')
    table = [row for row in table if row[animal_index] == 'Dog' and row[class_index] != '']

    # Remove all duplicate entries 
    animal_ids = set()
    animal_id_index = attr.index('animal_id')
    for row in table:
        # Check for duplicates
        if row[animal_id_index] in animal_ids:
            table.remove(row)
        else: 
            print(row[animal_id_index])
            animal_ids.add(row[animal_id_index]) 
    dogs_data = copy.deepcopy(table)
    utils.write_csv('dogs_data.csv', attr, dogs_data)

    # Remove attributes not to be trained on from instances in the dataset 
    remove_attr = ['animal_id', 'name_intake', 'date_time_intake', 'found_location', 'intake_condition', 
                    'animal_type_intake', 'month_year_intake', 'intake_sex', 'breed_intake', 'color_intake', 
                    'name_outcome', 'date_time_outcome', 'month_year_outcome','outcome_subtype', 'outcome_sex', 
                    'outcome_age', 'gender_outcome', 'fixed_intake', 'fixed_changed', 'date_time_length']

    # Remove each attribute from all rows 
    for col in remove_attr: 
        index = attr.index(col)
        attr.pop(index)
        for row in table: 
            row.pop(index)    

    utils.write_csv('clean_data.csv', attr, table)

## Naive Bayes Classifier

In [None]:
def naive_bayes(table, attr, attr_indexes, class_index): 
    '''
    '''  
    # Stratify data across 10 folds
    stratified_data = utils.stratify_data(table, class_index, 10)

    tp_tn = 0
    for fold in stratified_data:
        train_set = []
        test_set = stratified_data.pop(fold)
        for i in stratified_data:
            train_set.extend(i)
        
        # Calculate probabilities of training set 
        classes, conditions, priors, posts = utils.prior_post_probabilities(train_set, attr, class_index, attr_indexes)

        # Iterate through test set 
        for inst in test_set:
            # Classify predicted and actual classes
            pred_class = utils.naive_bayes(train_set, classes, conditions, attr, priors, posts, inst, class_index)
            actual_class = inst[class_index]

## Decision Tree Classifier

Using the decision tree classifer seemed like an easy step to take and would make it easy to create an ensemble classifier, but proved to be quite difficult. With the amount of data in the set, the trees were growing so large that it would take forever to classify an instance using all of the attributes to split on. Instead I took a bootstrapping approach to build one decision tree and set the max depth of the tree to be four attributes so the tree would be more useful. It was a trade off between accuracy and computational costs

In [None]:
def decision_tree_classifier(table, original_table, attr_indexes, attr_domains, class_index, header, instance_to_classify):
    '''
    Calls the functions to get a decision tree for the data and uses that decision
    tree and classifies a given instance. Returns the classification to main()
    '''
    rand_index = random.randint(0, len(table) - 1)
    instance = table[rand_index]
    print("Classifying instance: ", instance)
    tree = tdidt(table, attr_indexes, attr_indexes, attr_domains, class_index, header, [])
    utils.pretty_print(tree)
    classification = decision_tree.classify_instance(header, instance, tree)
    print(original_table[rand_index])
    print("Classification: ", classification)

    class_values = utils.get_attr_domains(table, header, [len(header) - 1])
    print(class_values, class_index)
    decision_tree.forest_classifier(table, attr_indexes, attr_domains, class_index, header, class_values, 10, 5)
    

def tdidt(instances, att_indexes, all_att_indexes, att_domains, class_index, header, tree):
    '''
    Uses the tdidt algorithm to build a decision tree based on a given set of data
    '''
    #print("Current Tree: ", tree)
    #print("att_indexes = ", att_indexes)
    if att_indexes == []:
        return
    att_index = entropy(instances, header, att_domains, att_indexes)
    att_indexes.remove(att_index)
    partition = partition_instances(instances, att_index, att_domains[header[att_index]])
    partition_keys = partition.keys()
    
    tree.append("Attribute")
    tree.append(header[att_index])
    count = 0
    for i in range(len(att_domains[header[att_index]])):
        #print(i)
        tree.append(["Value", att_domains[header[att_index]][count]])
        col = utils.get_column(partition.get(att_domains[header[att_index]][i]), len(header)-1)
        items_in_col = []
        for item in col:
            if item not in items_in_col:
                items_in_col.append(item)
        if len(items_in_col) == 1:
            tree[2+count].append(["Leaves", has_same_class_label(instances, header, att_index, class_index, col, items_in_col[0])])
        elif len(att_indexes) == 0 and len(col) > 0:
            majority_class = compute_partition_stats(col)
            tree[2+count].append(["Leaves", has_same_class_label(instances, header, att_index, class_index, col, majority_class)])
        elif col == []:
            del tree[2+count]
            return []
        else:
            tree[2+count].append([])
            new_branch = [tdidt(partition.get(att_domains[header[att_index]][i]), att_indexes, all_att_indexes, att_domains, class_index, header, tree[2+count][2])]
            if new_branch == [[]]:                
                majority_class = compute_partition_stats(col)
                tree[2][2] = ["Leaves", has_same_class_label(instances, header, att_index, class_index, col, majority_class)]
            else:
                tree[2][2] = new_branch
        count += 1
        att_indexes.append(att_index)

    return tree

## Ensemble Classifier
### Random Forest Method

For our ensemble classifer, we used the random forest approach and built off of the decision tree classifier. In the bootstrapping method, we chose to split the trees on 4 attributes and select 10 of the best 20 trees created in order to keep computational costs at a minumum since it takes some time to go through all of that data, build multiple trees, test the trees on the validation set, and then use the forest to classify the instances in the training set.

## K-Means Clustering Classifier

# Conclusions

After working with the different classifiers it is easy to see why data mining is a tedious task with such large data sets. Working the thousands of instances made it hard to keep a high accuracy. Also with the data we chose we have concluded that it is hard to predict something that is very case by case. When people are adopting a dog it comes down to their preference in breed and gender, but also what dog they bond most with. It also depends on the dogs that are at the shelter the time they chose to adopt. There are so many uncontrolled variables in data such as this making is nearly impossible to get super accurate classifiers. However, through using our multiple classifiers we were able to come up with solutions that will classify how long a dog will spend at the shelter accurately enough that is could help the people that run the shelter. 