# MLE - Exercise 3 - Kaggle Competition
## Andreas Kocman (se19m024)


This exercise is in the form of a Kaggle competition. A few quick details on Kaggle & the competition format:

## Kaggle
* Kaggle (https://en.wikipedia.org/wiki/Kaggle) is a platform that allows a competition for a certain data set. Participants submit their prediction on a test set, and will get automated scoring on their results, and will enter the leaderboard.
* From Kaggle, you will be able to obtain a labelled training set, and an unlabelled test set.
* You can submit multiple entries to Kaggle; for each entry, you need to provide details on how you achieved the results - which software and which version of the software, which operating system, which algorithms, and which parameter settings for these algorithms; further, any processing applied to the data before training/predicting. There is a specific "description" field when submitting, you should fill in this information there, and you also need to include this description and the actual submission file in your final submission to Moodle.
* To submit to Kaggle, you need to create a specific submission file, which contains the predictions you obtain on the test set. Computing an aggregated evaluation criterion is done automatically by Kaggle
* The format of your submission is rather simple - it is a comma-separated file, where the first column is the identifier of the item that you are predicting, and the second column is the class you are predicting for that item. The first line should include a header, and is should use the names provided in the training set. An example is below:
```
ID,class
911366,B
852781,B
89524,B
857438,B
905686,B
```
* There is a limit of 7 submissions per day; finally, you also need to select your top 7 submissions to be counted in the competition
* Before you submit, you should evaluate the classifiers "locally" on your training set, i.e. by splitting that again in a training & test set (or using cross validation), to select a number of fitting algorithms & parameters. Then re-train your best models on the full local training set, and generate the predictions for the test set.
* Evaluation in Kaggle is split in two types of leaderboards - the private and public one. Here, the data is split into 50% / 50%, and as soon as you upload, you will know your results on one of these splits.
* The final results will only be visible once the competition closes, and as it is computed on a different split, might be slightly different than what you see initially (e.g. this is similar to a training/test/validation split)
* As it is a competition, there will be bonus points for the top 3 submissions.
* As reproducible science is great, there will be additional bonus points for submissions that use a notebook within the Kaggle competition (note: this was / partially still is called a "kernel" inside the Kaggle competition; Kernel obviously was a confusing term here, as it basically refers to code being executed in the environment of Kaggle itself (e.g. a jupyter notebook, or also a python or R script), and they seem to have realized that, and renamed it). see https://www.kaggle.com/notebooks or https://www.kaggle.com/getting-started/44939. You can first work locally, and then port your code to the notebook version. In Kaggle, your notebook will initially be private. Please share it with me (mayer@ifs.tuwien.ac.at), at least, though. You can also make it public at the end of the competition, to show off :-)

## Datasets
We will use the following datasets:
* Congressional Voting: a small dataset, a good entry point for your experiments (435 instances, 16 features)
  * Kaggle page: https://www.kaggle.com/t/c04c953c596e48099d857129f53fcbdb
* Amazon reviews: a dataset with many features (10k, extracted from text), but not that many instances (~800)
  * Kaggle page: https://www.kaggle.com/t/0bd2ac297dc242478b5979d5ee772136

## Submission
The Kaggle competition will close on the day displayed in Kaggle. After that, you still have time to submit to Moodle. Your submission to Moodle shall contain:

* A brief report, containing
  * A description of the datasets, including a short analysis of the features.
  * Details on the software you used for creating your solution
  * The algorithms and parameters you tried
  * The results you obtained on the locally split training/test set
    * And a comparison to the results that you received on Kaggle - how large was the difference, did the rank of the classifiers change (i.e. the first on your training set, was it still the best on the test set on Kaggle?)
* All the code needed to obtain your results
* The solution files that you uploaded to Kaggle

# Solution

## Helper Functions for Solution and Data Analysis

In [1]:
# global Imports
import pandas as pd
import numpy as np

#sk learn imports
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer

#Data reporting
from IPython.display import display

# Global definitions:
overall_results_vote = []
overall_results_amazon = []
averaging_approach = 'macro'
zero_division_approach = 0
number_of_folds = 2
scoring = {'Accuracy': make_scorer(accuracy_score),
            'Precision': make_scorer(precision_score, average=averaging_approach, zero_division=zero_division_approach),
            'Recall': make_scorer(recall_score, average=averaging_approach, zero_division=zero_division_approach)}

# Helper functions
def parse_k_fold_results(results):
    return "m: " + str(np.average(results)) + " std: " + str(np.std(results))

def parse_argument_tuple_as_string(argumentsTuple):
    return "max Depth: " + str(argumentsTuple[0])  + \
           ", min Samples: " + str(argumentsTuple[1])

def calculate_results_holdout(classifier_used, X_train, X_test, y_train, y_test):
    classifier_used.fit(X_train, y_train)

    # predict the test set on our trained classifier
    y_test_predicted = classifier_used.predict(X_test)

    acc = metrics.accuracy_score(y_test, y_test_predicted)
    recall=metrics.recall_score(y_test, y_test_predicted)
    precision = metrics.precision_score(y_test, y_test_predicted)

    return pd.Series({
            'classifier': str(classifier_used),
            'arguments': "",
            'accuracy':acc,
            'precision':precision,
            'recall':recall
        })

def calculate_results_cross_validate(classifier_used, description_used, data, target):
   scores = cross_validate(classifier_used, data, target,
                                scoring = scoring,
                                cv = number_of_folds,
                                error_score = 0)

   return pd.Series({
            'classifier': str(classifier_used),
            'arguments': description_used,
            'mean_accuracy': np.average(scores.get('test_Accuracy')),
            'mean_precision': np.average(scores.get('test_Precision')),
            'mean_recall': np.average(scores.get('test_Recall')),
            'accuracy': parse_k_fold_results(scores.get('test_Accuracy')),
            'precision': parse_k_fold_results(scores.get('test_Precision')),
            'recall':parse_k_fold_results(scores.get('test_Recall'))
        })

def print_results(array, column_for_max, ascending=False):
    df = pd.DataFrame(array)
    df = df.sort_values(by=[column_for_max], ascending=False)
    display('Results', df)

    best = df.iloc[df[column_for_max].argmax()]
    display(best)

### Calculation Functions


#### k-NN Calculation

In [2]:
from sklearn import neighbors

def calculate_knn(data, target):
    knn_results = []

    n_neighbors = range(1,10,1)

    for n in n_neighbors:
        knn_classifier = neighbors.KNeighborsClassifier(n)
        description = "N = " + str(n)
        result = calculate_results_cross_validate(knn_classifier,
                                                  description,
                                                  data,
                                                  target)
        knn_results.append(result)
    return knn_results


#### Bayes Calculation

In [3]:
from sklearn import naive_bayes

def calculate_bayes(data, target):
    bayes_results = []

    alphas = np.arange(0.1,5,1)

    for alpha in alphas:
        classifier = naive_bayes.CategoricalNB(alpha = alpha)
        description = "Alpha = " + str(alpha)
        result = calculate_results_cross_validate(classifier,
                                                  description,
                                                  data,
                                                  target)
        bayes_results.append(result)

    return bayes_results

#### Perceptron Calculation

In [4]:
from sklearn import linear_model

def calculate_perceptron(data, target):
    perceptron_results=[]
    classifier = linear_model.Perceptron()
    description = "No additional args."
    result = calculate_results_cross_validate(classifier,
                                              description,
                                              data,
                                              target)
    perceptron_results.append(result)
    return perceptron_results

#### Decision Tree Calculation

In [5]:
from sklearn import tree
import itertools

def calculate_decision_tree(data, target):
    # Parameters for the decision tree
    max_depth_arguments = range(1, 10, 2)
    min_samples_leaf_arguments = [2,20,50,100]
    argumentTuples = list(itertools.product(max_depth_arguments,
                                            min_samples_leaf_arguments))
    decision_tree_results = []

    for argumentTuple in argumentTuples:
        max_depth = argumentTuple[0]
        min_samples_leaf = argumentTuple[1]

        classifier = tree.DecisionTreeClassifier(criterion = 'gini',
                                                 max_depth = max_depth,
                                                 min_samples_leaf = min_samples_leaf,
                                                 splitter = 'best')
        #result = calculate_results_holdout(classifier, X_train, X_test, y_train, y_test)
        result = calculate_results_cross_validate(classifier,
                                                  parse_argument_tuple_as_string(argumentTuple),
                                                  data,
                                                  target)
        decision_tree_results.append(result)
    return decision_tree_results

#### SVM Calculation

In [6]:
from sklearn import svm
import itertools

def calculate_svm(data, target):
    kernels = ["poly", "rbf"]#{"linear", "poly", "sigmoid", "rbf"}
    gamma = [0.001, "scale", "auto"]
    c = [100]
    degree = range(1, 10, 1)

    argumentTuples = list(itertools.product(kernels,
                                            gamma,
                                            c,
                                            degree))
    svm_results = []

    for argumentTuple in argumentTuples:
        kernel = argumentTuple[0]
        gamma = argumentTuple[1]
        c = argumentTuple[2]
        degree = argumentTuple[3]

        classifier = svm.SVC(kernel = kernel, gamma = gamma, C = c, degree = degree)

        #result = calculate_results_holdout(classifier, X_train, X_test, y_train, y_test)
        result = calculate_results_cross_validate(classifier,
                                                  "Kernel: " + kernel + ", Degree: " + str(degree),
                                                  data,
                                                  target)
        svm_results.append(result)
    return svm_results

## Congressional Voting

In [7]:
#Recode values for predicting variables
def recode_voting_data(dataset):
    dataset = dataset.replace('y', 1)\
        .replace('n', 0)\
        .replace('democrat', 2)\
        .replace('republican', 3)
    dataset.loc[:, dataset.columns != "ID"] = dataset.loc[:, dataset.columns != "ID"].astype('category')
    return pd.DataFrame(dataset, columns=dataset.columns)

#Imput missing values
def input_missing_values(data):
    columns = data.columns
    imp = IterativeImputer(max_iter=10, random_state=0)
    imp.fit(data.loc[:, data.columns != "ID"])
    data.loc[:, data.columns != "ID"] = np.round(imp.transform(data.loc[:, data.columns != "ID"]))
    data.loc[:, data.columns != "ID"] = data.loc[:, data.columns != "ID"].apply(lambda x: x.astype('int'))
    data.loc[:, data.columns != "ID"] = data.loc[:, data.columns != "ID"].astype('category')
    return pd.DataFrame(data, columns=columns)

#Read Data
votingDataLearn = pd.read_csv("data/voting/CongressionalVotingID.shuf.lrn.csv", na_values='unknown')
votingDataSolutionExample = pd.read_csv("data/voting/CongressionalVotingID.shuf.sol.ex.csv", na_values='unknown')
votingDataTest = pd.read_csv("data/voting/CongressionalVotingID.shuf.tes.csv", na_values='unknown')
display("Original Data", votingDataLearn)

#Recode values
votingDataLearn = recode_voting_data(votingDataLearn)
votingDataLearn = input_missing_values(votingDataLearn)

display("Recoded Data", votingDataLearn)

display("Data: ", votingDataLearn[votingDataLearn.columns[2:18]])
display("Target: ", votingDataLearn["class"])

'Original Data'

Unnamed: 0,ID,class,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-crporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,213,democrat,n,n,y,n,n,n,y,y,y,n,y,n,n,n,y,y
1,94,democrat,y,n,y,n,n,n,y,n,y,y,y,n,n,n,y,y
2,188,democrat,y,n,y,n,n,n,y,y,y,n,n,n,n,n,y,
3,61,democrat,y,y,y,n,n,,y,y,y,y,n,n,n,n,y,
4,184,democrat,,,,,,,,,y,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,250,democrat,y,n,y,n,n,n,y,y,,n,y,n,n,n,y,y
214,26,democrat,y,n,y,n,n,n,y,y,y,y,n,n,n,n,y,y
215,110,democrat,y,,y,n,n,n,y,y,y,n,n,n,n,n,y,
216,34,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y




'Recoded Data'

Unnamed: 0,ID,class,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-crporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,213,2,0,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
1,94,2,1,0,1,0,0,0,1,0,1,1,1,0,0,0,1,1
2,188,2,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
3,61,2,1,1,1,0,0,0,1,1,1,1,0,0,0,0,1,1
4,184,2,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,250,2,1,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
214,26,2,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
215,110,2,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
216,34,3,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1


'Data: '

Unnamed: 0,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-crporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,0,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
1,1,0,1,0,0,0,1,0,1,1,1,0,0,0,1,1
2,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
3,1,1,1,0,0,0,1,1,1,1,0,0,0,0,1,1
4,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,1,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
214,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
215,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
216,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1


'Target: '

0      2
1      2
2      2
3      2
4      2
      ..
213    2
214    2
215    2
216    3
217    3
Name: class, Length: 218, dtype: category
Categories (2, int64): [2, 3]

### k-NN - Congressional Vote

In [8]:
knn_results_vote = calculate_knn(votingDataLearn[votingDataLearn.columns[2:18]],
                                 votingDataLearn["class"])
overall_results_vote.extend(knn_results_vote)

print_results(knn_results_vote, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
1,KNeighborsClassifier(n_neighbors=2),N = 2,0.958716,0.955407,0.959755,m: 0.9587155963302753 std: 0.013761467889908285,m: 0.9554073308361875 std: 0.0176916648596519,m: 0.9597547974413646 std: 0.008972992181947359
6,KNeighborsClassifier(n_neighbors=7),N = 7,0.949541,0.943333,0.958955,m: 0.9495412844036697 std: 0.02293577981651379,m: 0.9433333333333334 std: 0.023333333333333373,m: 0.9589552238805971 std: 0.018656716417910446
8,KNeighborsClassifier(n_neighbors=9),N = 9,0.944954,0.939216,0.955224,m: 0.9449541284403671 std: 0.027522935779816515,m: 0.9392156862745098 std: 0.027450980392156876,m: 0.9552238805970149 std: 0.022388059701492602
0,KNeighborsClassifier(n_neighbors=1),N = 1,0.944954,0.940633,0.950782,m: 0.944954128440367 std: 0.01834862385321101,m: 0.9406325515280739 std: 0.020632551528073972,m: 0.9507818052594172 std: 0.010483297796730628
3,KNeighborsClassifier(n_neighbors=4),N = 4,0.944954,0.940076,0.948561,m: 0.944954128440367 std: 0.01834862385321101,m: 0.9400758575390029 std: 0.02118924551714496,m: 0.9485607675906184 std: 0.012704335465529515
2,KNeighborsClassifier(n_neighbors=3),N = 3,0.940367,0.934982,0.94705,m: 0.9403669724770642 std: 0.013761467889908285,m: 0.9349823819591261 std: 0.014982381959126156,m: 0.9470504619758351 std: 0.006751954513148473
4,KNeighborsClassifier(),N = 5,0.940367,0.93565,0.944829,m: 0.9403669724770642 std: 0.022935779816513735,m: 0.935649558330795 std: 0.02561554472535288,m: 0.9448294243070363 std: 0.016435678749111615
5,KNeighborsClassifier(n_neighbors=6),N = 6,0.940367,0.93565,0.944829,m: 0.9403669724770642 std: 0.022935779816513735,m: 0.935649558330795 std: 0.02561554472535288,m: 0.9448294243070363 std: 0.016435678749111615
7,KNeighborsClassifier(n_neighbors=8),N = 8,0.940367,0.934982,0.94705,m: 0.9403669724770642 std: 0.013761467889908285,m: 0.9349823819591261 std: 0.014982381959126156,m: 0.9470504619758351 std: 0.006751954513148473


classifier                    KNeighborsClassifier(n_neighbors=2)
arguments                                                   N = 2
mean_accuracy                                            0.958716
mean_precision                                           0.955407
mean_recall                                              0.959755
accuracy          m: 0.9587155963302753 std: 0.013761467889908285
precision           m: 0.9554073308361875 std: 0.0176916648596519
recall            m: 0.9597547974413646 std: 0.008972992181947359
Name: 1, dtype: object

### Bayes - Congressional Vote

In [9]:
bayes_results_vote = calculate_bayes(votingDataLearn[votingDataLearn.columns[2:18]],
                                     votingDataLearn["class"])
overall_results_vote.extend(bayes_results_vote)

print_results(bayes_results_vote, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
1,CategoricalNB(alpha=1.1),Alpha = 1.1,0.922018,0.917619,0.927683,m: 0.9220183486238532 std: 0.022935779816513735,m: 0.9176188746213874 std: 0.02427877996283434,m: 0.927683013503909 std: 0.014214641080312673
2,CategoricalNB(alpha=2.1),Alpha = 2.1,0.922018,0.91625,0.929904,m: 0.9220183486238532 std: 0.022935779816513735,m: 0.916250466909696 std: 0.022910372251143063,m: 0.9299040511727079 std: 0.01643567874911156
3,CategoricalNB(alpha=3.1),Alpha = 3.1,0.922018,0.91625,0.929904,m: 0.9220183486238532 std: 0.022935779816513735,m: 0.916250466909696 std: 0.022910372251143063,m: 0.9299040511727079 std: 0.01643567874911156
4,CategoricalNB(alpha=4.1),Alpha = 4.1,0.922018,0.91625,0.929904,m: 0.9220183486238532 std: 0.022935779816513735,m: 0.916250466909696 std: 0.022910372251143063,m: 0.9299040511727079 std: 0.01643567874911156
0,CategoricalNB(alpha=0.1),Alpha = 0.1,0.912844,0.908527,0.915778,m: 0.9128440366972477 std: 0.02293577981651379,m: 0.9085268584490431 std: 0.025476010991416054,m: 0.915778251599147 std: 0.014214641080312729


classifier                               CategoricalNB(alpha=1.1)
arguments                                             Alpha = 1.1
mean_accuracy                                            0.922018
mean_precision                                           0.917619
mean_recall                                              0.927683
accuracy          m: 0.9220183486238532 std: 0.022935779816513735
precision          m: 0.9176188746213874 std: 0.02427877996283434
recall             m: 0.927683013503909 std: 0.014214641080312673
Name: 1, dtype: object

### Perceptron - Congressional Vote

In [10]:
perceptron_results_vote = calculate_perceptron(votingDataLearn[votingDataLearn.columns[2:18]],
                                               votingDataLearn["class"])
overall_results_vote.extend(perceptron_results_vote)

print_results(perceptron_results_vote, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
0,Perceptron(),No additional args.,0.926606,0.928189,0.918088,m: 0.926605504587156 std: 0.009174311926605505,m: 0.9281886273165343 std: 0.01663119250328554,m: 0.9180881307746979 std: 0.003020611229566428


classifier                                           Perceptron()
arguments                                     No additional args.
mean_accuracy                                            0.926606
mean_precision                                           0.928189
mean_recall                                              0.918088
accuracy           m: 0.926605504587156 std: 0.009174311926605505
precision          m: 0.9281886273165343 std: 0.01663119250328554
recall            m: 0.9180881307746979 std: 0.003020611229566428
Name: 0, dtype: object

### Decision Tree - Congressional Vote

In [11]:
decision_tree_results_vote = calculate_decision_tree(votingDataLearn[votingDataLearn.columns[2:18]],
                                                     votingDataLearn["class"])
overall_results_vote.extend(decision_tree_results_vote)

print_results(decision_tree_results_vote, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
0,"DecisionTreeClassifier(max_depth=1, min_sample...","max Depth: 1, min Samples: 2",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
17,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 20",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
5,"DecisionTreeClassifier(max_depth=3, min_sample...","max Depth: 3, min Samples: 20",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
9,"DecisionTreeClassifier(max_depth=5, min_sample...","max Depth: 5, min Samples: 20",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
1,"DecisionTreeClassifier(max_depth=1, min_sample...","max Depth: 1, min Samples: 20",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
13,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 20",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
12,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 2",0.958716,0.955036,0.959755,m: 0.9587155963302753 std: 0.004587155963302725,m: 0.955036335849292 std: 0.001485403281142772,m: 0.9597547974413647 std: 0.010394456289978649
16,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 2",0.958716,0.955036,0.959755,m: 0.9587155963302753 std: 0.004587155963302725,m: 0.955036335849292 std: 0.001485403281142772,m: 0.9597547974413647 std: 0.010394456289978649
8,"DecisionTreeClassifier(max_depth=5, min_sample...","max Depth: 5, min Samples: 2",0.958716,0.955036,0.959755,m: 0.9587155963302753 std: 0.004587155963302725,m: 0.955036335849292 std: 0.001485403281142772,m: 0.9597547974413647 std: 0.010394456289978649
4,"DecisionTreeClassifier(max_depth=3, min_sample...","max Depth: 3, min Samples: 2",0.954128,0.95018,0.956023,m: 0.9541284403669725 std: 0.0,m: 0.9501797216032235 std: 0.0033712109649257083,m: 0.9560234541577826 std: 0.006663113006396548


classifier        DecisionTreeClassifier(max_depth=1, min_sample...
arguments                              max Depth: 1, min Samples: 2
mean_accuracy                                               0.96789
mean_precision                                             0.962041
mean_recall                                                0.973881
accuracy             m: 0.9678899082568808 std: 0.01376146788990823
precision           m: 0.9620406189555126 std: 0.015232108317214721
recall              m: 0.9738805970149254 std: 0.011194029850746245
Name: 0, dtype: object

### SVM - Congressional Vote

In [12]:
svm_results_vote = calculate_svm(votingDataLearn[votingDataLearn.columns[2:18]],
                                 votingDataLearn["class"])
overall_results_vote.extend(svm_results_vote)

print_results(svm_results_vote, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
0,"SVC(C=100, degree=1, gamma=0.001, kernel='poly')","Kernel: poly, Degree: 1",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
28,"SVC(C=100, degree=2, gamma=0.001)","Kernel: rbf, Degree: 2",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
35,"SVC(C=100, degree=9, gamma=0.001)","Kernel: rbf, Degree: 9",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
34,"SVC(C=100, degree=8, gamma=0.001)","Kernel: rbf, Degree: 8",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
33,"SVC(C=100, degree=7, gamma=0.001)","Kernel: rbf, Degree: 7",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
32,"SVC(C=100, degree=6, gamma=0.001)","Kernel: rbf, Degree: 6",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
31,"SVC(C=100, degree=5, gamma=0.001)","Kernel: rbf, Degree: 5",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
30,"SVC(C=100, degree=4, gamma=0.001)","Kernel: rbf, Degree: 4",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
29,"SVC(C=100, gamma=0.001)","Kernel: rbf, Degree: 3",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
27,"SVC(C=100, degree=1, gamma=0.001)","Kernel: rbf, Degree: 1",0.96789,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245


classifier        SVC(C=100, degree=1, gamma=0.001, kernel='poly')
arguments                                  Kernel: poly, Degree: 1
mean_accuracy                                              0.96789
mean_precision                                            0.962041
mean_recall                                               0.973881
accuracy            m: 0.9678899082568808 std: 0.01376146788990823
precision          m: 0.9620406189555126 std: 0.015232108317214721
recall             m: 0.9738805970149254 std: 0.011194029850746245
Name: 0, dtype: object

### Overall Results for Congressional Vote

In [13]:
print_results(overall_results_vote, "mean_accuracy")


'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
68,"SVC(C=100, degree=7, gamma=0.001)","Kernel: rbf, Degree: 7",0.967890,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
63,"SVC(C=100, degree=2, gamma=0.001)","Kernel: rbf, Degree: 2",0.967890,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
66,"SVC(C=100, degree=5, gamma=0.001)","Kernel: rbf, Degree: 5",0.967890,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
67,"SVC(C=100, degree=6, gamma=0.001)","Kernel: rbf, Degree: 6",0.967890,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
16,"DecisionTreeClassifier(max_depth=1, min_sample...","max Depth: 1, min Samples: 20",0.967890,0.962041,0.973881,m: 0.9678899082568808 std: 0.01376146788990823,m: 0.9620406189555126 std: 0.015232108317214721,m: 0.9738805970149254 std: 0.011194029850746245
...,...,...,...,...,...,...,...,...
34,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 100",0.614679,0.307339,0.500000,m: 0.6146788990825688 std: 0.0,m: 0.3073394495412844 std: 0.0,m: 0.5 std: 0.0
36,"SVC(C=100, degree=2, gamma=0.001, kernel='poly')","Kernel: poly, Degree: 2",0.614679,0.307339,0.500000,m: 0.6146788990825688 std: 0.0,m: 0.3073394495412844 std: 0.0,m: 0.5 std: 0.0
37,"SVC(C=100, gamma=0.001, kernel='poly')","Kernel: poly, Degree: 3",0.614679,0.307339,0.500000,m: 0.6146788990825688 std: 0.0,m: 0.3073394495412844 std: 0.0,m: 0.5 std: 0.0
38,"SVC(C=100, degree=4, gamma=0.001, kernel='poly')","Kernel: poly, Degree: 4",0.614679,0.307339,0.500000,m: 0.6146788990825688 std: 0.0,m: 0.3073394495412844 std: 0.0,m: 0.5 std: 0.0


classifier                      SVC(C=100, degree=7, gamma=0.001)
arguments                                  Kernel: rbf, Degree: 7
mean_accuracy                                             0.96789
mean_precision                                           0.962041
mean_recall                                              0.973881
accuracy           m: 0.9678899082568808 std: 0.01376146788990823
precision         m: 0.9620406189555126 std: 0.015232108317214721
recall            m: 0.9738805970149254 std: 0.011194029850746245
Name: 68, dtype: object

### Train submission file

In [14]:
#Required Imports
from sklearn import svm
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import pandas as pd
import numpy as np

#Recode values for predicting variables
def recode_voting_data(dataset):
    dataset = dataset.replace('y', 1)\
        .replace('n', 0)
    dataset[dataset.columns[1:18]] = dataset[dataset.columns[1:18]].astype('category')
    return pd.DataFrame(dataset, columns=dataset.columns)

#Imput missing values
def input_missing_values(data):
    columns = data.columns
    imp = IterativeImputer(max_iter=10, random_state=0)
    imp.fit(data.loc[:, data.columns != "ID"])
    data.loc[:, data.columns != "ID"] = np.round(imp.transform(data.loc[:, data.columns != "ID"]))
    data.loc[:, data.columns != "ID"] = data.loc[:, data.columns != "ID"].apply(lambda x: x.astype('int'))
    return pd.DataFrame(data, columns=columns)

#Read Data
votingDataLearn = pd.read_csv("data/voting/CongressionalVotingID.shuf.lrn.csv", na_values='unknown')
votingDataTest = pd.read_csv("data/voting/CongressionalVotingID.shuf.tes.csv", na_values='unknown')

#Extract target variable
votingDataLearn = votingDataLearn.replace('democrat', 2)\
    .replace('republican', 3)

y = votingDataLearn["class"]
X = pd.DataFrame(votingDataLearn.drop(["ID", "class"], axis=1))

X = recode_voting_data(X)
X = input_missing_values(X)
votingDataTest = recode_voting_data(votingDataTest)
votingDataTest = input_missing_values(votingDataTest)

#Calculate Model
classifier = svm.SVC(kernel = "rbf", gamma=0.001, C=100)
classifier.fit(X, y)

#Predict the Test Data
votingDataTest["class"] = classifier.predict(votingDataTest[votingDataTest.columns[1:18]])

#Recode to required output
votingDataTest["class"].replace({2: "democrat", 3: "republican"}, inplace=True)
display("Finally recoded: ", votingDataTest[["ID", "class"]])
votingDataTest[["ID", "class"]].to_csv("solution_voting.csv", index = False)



'Finally recoded: '

Unnamed: 0,ID,class
0,13,democrat
1,393,republican
2,163,democrat
3,57,republican
4,148,democrat
...,...,...
212,359,democrat
213,128,democrat
214,27,democrat
215,119,democrat


## Amazon

In [15]:
from sklearn import preprocessing
#Read Data
amazonDataLearn = pd.read_csv("data/amazon/amazon_review_ID.shuf.lrn.csv")
amazonDataSolutionExample = pd.read_csv("data/amazon/amazon_review_ID.shuf.sol.ex.csv")
amazonDataTest = pd.read_csv("data/amazon/amazon_review_ID.shuf.tes.csv")
display("Original Data", amazonDataLearn)

#Recode values
#For One Hot Encoding of Class
#amazonDataLearn = pd.concat([amazonDataLearn, pd.get_dummies(amazonDataLearn["Class"], prefix='author_',drop_first=False)], axis=1)
#amazonDataLearn.drop(['Class'],axis=1, inplace=True)
#names_target = amazonDataLearn.loc[:, amazonDataLearn.columns.str.startswith('author_')]
#amazonDataLearn[names_target.columns] = amazonDataLearn[names_target.columns].apply(lambda x: x.astype('category'))

# For Label Encoding
#le = preprocessing.LabelEncoder()
#le.fit(amazonDataLearn['Class'])
#amazonDataLearn['Class'] = le.transform(amazonDataLearn['Class'])
#amazonDataLearn['Class'] = amazonDataLearn['Class'].astype('category')

names_data = amazonDataLearn.loc[:, amazonDataLearn.columns.str.startswith('V')]
#amazonDataLearn[0:10000] = amazonDataLearn[0:10000].apply(lambda x: x.astype('int'))

# Normalize data
def normalize_values(data):
    columns = data.columns
    data = preprocessing.Normalizer().fit_transform(data)
    return pd.DataFrame(data, columns=columns)

display("Recoded Data", amazonDataLearn)

X_amazon = normalize_values(amazonDataLearn[names_data.columns])
y_amazon = amazonDataLearn["Class"]

display("Data: ", X_amazon)
display("Target: ", y_amazon)

'Original Data'

Unnamed: 0,ID,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000,Class
0,0,9,5,5,9,7,0,8,7,1,...,0,1,0,1,0,0,0,0,2,Power
1,1,11,9,15,15,5,11,10,1,5,...,0,0,0,0,0,0,0,0,0,Goonan
2,2,11,10,13,12,6,5,0,3,1,...,0,0,0,0,0,0,0,1,0,Merritt
3,3,18,9,7,8,8,7,12,6,7,...,0,1,0,0,0,1,0,0,1,Goonan
4,4,11,7,10,11,4,5,1,8,4,...,0,0,0,0,0,1,0,0,3,Corn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,745,5,5,8,2,8,0,5,1,2,...,1,0,0,0,0,0,0,0,0,Chachra
746,746,22,13,8,14,8,11,3,6,7,...,6,0,2,0,0,2,0,0,0,Morrison
747,747,10,3,5,5,7,1,14,2,6,...,0,0,4,1,0,0,2,0,0,Sherwin
748,748,9,13,8,5,11,9,9,3,3,...,0,0,0,1,0,0,0,0,0,Blankenship


'Recoded Data'

Unnamed: 0,ID,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000,Class
0,0,9,5,5,9,7,0,8,7,1,...,0,1,0,1,0,0,0,0,2,Power
1,1,11,9,15,15,5,11,10,1,5,...,0,0,0,0,0,0,0,0,0,Goonan
2,2,11,10,13,12,6,5,0,3,1,...,0,0,0,0,0,0,0,1,0,Merritt
3,3,18,9,7,8,8,7,12,6,7,...,0,1,0,0,0,1,0,0,1,Goonan
4,4,11,7,10,11,4,5,1,8,4,...,0,0,0,0,0,1,0,0,3,Corn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,745,5,5,8,2,8,0,5,1,2,...,1,0,0,0,0,0,0,0,0,Chachra
746,746,22,13,8,14,8,11,3,6,7,...,6,0,2,0,0,2,0,0,0,Morrison
747,747,10,3,5,5,7,1,14,2,6,...,0,0,4,1,0,0,2,0,0,Sherwin
748,748,9,13,8,5,11,9,9,3,3,...,0,0,0,1,0,0,0,0,0,Blankenship


'Data: '

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V9991,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000
0,0.022849,0.012694,0.012694,0.022849,0.017771,0.000000,0.020310,0.017771,0.002539,0.012694,...,0.000000,0.000000,0.002539,0.000000,0.002539,0.00000,0.000000,0.000000,0.000000,0.005077
1,0.030256,0.024755,0.041258,0.041258,0.013753,0.030256,0.027505,0.002751,0.013753,0.019254,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000
2,0.028761,0.026146,0.033990,0.031376,0.015688,0.013073,0.000000,0.007844,0.002615,0.002615,...,0.002615,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.002615,0.000000
3,0.041891,0.020946,0.016291,0.018618,0.018618,0.016291,0.027927,0.013964,0.016291,0.002327,...,0.000000,0.000000,0.002327,0.000000,0.000000,0.00000,0.002327,0.000000,0.000000,0.002327
4,0.028918,0.018402,0.026289,0.028918,0.010516,0.013145,0.002629,0.021031,0.010516,0.010516,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.002629,0.000000,0.000000,0.007887
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,0.026867,0.026867,0.042987,0.010747,0.042987,0.000000,0.026867,0.005373,0.010747,0.016120,...,0.000000,0.005373,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000
746,0.048706,0.028781,0.017711,0.030995,0.017711,0.024353,0.006642,0.013284,0.015497,0.013284,...,0.000000,0.013284,0.000000,0.004428,0.000000,0.00000,0.004428,0.000000,0.000000,0.000000
747,0.026260,0.007878,0.013130,0.013130,0.018382,0.002626,0.036764,0.005252,0.015756,0.002626,...,0.000000,0.000000,0.000000,0.010504,0.002626,0.00000,0.000000,0.005252,0.000000,0.000000
748,0.025152,0.036330,0.022357,0.013973,0.030741,0.025152,0.025152,0.008384,0.008384,0.016768,...,0.000000,0.000000,0.000000,0.000000,0.002795,0.00000,0.000000,0.000000,0.000000,0.000000


'Target: '

0            Power
1           Goonan
2          Merritt
3           Goonan
4             Corn
          ...     
745        Chachra
746       Morrison
747        Sherwin
748    Blankenship
749       Davisson
Name: Class, Length: 750, dtype: object

### k-NN Calculation - Amazon

In [16]:
knn_results_amazon = calculate_knn(X_amazon,
                                   y_amazon)
overall_results_amazon.extend(knn_results_amazon)

print_results(knn_results_amazon, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
0,KNeighborsClassifier(n_neighbors=1),N = 1,0.217333,0.269663,0.213175,m: 0.21733333333333332 std: 0.009333333333333332,m: 0.2696625644505507 std: 0.020481226845872155,m: 0.21317460317460316 std: 0.009952380952380935
6,KNeighborsClassifier(n_neighbors=7),N = 7,0.201333,0.300936,0.195698,m: 0.2013333333333333 std: 0.025333333333333333,m: 0.3009359955461926 std: 0.05788954817545304,m: 0.1956984126984127 std: 0.028539682539682545
4,KNeighborsClassifier(),N = 5,0.196,0.291129,0.195845,m: 0.196 std: 0.0040000000000000036,m: 0.2911289372651379 std: 0.004014736670988839,m: 0.19584523809523807 std: 0.004146825396825385
7,KNeighborsClassifier(n_neighbors=8),N = 8,0.192,0.261385,0.185873,m: 0.192 std: 0.026666666666666672,m: 0.2613846367265696 std: 0.03556916600664496,m: 0.18587301587301586 std: 0.029031746031746034
2,KNeighborsClassifier(n_neighbors=3),N = 3,0.186667,0.276623,0.188579,m: 0.18666666666666665 std: 0.0026666666666666644,m: 0.27662335266991456 std: 0.004006194104617361,m: 0.1885793650793651 std: 0.004230158730158723
8,KNeighborsClassifier(n_neighbors=9),N = 9,0.186667,0.239957,0.180512,m: 0.18666666666666665 std: 0.024000000000000007,m: 0.23995687630346438 std: 0.027838393842211456,m: 0.18051190476190476 std: 0.025615079365079355
5,KNeighborsClassifier(n_neighbors=6),N = 6,0.184,0.245926,0.180817,m: 0.184 std: 0.005333333333333329,m: 0.24592593293881482 std: 0.018142273364685588,m: 0.18081746031746032 std: 0.008349206349206342
1,KNeighborsClassifier(n_neighbors=2),N = 2,0.181333,0.237999,0.18252,m: 0.18133333333333335 std: 0.013333333333333322,m: 0.23799941938395774 std: 0.013408418616183526,m: 0.18251984126984128 std: 0.01686904761904763
3,KNeighborsClassifier(n_neighbors=4),N = 4,0.18,0.251769,0.178675,m: 0.18 std: 0.011999999999999997,m: 0.2517694421882424 std: 0.008939613408469882,m: 0.17867460317460318 std: 0.01111904761904764


classifier                     KNeighborsClassifier(n_neighbors=1)
arguments                                                    N = 1
mean_accuracy                                             0.217333
mean_precision                                            0.269663
mean_recall                                               0.213175
accuracy          m: 0.21733333333333332 std: 0.009333333333333332
precision          m: 0.2696625644505507 std: 0.020481226845872155
recall            m: 0.21317460317460316 std: 0.009952380952380935
Name: 0, dtype: object

### Perceptron - Amazon

In [17]:
perceptron_results_amazon = calculate_perceptron(X_amazon,
                                                 y_amazon)
overall_results_amazon.extend(perceptron_results_amazon)

print_results(perceptron_results_amazon, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
0,Perceptron(),No additional args.,0.113333,0.126572,0.107762,m: 0.11333333333333334 std: 0.006666666666666661,m: 0.12657218917034596 std: 0.003983440323285119,m: 0.10776190476190475 std: 0.00871428571428571


classifier                                            Perceptron()
arguments                                      No additional args.
mean_accuracy                                             0.113333
mean_precision                                            0.126572
mean_recall                                               0.107762
accuracy          m: 0.11333333333333334 std: 0.006666666666666661
precision         m: 0.12657218917034596 std: 0.003983440323285119
recall             m: 0.10776190476190475 std: 0.00871428571428571
Name: 0, dtype: object

### Decision Tree - Amazon

In [18]:
decision_tree_results_amazon = calculate_decision_tree(X_amazon,
                                                       y_amazon)
overall_results_amazon.extend(decision_tree_results_amazon)

print_results(decision_tree_results_amazon, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
16,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 2",0.184,0.166816,0.17048,m: 0.184 std: 0.008000000000000007,m: 0.1668159040067333 std: 0.012119436015563298,m: 0.17048015873015873 std: 0.009932539682539682
12,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 2",0.157333,0.138446,0.141512,m: 0.15733333333333333 std: 0.007999999999999993,m: 0.13844625015164747 std: 0.007118013171236559,m: 0.14151190476190476 std: 0.007789682539682541
17,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 20",0.149333,0.050988,0.138817,m: 0.14933333333333335 std: 0.0,m: 0.050988080219772955 std: 0.002261685195509698,m: 0.13881746031746034 std: 9.523809523810656e-05
13,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 20",0.138667,0.046267,0.128817,m: 0.13866666666666666 std: 0.008000000000000007,m: 0.04626707162461866 std: 0.0007024033448563075,m: 0.12881746031746033 std: 0.007126984126984134
9,"DecisionTreeClassifier(max_depth=5, min_sample...","max Depth: 5, min Samples: 20",0.133333,0.036094,0.122556,m: 0.13333333333333333 std: 0.0026666666666666783,m: 0.036093533583553446 std: 0.000897945951404...,m: 0.12255555555555556 std: 0.0061666666666666745
8,"DecisionTreeClassifier(max_depth=5, min_sample...","max Depth: 5, min Samples: 2",0.130667,0.098178,0.117639,m: 0.13066666666666665 std: 0.0026666666666666644,m: 0.09817781104457751 std: 0.0023611111111111055,m: 0.11763888888888889 std: 0.0001388888888888...
4,"DecisionTreeClassifier(max_depth=3, min_sample...","max Depth: 3, min Samples: 2",0.086667,0.058152,0.077639,m: 0.08666666666666667 std: 0.0013333333333333322,m: 0.058152442002442004 std: 0.00236037851037851,m: 0.0776388888888889 std: 0.00013888888888888978
10,"DecisionTreeClassifier(max_depth=5, min_sample...","max Depth: 5, min Samples: 50",0.085333,0.009337,0.075472,m: 0.08533333333333333 std: 0.013333333333333336,m: 0.009336784884282716 std: 0.000723376317361...,m: 0.07547222222222223 std: 0.014805555555555561
18,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 50",0.085333,0.009337,0.075472,m: 0.08533333333333333 std: 0.013333333333333336,m: 0.009336784884282716 std: 0.000723376317361...,m: 0.07547222222222223 std: 0.014805555555555561
14,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 50",0.085333,0.009337,0.075472,m: 0.08533333333333333 std: 0.013333333333333336,m: 0.009336784884282716 std: 0.000723376317361...,m: 0.07547222222222223 std: 0.014805555555555561


classifier        DecisionTreeClassifier(max_depth=9, min_sample...
arguments                              max Depth: 9, min Samples: 2
mean_accuracy                                                 0.184
mean_precision                                             0.166816
mean_recall                                                 0.17048
accuracy                         m: 0.184 std: 0.008000000000000007
precision           m: 0.1668159040067333 std: 0.012119436015563298
recall             m: 0.17048015873015873 std: 0.009932539682539682
Name: 16, dtype: object

### SVM - Amazon

In [19]:
svm_results_amazon = calculate_svm(X_amazon,
                                   y_amazon)
overall_results_amazon.extend(svm_results_amazon)

print_results(svm_results_amazon, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
11,"SVC(C=100, kernel='poly')","Kernel: poly, Degree: 3",0.418667,0.483627,0.404425,m: 0.41866666666666663 std: 0.013333333333333336,m: 0.48362695634707353 std: 0.03974960098829397,m: 0.40442460317460316 std: 0.016067460317460258
10,"SVC(C=100, degree=2, kernel='poly')","Kernel: poly, Degree: 2",0.417333,0.482069,0.40244,m: 0.41733333333333333 std: 0.00933333333333336,m: 0.4820687265950423 std: 0.03974360726992307,m: 0.4024404761904761 std: 0.0118611111111111
39,"SVC(C=100, degree=4)","Kernel: rbf, Degree: 4",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
42,"SVC(C=100, degree=7)","Kernel: rbf, Degree: 7",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
41,"SVC(C=100, degree=6)","Kernel: rbf, Degree: 6",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
43,"SVC(C=100, degree=8)","Kernel: rbf, Degree: 8",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
40,"SVC(C=100, degree=5)","Kernel: rbf, Degree: 5",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
44,"SVC(C=100, degree=9)","Kernel: rbf, Degree: 9",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
38,SVC(C=100),"Kernel: rbf, Degree: 3",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
37,"SVC(C=100, degree=2)","Kernel: rbf, Degree: 2",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884


classifier                               SVC(C=100, kernel='poly')
arguments                                  Kernel: poly, Degree: 3
mean_accuracy                                             0.418667
mean_precision                                            0.483627
mean_recall                                               0.404425
accuracy          m: 0.41866666666666663 std: 0.013333333333333336
precision          m: 0.48362695634707353 std: 0.03974960098829397
recall            m: 0.40442460317460316 std: 0.016067460317460258
Name: 11, dtype: object

### Overall Results for Amazon

In [20]:
print_results(overall_results_amazon, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
41,"SVC(C=100, kernel='poly')","Kernel: poly, Degree: 3",0.418667,0.483627,0.404425,m: 0.41866666666666663 std: 0.013333333333333336,m: 0.48362695634707353 std: 0.03974960098829397,m: 0.40442460317460316 std: 0.016067460317460258
67,"SVC(C=100, degree=2)","Kernel: rbf, Degree: 2",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
70,"SVC(C=100, degree=5)","Kernel: rbf, Degree: 5",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
66,"SVC(C=100, degree=1)","Kernel: rbf, Degree: 1",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
68,SVC(C=100),"Kernel: rbf, Degree: 3",0.417333,0.483697,0.402996,m: 0.41733333333333333 std: 0.01200000000000001,m: 0.4836967850690761 std: 0.03981942971029659,m: 0.4029960317460317 std: 0.01463888888888884
...,...,...,...,...,...,...,...,...
55,"SVC(C=100, degree=8, gamma='auto', kernel='poly')","Kernel: poly, Degree: 8",0.044000,0.014577,0.034444,m: 0.044 std: 0.0066666666666666645,m: 0.014577076360155063 std: 0.007359445506160574,m: 0.034444444444444444 std: 0.005555555555555557
56,"SVC(C=100, degree=9, gamma='auto', kernel='poly')","Kernel: poly, Degree: 9",0.044000,0.014577,0.034444,m: 0.044 std: 0.0066666666666666645,m: 0.014577076360155063 std: 0.007359445506160574,m: 0.034444444444444444 std: 0.005555555555555557
11,"DecisionTreeClassifier(max_depth=1, min_sample...","max Depth: 1, min Samples: 20",0.040000,0.009083,0.032444,m: 0.04 std: 0.0,m: 0.009082597803947276 std: 0.001413269964647...,m: 0.03244444444444444 std: 0.000888888888888887
13,"DecisionTreeClassifier(max_depth=1, min_sample...","max Depth: 1, min Samples: 100",0.038667,0.001815,0.032444,m: 0.03866666666666667 std: 0.004,m: 0.0018147876372591764 std: 4.93769063312042...,m: 0.03244444444444444 std: 0.0013333333333333322


classifier                               SVC(C=100, kernel='poly')
arguments                                  Kernel: poly, Degree: 3
mean_accuracy                                             0.418667
mean_precision                                            0.483627
mean_recall                                               0.404425
accuracy          m: 0.41866666666666663 std: 0.013333333333333336
precision          m: 0.48362695634707353 std: 0.03974960098829397
recall            m: 0.40442460317460316 std: 0.016067460317460258
Name: 41, dtype: object

### Prepare Submission

In [21]:
from sklearn import preprocessing
from sklearn import svm
import pandas as pd
import numpy as np
# Read Data
amazon_data_learn = pd.read_csv("data/amazon/amazon_review_ID.shuf.lrn.csv")
amazon_data_test = pd.read_csv("data/amazon/amazon_review_ID.shuf.tes.csv")

# Label Encoding
le = preprocessing.LabelEncoder()
le.fit(amazon_data_learn["Class"])
#amazon_data_learn["Class"] = le.transform(amazon_data_learn["Class"])
amazon_data_learn["Class"] = amazon_data_learn["Class"].astype("category")

names_data = amazon_data_learn.loc[:, amazon_data_learn.columns.str.startswith("V")]

display("Recoded Data", amazon_data_learn)
y = amazon_data_learn["Class"]
X = pd.DataFrame(amazon_data_learn.drop(["ID", "Class"], axis=1))
X = X[names_data.columns]

#Normalize data
def normalize_values(data):
    columns = data.columns
    data = preprocessing.Normalizer().fit_transform(data)
    return pd.DataFrame(data, columns=columns)

X_norm = normalize_values(X)
test_X = normalize_values(amazon_data_test[names_data.columns])

display("Data Normalized: ", X_norm)
display("Target: ", y)
display("Test Dataset: ", amazon_data_test)
display("Test Dataset, Predictors: ", amazon_data_test[names_data.columns])
display("Test Dataset, Normalized: ", test_X)

#Calculate Model
classifier = svm.SVC(C=101, kernel='poly')
classifier.fit(X_norm, y)

#Predict the Test Data
amazon_data_test["Class"] = classifier.predict(test_X)

display("Finally recoded: ", amazon_data_test[["ID", "Class"]])
amazon_data_test[["ID", "Class"]].to_csv("solution_amazon.csv", index = False)

'Recoded Data'

Unnamed: 0,ID,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000,Class
0,0,9,5,5,9,7,0,8,7,1,...,0,1,0,1,0,0,0,0,2,Power
1,1,11,9,15,15,5,11,10,1,5,...,0,0,0,0,0,0,0,0,0,Goonan
2,2,11,10,13,12,6,5,0,3,1,...,0,0,0,0,0,0,0,1,0,Merritt
3,3,18,9,7,8,8,7,12,6,7,...,0,1,0,0,0,1,0,0,1,Goonan
4,4,11,7,10,11,4,5,1,8,4,...,0,0,0,0,0,1,0,0,3,Corn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,745,5,5,8,2,8,0,5,1,2,...,1,0,0,0,0,0,0,0,0,Chachra
746,746,22,13,8,14,8,11,3,6,7,...,6,0,2,0,0,2,0,0,0,Morrison
747,747,10,3,5,5,7,1,14,2,6,...,0,0,4,1,0,0,2,0,0,Sherwin
748,748,9,13,8,5,11,9,9,3,3,...,0,0,0,1,0,0,0,0,0,Blankenship


'Data Normalized: '

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V9991,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000
0,0.022849,0.012694,0.012694,0.022849,0.017771,0.000000,0.020310,0.017771,0.002539,0.012694,...,0.000000,0.000000,0.002539,0.000000,0.002539,0.00000,0.000000,0.000000,0.000000,0.005077
1,0.030256,0.024755,0.041258,0.041258,0.013753,0.030256,0.027505,0.002751,0.013753,0.019254,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000
2,0.028761,0.026146,0.033990,0.031376,0.015688,0.013073,0.000000,0.007844,0.002615,0.002615,...,0.002615,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.002615,0.000000
3,0.041891,0.020946,0.016291,0.018618,0.018618,0.016291,0.027927,0.013964,0.016291,0.002327,...,0.000000,0.000000,0.002327,0.000000,0.000000,0.00000,0.002327,0.000000,0.000000,0.002327
4,0.028918,0.018402,0.026289,0.028918,0.010516,0.013145,0.002629,0.021031,0.010516,0.010516,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.002629,0.000000,0.000000,0.007887
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,0.026867,0.026867,0.042987,0.010747,0.042987,0.000000,0.026867,0.005373,0.010747,0.016120,...,0.000000,0.005373,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000
746,0.048706,0.028781,0.017711,0.030995,0.017711,0.024353,0.006642,0.013284,0.015497,0.013284,...,0.000000,0.013284,0.000000,0.004428,0.000000,0.00000,0.004428,0.000000,0.000000,0.000000
747,0.026260,0.007878,0.013130,0.013130,0.018382,0.002626,0.036764,0.005252,0.015756,0.002626,...,0.000000,0.000000,0.000000,0.010504,0.002626,0.00000,0.000000,0.005252,0.000000,0.000000
748,0.025152,0.036330,0.022357,0.013973,0.030741,0.025152,0.025152,0.008384,0.008384,0.016768,...,0.000000,0.000000,0.000000,0.000000,0.002795,0.00000,0.000000,0.000000,0.000000,0.000000


'Target: '

0            Power
1           Goonan
2          Merritt
3           Goonan
4             Corn
          ...     
745        Chachra
746       Morrison
747        Sherwin
748    Blankenship
749       Davisson
Name: Class, Length: 750, dtype: category
Categories (50, object): [Agresti, Ashbacher, Auken, Blankenship, ..., Vernon, Vision, Walters, Wilson]

'Test Dataset: '

Unnamed: 0,ID,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V9991,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000
0,750,3,2,5,1,3,4,9,4,9,...,0,0,0,0,0,1,0,0,0,0
1,751,9,4,3,4,6,7,2,1,0,...,0,1,1,1,2,0,0,0,0,0
2,752,18,16,6,13,0,7,0,6,3,...,1,0,0,0,1,0,2,0,0,0
3,753,5,2,6,2,12,7,1,2,3,...,0,0,1,0,0,0,0,0,0,0
4,754,14,9,9,5,5,8,10,2,0,...,0,0,0,0,2,1,0,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,1495,10,2,2,5,4,2,6,3,4,...,0,1,0,0,0,1,0,0,0,0
746,1496,19,8,6,11,6,4,3,8,2,...,0,0,1,0,0,0,3,0,0,0
747,1497,15,4,8,6,10,6,11,5,9,...,0,0,3,0,0,1,0,0,0,0
748,1498,13,7,11,14,4,3,0,3,0,...,0,0,0,0,0,0,0,0,0,0


'Test Dataset, Predictors: '

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V9991,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000
0,3,2,5,1,3,4,9,4,9,4,...,0,0,0,0,0,1,0,0,0,0
1,9,4,3,4,6,7,2,1,0,1,...,0,1,1,1,2,0,0,0,0,0
2,18,16,6,13,0,7,0,6,3,0,...,1,0,0,0,1,0,2,0,0,0
3,5,2,6,2,12,7,1,2,3,5,...,0,0,1,0,0,0,0,0,0,0
4,14,9,9,5,5,8,10,2,0,5,...,0,0,0,0,2,1,0,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,10,2,2,5,4,2,6,3,4,1,...,0,1,0,0,0,1,0,0,0,0
746,19,8,6,11,6,4,3,8,2,3,...,0,0,1,0,0,0,3,0,0,0
747,15,4,8,6,10,6,11,5,9,2,...,0,0,3,0,0,1,0,0,0,0
748,13,7,11,14,4,3,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0


'Test Dataset, Normalized: '

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V9991,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000
0,0.009978,0.006652,0.016630,0.003326,0.009978,0.013304,0.029935,0.013304,0.029935,0.013304,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.003326,0.000000,0.000000,0.0,0.0
1,0.026896,0.011954,0.008965,0.011954,0.017930,0.020919,0.005977,0.002988,0.000000,0.002988,...,0.000000,0.002988,0.002988,0.002988,0.005977,0.000000,0.000000,0.000000,0.0,0.0
2,0.046578,0.041402,0.015526,0.033639,0.000000,0.018114,0.000000,0.015526,0.007763,0.000000,...,0.002588,0.000000,0.000000,0.000000,0.002588,0.000000,0.005175,0.000000,0.0,0.0
3,0.014007,0.005603,0.016809,0.005603,0.033617,0.019610,0.002801,0.005603,0.008404,0.014007,...,0.000000,0.000000,0.002801,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0
4,0.035705,0.022953,0.022953,0.012752,0.012752,0.020403,0.025503,0.005101,0.000000,0.012752,...,0.000000,0.000000,0.000000,0.000000,0.005101,0.002550,0.000000,0.005101,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,0.033729,0.006746,0.006746,0.016865,0.013492,0.006746,0.020237,0.010119,0.013492,0.003373,...,0.000000,0.003373,0.000000,0.000000,0.000000,0.003373,0.000000,0.000000,0.0,0.0
746,0.057413,0.024174,0.018131,0.033239,0.018131,0.012087,0.009065,0.024174,0.006044,0.009065,...,0.000000,0.000000,0.003022,0.000000,0.000000,0.000000,0.009065,0.000000,0.0,0.0
747,0.035489,0.009464,0.018928,0.014196,0.023659,0.014196,0.026025,0.011830,0.021293,0.004732,...,0.000000,0.000000,0.007098,0.000000,0.000000,0.002366,0.000000,0.000000,0.0,0.0
748,0.036754,0.019790,0.031099,0.039581,0.011309,0.008482,0.000000,0.008482,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0


'Finally recoded: '

Unnamed: 0,ID,Class
0,750,Sherwin
1,751,Engineer
2,752,Harp
3,753,Shea
4,754,Agresti
...,...,...
745,1495,Cholette
746,1496,Ashbacher
747,1497,Cholette
748,1498,Kolln
