# MLE - Exercise 3 - Kaggle Competition
## Andreas Kocman (se19m024)


This exercise is in the form of a Kaggle competition. A few quick details on Kaggle & the competition format:

## Kaggle
* Kaggle (https://en.wikipedia.org/wiki/Kaggle) is a platform that allows a competition for a certain data set. Participants submit their prediction on a test set, and will get automated scoring on their results, and will enter the leaderboard.
* From Kaggle, you will be able to obtain a labelled training set, and an unlabelled test set.
* You can submit multiple entries to Kaggle; for each entry, you need to provide details on how you achieved the results - which software and which version of the software, which operating system, which algorithms, and which parameter settings for these algorithms; further, any processing applied to the data before training/predicting. There is a specific "description" field when submitting, you should fill in this information there, and you also need to include this description and the actual submission file in your final submission to Moodle.
* To submit to Kaggle, you need to create a specific submission file, which contains the predictions you obtain on the test set. Computing an aggregated evaluation criterion is done automatically by Kaggle
* The format of your submission is rather simple - it is a comma-separated file, where the first column is the identifier of the item that you are predicting, and the second column is the class you are predicting for that item. The first line should include a header, and is should use the names provided in the training set. An example is below:
```
ID,class
911366,B
852781,B
89524,B
857438,B
905686,B
```
* There is a limit of 7 submissions per day; finally, you also need to select your top 7 submissions to be counted in the competition
* Before you submit, you should evaluate the classifiers "locally" on your training set, i.e. by splitting that again in a training & test set (or using cross validation), to select a number of fitting algorithms & parameters. Then re-train your best models on the full local training set, and generate the predictions for the test set.
* Evaluation in Kaggle is split in two types of leaderboards - the private and public one. Here, the data is split into 50% / 50%, and as soon as you upload, you will know your results on one of these splits.
* The final results will only be visible once the competition closes, and as it is computed on a different split, might be slightly different than what you see initially (e.g. this is similar to a training/test/validation split)
* As it is a competition, there will be bonus points for the top 3 submissions.
* As reproducible science is great, there will be additional bonus points for submissions that use a notebook within the Kaggle competition (note: this was / partially still is called a "kernel" inside the Kaggle competition; Kernel obviously was a confusing term here, as it basically refers to code being executed in the environment of Kaggle itself (e.g. a jupyter notebook, or also a python or R script), and they seem to have realized that, and renamed it). see https://www.kaggle.com/notebooks or https://www.kaggle.com/getting-started/44939. You can first work locally, and then port your code to the notebook version. In Kaggle, your notebook will initially be private. Please share it with me (mayer@ifs.tuwien.ac.at), at least, though. You can also make it public at the end of the competition, to show off :-)

## Datasets
We will use the following datasets:
* Congressional Voting: a small dataset, a good entry point for your experiments (435 instances, 16 features)
  * Kaggle page: https://www.kaggle.com/t/c04c953c596e48099d857129f53fcbdb
* Amazon reviews: a dataset with many features (10k, extracted from text), but not that many instances (~800)
  * Kaggle page: https://www.kaggle.com/t/0bd2ac297dc242478b5979d5ee772136

## Submission
The Kaggle competition will close on the day displayed in Kaggle. After that, you still have time to submit to Moodle. Your submission to Moodle shall contain:

* A brief report, containing
  * A description of the datasets, including a short analysis of the features.
  * Details on the software you used for creating your solution
  * The algorithms and parameters you tried
  * The results you obtained on the locally split training/test set
    * And a comparison to the results that you received on Kaggle - how large was the difference, did the rank of the classifiers change (i.e. the first on your training set, was it still the best on the test set on Kaggle?)
* All the code needed to obtain your results
* The solution files that you uploaded to Kaggle

# Solution

## Helper Functions for Solution and Data Analysis

In [1]:
# global Imports
import pandas as pd
import numpy as np

#sk learn imports
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer

#Data reporting
from IPython.display import display

# Global definitions:
overall_results_vote = []
overall_results_amazon = []
averaging_approach = 'macro'
zero_division_approach = 0
number_of_folds = 5
scoring = {'Accuracy': make_scorer(accuracy_score),
            'Precision': make_scorer(precision_score, average=averaging_approach, zero_division=zero_division_approach),
            'Recall': make_scorer(recall_score, average=averaging_approach, zero_division=zero_division_approach)}

# Helper functions
def parse_k_fold_results(results):
    return "m: " + str(np.average(results)) + " std: " + str(np.std(results))

def parse_argument_tuple_as_string(argumentsTuple):
    return "max Depth: " + str(argumentsTuple[0])  + \
           ", min Samples: " + str(argumentsTuple[1])

def calculate_results_holdout(classifier_used, X_train, X_test, y_train, y_test):
    classifier_used.fit(X_train, y_train)

    # predict the test set on our trained classifier
    y_test_predicted = classifier_used.predict(X_test)

    acc = metrics.accuracy_score(y_test, y_test_predicted)
    recall=metrics.recall_score(y_test, y_test_predicted)
    precision = metrics.precision_score(y_test, y_test_predicted)

    return pd.Series({
            'classifier': str(classifier_used),
            'arguments': "",
            'accuracy':acc,
            'precision':precision,
            'recall':recall
        })

def calculate_results_cross_validate(classifier_used, description_used, data, target):
   scores = cross_validate(classifier_used, data, target,
                                scoring = scoring,
                                cv = number_of_folds,
                                error_score = 0)

   return pd.Series({
            'classifier': str(classifier_used),
            'arguments': description_used,
            'mean_accuracy': np.average(scores.get('test_Accuracy')),
            'mean_precision': np.average(scores.get('test_Precision')),
            'mean_recall': np.average(scores.get('test_Recall')),
            'accuracy': parse_k_fold_results(scores.get('test_Accuracy')),
            'precision': parse_k_fold_results(scores.get('test_Precision')),
            'recall':parse_k_fold_results(scores.get('test_Recall'))
        })

def print_results(array, column_for_max, ascending=False):
    df = pd.DataFrame(array)
    df = df.sort_values(by=[column_for_max], ascending=False)
    display('Results', df)

    best = df.iloc[df[column_for_max].argmax()]
    display(best)

### Calculation Functions


#### k-NN Calculation

In [2]:
from sklearn import neighbors

def calculate_knn(data, target):
    knn_results = []

    n_neighbors = range(1,10,1)

    for n in n_neighbors:
        knn_classifier = neighbors.KNeighborsClassifier(n)
        description = "N = " + str(n)
        result = calculate_results_cross_validate(knn_classifier,
                                                  description,
                                                  data,
                                                  target)
        knn_results.append(result)
    return knn_results


#### Bayes Calculation

In [3]:
from sklearn import naive_bayes

def calculate_bayes(data, target):
    bayes_results = []

    alphas = np.arange(0.1,5,1)

    for alpha in alphas:
        classifier = naive_bayes.CategoricalNB(alpha = alpha)
        description = "Alpha = " + str(alpha)
        result = calculate_results_cross_validate(classifier,
                                                  description,
                                                  data,
                                                  target)
        bayes_results.append(result)

    return bayes_results

#### Perceptron Calculation

In [4]:
from sklearn import linear_model

def calculate_perceptron(data, target):
    perceptron_results=[]
    classifier = linear_model.Perceptron()
    description = "No additional args."
    result = calculate_results_cross_validate(classifier,
                                              description,
                                              data,
                                              target)
    perceptron_results.append(result)
    return perceptron_results

#### Decision Tree Calculation

In [5]:
from sklearn import tree
import itertools

def calculate_decision_tree(data, target):
    # Parameters for the decision tree
    max_depth_arguments = range(1, 10, 2)
    min_samples_leaf_arguments = [2,20,50,100]
    argumentTuples = list(itertools.product(max_depth_arguments,
                                            min_samples_leaf_arguments))
    decision_tree_results = []

    for argumentTuple in argumentTuples:
        max_depth = argumentTuple[0]
        min_samples_leaf = argumentTuple[1]

        classifier = tree.DecisionTreeClassifier(criterion = 'gini',
                                                 max_depth = max_depth,
                                                 min_samples_leaf = min_samples_leaf,
                                                 splitter = 'best')
        #result = calculate_results_holdout(classifier, X_train, X_test, y_train, y_test)
        result = calculate_results_cross_validate(classifier,
                                                  parse_argument_tuple_as_string(argumentTuple),
                                                  data,
                                                  target)
        decision_tree_results.append(result)
    return decision_tree_results

#### SVM Calculation

In [6]:
from sklearn import svm
import itertools

def calculate_svm(data, target):
    kernels = ["poly", "rbf"]#{"linear", "poly", "sigmoid", "rbf"}
    gamma = [0.001, "scale", "auto"]
    c = [100]
    degree = range(1, 10, 1)

    argumentTuples = list(itertools.product(kernels,
                                            gamma,
                                            c,
                                            degree))
    svm_results = []

    for argumentTuple in argumentTuples:
        kernel = argumentTuple[0]
        gamma = argumentTuple[1]
        c = argumentTuple[2]
        degree = argumentTuple[3]

        classifier = svm.SVC(kernel = kernel, gamma = gamma, C = c, degree = degree)

        #result = calculate_results_holdout(classifier, X_train, X_test, y_train, y_test)
        result = calculate_results_cross_validate(classifier,
                                                  "Kernel: " + kernel + ", Degree: " + str(degree),
                                                  data,
                                                  target)
        svm_results.append(result)
    return svm_results

## Congressional Voting

In [7]:
#Recode values for predicting variables
def recode_voting_data(dataset):
    dataset = dataset.replace('y', 1)\
        .replace('n', 0)\
        .replace('democrat', 2)\
        .replace('republican', 3)
    dataset.loc[:, dataset.columns != "ID"] = dataset.loc[:, dataset.columns != "ID"].astype('category')
    return pd.DataFrame(dataset, columns=dataset.columns)

#Imput missing values
def input_missing_values(data):
    columns = data.columns
    imp = IterativeImputer(max_iter=10, random_state=0)
    imp.fit(data.loc[:, data.columns != "ID"])
    data.loc[:, data.columns != "ID"] = np.round(imp.transform(data.loc[:, data.columns != "ID"]))
    data.loc[:, data.columns != "ID"] = data.loc[:, data.columns != "ID"].apply(lambda x: x.astype('int'))
    data.loc[:, data.columns != "ID"] = data.loc[:, data.columns != "ID"].astype('category')
    return pd.DataFrame(data, columns=columns)

#Read Data
votingDataLearn = pd.read_csv("data/voting/CongressionalVotingID.shuf.lrn.csv", na_values='unknown')
votingDataSolutionExample = pd.read_csv("data/voting/CongressionalVotingID.shuf.sol.ex.csv", na_values='unknown')
votingDataTest = pd.read_csv("data/voting/CongressionalVotingID.shuf.tes.csv", na_values='unknown')
display("Original Data", votingDataLearn)

#Recode values
votingDataLearn = recode_voting_data(votingDataLearn)
votingDataLearn = input_missing_values(votingDataLearn)

display("Recoded Data", votingDataLearn)

display("Data: ", votingDataLearn[votingDataLearn.columns[2:18]])
display("Target: ", votingDataLearn["class"])

'Original Data'

Unnamed: 0,ID,class,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-crporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,213,democrat,n,n,y,n,n,n,y,y,y,n,y,n,n,n,y,y
1,94,democrat,y,n,y,n,n,n,y,n,y,y,y,n,n,n,y,y
2,188,democrat,y,n,y,n,n,n,y,y,y,n,n,n,n,n,y,
3,61,democrat,y,y,y,n,n,,y,y,y,y,n,n,n,n,y,
4,184,democrat,,,,,,,,,y,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,250,democrat,y,n,y,n,n,n,y,y,,n,y,n,n,n,y,y
214,26,democrat,y,n,y,n,n,n,y,y,y,y,n,n,n,n,y,y
215,110,democrat,y,,y,n,n,n,y,y,y,n,n,n,n,n,y,
216,34,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y




'Recoded Data'

Unnamed: 0,ID,class,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-crporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,213,2,0,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
1,94,2,1,0,1,0,0,0,1,0,1,1,1,0,0,0,1,1
2,188,2,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
3,61,2,1,1,1,0,0,0,1,1,1,1,0,0,0,0,1,1
4,184,2,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,250,2,1,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
214,26,2,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
215,110,2,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
216,34,3,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1


'Data: '

Unnamed: 0,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-crporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,0,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
1,1,0,1,0,0,0,1,0,1,1,1,0,0,0,1,1
2,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
3,1,1,1,0,0,0,1,1,1,1,0,0,0,0,1,1
4,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,1,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
214,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
215,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
216,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1


'Target: '

0      2
1      2
2      2
3      2
4      2
      ..
213    2
214    2
215    2
216    3
217    3
Name: class, Length: 218, dtype: category
Categories (2, int64): [2, 3]

### k-NN - Congressional Vote

In [8]:
knn_results_vote = calculate_knn(votingDataLearn[votingDataLearn.columns[2:18]],
                                 votingDataLearn["class"])
overall_results_vote.extend(knn_results_vote)

print_results(knn_results_vote, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
6,KNeighborsClassifier(n_neighbors=7),N = 7,0.949789,0.946634,0.956796,m: 0.9497885835095138 std: 0.03633965972902396,m: 0.9466340390488999 std: 0.03608620915053702,m: 0.9567957097368863 std: 0.029626605680041794
7,KNeighborsClassifier(n_neighbors=8),N = 8,0.949683,0.94608,0.956796,m: 0.9496828752642706 std: 0.03925378581485785,m: 0.9460797448165869 std: 0.03905530325409666,m: 0.9567957097368863 std: 0.033352054331873056
8,KNeighborsClassifier(n_neighbors=9),N = 9,0.945243,0.94216,0.953092,m: 0.9452431289640592 std: 0.03687778755803422,m: 0.9421603548383735 std: 0.037090417983209244,m: 0.9530920060331824 std: 0.02978123723739161
0,KNeighborsClassifier(n_neighbors=1),N = 1,0.945032,0.942523,0.946331,m: 0.945031712473573 std: 0.034048886698533634,m: 0.9425226260094682 std: 0.03634869147380645,m: 0.9463308614043908 std: 0.031909717382637634
2,KNeighborsClassifier(n_neighbors=3),N = 3,0.945032,0.941734,0.950913,m: 0.945031712473573 std: 0.042524418215617005,m: 0.9417338423530375 std: 0.04229555526924847,m: 0.9509133567957099 std: 0.039255464764560155
4,KNeighborsClassifier(),N = 5,0.945032,0.941734,0.950913,m: 0.945031712473573 std: 0.042524418215617005,m: 0.9417338423530375 std: 0.04229555526924847,m: 0.9509133567957099 std: 0.039255464764560155
1,KNeighborsClassifier(n_neighbors=2),N = 2,0.944926,0.945693,0.941606,m: 0.9449260042283297 std: 0.03699848987727991,m: 0.9456933175569079 std: 0.04001026811690184,m: 0.9416059158706218 std: 0.03451379175922509
3,KNeighborsClassifier(n_neighbors=4),N = 4,0.940381,0.937812,0.942485,m: 0.9403805496828752 std: 0.027634973997320893,m: 0.937811566371938 std: 0.030278525504452966,m: 0.942484707558237 std: 0.02472620156914983
5,KNeighborsClassifier(n_neighbors=6),N = 6,0.935835,0.934132,0.938781,m: 0.9358350951374208 std: 0.033627186829120255,m: 0.9341319126922842 std: 0.034844958450856264,m: 0.9387810038545334 std: 0.028087583916356424


classifier                    KNeighborsClassifier(n_neighbors=7)
arguments                                                   N = 7
mean_accuracy                                            0.949789
mean_precision                                           0.946634
mean_recall                                              0.956796
accuracy           m: 0.9497885835095138 std: 0.03633965972902396
precision          m: 0.9466340390488999 std: 0.03608620915053702
recall            m: 0.9567957097368863 std: 0.029626605680041794
Name: 6, dtype: object

### Bayes - Congressional Vote

In [9]:
bayes_results_vote = calculate_bayes(votingDataLearn[votingDataLearn.columns[2:18]],
                                     votingDataLearn["class"])
overall_results_vote.extend(bayes_results_vote)

print_results(bayes_results_vote, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
3,CategoricalNB(alpha=3.1),Alpha = 3.1,0.922304,0.918578,0.929706,m: 0.9223044397463003 std: 0.04195165926635185,m: 0.9185782390759508 std: 0.04469462645086833,m: 0.9297060918384448 std: 0.03623512490395484
4,CategoricalNB(alpha=4.1),Alpha = 4.1,0.922304,0.918578,0.929706,m: 0.9223044397463003 std: 0.04195165926635185,m: 0.9185782390759508 std: 0.04469462645086833,m: 0.9297060918384448 std: 0.03623512490395484
2,CategoricalNB(alpha=2.1),Alpha = 2.1,0.917759,0.914147,0.923824,m: 0.9177589852008456 std: 0.03936351367397849,m: 0.9141472899081595 std: 0.04267326106536261,m: 0.9238237388972683 std: 0.03255971857919237
0,CategoricalNB(alpha=0.1),Alpha = 0.1,0.917653,0.916001,0.919099,m: 0.9176532769556026 std: 0.036606812641491196,m: 0.9160010673755272 std: 0.041199924531873135,m: 0.9190987933634993 std: 0.03090170389606902
1,CategoricalNB(alpha=1.1),Alpha = 1.1,0.908562,0.905936,0.911691,m: 0.9085623678646935 std: 0.03759113618117659,m: 0.9059362018431694 std: 0.04263650611873664,m: 0.9116913859560919 std: 0.03226414499176202


classifier                              CategoricalNB(alpha=3.1)
arguments                                            Alpha = 3.1
mean_accuracy                                           0.922304
mean_precision                                          0.918578
mean_recall                                             0.929706
accuracy          m: 0.9223044397463003 std: 0.04195165926635185
precision         m: 0.9185782390759508 std: 0.04469462645086833
recall            m: 0.9297060918384448 std: 0.03623512490395484
Name: 3, dtype: object

### Perceptron - Congressional Vote

In [10]:
perceptron_results_vote = calculate_perceptron(votingDataLearn[votingDataLearn.columns[2:18]],
                                               votingDataLearn["class"])
overall_results_vote.extend(perceptron_results_vote)

print_results(perceptron_results_vote, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
0,Perceptron(),No additional args.,0.944715,0.948016,0.941238,m: 0.9447145877378436 std: 0.03746846609226435,m: 0.948015873015873 std: 0.035728392275797285,m: 0.9412382688117983 std: 0.04219766815473462


classifier                                          Perceptron()
arguments                                    No additional args.
mean_accuracy                                           0.944715
mean_precision                                          0.948016
mean_recall                                             0.941238
accuracy          m: 0.9447145877378436 std: 0.03746846609226435
precision         m: 0.948015873015873 std: 0.035728392275797285
recall            m: 0.9412382688117983 std: 0.04219766815473462
Name: 0, dtype: object

### Decision Tree - Congressional Vote

In [11]:
decision_tree_results_vote = calculate_decision_tree(votingDataLearn[votingDataLearn.columns[2:18]],
                                                     votingDataLearn["class"])
overall_results_vote.extend(decision_tree_results_vote)

print_results(decision_tree_results_vote, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
0,"DecisionTreeClassifier(max_depth=1, min_sample...","max Depth: 1, min Samples: 2",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
9,"DecisionTreeClassifier(max_depth=5, min_sample...","max Depth: 5, min Samples: 20",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
18,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 50",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
17,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 20",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
14,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 50",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
13,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 20",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
1,"DecisionTreeClassifier(max_depth=1, min_sample...","max Depth: 1, min Samples: 20",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
10,"DecisionTreeClassifier(max_depth=5, min_sample...","max Depth: 5, min Samples: 50",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
6,"DecisionTreeClassifier(max_depth=3, min_sample...","max Depth: 3, min Samples: 50",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
5,"DecisionTreeClassifier(max_depth=3, min_sample...","max Depth: 3, min Samples: 20",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786


classifier        DecisionTreeClassifier(max_depth=1, min_sample...
arguments                              max Depth: 1, min Samples: 2
mean_accuracy                                               0.96797
mean_precision                                             0.963363
mean_recall                                                0.973789
accuracy             m: 0.967970401691332 std: 0.023305252083176783
precision           m: 0.9633625730994153 std: 0.025403888401415272
recall                m: 0.9737891737891738 std: 0.0190606594685786
Name: 0, dtype: object

### SVM - Congressional Vote

In [12]:
svm_results_vote = calculate_svm(votingDataLearn[votingDataLearn.columns[2:18]],
                                 votingDataLearn["class"])
overall_results_vote.extend(svm_results_vote)

print_results(svm_results_vote, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
0,"SVC(C=100, degree=1, gamma=0.001, kernel='poly')","Kernel: poly, Degree: 1",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
28,"SVC(C=100, degree=2, gamma=0.001)","Kernel: rbf, Degree: 2",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
35,"SVC(C=100, degree=9, gamma=0.001)","Kernel: rbf, Degree: 9",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
34,"SVC(C=100, degree=8, gamma=0.001)","Kernel: rbf, Degree: 8",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
33,"SVC(C=100, degree=7, gamma=0.001)","Kernel: rbf, Degree: 7",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
32,"SVC(C=100, degree=6, gamma=0.001)","Kernel: rbf, Degree: 6",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
31,"SVC(C=100, degree=5, gamma=0.001)","Kernel: rbf, Degree: 5",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
30,"SVC(C=100, degree=4, gamma=0.001)","Kernel: rbf, Degree: 4",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
29,"SVC(C=100, gamma=0.001)","Kernel: rbf, Degree: 3",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
27,"SVC(C=100, degree=1, gamma=0.001)","Kernel: rbf, Degree: 1",0.96797,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786


classifier        SVC(C=100, degree=1, gamma=0.001, kernel='poly')
arguments                                  Kernel: poly, Degree: 1
mean_accuracy                                              0.96797
mean_precision                                            0.963363
mean_recall                                               0.973789
accuracy            m: 0.967970401691332 std: 0.023305252083176783
precision          m: 0.9633625730994153 std: 0.025403888401415272
recall               m: 0.9737891737891738 std: 0.0190606594685786
Name: 0, dtype: object

### Overall Results for Congressional Vote

In [13]:
print_results(overall_results_vote, "mean_accuracy")


'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
15,"DecisionTreeClassifier(max_depth=1, min_sample...","max Depth: 1, min Samples: 2",0.967970,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
70,"SVC(C=100, degree=9, gamma=0.001)","Kernel: rbf, Degree: 9",0.967970,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
28,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 20",0.967970,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
32,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 20",0.967970,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
33,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 50",0.967970,0.963363,0.973789,m: 0.967970401691332 std: 0.023305252083176783,m: 0.9633625730994153 std: 0.025403888401415272,m: 0.9737891737891738 std: 0.0190606594685786
...,...,...,...,...,...,...,...,...
30,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 100",0.614693,0.307347,0.500000,m: 0.614693446088795 std: 0.007467223261077965,m: 0.3073467230443975 std: 0.0037336116305389825,m: 0.5 std: 0.0
34,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 100",0.614693,0.307347,0.500000,m: 0.614693446088795 std: 0.007467223261077965,m: 0.3073467230443975 std: 0.0037336116305389825,m: 0.5 std: 0.0
40,"SVC(C=100, degree=6, gamma=0.001, kernel='poly')","Kernel: poly, Degree: 6",0.614693,0.307347,0.500000,m: 0.614693446088795 std: 0.007467223261077965,m: 0.3073467230443975 std: 0.0037336116305389825,m: 0.5 std: 0.0
37,"SVC(C=100, gamma=0.001, kernel='poly')","Kernel: poly, Degree: 3",0.614693,0.307347,0.500000,m: 0.614693446088795 std: 0.007467223261077965,m: 0.3073467230443975 std: 0.0037336116305389825,m: 0.5 std: 0.0


classifier        DecisionTreeClassifier(max_depth=1, min_sample...
arguments                              max Depth: 1, min Samples: 2
mean_accuracy                                               0.96797
mean_precision                                             0.963363
mean_recall                                                0.973789
accuracy             m: 0.967970401691332 std: 0.023305252083176783
precision           m: 0.9633625730994153 std: 0.025403888401415272
recall                m: 0.9737891737891738 std: 0.0190606594685786
Name: 15, dtype: object

### Train submission file

In [24]:
#Required Imports
from sklearn import svm
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import pandas as pd
import numpy as np

#Recode values for predicting variables
def recode_voting_data(dataset):
    dataset = dataset.replace('y', 1)\
        .replace('n', 0)
    dataset[dataset.columns[1:18]] = dataset[dataset.columns[1:18]].astype('category')
    return pd.DataFrame(dataset, columns=dataset.columns)

#Imput missing values
def input_missing_values(data, iterative_imputer):
    columns = data.columns
    data.loc[:, data.columns != "ID"] = np.round(imp.transform(data.loc[:, data.columns != "ID"]))
    data.loc[:, data.columns != "ID"] = data.loc[:, data.columns != "ID"].apply(lambda x: x.astype('int'))
    return pd.DataFrame(data, columns=columns)

#Read Data
votingDataLearn = pd.read_csv("data/voting/CongressionalVotingID.shuf.lrn.csv", na_values='unknown')
votingDataTest = pd.read_csv("data/voting/CongressionalVotingID.shuf.tes.csv", na_values='unknown')

#Extract target variable
votingDataLearn = votingDataLearn.replace('democrat', 2)\
    .replace('republican', 3)

y = votingDataLearn["class"]
X = pd.DataFrame(votingDataLearn.drop(["ID", "class"], axis=1))

#Recode Variables
X = recode_voting_data(X)
votingDataTest = recode_voting_data(votingDataTest)

#Input missing values
imp = IterativeImputer(max_iter=50, random_state=0)
combined_data = X.append(votingDataTest)
imp.fit(combined_data.loc[:, combined_data.columns != "ID"])
X = input_missing_values(X, imp)
votingDataTest = input_missing_values(votingDataTest, imp)

#Calculate Model
classifier = svm.SVC(kernel = "rbf", gamma=0.001, C=100)
#classifier = tree.DecisionTreeClassifier(max_depth=1)
classifier.fit(X, y)

#Predict the Test Data
votingDataTest["class"] = classifier.predict(votingDataTest[votingDataTest.columns[1:18]])

#Recode to required output
votingDataTest["class"].replace({2: "democrat", 3: "republican"}, inplace=True)
display("Finally recoded back: ", votingDataTest[["ID", "class"]])
votingDataTest[["ID", "class"]].to_csv("solution_voting.csv", index = False)

'Finally recoded back: '

Unnamed: 0,ID,class
0,13,democrat
1,393,republican
2,163,democrat
3,57,republican
4,148,democrat
...,...,...
212,359,democrat
213,128,democrat
214,27,democrat
215,119,democrat


## Amazon

In [15]:
from sklearn import preprocessing
#Read Data
amazonDataLearn = pd.read_csv("data/amazon/amazon_review_ID.shuf.lrn.csv")
amazonDataSolutionExample = pd.read_csv("data/amazon/amazon_review_ID.shuf.sol.ex.csv")
amazonDataTest = pd.read_csv("data/amazon/amazon_review_ID.shuf.tes.csv")
display("Original Data", amazonDataLearn)

#Recode values
#For One Hot Encoding of Class
#amazonDataLearn = pd.concat([amazonDataLearn, pd.get_dummies(amazonDataLearn["Class"], prefix='author_',drop_first=False)], axis=1)
#amazonDataLearn.drop(['Class'],axis=1, inplace=True)
#names_target = amazonDataLearn.loc[:, amazonDataLearn.columns.str.startswith('author_')]
#amazonDataLearn[names_target.columns] = amazonDataLearn[names_target.columns].apply(lambda x: x.astype('category'))

# For Label Encoding
#le = preprocessing.LabelEncoder()
#le.fit(amazonDataLearn['Class'])
#amazonDataLearn['Class'] = le.transform(amazonDataLearn['Class'])
#amazonDataLearn['Class'] = amazonDataLearn['Class'].astype('category')

names_data = amazonDataLearn.loc[:, amazonDataLearn.columns.str.startswith('V')]
#amazonDataLearn[0:10000] = amazonDataLearn[0:10000].apply(lambda x: x.astype('int'))

# Normalize data
def normalize_values(data):
    columns = data.columns
    data = preprocessing.Normalizer().fit_transform(data)
    return pd.DataFrame(data, columns=columns)

display("Recoded Data", amazonDataLearn)

X_amazon = normalize_values(amazonDataLearn[names_data.columns])
y_amazon = amazonDataLearn["Class"]

display("Data: ", X_amazon)
display("Target: ", y_amazon)

'Original Data'

Unnamed: 0,ID,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000,Class
0,0,9,5,5,9,7,0,8,7,1,...,0,1,0,1,0,0,0,0,2,Power
1,1,11,9,15,15,5,11,10,1,5,...,0,0,0,0,0,0,0,0,0,Goonan
2,2,11,10,13,12,6,5,0,3,1,...,0,0,0,0,0,0,0,1,0,Merritt
3,3,18,9,7,8,8,7,12,6,7,...,0,1,0,0,0,1,0,0,1,Goonan
4,4,11,7,10,11,4,5,1,8,4,...,0,0,0,0,0,1,0,0,3,Corn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,745,5,5,8,2,8,0,5,1,2,...,1,0,0,0,0,0,0,0,0,Chachra
746,746,22,13,8,14,8,11,3,6,7,...,6,0,2,0,0,2,0,0,0,Morrison
747,747,10,3,5,5,7,1,14,2,6,...,0,0,4,1,0,0,2,0,0,Sherwin
748,748,9,13,8,5,11,9,9,3,3,...,0,0,0,1,0,0,0,0,0,Blankenship


'Recoded Data'

Unnamed: 0,ID,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000,Class
0,0,9,5,5,9,7,0,8,7,1,...,0,1,0,1,0,0,0,0,2,Power
1,1,11,9,15,15,5,11,10,1,5,...,0,0,0,0,0,0,0,0,0,Goonan
2,2,11,10,13,12,6,5,0,3,1,...,0,0,0,0,0,0,0,1,0,Merritt
3,3,18,9,7,8,8,7,12,6,7,...,0,1,0,0,0,1,0,0,1,Goonan
4,4,11,7,10,11,4,5,1,8,4,...,0,0,0,0,0,1,0,0,3,Corn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,745,5,5,8,2,8,0,5,1,2,...,1,0,0,0,0,0,0,0,0,Chachra
746,746,22,13,8,14,8,11,3,6,7,...,6,0,2,0,0,2,0,0,0,Morrison
747,747,10,3,5,5,7,1,14,2,6,...,0,0,4,1,0,0,2,0,0,Sherwin
748,748,9,13,8,5,11,9,9,3,3,...,0,0,0,1,0,0,0,0,0,Blankenship


'Data: '

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V9991,V9992,V9993,V9994,V9995,V9996,V9997,V9998,V9999,V10000
0,0.022849,0.012694,0.012694,0.022849,0.017771,0.000000,0.020310,0.017771,0.002539,0.012694,...,0.000000,0.000000,0.002539,0.000000,0.002539,0.00000,0.000000,0.000000,0.000000,0.005077
1,0.030256,0.024755,0.041258,0.041258,0.013753,0.030256,0.027505,0.002751,0.013753,0.019254,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000
2,0.028761,0.026146,0.033990,0.031376,0.015688,0.013073,0.000000,0.007844,0.002615,0.002615,...,0.002615,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.002615,0.000000
3,0.041891,0.020946,0.016291,0.018618,0.018618,0.016291,0.027927,0.013964,0.016291,0.002327,...,0.000000,0.000000,0.002327,0.000000,0.000000,0.00000,0.002327,0.000000,0.000000,0.002327
4,0.028918,0.018402,0.026289,0.028918,0.010516,0.013145,0.002629,0.021031,0.010516,0.010516,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.002629,0.000000,0.000000,0.007887
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,0.026867,0.026867,0.042987,0.010747,0.042987,0.000000,0.026867,0.005373,0.010747,0.016120,...,0.000000,0.005373,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000
746,0.048706,0.028781,0.017711,0.030995,0.017711,0.024353,0.006642,0.013284,0.015497,0.013284,...,0.000000,0.013284,0.000000,0.004428,0.000000,0.00000,0.004428,0.000000,0.000000,0.000000
747,0.026260,0.007878,0.013130,0.013130,0.018382,0.002626,0.036764,0.005252,0.015756,0.002626,...,0.000000,0.000000,0.000000,0.010504,0.002626,0.00000,0.000000,0.005252,0.000000,0.000000
748,0.025152,0.036330,0.022357,0.013973,0.030741,0.025152,0.025152,0.008384,0.008384,0.016768,...,0.000000,0.000000,0.000000,0.000000,0.002795,0.00000,0.000000,0.000000,0.000000,0.000000


'Target: '

0            Power
1           Goonan
2          Merritt
3           Goonan
4             Corn
          ...     
745        Chachra
746       Morrison
747        Sherwin
748    Blankenship
749       Davisson
Name: Class, Length: 750, dtype: object

### k-NN Calculation - Amazon

In [16]:
knn_results_amazon = calculate_knn(X_amazon,
                                   y_amazon)
overall_results_amazon.extend(knn_results_amazon)

print_results(knn_results_amazon, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
0,KNeighborsClassifier(n_neighbors=1),N = 1,0.273333,0.279978,0.262333,m: 0.2733333333333333 std: 0.018378731669453627,m: 0.2799775335775336 std: 0.024564836359824562,m: 0.2623333333333333 std: 0.014087031072743617
6,KNeighborsClassifier(n_neighbors=7),N = 7,0.253333,0.280771,0.244,m: 0.25333333333333335 std: 0.013333333333333327,m: 0.2807707117264662 std: 0.02619790566738083,m: 0.244 std: 0.009463379711052279
7,KNeighborsClassifier(n_neighbors=8),N = 8,0.246667,0.266944,0.238333,m: 0.24666666666666665 std: 0.01632993161855452,m: 0.26694380717321897 std: 0.03690138309463883,m: 0.23833333333333334 std: 0.012247448713915901
4,KNeighborsClassifier(),N = 5,0.238667,0.260482,0.232333,m: 0.23866666666666667 std: 0.022070593809662472,m: 0.2604821920108877 std: 0.043216904898388454,m: 0.23233333333333334 std: 0.017907168024751053
5,KNeighborsClassifier(n_neighbors=6),N = 6,0.237333,0.256108,0.226667,m: 0.23733333333333334 std: 0.030579586509812573,m: 0.2561081249668206 std: 0.05422575028681419,m: 0.22666666666666666 std: 0.026729093595639287
8,KNeighborsClassifier(n_neighbors=9),N = 9,0.234667,0.249151,0.227667,m: 0.23466666666666666 std: 0.022469732728470294,m: 0.24915127694981914 std: 0.02777585327246395,m: 0.22766666666666663 std: 0.023036203390894655
2,KNeighborsClassifier(n_neighbors=3),N = 3,0.229333,0.244745,0.227,m: 0.22933333333333333 std: 0.01768866554856214,m: 0.24474459037021967 std: 0.04298428983874409,m: 0.227 std: 0.01833030277982336
3,KNeighborsClassifier(n_neighbors=4),N = 4,0.226667,0.250132,0.221333,m: 0.22666666666666666 std: 0.017384539747207068,m: 0.25013176185624697 std: 0.046394677392848166,m: 0.22133333333333333 std: 0.01442990721460891
1,KNeighborsClassifier(n_neighbors=2),N = 2,0.221333,0.224895,0.218667,m: 0.22133333333333333 std: 0.016546231527987804,m: 0.22489529914529913 std: 0.012649167661970356,m: 0.21866666666666665 std: 0.01671326818229556


classifier                    KNeighborsClassifier(n_neighbors=1)
arguments                                                   N = 1
mean_accuracy                                            0.273333
mean_precision                                           0.279978
mean_recall                                              0.262333
accuracy          m: 0.2733333333333333 std: 0.018378731669453627
precision         m: 0.2799775335775336 std: 0.024564836359824562
recall            m: 0.2623333333333333 std: 0.014087031072743617
Name: 0, dtype: object

### Perceptron - Amazon

In [17]:
perceptron_results_amazon = calculate_perceptron(X_amazon,
                                                 y_amazon)
overall_results_amazon.extend(perceptron_results_amazon)

print_results(perceptron_results_amazon, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
0,Perceptron(),No additional args.,0.224,0.228859,0.211,m: 0.22400000000000003 std: 0.07584780081774876,m: 0.22885949351343768 std: 0.08540867785826414,m: 0.211 std: 0.0708958547605022


classifier                                           Perceptron()
arguments                                     No additional args.
mean_accuracy                                               0.224
mean_precision                                           0.228859
mean_recall                                                 0.211
accuracy          m: 0.22400000000000003 std: 0.07584780081774876
precision         m: 0.22885949351343768 std: 0.08540867785826414
recall                           m: 0.211 std: 0.0708958547605022
Name: 0, dtype: object

### Decision Tree - Amazon

In [18]:
decision_tree_results_amazon = calculate_decision_tree(X_amazon,
                                                       y_amazon)
overall_results_amazon.extend(decision_tree_results_amazon)

print_results(decision_tree_results_amazon, "mean_accuracy")

'Results'

Unnamed: 0,classifier,arguments,mean_accuracy,mean_precision,mean_recall,accuracy,precision,recall
17,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 20",0.221333,0.109078,0.209,m: 0.22133333333333338 std: 0.03166491223210112,m: 0.10907801631217191 std: 0.02207017014108872,m: 0.209 std: 0.030886890422961007
16,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 2",0.197333,0.182503,0.191,m: 0.1973333333333333 std: 0.020912516188477497,m: 0.1825029903600236 std: 0.0178525974282946,m: 0.191 std: 0.01830604029032797
13,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 20",0.186667,0.080795,0.172667,m: 0.18666666666666668 std: 0.02666666666666667,m: 0.08079470558607138 std: 0.016771590163415853,m: 0.1726666666666667 std: 0.025854292572887103
12,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 2",0.161333,0.135135,0.152,m: 0.16133333333333333 std: 0.009797958971132713,m: 0.13513467576779098 std: 0.008851949077256116,m: 0.152 std: 0.009273618495495703
10,"DecisionTreeClassifier(max_depth=5, min_sample...","max Depth: 5, min Samples: 50",0.141333,0.028742,0.132333,m: 0.1413333333333333 std: 0.025785439474418283,m: 0.028741512735630387 std: 0.005716058018951...,m: 0.1323333333333333 std: 0.028256759270030317
18,"DecisionTreeClassifier(max_depth=9, min_sample...","max Depth: 9, min Samples: 50",0.141333,0.028742,0.132333,m: 0.1413333333333333 std: 0.025785439474418283,m: 0.028741512735630387 std: 0.005716058018951...,m: 0.1323333333333333 std: 0.028256759270030317
14,"DecisionTreeClassifier(max_depth=7, min_sample...","max Depth: 7, min Samples: 50",0.141333,0.028742,0.132333,m: 0.1413333333333333 std: 0.025785439474418283,m: 0.028741512735630387 std: 0.005716058018951...,m: 0.1323333333333333 std: 0.028256759270030317
8,"DecisionTreeClassifier(max_depth=5, min_sample...","max Depth: 5, min Samples: 2",0.130667,0.099738,0.117667,m: 0.13066666666666665 std: 0.009977753031397182,m: 0.09973849224963775 std: 0.001567674636739009,m: 0.11766666666666666 std: 0.002905932629027114
9,"DecisionTreeClassifier(max_depth=5, min_sample...","max Depth: 5, min Samples: 20",0.12,0.044873,0.108333,m: 0.12 std: 0.023851391759997755,m: 0.044873463008003094 std: 0.012153757507980322,m: 0.10833333333333332 std: 0.02019350830781462
6,"DecisionTreeClassifier(max_depth=3, min_sample...","max Depth: 3, min Samples: 50",0.093333,0.014105,0.084667,m: 0.09333333333333334 std: 0.008432740427115679,m: 0.014105219748524134 std: 0.001445612402666...,m: 0.08466666666666667 std: 0.007774602526460402


classifier        DecisionTreeClassifier(max_depth=9, min_sample...
arguments                             max Depth: 9, min Samples: 20
mean_accuracy                                              0.221333
mean_precision                                             0.109078
mean_recall                                                   0.209
accuracy            m: 0.22133333333333338 std: 0.03166491223210112
precision           m: 0.10907801631217191 std: 0.02207017014108872
recall                           m: 0.209 std: 0.030886890422961007
Name: 17, dtype: object

### SVM - Amazon

In [None]:
svm_results_amazon = calculate_svm(X_amazon,
                                   y_amazon)
overall_results_amazon.extend(svm_results_amazon)

print_results(svm_results_amazon, "mean_accuracy")

### Overall Results for Amazon

In [None]:
print_results(overall_results_amazon, "mean_accuracy")

### Prepare Submission

In [None]:
from sklearn import preprocessing
from sklearn import svm
import pandas as pd
import numpy as np
# Read Data
amazon_data_learn = pd.read_csv("data/amazon/amazon_review_ID.shuf.lrn.csv")
amazon_data_test = pd.read_csv("data/amazon/amazon_review_ID.shuf.tes.csv")

# Label Encoding
le = preprocessing.LabelEncoder()
le.fit(amazon_data_learn["Class"])
#amazon_data_learn["Class"] = le.transform(amazon_data_learn["Class"])
amazon_data_learn["Class"] = amazon_data_learn["Class"].astype("category")

names_data = amazon_data_learn.loc[:, amazon_data_learn.columns.str.startswith("V")]

display("Recoded Data", amazon_data_learn)
y = amazon_data_learn["Class"]
X = pd.DataFrame(amazon_data_learn.drop(["ID", "Class"], axis=1))
X = X[names_data.columns]

#Normalize data
def normalize_values(data):
    columns = data.columns
    data = preprocessing.Normalizer().fit_transform(data)
    return pd.DataFrame(data, columns=columns)

X_norm = normalize_values(X)
test_X = normalize_values(amazon_data_test[names_data.columns])

display("Data Normalized: ", X_norm)
display("Target: ", y)
display("Test Dataset: ", amazon_data_test)
display("Test Dataset, Predictors: ", amazon_data_test[names_data.columns])
display("Test Dataset, Normalized: ", test_X)

#Calculate Model
classifier = svm.SVC(C=101, kernel='poly')
classifier.fit(X_norm, y)

#Predict the Test Data
amazon_data_test["Class"] = classifier.predict(test_X)

display("Finally recoded: ", amazon_data_test[["ID", "Class"]])
amazon_data_test[["ID", "Class"]].to_csv("solution_amazon.csv", index = False)