# MLE - Exercise 3 - Kaggle Competition
## Andreas Kocman (se19m024)

##Assignment
This exercise is in the form of a Kaggle competition. A few quick details on Kaggle & the competition format:

### Kaggle
* Kaggle (https://en.wikipedia.org/wiki/Kaggle) is a platform that allows a competition for a certain data set. Participants submit their prediction on a test set, and will get automated scoring on their results, and will enter the leaderboard.
* From Kaggle, you will be able to obtain a labelled training set, and an unlabelled test set.
* You can submit multiple entries to Kaggle; for each entry, you need to provide details on how you achieved the results - which software and which version of the software, which operating system, which algorithms, and which parameter settings for these algorithms; further, any processing applied to the data before training/predicting. There is a specific "description" field when submitting, you should fill in this information there, and you also need to include this description and the actual submission file in your final submission to Moodle.
* To submit to Kaggle, you need to create a specific submission file, which contains the predictions you obtain on the test set. Computing an aggregated evaluation criterion is done automatically by Kaggle
* The format of your submission is rather simple - it is a comma-separated file, where the first column is the identifier of the item that you are predicting, and the second column is the class you are predicting for that item. The first line should include a header, and is should use the names provided in the training set. An example is below:
```
ID,class
911366,B
852781,B
89524,B
857438,B
905686,B
```
* There is a limit of 7 submissions per day; finally, you also need to select your top 7 submissions to be counted in the competition
* Before you submit, you should evaluate the classifiers "locally" on your training set, i.e. by splitting that again in a training & test set (or using cross validation), to select a number of fitting algorithms & parameters. Then re-train your best models on the full local training set, and generate the predictions for the test set.
* Evaluation in Kaggle is split in two types of leaderboards - the private and public one. Here, the data is split into 50% / 50%, and as soon as you upload, you will know your results on one of these splits.
* The final results will only be visible once the competition closes, and as it is computed on a different split, might be slightly different than what you see initially (e.g. this is similar to a training/test/validation split)
* As it is a competition, there will be bonus points for the top 3 submissions.
* As reproducible science is great, there will be additional bonus points for submissions that use a notebook within the Kaggle competition (note: this was / partially still is called a "kernel" inside the Kaggle competition; Kernel obviously was a confusing term here, as it basically refers to code being executed in the environment of Kaggle itself (e.g. a jupyter notebook, or also a python or R script), and they seem to have realized that, and renamed it). see https://www.kaggle.com/notebooks or https://www.kaggle.com/getting-started/44939. You can first work locally, and then port your code to the notebook version. In Kaggle, your notebook will initially be private. Please share it with me (mayer@ifs.tuwien.ac.at), at least, though. You can also make it public at the end of the competition, to show off :-)

### Datasets
We will use the following datasets:
* Congressional Voting: a small dataset, a good entry point for your experiments (435 instances, 16 features)
  * Kaggle page: https://www.kaggle.com/t/c04c953c596e48099d857129f53fcbdb
* Amazon reviews: a dataset with many features (10k, extracted from text), but not that many instances (~800)
  * Kaggle page: https://www.kaggle.com/t/0bd2ac297dc242478b5979d5ee772136

### Submission
The Kaggle competition will close on the day displayed in Kaggle. After that, you still have time to submit to Moodle. Your submission to Moodle shall contain:

* A brief report, containing
  * A description of the datasets, including a short analysis of the features.
  * Details on the software you used for creating your solution
  * The algorithms and parameters you tried
  * The results you obtained on the locally split training/test set
    * And a comparison to the results that you received on Kaggle - how large was the difference, did the rank of the classifiers change (i.e. the first on your training set, was it still the best on the test set on Kaggle?)
* All the code needed to obtain your results
* The solution files that you uploaded to Kaggle

# Solution
## Brief Description of Datasets

### Congressional Voting
#### Background
The classification task relates to the prediction of voting behaviour in relation to opinions on certain political questions/topics.

#### Dataset Features
The dataset features boolean responses to 16 political topics or issues like allowing religious groups in schools of 219 people.
Features are coded as 'y' and 'n' and are in most cases relatively evenly distributed.

There missing data labeled as "unknown".

The classes for this classification task is the voting behaviour of the respective persons, coded as "democrat" or "republican".

### Amazon
#### Background
The classification task relates to language features of frequent authors of Amazon reviews.

#### Dataset Features
The dataset consists of 10 000 dimensions relating to language features with only 750 rows. The actual meaning of the data that can be used for prediction is unknown to the author.
The fields are integer values which makes it likely that the fields represent frequencies or occurancies of specific language features.
The distribution of these features are in most cases following a slightly skewed normal distribution (rechtssteile/linksschiefe Verteilung).

There is no missing data.

The classes for this classification task are the names of the authors of the reviews.


## Software used
The solutions were calculated on Python with Scikit Learn using a mixture of local Jupyter Notebooks (using Pycharm) and the Kaggle Notebooks

While the Kaggle Notebooks provided a surprisingly powerful environment for data analysis and calculation of predictions,
frequent disconnections and errors when using the notebooks for longer periods of time made it necessary to also use Jupyter
locally on PyCharm.

## Approaches used
Main Approaches (with five folds each):
* Naive Bayes
* Decision Tree
* Perceptron
* Regression Models
* SVM

For the specific combinations of parameters tried, please see the following chapters.
The following describes in detail the approach used as well as the selection of classifiers and the preparation of the data submission, especially the section "Calculation Functions".

## Obtained Results
### Congressional Voting
The best results were found for a Decision Tree closely followed by a SVM solution:
```
classifier        DecisionTreeClassifier(max_depth=1, min_sample...
arguments                              max Depth: 1, min Samples: 2
mean_accuracy                                               0.96797
mean_precision                                             0.963363
mean_recall                                                0.973789
accuracy             m: 0.967970401691332 std: 0.023305252083176783
precision           m: 0.9633625730994153 std: 0.025403888401415272
recall                m: 0.9737891737891738 std: 0.0190606594685786
Name: 15, dtype: object
```
Solutions for SVM were good, irrespective of kernels used, i.e.:
```
classifier        SVC(C=100, degree=1, gamma=0.001, kernel='poly')
arguments                                  Kernel: poly, Degree: 1
mean_accuracy                                              0.96797
mean_precision                                            0.963363
mean_recall                                               0.973789
accuracy            m: 0.967970401691332 std: 0.023305252083176783
precision          m: 0.9633625730994153 std: 0.025403888401415272
recall               m: 0.9737891737891738 std: 0.0190606594685786
Name: 0, dtype: object
```
The mean accuracy of both the SVM solution and the Decision Tree Classifiers were close to those yielded during submission.

One notable finding was that the Decision Tree classifier worked with max_depth=1, which means that responses to one question was sufficient to correctly assign 96% of people.

For detailed results, please see the section "Overall Results for Congressional Vote"

### Amazon

For detailed results, please see the section "Overall Results for Amazon"

### Helper Functions for Solution and Data Analysis

In [1]:
# global Imports
import pandas as pd
import numpy as np

#sk learn imports
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer

#Data reporting
from IPython.display import display

# Global definitions:
overall_results_vote = []
overall_results_amazon = []
averaging_approach = 'macro'
zero_division_approach = 0
number_of_folds = 5
scoring = {'Accuracy': make_scorer(accuracy_score),
            'Precision': make_scorer(precision_score, average=averaging_approach, zero_division=zero_division_approach),
            'Recall': make_scorer(recall_score, average=averaging_approach, zero_division=zero_division_approach)}

# Helper functions
def parse_k_fold_results(results):
    return "m: " + str(np.average(results)) + " std: " + str(np.std(results))

def parse_argument_tuple_as_string(argumentsTuple):
    return "max Depth: " + str(argumentsTuple[0])  + \
           ", min Samples: " + str(argumentsTuple[1])

def calculate_results_holdout(classifier_used, X_train, X_test, y_train, y_test):
    classifier_used.fit(X_train, y_train)

    # predict the test set on our trained classifier
    y_test_predicted = classifier_used.predict(X_test)

    acc = metrics.accuracy_score(y_test, y_test_predicted)
    recall=metrics.recall_score(y_test, y_test_predicted)
    precision = metrics.precision_score(y_test, y_test_predicted)

    return pd.Series({
            'classifier': str(classifier_used),
            'arguments': "",
            'accuracy':acc,
            'precision':precision,
            'recall':recall
        })

def calculate_results_cross_validate(classifier_used, description_used, data, target):
   scores = cross_validate(classifier_used, data, target,
                                scoring = scoring,
                                cv = number_of_folds,
                                error_score = 0)

   return pd.Series({
            'classifier': str(classifier_used),
            'arguments': description_used,
            'mean_accuracy': np.average(scores.get('test_Accuracy')),
            'mean_precision': np.average(scores.get('test_Precision')),
            'mean_recall': np.average(scores.get('test_Recall')),
            'accuracy': parse_k_fold_results(scores.get('test_Accuracy')),
            'precision': parse_k_fold_results(scores.get('test_Precision')),
            'recall':parse_k_fold_results(scores.get('test_Recall'))
        })

def print_results(array, column_for_max, ascending=False):
    df = pd.DataFrame(array)
    df = df.sort_values(by=[column_for_max], ascending=False)
    display('Results', df)

    best = df.iloc[df[column_for_max].argmax()]
    display(best)

### Calculation Functions


#### k-NN Calculation

In [2]:
from sklearn import neighbors

def calculate_knn(data, target):
    knn_results = []

    n_neighbors = range(1,10,1)

    for n in n_neighbors:
        knn_classifier = neighbors.KNeighborsClassifier(n)
        description = "N = " + str(n)
        result = calculate_results_cross_validate(knn_classifier,
                                                  description,
                                                  data,
                                                  target)
        knn_results.append(result)
    return knn_results


#### Bayes Calculation

In [3]:
from sklearn import naive_bayes

def calculate_bayes(data, target):
    bayes_results = []

    alphas = np.arange(0.1,5,1)

    for alpha in alphas:
        classifier = naive_bayes.CategoricalNB(alpha = alpha)
        description = "Alpha = " + str(alpha)
        result = calculate_results_cross_validate(classifier,
                                                  description,
                                                  data,
                                                  target)
        bayes_results.append(result)

    return bayes_results

#### Perceptron Calculation

In [4]:
from sklearn import linear_model

def calculate_perceptron(data, target):
    perceptron_results=[]
    classifier = linear_model.Perceptron()
    description = "No additional args."
    result = calculate_results_cross_validate(classifier,
                                              description,
                                              data,
                                              target)
    perceptron_results.append(result)
    return perceptron_results

#### Decision Tree Calculation

In [5]:
from sklearn import tree
import itertools

def calculate_decision_tree(data, target):
    # Parameters for the decision tree
    max_depth_arguments = range(1, 10, 2)
    min_samples_leaf_arguments = [2,20,50,100]
    argumentTuples = list(itertools.product(max_depth_arguments,
                                            min_samples_leaf_arguments))
    decision_tree_results = []

    for argumentTuple in argumentTuples:
        max_depth = argumentTuple[0]
        min_samples_leaf = argumentTuple[1]

        classifier = tree.DecisionTreeClassifier(criterion = 'gini',
                                                 max_depth = max_depth,
                                                 min_samples_leaf = min_samples_leaf,
                                                 splitter = 'best')
        #result = calculate_results_holdout(classifier, X_train, X_test, y_train, y_test)
        result = calculate_results_cross_validate(classifier,
                                                  parse_argument_tuple_as_string(argumentTuple),
                                                  data,
                                                  target)
        decision_tree_results.append(result)
    return decision_tree_results

#### Logistic Regression

In [6]:
from sklearn import linear_model

def calculate_logistic_regression(data, target):
    penalty = ["none", "l2"]#, "l1", "elasticnet"]
    class_weight = ["balanced"]
    solvers = ["newton-cg"]#, "lbfgs", "liblinear"]

    argumentTuples = list(itertools.product(solvers, penalty, class_weight))

    regression_results = []

    for tuple in argumentTuples:
        solver = tuple[0]
        penalty = tuple[1]
        class_weight = tuple[2]
        classifier = linear_model.LogisticRegression(solver = solver, class_weight = class_weight, penalty = penalty)

        result = calculate_results_cross_validate(classifier,
                                                  solver + ", " + penalty,
                                                  data,
                                                  target)
        regression_results.append(result)
    return regression_results


#### SVM Calculation

In [7]:
from sklearn import svm
import itertools

def calculate_svm(data, target):
    kernels = ["poly", "rbf"]#{"linear", "poly", "sigmoid", "rbf"}
    gamma = [0.001, "scale", "auto"]
    c = [100]
    degree = 1#range(1, 10, 1)

    argumentTuples = list(itertools.product(kernels,
                                            gamma,
                                            c,
                                            degree))
    svm_results = []

    for argumentTuple in argumentTuples:
        kernel = argumentTuple[0]
        gamma = argumentTuple[1]
        c = argumentTuple[2]
        degree = argumentTuple[3]

        classifier = svm.SVC(kernel = kernel, gamma = gamma, C = c, degree = degree)

        #result = calculate_results_holdout(classifier, X_train, X_test, y_train, y_test)
        result = calculate_results_cross_validate(classifier,
                                                  "Kernel: " + kernel + ", Degree: " + str(degree),
                                                  data,
                                                  target)
        svm_results.append(result)
    return svm_results

## Congressional Voting

### Preparation of the Dataset

In [17]:
#Recode values for predicting variables
def recode_voting_data(dataset):
    dataset = dataset.replace('y', 1)\
        .replace('n', 0)\
        .replace('democrat', 2)\
        .replace('republican', 3)
    dataset.loc[:, dataset.columns != "ID"] = dataset.loc[:, dataset.columns != "ID"].astype('category')
    return pd.DataFrame(dataset, columns=dataset.columns)

#Imput missing values
def input_missing_values(data):
    columns = data.columns
    imp = IterativeImputer(max_iter=10, random_state=0)
    imp.fit(data.loc[:, data.columns != "ID"])
    data.loc[:, data.columns != "ID"] = np.round(imp.transform(data.loc[:, data.columns != "ID"]))
    data.loc[:, data.columns != "ID"] = data.loc[:, data.columns != "ID"].apply(lambda x: x.astype('int'))
    data.loc[:, data.columns != "ID"] = data.loc[:, data.columns != "ID"].astype('category')
    return pd.DataFrame(data, columns=columns)

#Read Data
votingDataLearnOriginal = pd.read_csv("data/voting/CongressionalVotingID.shuf.lrn.csv", na_values='unknown')
votingDataSolutionExample = pd.read_csv("data/voting/CongressionalVotingID.shuf.sol.ex.csv", na_values='unknown')
votingDataTest = pd.read_csv("data/voting/CongressionalVotingID.shuf.tes.csv", na_values='unknown')
display("Original Data", votingDataLearnOriginal)

#Recode values
votingDataLearn = recode_voting_data(votingDataLearnOriginal)
votingDataLearn = input_missing_values(votingDataLearn)

display("Recoded Data", votingDataLearn)

display("Data: ", votingDataLearn[votingDataLearn.columns[2:18]])
display("Target: ", votingDataLearn["class"])

'Original Data'

Unnamed: 0,ID,class,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-crporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,213,democrat,n,n,y,n,n,n,y,y,y,n,y,n,n,n,y,y
1,94,democrat,y,n,y,n,n,n,y,n,y,y,y,n,n,n,y,y
2,188,democrat,y,n,y,n,n,n,y,y,y,n,n,n,n,n,y,
3,61,democrat,y,y,y,n,n,,y,y,y,y,n,n,n,n,y,
4,184,democrat,,,,,,,,,y,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,250,democrat,y,n,y,n,n,n,y,y,,n,y,n,n,n,y,y
214,26,democrat,y,n,y,n,n,n,y,y,y,y,n,n,n,n,y,y
215,110,democrat,y,,y,n,n,n,y,y,y,n,n,n,n,n,y,
216,34,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y




'Recoded Data'

Unnamed: 0,ID,class,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-crporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,213,2,0,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
1,94,2,1,0,1,0,0,0,1,0,1,1,1,0,0,0,1,1
2,188,2,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
3,61,2,1,1,1,0,0,0,1,1,1,1,0,0,0,0,1,1
4,184,2,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,250,2,1,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
214,26,2,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
215,110,2,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
216,34,3,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1


'Data: '

Unnamed: 0,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-crporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,0,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
1,1,0,1,0,0,0,1,0,1,1,1,0,0,0,1,1
2,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
3,1,1,1,0,0,0,1,1,1,1,0,0,0,0,1,1
4,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,1,0,1,0,0,0,1,1,1,0,1,0,0,0,1,1
214,1,0,1,0,0,0,1,1,1,1,0,0,0,0,1,1
215,1,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
216,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1


'Target: '

0      2
1      2
2      2
3      2
4      2
      ..
213    2
214    2
215    2
216    3
217    3
Name: class, Length: 218, dtype: category
Categories (2, int64): [2, 3]

### Data Exploration - Congressional Vote

In [28]:
import matplotlib



TypeError: '[2]' is an invalid key

### k-NN - Congressional Vote

In [None]:
knn_results_vote = calculate_knn(votingDataLearn[votingDataLearn.columns[2:18]],
                                 votingDataLearn["class"])
overall_results_vote.extend(knn_results_vote)

print_results(knn_results_vote, "mean_accuracy")

### Bayes - Congressional Vote

In [None]:
bayes_results_vote = calculate_bayes(votingDataLearn[votingDataLearn.columns[2:18]],
                                     votingDataLearn["class"])
overall_results_vote.extend(bayes_results_vote)

print_results(bayes_results_vote, "mean_accuracy")

### Perceptron - Congressional Vote

In [None]:
perceptron_results_vote = calculate_perceptron(votingDataLearn[votingDataLearn.columns[2:18]],
                                               votingDataLearn["class"])
overall_results_vote.extend(perceptron_results_vote)

print_results(perceptron_results_vote, "mean_accuracy")

### Decision Tree - Congressional Vote

In [None]:
decision_tree_results_vote = calculate_decision_tree(votingDataLearn[votingDataLearn.columns[2:18]],
                                                     votingDataLearn["class"])
overall_results_vote.extend(decision_tree_results_vote)

print_results(decision_tree_results_vote, "mean_accuracy")

### SVM - Congressional Vote

In [None]:
svm_results_vote = calculate_svm(votingDataLearn[votingDataLearn.columns[2:18]],
                                 votingDataLearn["class"])
overall_results_vote.extend(svm_results_vote)

print_results(svm_results_vote, "mean_accuracy")

### Overall Results for Congressional Vote

In [None]:
print_results(overall_results_vote, "mean_accuracy")


### Train submission file

In [None]:
#Required Imports
from sklearn import svm
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import pandas as pd
import numpy as np

#Recode values for predicting variables
def recode_voting_data(dataset):
    dataset = dataset.replace('y', 1)\
        .replace('n', 0)
    dataset[dataset.columns[1:18]] = dataset[dataset.columns[1:18]].astype('category')
    return pd.DataFrame(dataset, columns=dataset.columns)

#Imput missing values
def input_missing_values(data, iterative_imputer):
    columns = data.columns
    data.loc[:, data.columns != "ID"] = np.round(imp.transform(data.loc[:, data.columns != "ID"]))
    data.loc[:, data.columns != "ID"] = data.loc[:, data.columns != "ID"].apply(lambda x: x.astype('int'))
    return pd.DataFrame(data, columns=columns)

#Read Data
votingDataLearn = pd.read_csv("data/voting/CongressionalVotingID.shuf.lrn.csv", na_values='unknown')
votingDataTest = pd.read_csv("data/voting/CongressionalVotingID.shuf.tes.csv", na_values='unknown')

#Extract target variable
votingDataLearn = votingDataLearn.replace('democrat', 2)\
    .replace('republican', 3)

y = votingDataLearn["class"]
X = pd.DataFrame(votingDataLearn.drop(["ID", "class"], axis=1))

#Recode Variables
X = recode_voting_data(X)
votingDataTest = recode_voting_data(votingDataTest)

#Input missing values
imp = IterativeImputer(max_iter=50, random_state=0)
combined_data = X.append(votingDataTest)
imp.fit(combined_data.loc[:, combined_data.columns != "ID"])
X = input_missing_values(X, imp)
votingDataTest = input_missing_values(votingDataTest, imp)

#Calculate Model
classifier = svm.SVC(kernel = "rbf", gamma=0.001, C=100)
#classifier = tree.DecisionTreeClassifier(max_depth=1)
classifier.fit(X, y)

#Predict the Test Data
votingDataTest["class"] = classifier.predict(votingDataTest[votingDataTest.columns[1:18]])

#Recode to required output
votingDataTest["class"].replace({2: "democrat", 3: "republican"}, inplace=True)
display("Finally recoded back: ", votingDataTest[["ID", "class"]])
votingDataTest[["ID", "class"]].to_csv("solution_voting.csv", index = False)

## Amazon

### Preparation of the Dataset

In [None]:
from sklearn import preprocessing
#Read Data
amazonDataLearn = pd.read_csv("data/amazon/amazon_review_ID.shuf.lrn.csv")
amazonDataSolutionExample = pd.read_csv("data/amazon/amazon_review_ID.shuf.sol.ex.csv")
amazonDataTest = pd.read_csv("data/amazon/amazon_review_ID.shuf.tes.csv")
display("Original Data", amazonDataLearn)

#Recode values
#For One Hot Encoding of Class
#amazonDataLearn = pd.concat([amazonDataLearn, pd.get_dummies(amazonDataLearn["Class"], prefix='author_',drop_first=False)], axis=1)
#amazonDataLearn.drop(['Class'],axis=1, inplace=True)
#names_target = amazonDataLearn.loc[:, amazonDataLearn.columns.str.startswith('author_')]
#amazonDataLearn[names_target.columns] = amazonDataLearn[names_target.columns].apply(lambda x: x.astype('category'))

# For Label Encoding
#le = preprocessing.LabelEncoder()
#le.fit(amazonDataLearn['Class'])
#amazonDataLearn['Class'] = le.transform(amazonDataLearn['Class'])
#amazonDataLearn['Class'] = amazonDataLearn['Class'].astype('category')

names_data = amazonDataLearn.loc[:, amazonDataLearn.columns.str.startswith('V')]
#amazonDataLearn[0:10000] = amazonDataLearn[0:10000].apply(lambda x: x.astype('int'))

# Normalize data
def normalize_values(data):
    columns = data.columns
    data = preprocessing.Normalizer().fit_transform(data)
    return pd.DataFrame(data, columns=columns)

display("Recoded Data", amazonDataLearn)

X_amazon = normalize_values(amazonDataLearn[names_data.columns])
y_amazon = amazonDataLearn["Class"]

display("Data: ", X_amazon)
display("Target: ", y_amazon)

### k-NN Calculation - Amazon

In [None]:
knn_results_amazon = calculate_knn(X_amazon,
                                   y_amazon)
overall_results_amazon.extend(knn_results_amazon)

print_results(knn_results_amazon, "mean_accuracy")

### Perceptron - Amazon

In [None]:
perceptron_results_amazon = calculate_perceptron(X_amazon,
                                                 y_amazon)
overall_results_amazon.extend(perceptron_results_amazon)

print_results(perceptron_results_amazon, "mean_accuracy")

### Decision Tree - Amazon

In [None]:
decision_tree_results_amazon = calculate_decision_tree(X_amazon,
                                                       y_amazon)
overall_results_amazon.extend(decision_tree_results_amazon)

print_results(decision_tree_results_amazon, "mean_accuracy")

### Logistic Regression - Amazon

In [None]:
regression_results_amazon = calculate_logistic_regression(X_amazon,
                                                          y_amazon)
overall_results_amazon.extend(regression_results_amazon)

print_results(regression_results_amazon, "mean_accuracy")

### SVM - Amazon

In [None]:
svm_results_amazon = calculate_svm(X_amazon,
                                   y_amazon)
overall_results_amazon.extend(svm_results_amazon)

print_results(svm_results_amazon, "mean_accuracy")

### Overall Results for Amazon

In [None]:
print_results(overall_results_amazon, "mean_accuracy")

### Prepare Submission

In [None]:
from sklearn import preprocessing
#from sklearn import svm
from sklearn import linear_model
import pandas as pd
import numpy as np
# Read Data
amazon_data_learn = pd.read_csv("data/amazon/amazon_review_ID.shuf.lrn.csv")
amazon_data_test = pd.read_csv("data/amazon/amazon_review_ID.shuf.tes.csv")

# Label Encoding
le = preprocessing.LabelEncoder()
le.fit(amazon_data_learn["Class"])
#amazon_data_learn["Class"] = le.transform(amazon_data_learn["Class"])
amazon_data_learn["Class"] = amazon_data_learn["Class"].astype("category")

names_data = amazon_data_learn.loc[:, amazon_data_learn.columns.str.startswith("V")]

display("Recoded Data", amazon_data_learn)
y = amazon_data_learn["Class"]
X = pd.DataFrame(amazon_data_learn.drop(["ID", "Class"], axis=1))
X = X[names_data.columns]

#Normalize data
def normalize_values(data):
    columns = data.columns
    data = preprocessing.Normalizer().fit_transform(data)
    return pd.DataFrame(data, columns=columns)

X_norm = normalize_values(X)
test_X = normalize_values(amazon_data_test[names_data.columns])

display("Data Normalized: ", X_norm)
display("Target: ", y)
display("Test Dataset: ", amazon_data_test)
display("Test Dataset, Predictors: ", amazon_data_test[names_data.columns])
display("Test Dataset, Normalized: ", test_X)

#Calculate Model

#classifier = svm.SVC(C=101, kernel='poly')
classifier = linear_model.LogisticRegression(solver = "newton-cg",
                                             class_weight = "balanced",
                                             penalty = "none")
classifier.fit(X_norm, y)

#Predict the Test Data
amazon_data_test["Class"] = classifier.predict(test_X)

display("Finally recoded: ", amazon_data_test[["ID", "Class"]])
amazon_data_test[["ID", "Class"]].to_csv("solution_amazon.csv", index = False)