# Assignment


## About the dataset
The data set you will be using is the same set we used last time the UCI Machine Learning DataSet repository.  
http://mlr.cs.umass.edu/ml/datasets/Balance+Scale

It is a Balance scale weight & distance database.  You have a left weight and distance and a right weight and distance.  There are three categories:
* B is balanced
* L is scale tips left
* R is scale tips right

For the purposses of this assignment, we are going to define balance, we are going to define the categories slightly differently.  We are not going to look for perfect balance, instead we are going to look for a "close" balance.  Still compute  left weight*distance and right weight * distance.  If absolute value of the difference is less than 0.5, we will tag it as balanced.  Otherwise if the left product is greater, then we have a tip left, equal is balanced, right greater is tip right.

This data set exhaustively gives examples for all 625 posible combinations of weights of 1, 2, 3, 4, 5 with distances of 1, 2, 3, 4, 5.

If generating new instances, we should restrict the weight and distance to be positive.

## Given: Get the DataSet and Cross Validate
We use a decision tree to give us a base line. We cross validate because the decision tree will overfit the data.

In [14]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Get the data setimport pandas as pd
balance_df = pd.read_csv("balance-scale.csv")

# Add new features
balance_df["Left-Product"]=balance_df["Left-Weight"]*balance_df["Left-Distance"]
balance_df["Right-Product"]=balance_df["Right-Weight"]*balance_df["Right-Distance"]

# Use a decision tree classifier and cross validate
from sklearn.tree import DecisionTreeClassifier
X = balance_df[["Left-Weight", 'Left-Distance', "Right-Weight", "Right-Distance"]]
y = balance_df['Class']

from sklearn.model_selection import KFold
validation_accuracy = []
validation_f1 =[]
fold_and_validate = KFold(n_splits=5, shuffle=True, random_state=145)
for train_set_indices, validation_set_indices in fold_and_validate.split(X):
    cv_train_set = X.iloc[train_set_indices]
    cv_train_target = y.iloc[train_set_indices]
    
    cv_decision_tree = DecisionTreeClassifier()
    cv_decision_tree.fit(cv_train_set, cv_train_target)
    
    cv_xvalidation = X.iloc[validation_set_indices]
    cv_y_true = y.iloc[validation_set_indices]
    cv_y_predicted = cv_decision_tree.predict(cv_xvalidation)
    
    cv_accuracy_score = accuracy_score(cv_y_true, cv_y_predicted)
    cv_f1_score = f1_score(cv_y_true, cv_y_predicted,  average="weighted")
    validation_accuracy.append(cv_accuracy_score)
    validation_f1.append(cv_f1_score)
    
print("Cross validation accuracies are: ", validation_accuracy)
print("Cross validation f1 scores  are: ", validation_f1)


# 

Cross validation accuracies are:  [0.768, 0.728, 0.808, 0.776, 0.824]
Cross validation f1 scores  are:  [0.7848807339449542, 0.7592982926012721, 0.8044112149532711, 0.7894983077528532, 0.8310622488437334]


## Given: - Create test sets 
We will create two test sets
```integer_10_test_set```  This set will have integer weights and distances randomly chosen from 1 to 10.  

```float_5_test_set```  This set will have floating point weights and distances randomly chosen from 1 to 5.

The balance function we will use is not exact balance, but will look for a close balance.  The reason for this is that if we use exact balance we will not get anything in the balanced category for the floating point set.  

We print the counts to guarantee that we have enough in the balance category.

In [15]:
import numpy as np
import pandas as pd


# A function that computes balance
def close_balance(left, right):
    if (abs(left-right) < 0.5): return "B"
    if left<right: return "R"
    return "L"

# Generate the same sequence each time
np.random.seed(20)

# create my data values for integer_10_test_set
# create 10000 instances
integer_10_leftW = np.random.randint(1, 10, 10000)     
integer_10_leftD = np.random.randint(1, 10, 10000)    
integer_10_rightW = np.random.randint(1, 10, 10000)    
integer_10_rightD = np.random.randint(1, 10, 10000)  
integer_10_leftP = integer_10_leftW * integer_10_leftD
integer_10_rightP = integer_10_rightW * integer_10_rightD

integer_10_target = [close_balance(left,right) for (left,right) in zip(integer_10_leftP, integer_10_rightP)]

# create a dictionary with each feature
d = {}
d["Class"] = integer_10_target
d["Left-Weight"] = integer_10_leftW
d["Left-Distance"] = integer_10_leftD
d["Right-Weight"] = integer_10_rightW
d["Right-Distance"] = integer_10_rightD
d["Left-Product"] = integer_10_leftP
d["Right-Product"] = integer_10_rightP

# Create the data frame from the dictionary
integer_10_test_set = pd.DataFrame(data=d)

print(integer_10_test_set["Class"].value_counts())


# create my data values for float_5_test_set
# create 10000 instances
float_5_leftW = np.random.uniform(1, 5, 10000)     
float_5_leftD = np.random.uniform(1, 5, 10000)    
float_5_rightW = np.random.uniform(1, 5, 10000)    
float_5_rightD = np.random.uniform(1, 5, 10000)  
float_5_leftP = float_5_leftW * float_5_leftD
float_5_rightP = float_5_rightW * float_5_rightD

float_5_target = [close_balance(left,right) for (left,right) in zip(float_5_leftP, float_5_rightP)]

# create a dictionary with each feature
d = {}
d["Class"] = float_5_target
d["Left-Weight"] = float_5_leftW
d["Left-Distance"] = float_5_leftD
d["Right-Weight"] = float_5_rightW
d["Right-Distance"] = float_5_rightD
d["Left-Product"] = float_5_leftP
d["Right-Product"] = float_5_rightP


# Create the data frame from the dictionary
float_5_test_set = pd.DataFrame(data=d)
print(float_5_test_set["Class"].value_counts())

      
      
      


R    4843
L    4831
B     326
Name: Class, dtype: int64
L    4704
R    4686
B     610
Name: Class, dtype: int64


## Given - Train on Balance Data Set and Test our new sets


In [16]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# Train the classifier
X = balance_df[["Left-Weight", 'Left-Distance', "Right-Weight", "Right-Distance"]]
y = balance_df['Class']
tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(X,y)

# Test it on our two test sets
print("Evaluating Decision Tree on the integers from 1 to 10 test set")
X_test_int_10 = integer_10_test_set[["Left-Weight", 'Left-Distance', "Right-Weight", "Right-Distance"]]
y_true = integer_10_test_set["Class"]

y_predicted = tree_classifier.predict(X_test_int_10)
matrix = confusion_matrix(y_true, y_predicted)
print(matrix)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print ("Accuracy is ", accuracy_score(y_true, y_predicted))

print ("F1 is ", f1_score(y_true, y_predicted, average="weighted"))

print()
print("Evaluating Decision Tree on the floats from 1 to 5 test set")
X_test_float_5 = float_5_test_set[["Left-Weight", 'Left-Distance', "Right-Weight", "Right-Distance"]]
y_true = float_5_test_set["Class"]

y_predicted = tree_classifier.predict(X_test_float_5)
matrix = confusion_matrix(y_true, y_predicted)
print(matrix)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print ("Accuracy is ", accuracy_score(y_true, y_predicted))

print ("F1 is ", f1_score(y_true, y_predicted, average="weighted"))



Evaluating Decision Tree on the integers from 1 to 10 test set
[[ 256   31   39]
 [ 691 4017  123]
 [ 702  113 4028]]
Accuracy is  0.8301
F1 is  0.8720007886612162

Evaluating Decision Tree on the floats from 1 to 5 test set
[[ 211  204  195]
 [ 330 4261  113]
 [ 336  135 4215]]
Accuracy is  0.8687
F1 is  0.87713488787475


## Part 1 - Cross validate linear SVM
Do a 5 fold cross validation on an SVC model using the four features Left-Weight, Left-Distance, Right-Weight, and Right-Distance.  Use Class as the target.


In [17]:
from sklearn.svm import SVC

validation_accuracy = []
validation_f1 =[]
fold_and_validate = KFold(n_splits=5, shuffle=True, random_state=145)
for train_set_indices, validation_set_indices in fold_and_validate.split(X):
    cv_train_set = X.iloc[train_set_indices]
    cv_train_target = y.iloc[train_set_indices]
    #print(cv_train_set)
    
    cv_svc = SVC()
    cv_svc.fit(cv_train_set, cv_train_target)
    
    cv_xvalidation = X.iloc[validation_set_indices]
    cv_y_true = y.iloc[validation_set_indices]
    cv_y_predicted = cv_svc.predict(cv_xvalidation)
    
    cv_accuracy_score = accuracy_score(cv_y_true, cv_y_predicted)
    cv_f1_score = f1_score(cv_y_true, cv_y_predicted,  average="weighted")
    validation_accuracy.append(cv_accuracy_score)
    validation_f1.append(cv_f1_score)
    
print("Cross validation accuracies are: ", validation_accuracy)
print("Cross validation f1 scores  are: ", validation_f1)
    


Cross validation accuracies are:  [0.904, 0.888, 0.912, 0.888, 0.896]
Cross validation f1 scores  are:  [0.8704297931034483, 0.8471367884451996, 0.8742356902356901, 0.8510804821963627, 0.8627300884955753]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


## Part 2 - Train and test SVM
Train a linear SVC on the balance data set and then get the performance measures for the two test sets.
  

In [20]:
X = balance_df[["Left-Weight", 'Left-Distance', "Right-Weight", "Right-Distance"]]
y = balance_df['Class']
svm_classifier = SVC(kernel="linear")
svm_classifier.fit(X,y)

intX = integer_10_test_set[["Left-Weight", 'Left-Distance', "Right-Weight", "Right-Distance"]]
y_predicted = svm_classifier.predict(intX)
y = integer_10_test_set["Class"]
matrix = confusion_matrix(y, y_predicted)
print("int 10s: ")
print(matrix)


floatX = float_5_test_set[["Left-Weight", 'Left-Distance', "Right-Weight", "Right-Distance"]]
y_predicted = svm_classifier.predict(floatX)
y= float_5_test_set["Class"]
matrix = confusion_matrix(y, y_predicted)
print("float 5s: ")
print(matrix)

int 10s: 
[[ 243   46   37]
 [ 253 4393  185]
 [ 250  168 4425]]
float 5s: 
[[ 472   71   67]
 [ 611 4071   22]
 [ 646   18 4022]]


### Your comparison with previous results here:

## Part 3 - Cross validate RBF SVM
Do a 5 fold cross validation on an SVC model using the four features Left-Weight, Left-Distance, Right-Weight, and Right-Distance.  Use Class as the target.


In [10]:

validation_accuracy = []
validation_f1 =[]
fold_and_validate = KFold(n_splits=5, shuffle=True, random_state=145)
for train_set_indices, validation_set_indices in fold_and_validate.split(X):
    cv_train_set = X.iloc[train_set_indices]
    cv_train_target = y.iloc[train_set_indices]
    #print(cv_train_set)
    
    cv_svc = SVC(kernel="rbf")
    cv_svc.fit(cv_train_set, cv_train_target)
    
    cv_xvalidation = X.iloc[validation_set_indices]
    cv_y_true = y.iloc[validation_set_indices]
    cv_y_predicted = cv_svc.predict(cv_xvalidation)
    
    cv_accuracy_score = accuracy_score(cv_y_true, cv_y_predicted)
    cv_f1_score = f1_score(cv_y_true, cv_y_predicted,  average="weighted")
    validation_accuracy.append(cv_accuracy_score)
    validation_f1.append(cv_f1_score)
    
print("Cross validation accuracies are: ", validation_accuracy)
print("Cross validation f1 scores  are: ", validation_f1)
    

Cross validation accuracies are:  [0.488, 0.568, 0.536, 0.536, 0.488]
Cross validation f1 scores  are:  [0.47651963109354417, 0.5635715229163916, 0.5295685245901639, 0.5137252173913044, 0.47684266666666664]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


## Part 4 - Train and test SVM
Train a rbf SVC on the balance data set and then get the performance measures for the two test sets.
  

In [5]:
X = balance_df[["Left-Weight", 'Left-Distance', "Right-Weight", "Right-Distance"]]
y = balance_df['Class']
svm_classifier = SVC(kernel="rbf")
svm_classifier.fit(X,y)


### Your comparison with previous results here:

## Part 5 - Cross validate polynomial SVM
Do a 5 fold cross validation on an SVC model using the four features Left-Weight, Left-Distance, Right-Weight, and Right-Distance.  Use Class as the target.


In [11]:

validation_accuracy = []
validation_f1 =[]
fold_and_validate = KFold(n_splits=5, shuffle=True, random_state=145)
for train_set_indices, validation_set_indices in fold_and_validate.split(X):
    cv_train_set = X.iloc[train_set_indices]
    cv_train_target = y.iloc[train_set_indices]
    #print(cv_train_set)
    
    cv_svc = SVC(kernel = "poly")
    cv_svc.fit(cv_train_set, cv_train_target)
    
    cv_xvalidation = X.iloc[validation_set_indices]
    cv_y_true = y.iloc[validation_set_indices]
    cv_y_predicted = cv_svc.predict(cv_xvalidation)
    
    cv_accuracy_score = accuracy_score(cv_y_true, cv_y_predicted)
    cv_f1_score = f1_score(cv_y_true, cv_y_predicted,  average="weighted")
    validation_accuracy.append(cv_accuracy_score)
    validation_f1.append(cv_f1_score)
    
print("Cross validation accuracies are: ", validation_accuracy)
print("Cross validation f1 scores  are: ", validation_f1)
    

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Cross validation accuracies are:  [0.456, 0.504, 0.504, 0.464, 0.44]
Cross validation f1 scores  are:  [0.43287026591458505, 0.49113300752872807, 0.47073109473109476, 0.4379340875553369, 0.40051267762607967]


  'precision', 'predicted', average, warn_for)


## Part 6 - Train and test SVM
Train a polynomial SVC on the balance data set and then get the performance measures for the two test sets.
  

In [5]:
X = balance_df[["Left-Weight", 'Left-Distance', "Right-Weight", "Right-Distance"]]
y = balance_df['Class']
svm_classifier = SVC(kernel="poly")
svm_classifier.fit(X,y)

### Your comparison with previous results here:

## Part 7 - Cross validate sigmoid SVM
Do a 5 fold cross validation on an SVC model using the four features Left-Weigh7, Left-Distance, Right-Weight, and Right-Distance.  Use Class as the target.


In [12]:
validation_accuracy = []
validation_f1 =[]
fold_and_validate = KFold(n_splits=5, shuffle=True, random_state=145)
for train_set_indices, validation_set_indices in fold_and_validate.split(X):
    cv_train_set = X.iloc[train_set_indices]
    cv_train_target = y.iloc[train_set_indices]
    #print(cv_train_set)
    
    cv_svc = SVC(kernel = "sigmoid")
    cv_svc.fit(cv_train_set, cv_train_target)
    
    cv_xvalidation = X.iloc[validation_set_indices]
    cv_y_true = y.iloc[validation_set_indices]
    cv_y_predicted = cv_svc.predict(cv_xvalidation)
    
    cv_accuracy_score = accuracy_score(cv_y_true, cv_y_predicted)
    cv_f1_score = f1_score(cv_y_true, cv_y_predicted,  average="weighted")
    validation_accuracy.append(cv_accuracy_score)
    validation_f1.append(cv_f1_score)
    
print("Cross validation accuracies are: ", validation_accuracy)
print("Cross validation f1 scores  are: ", validation_f1)
    

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Cross validation accuracies are:  [0.472, 0.568, 0.52, 0.48, 0.448]
Cross validation f1 scores  are:  [0.302695652173913, 0.43275454545454545, 0.35578947368421054, 0.31135135135135134, 0.2772154696132597]


## Part 8 - Train and test SVM
Train a sigmoid SVC on the balance data set and then get the performance measures for the two test sets.
  

In [5]:
X = balance_df[["Left-Weight", 'Left-Distance', "Right-Weight", "Right-Distance"]]
y = balance_df['Class']
svm_classifier = SVC(kernel="sigmoid")
svm_classifier.fit(X,y)


### Your comparison with previous results here:

## Propose an explanation for the results that you found for the SVM with the different kernels.

## Bonus
1. Use a Stochastic Gradient Descent classifier and compare the performance.
1. Use a Random Forrest classifier and compare the performance. 

In [None]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
clf.fit(X, y)  

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=0)
rfc.fit(X,y)