## Support Vector Machines

Since the data isn't in entirely the best form, e.g. missing values, unordered values, makes it quite difficult to simply
plug the values into a label encoder. The fact that there are also a considerable number of features (~20), and the corresponding
number of examples are low (700-800), it makes sense to test SVM's with different kernels and inspect its performance.

This will thus better deal with high variance and deal with bias implicitly.

In [1]:
# Load scripts to clean and generate data
# noinspection PyUnresolvedReferences
from auxiliary.data_clean2 import clean_data
import pandas as pd
import numpy as np

data = pd.read_csv('dataset/GSMArena_dataset_2020.csv', index_col=0)

data_features = data[["oem", "launch_announced", "launch_status", "body_dimensions", "display_size", "comms_wlan", "comms_usb",
                "features_sensors", "platform_os", "platform_cpu", "platform_gpu", "memory_internal",
                "main_camera_single", "main_camera_video", "misc_price",
                "selfie_camera_video",
                "selfie_camera_single", "battery"]]

# Clean up the data into a trainable form.
df = clean_data(data_features)


key_index
1        None
2        None
3        46.3
4        43.7
5        81.3
         ... 
10675    36.1
10676    26.1
10677    26.1
10678    26.1
10679    None
Name: scn_bdy_ratio, Length: 10679, dtype: object key_index
1        None
2         3.5
3         3.2
4         2.8
5         6.3
         ... 
10675     2.4
10676     2.0
10677     2.0
10678     2.0
10679    None
Name: screen_size, Length: 10679, dtype: object
Dimensions of imputed df 10679 16
DF has been output to imputed_df.csv
RangeIndex(start=0, stop=1035, step=1)


#### Learning the SVM

Using sk-learn, it is possible to plug in the data and fit a model.
Most of the kernel functions will be tested with 4-F cross-validation.

In [8]:
def y_classify_five(y):
    if y>1000:
        return 4
    elif y>700 and y<=1000:
        return 3
    elif y>450 and y<=700:
        return 2
    elif y>200 and y<=450:
        return 1
    
    return 0


def y_classify(y):
    if y>700:
        return 2
    elif y>=300 and y<=700:
        return 1
    
    return 0


In [13]:
# Now its time to split the data
from sklearn.model_selection import train_test_split

y = df["misc_price"]
y3 = y.apply(y_classify)
X = df.drop(["key_index", "misc_price"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y3, random_state=120, test_size=.3)

y5 = y.apply(y_classify_five)
X_train5, X_test5, y_train5, y_test5 = train_test_split(X, y5, random_state=120, test_size=.3)

In [14]:
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

svm_clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svm_clf.fit(X_train,y_train)

y_pred = svm_clf.predict(X_test)
print("3 class accuracy: ", accuracy_score(y_test,y_pred))

svm_clf.fit(X_train5,y_train5)

y_pred5 = svm_clf.predict(X_test5)
print("5 class accuracy: ", accuracy_score(y_test5,y_pred5))



0.7234726688102894
0.6752411575562701


#### Analyzing the model and results

As seen, we have fitted a preliminary SVM model to the training data.
Using matplotlib, it is possible to visualize the model & preliminary performance.


In [5]:
# matplotlib



#### Cross-Validation & Performance Tuning

We now implement our own SVM using the dual lagragian with hinge loss. We then test all the possible kernel mappings, linear, polynomial, euclidean, sigmoid.

As one may see, the preliminary performance is a considerable [improvement] to the LR model.

By tuning some more parameters & using different kernel functions, it may be possible to further increase the training & testing
performance.


In [16]:
# SVM class & function defs

class HyperSVM:
    """
    A support-vector machine with multiple kernel mappings for high 
    dimensions & hinge loss.
    """
    def __init__(self):
        pass

    def fit(self, X, y):
        pass

    def predict(self, X):
        pass

    def performance(self):
        pass



In [6]:
# Kernel function [1]

# 4-F Cross-Validation on [1]

# Performance Insights



# Kernel function [2]

# 4-F Cross-Validation on [2]

# Performance Insights



# Kernel function [3]

# 4-F Cross-Validation on [2]

# Performance Insights



#### Plots & Analysis of different Kernel methods

[Write Here]


