# SUPPORT VECTOR MACHINE

#### About Dataset:
1. Title of Database: Abalone data
3. Number of Instances: 4177
4. Number of Attributes: 8
5. Attribute information:

Given is the attribute name, attribute type, the measurement unit and a brief description

	  |Name	|	   Data Type	   Meas.	     Description|
	   ----		   ---------	   -----	     -----------
	  Sex		       nominal		         M, F, and I (infant)
	  Length		    continuous	mm	    Longest shell measurement
	  Diameter	      continuous	mm	    perpendicular to length
	  Height		    continuous	mm	    with meat in shell
	  Whole weight	  continuous	grams	 whole abalone
	  Shucked weight	continuous	grams	 weight of meat
	  Viscera weight	continuous	grams	 gut weight (after bleeding)
	  Shell weight	  continuous	grams	 after being dried
	  Rings		       integer			   +1.5 gives the age in years


** Aim: From the above description of the dataset predict the Sex of abalone using SVM  Classifier .** 

## Loading Libraries and dataset

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.metrics import confusion_matrix,accuracy_score


In [3]:
df = pd.read_csv("abalone.csv")
df.head(10)

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
5,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
6,F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
7,F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
8,M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
9,F,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


In [4]:
y = df["Sex"]
del df["Sex"]

In [32]:
x = df
x.head(10)

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
5,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
6,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
7,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
8,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
9,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


In [33]:
df.Rings.value_counts()[:5]

9     689
10    634
8     568
11    487
7     391
Name: Rings, dtype: int64

In [34]:
#since this is supervised training algorithm we will split the data into xtrain ytrain
from sklearn.model_selection import train_test_split

In [35]:
xtr,xts,ytr,yts = train_test_split(x,y,test_size = 0.2,random_state=123)

In [36]:
clf= svm.SVC()
model = clf.fit(xtr,ytr)


In [37]:
pred1 = clf.predict(xts)

In [38]:
accuracy_score(yts,pred1)

0.56698564593301437

In [39]:
confusion_matrix(yts,pred1)

array([[ 61,  26, 155],
       [ 15, 216,  49],
       [ 64,  53, 197]])

In [40]:
# now using grid search we will perform svc
from sklearn.model_selection import GridSearchCV

In [41]:
clf.get_params()

{'C': 1.0,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': None,
 'degree': 3,
 'gamma': 'auto',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [42]:
grid_param1 = {'C': [1.0,100,1000],
              'kernel':['rbf','poly','linear']
             }

In [43]:
grd = GridSearchCV(estimator=clf,param_grid=grid_param1)

In [44]:
grd.fit(xtr,ytr) # it will take lot of time 

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [1.0, 100, 1000], 'kernel': ['rbf', 'poly', 'linear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [45]:
pred2 = grd.predict(xts)

In [46]:
accuracy_score(yts,pred2)

0.57535885167464118

In [47]:
confusion_matrix(yts,pred1)

array([[ 61,  26, 155],
       [ 15, 216,  49],
       [ 64,  53, 197]])

In [None]:
# comparing with random forest 

In [26]:
from sklearn.ensemble import RandomForestClassifier

In [49]:
rf = RandomForestClassifier(n_estimators=100)
rf.fit(xtr,ytr)
pred2 = rf.predict(xts)

In [50]:
#evaluation
confusion_matrix(yts,pred2)

array([[104,  34, 104],
       [ 29, 213,  38],
       [123,  44, 147]])

In [51]:
print(accuracy_score(yts,pred2))

0.555023923445


** we see that SVM performed better than Random Forest Classifier**
**we can use more parameters for tuning while grid search , ihavent done so due to system constraints**
**SVM with hyperparameter tuning gave better results than other model**