## Problem
Most sectors of the economy are embracing data science in their practises and consequently seeing improved performance and efficiency and the education sector is not left out. 
With such a huge amount of data being generated from schools in particular, we can make use of data science concepts to, for example, improve student performance, learning experience and to foster data-driven decision making in the managerial ranks. 

## Brief on Support Vector Machines
  
Support Vector Machine is a set of supervised machine learning algorithms used to solve classification and regression problems.   

SVM's can be further categorized into two types:  
* SVR (Support vector regression) : for solving regression problems. 
* SVC (Support vector classification): for solving classification problems.  

SVC Performs both linear and non-linear classification: 
Linear: 

Say you have labelled data with a feature which has two groupings (male and female, spam not not-spam), SVMs works in such a way that a line (mostly referred to as a hyperplane) is sought which separates the two distinct labels from each other. The optimal hyperplane would be one whereby the closest label from both group A and group B is as far as possible from the hyperplane (decision boundary) so that the groups are as distinct as possible.   
Its called a decision boundary seeing that its used to decide whether a point falls in group A or group B.  
Now, that works only for a case where the two groups are linearly separable. 

There are cases where the groupings are not linearly separable. 

Enter Non-linear classification: 
The non-linear classification is performed using the kernel function. 
When the dataset is separable by nonlinear boundary, certain kernels are implemented in the SVM to appropriately transform the feature space. 

A kernel is a function that transforms the data into a higher dimensional feature space where data is separable.  

Kernel functions:  
* Linear
* Polynomial 
* Gaussian Radial Basis Function  
* Sigmoid  

The radial basis function kernel is mostly used for non-linear problems. 

SVM's are mostly used in text classification problems.  
One of the gains to using SVM is that it helps to find complex relationships in your data without much transformations.   
Works well in cases where the features are more than the samples. 
Its memory efficient. (uses a subset of the training points in the decision function).  

Cons of SVM's  : 
SVM's don't give probability estimates. You have to calculate.  
Works best on small sample datasets.  
Can be memory consuming especially when processing huge volume of data.  


What's covered in this article: 
* Data import  
* Inspecting the data.  
* Data munging and preprocessing.  
* Data partitioning  
* Modelling.  
* Predicting.  
* Model evaluation.  


In [None]:
from xml.etree.ElementInclude import include
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler,OrdinalEncoder
from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer

: 

In [None]:
# Import the data
svc_data = pd.read_csv("../data/student_performance.csv")

In [None]:
# Data inspection
svc_data.info()

In [None]:
# Data transformation
## Encode the categorical variables to numeric

svc_data1 = svc_data.replace({'address':{'R':0,'U':1},
                            'famsize':{'LE3':0,'GT3':1},
                            'Pstatus':{'T':0,'A':1},
                            'Mjob':{'teacher':0,'at_home':1,'services':2,'other':3,'health':4},
                            'Fjob':{'teacher':0,'at_home':1,'services':2,'other':3,'health':4},
                            'guardian':{'mother':0,'father':1,'other':2},
                            'schoolsup':{'yes':1,'no':0},
                            'famsup':{'yes':1,'no':0},
                            'paid':{'no':0,'yes':1},
                            'activities':{'no':0,'yes':1},
                            'nursery':{'no':0,'yes':1},
                            'higher':{'yes':1,'no':0},
                            'internet':{'no':0,'yes':1},
                            'romantic':{'no':0,'yes':1},
                            'sex':{'F':0,'M':1}
                            }, regex=True)

 Seeing that we want to model a classification model, we'll categorize the marks into two groups, <10 will be a fail and >10 a pass. 
 To achieve that we'll use a conditional replace with the help of numpy function np.where 

In [None]:
svc_data['G3'] = np.where(svc_data['G3']>=10,1,0)

Some columns are not needful as such we'll drop them.

In [None]:
svc_data.drop(['G1','reason','school'], axis=1,inplace=True)

Data partitioning

In [None]:
## response variable
y = svc_data['G3']
## predictor variables
X = svc_data.drop('G3',axis=1)

# Split the data into training and testing sets
x_train,x_test,y_train,y_test = train_test_split(X,y, test_size = 0.2) 

 Separate categorical columns from numeric ones. Further separate categorical columns into those 
needing to be oneHot encoded and those that need to be ordinalEncoded.

In [None]:
num_cols = x_train.select_dtypes(exclude=['object']).columns.tolist() # extract numeric columns
cat_cols = x_train.select_dtypes(include=['object']).columns.tolist() # extract categorical columns 
ord_cols = ['Medu','Fedu','traveltime','studytime','famrel','freetime','goout','Dalc','Walc','health'] # categorical columns with order

Next we'll drop ordinal columns from the numeric columns list

In [None]:
# drop ordinal columns from the numeric columns list
num_cols = [value for value in num_cols if value not in ord_cols]
num_cols = [value for value in num_cols if value != "Unnamed: 0"]

Make pipeline for numerical columns

In [None]:
num_pipe = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)

Make pipeline for categorical columns

In [None]:
cat_pipe = make_pipeline(
    SimpleImputer(strategy='constant',fill_value='N/A'),
    OneHotEncoder(handle_unknown='ignore',sparse=False)
)

Make pipeline for ordinal columns

In [None]:
ord_pipe = make_pipeline(
    OrdinalEncoder()
)

Combine the pipelines

In [None]:
full_pipeline = ColumnTransformer([
    ('num',num_pipe,num_cols),
    ('cat',cat_pipe,cat_cols),
    ('ord',ord_pipe,ord_cols)
])

Make the full pipeline

In [None]:
svc_model = make_pipeline(full_pipeline, SVC(kernel='linear'))

Train the model

In [None]:
svc_model.fit(x_train,y_train)

Make predictions

In [None]:
y_pred = svc_model.predict(x_test)

Use the accuracy_score method from sklearn metrics module to calculate the accuracy.

In [None]:
print("Accuracy: ", accuracy_score(y_test,y_pred))