# Students performance and difficulties prediction

In this notebook, we will  :

- Predict whether or not a student will pass the final exam based on certain information given
- Compare the three learning algorithms
- Find out what most affects student achievement
- Find the best algorithm with high accuracy

We will be using three learning algorithms:

- Logistic regression
- Supported vector machine
- KNN

# Reading data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from time import time
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, accuracy_score, f1_score, roc_auc_score
from astropy.table import Table


df = pd.read_csv('student-data.csv')
dfv = pd.read_csv('student-data.csv')

# Data

**Before process the data let's describe it briefly:**
- Source : **Paulo Cortez, University of Minho, GuimarÃ£es, Portugal**, http://www3.dsi.uminho.pt/pcortez

- This data approach student achievement in secondary education of two Portuguese schools.

- The shape of our data set is **(395 rows × 31 columns)**.

- **No missing** values in the data.

- The data attributes **include demographic**, social and school related features and it was collected by using school reports and questionnaires.

- **The last column tell us whether a student passed the final exam or not**.

- The dataset is taken from : https://archive.ics.uci.edu/ml/datasets/student+performance

**Now let's explain every column in the dataframe**
- `school` : student's school (binary: "GP" or "MS")
- `sex` : student's sex (binary: "F" - female or "M" - male)
- `age` : student's age (numeric: from 15 to 22)
- `address` : student's home address type (binary: "U" - urban or "R" - rural)
- `famsize` : family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
- `Pstatus` : parent's cohabitation status (binary: "T" - living together or "A" - apart)
- `Medu` : mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
- `Fedu` : father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
- `Mjob` : mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- `Fjob` : father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- `reason` : reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
- `guardian` : student's guardian (nominal: "mother", "father" or "other")
- `traveltime` : home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- `studytime` : weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- `failures` : number of past class failures (numeric: n if 1<=n<3, else 4)
- `schoolsup` : extra educational support (binary: yes or no)
- `famsup` : family educational support (binary: yes or no)
- `paid` : extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- `activities` : extra-curricular activities (binary: yes or no)
- `nursery` : attended nursery school (binary: yes or no)
- `higher` : wants to take higher education (binary: yes or no)
- `internet` : Internet access at home (binary: yes or no)
- `romantic` : with a romantic relationship (binary: yes or no)
- `famrel` : quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- `freetime` : free time after school (numeric: from 1 - very low to 5 - very high)
- `goout` : going out with friends (numeric: from 1 - very low to 5 - very high)
- `Dalc` : workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- `Walc` : weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- `health` : current health status (numeric: from 1 - very bad to 5 - very good)
- `absences` : number of school absences (numeric: from 0 to 93)

**The last column:**
- `passed` : did the student pass the final exam (binary: yes or no)

**Displaying the dataset**

In [None]:
df.iloc[0:5]

## Data processing

In [None]:
# mapping strings to numeric values:
def numerical_data():
    df['school'] = df['school'].map({'GP': 0, 'MS': 1})
    df['sex'] = df['sex'].map({'M': 0, 'F': 1})
    df['address'] = df['address'].map({'U': 0, 'R': 1})
    df['famsize'] = df['famsize'].map({'LE3': 0, 'GT3': 1})
    df['Pstatus'] = df['Pstatus'].map({'T': 0, 'A': 1})
    df['Mjob'] = df['Mjob'].map({'teacher': 0, 'health': 1, 'services': 2, 'at_home': 3, 'other': 4})
    df['Fjob'] = df['Fjob'].map({'teacher': 0, 'health': 1, 'services': 2, 'at_home': 3, 'other': 4})
    df['reason'] = df['reason'].map({'home': 0, 'reputation': 1, 'course': 2, 'other': 3})
    df['guardian'] = df['guardian'].map({'mother': 0, 'father': 1, 'other': 2})
    df['schoolsup'] = df['schoolsup'].map({'no': 0, 'yes': 1})
    df['famsup'] = df['famsup'].map({'no': 0, 'yes': 1})
    df['paid'] = df['paid'].map({'no': 0, 'yes': 1})
    df['activities'] = df['activities'].map({'no': 0, 'yes': 1})
    df['nursery'] = df['nursery'].map({'no': 0, 'yes': 1})
    df['higher'] = df['higher'].map({'no': 0, 'yes': 1})
    df['internet'] = df['internet'].map({'no': 0, 'yes': 1})
    df['romantic'] = df['romantic'].map({'no': 0, 'yes' : 1})
    df['passed'] = df['passed'].map({'no': 0, 'yes': 1})
    # reorder dataframe columns :
    col = df['passed']
    del df['passed']
    df['passed'] = col

    
# feature scaling will allow the algorithm to converge faster, large data will have same scal
def feature_scaling(df):
    for i in df:
        col = df[i]
        # let's choose columns that have large values
        if(np.max(col)>6):
            Max = max(col)
            Min = min(col)
            mean = np.mean(col)
            col  = (col-mean)/(Max)
            df[i] = col
        elif(np.max(col)<6):
            col = (col-np.min(col))
            col /= np.max(col)
            df[i] = col
        
# This function will transform dataframe to a numpy array and split it
def split(df,test_size):
    data = df.to_numpy()
    n = data.shape[1]
    x = data[:,0:n-1]
    y = data[:,n-1]
    X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=test_size, random_state=0)
    return X_train,X_test,y_train,y_test

**digitization of values**

In [None]:
# All values in numerical after calling numerical_data() function
numerical_data()
df.iloc[0:5]

## Data visualisation

##### 1) data inspection

In [None]:
df.shape

In [None]:
df.dropna().shape # their is no null value "fortunately:)"

In [None]:
df.columns

##### 2) Now let's visualise the data and look deeper into each features

In [None]:
features=['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences']

a) boxplot:

In [None]:
sns.boxplot(x="passed", y="goout",  data=dfv)

In [None]:
# it seems that even people with low number of gout hours failed the exam

 b) Distribution of categorical features

     -for school Distribution

In [None]:
dfv["school"].unique()

In [None]:
f,fx = plt.subplots() 
figure = sns.countplot(x = 'school', data=dfv, order=['GP','MS'])
fx = fx.set(ylabel="Count", xlabel="school")
figure.grid(False)
plt.title('School Distribution')

      -for gender Distribution

In [None]:
f, fx = plt.subplots()
f = sns.countplot(x = 'sex', data=dfv, order=['M','F'])
fx = fx.set(ylabel="Count", xlabel="gender")
f.grid(False)
plt.title('Gender Distribution')

     -for Address Distribution

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'address', data=dfv, order=['U','R'])
fx = fx.set(ylabel="Count", xlabel="address")
figure.grid(False)
plt.title('Address Distribution')

    -for family Distribution

In [None]:
dfv["famsize"].unique()

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'famsize', data=dfv, order=['GT3','LE3'])
fx = fx.set(ylabel="Count", xlabel="famsize")
figure.grid(False)
plt.title('Family Distribution')

          -Parents status Distribution

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'Pstatus', data=dfv, order=['A','T'])
fx = fx.set(ylabel="Count", xlabel="status")
figure.grid(False)
plt.title('Parents status Distribution')   

     -School Support Distribution

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'schoolsup', data=dfv, order=['yes','no'])
fx = fx.set(ylabel="Count", xlabel="School Support")
figure.grid(False)
plt.title('School Support Distribution')

    -Family Support Distribution

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'famsup', data=dfv, order=['yes','no'])
fx = fx.set(ylabel="Count", xlabel="Family Support")
figure.grid(False)
plt.title('Family Support Distribution')

    -for extra paid distribution

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'paid', data=dfv, order=['yes','no'])
fx = fx.set(ylabel="Count", xlabel="extra paid classes")
figure.grid(False)
plt.title('Extra paid classes Distribution')

    - Students who want to take higher education Distribution

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'higher', data=dfv, order=['yes','no'])
fx = fx.set(ylabel="Count", xlabel="wants to take higher education")
figure.grid(False)
plt.title('Students who want to take higher education Distribution')

    -  internet Accessibility at home Distribution

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'internet', data=dfv, order=['yes','no'])
fx = fx.set(ylabel="Count", xlabel="Internet access at home")
figure.grid(False)
plt.title('Internet access at home Distribution')

       -Students with a romantic relationship Distribution

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'romantic', data=dfv, order=['yes','no'])
fx = fx.set(ylabel="Count", xlabel="With a romantic relationship")
figure.grid(False)
plt.title('Students with a romantic relationship Distribution')

 c) Distribution of  features with multiple categoric 

               - Parent_Education_Distribution

In [None]:
#(numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

In [None]:
dfv["Medu"].unique()

In [None]:
dfv["Fedu"].unique()

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'Medu', data=dfv, order=[0,1,2,3,4])
fx = fx.set(ylabel="Count", xlabel="Mother Education")
figure.grid(False)
plt.title('Parent Education Distribution')
   
f, fx = plt.subplots()
figure = sns.countplot(x = 'Fedu', data=dfv, order=[0,1,2,3,4])
fx = fx.set(ylabel="Count", xlabel="Father Education")
figure.grid(False)

In [None]:
# (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')

In [None]:
f, fx = plt.subplots()
figure = sns.countplot(x = 'Mjob', data=dfv, order=['teacher','health','services','at_home','other'])
fx = fx.set(ylabel="Count", xlabel="Mother Job")
figure.grid(False)
plt.title('Parent Job Distribution')
   
f, fx = plt.subplots()
figure = sns.countplot(x = 'Fjob', data=dfv, order=['teacher','health','services','at_home','other'])
fx = fx.set(ylabel="Count", xlabel="Father Job")
figure.grid(False)

    -travel_time_Distribution

In [None]:
# (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 

In [None]:
f, ax = plt.subplots()
figure = sns.countplot(x = 'traveltime', data=dfv, order=[1,2,3,4])
ax = ax.set(ylabel="Count", xlabel="travel time")
figure.grid(False)
plt.title('Travel Time Distribution')

      -Study Time Distribution

In [None]:
# (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

In [None]:
f, ax = plt.subplots()
figure = sns.countplot(x = 'studytime', data=dfv, order=[1,2,3,4])
ax = ax.set(ylabel="Count", xlabel="study time")
figure.grid(False)
plt.title('Study Time Distribution')

           -failures Distribution

In [None]:
f, ax = plt.subplots()
figure = sns.countplot(x = 'failures', data=dfv, order=[0,1,2,3])
ax = ax.set(ylabel="Count", xlabel="failures")
figure.grid(False)
plt.title('failures Distribution')

     -family relationship Distribution

In [None]:
f, ax = plt.subplots()
figure = sns.countplot(x = 'famrel', data=dfv, order=[1,2,3,4,5])
ax = ax.set(ylabel="Count", xlabel="family relationship")
figure.grid(False)
plt.title('family relationship Distribution')

     -Free time Distribution

In [None]:
f, ax = plt.subplots()
figure = sns.countplot(x = 'freetime', data=dfv, order=[1,2,3,4,5])
ax = ax.set(ylabel="Count", xlabel="Freetime")
figure.grid(False)
plt.title('Free time Distribution')

     -Going Out Distribution
     

In [None]:
f, ax = plt.subplots()
figure = sns.countplot(x = 'goout', data=dfv, order=[1,2,3,4,5])
ax = ax.set(ylabel="Count", xlabel="Going Out")
figure.grid(False)
plt.title('Going Out Distribution')

    -alcohol consumption Distribution

In [None]:
f, ax = plt.subplots()
figure = sns.countplot(x = 'Dalc', data=dfv, order=[1,2,3,4,5])
ax = ax.set(ylabel="Count", xlabel="Working")
figure.grid(False)
plt.title('Working day alcohol consumption Distribution')

f, ax = plt.subplots()
figure = sns.countplot(x = 'Walc', data=dfv, order=[1,2,3,4,5])
ax = ax.set(ylabel="Count", xlabel="Weekends")
figure.grid(False)
plt.title('Weekend alcohol consumption Distribution')

    -distribution of student status

In [None]:
dfv['passed'].value_counts()

In [None]:
labels = 'student pass the final exam ', 'student fail the final exam'
sizes = [265, 130]
colors=['lightskyblue','yellow']
fig1, ax1 = plt.subplots()
ax1.pie(sizes,  labels=labels, autopct='%1.1f%%',colors=colors,
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

   d) Now leats look at the most impactufull features for student failure
    

                 1-using correlation 
                 

In [None]:
# visualise correlation between student status and other features
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(df.corr()[['passed']].sort_values(by='passed', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with the status of student', fontdict={'fontsize':18}, pad=16);


it seems that most impactefull elements for student status are :

    _ for negatif impact we had:
    
        -failures
        -goout
        -age
        
    _ for positif impact
    
       -heigher
        -Medu
        -Fedu

**Features scalling**

In [None]:
feature_scaling(df)
X_train,X_test,y_train,y_test = split(df,0.2)

# Now we are ready for models training
df.iloc[0:3]

# Logistic regression

# k-nearest neighbors

In [None]:
# Réalisée par el nabaoui nouhaila

In this section we will discuss how to implement knn and how to get best accuracy using hyperparametters tuning (good lecture:)

In [None]:
#call  feature_scaling function:
#define data
y=df.passed
target=["passed"]
X = df.drop(target,axis = 1 )

### A)**Hyperparameter Tuning**



Before appling the knn algorithme it could be better to choose an optimal value of k,but what is the best method to tun this value ?
Actually There is no straightforward method to calculate the value of K in KNN. We have to play around with different values to choose the optimal value of K. 

In [None]:
#spliting the data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0, stratify=y)
#Setup arrays to store training and test accuracies
neighbors= np.arange(1,20)
train_accuracy =np.empty(19)
test_accuracy = np.empty(19)

for i,k in enumerate(neighbors):
    #Setup a knn classifier with k neighbors
    knn = KNeighborsClassifier(n_neighbors=k)
    
    #Fit the model
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)
    
    #Compute accuracy on the test set
    test_accuracy[i] = knn.score(X_test, y_test) 
    
#  Plotting the curv
plt.title('k-NN Varying number of neighbors')
plt.plot(neighbors, test_accuracy, label='Testing Accuracy')
plt.plot(neighbors, train_accuracy, label='Training accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show() 


In our case study we had a binary classification sow it could be better to choose an odd value of K.
By looking into the curv we might see that k=13 could be good choise .


**we are going to search for Best parameters(K,metric)  based on time,acc using validation data**


In [None]:
params = {"n_neighbors": np.arange(1, 30), "metric":["euclidean", "manhattan", "chebyshev"]}
acc = {}
i=0

for m in params["metric"]:
    acc[m] = []
    for k in params["n_neighbors"]:
        print("Model_{} metric: {}, n_neighbors: {}".format(i, m, k))
        i += 1
        t = time()
        knn = KNeighborsClassifier(n_neighbors=k, metric=m)
        knn.fit(X_train,y_train)
        pred = knn.predict(X_test)
        print("Time: ", time() - t)
        acc[m].append(accuracy_score(y_test, pred))
        print("Acc: ", acc[m][-1])

as  We can see that the best metric or distance is manhattan_distance,optimal k=8. This choice  gives heigh Acc=68% with less time consuming compared to other distances(t=0.007990360260009766 s)

## B)Final models implementation

As we discover in privious section the best parameters to implement knn algorithme are:

    -K=13
    
    -metric=euclidian-distance
    

In [None]:
#finale model
knn_f=KNeighborsClassifier(n_neighbors=13,metric='euclidean')
knn_f.fit(X_train,y_train)

## C)Model evaluation


**To evaluate our  model we are going to**:

-use heatmap (matrice de confusion )

-use the precision recall and  F1 score for each class

-plotting the roc curve

1)**Confusion matrix**

In [None]:
# at first let choose k=11 and evaluate the acquracy:
y_predict=knn_f.predict(X_test)
ac1 = accuracy_score(y_test,y_predict)
print('Accuracy is: ',ac1)
cm= confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True)


2)**classification_report**


In [None]:
#import classification_report
from sklearn.metrics import classification_report
y_predict = knn_f.predict(X_test)
print(classification_report(y_test,y_predict))

3)**Roc_curv**

In [None]:
#ploting the roc_curve

fpositif, tpositif, thresholds = roc_curve(y_test, y_predict)
plt.plot([0,1],[0,1],'k--')
plt.plot(fpositif,tpositif, label='knn_f')
plt.xlabel('false positif')
plt.ylabel('true positif')
plt.title('Knn_f ROC curve')
p=plt.show()
     

## D)conclusion :
To conclude after using knn algorithme with(euclidian_distance,k=13) we got a quit good accuracy acc=70%

# Support vector machine 

**First of all let's start with creating some useful functions :**

In [None]:
# Mohammed AL JADD
# Functions will help us


# ------------------------------------------------------------------------------------------------------------------------------
# Show results of every model-

def showResults(accuracy, trainingTime, y_pred,model):
    
    print('------------------------------------------------Results :',model,'-----------------------------------------------------')
    confusionMatrix = confusion_matrix(y_test, y_pred)
    print('\n The ROC curve is :\n')
    fpr,tpr,thresholds=roc_curve(y_test,y_pred)
    plt.plot([0, 1],[0, 1],'--')
    plt.plot(fpr,tpr,label=model)
    plt.xlabel('false positive')
    plt.ylabel('false negative')
    plt.legend()
    plt.show()
    print('----------------------------------------------')
    print('The model  accuracy:', accuracy,'%')
    print('----------------------------------------------')
    print('The confusion matrix is :\n',confusionMatrix)
    print('----------------------------------------------')
    print('The training time is: ',trainingTime)
    print('----------------------------------------------')
    print('The f1 score is :',f1_score(y_test, y_pred, average='macro'))
    print('----------------------------------------------')
    print('The roc_auc_score is :',roc_auc_score(y_test, y_pred))  
    print('--------------------------------------------------------------------------------------------------------------------')
    


    
# ------------------------------------------------------------------------------------------------------------------------------
# Hyperparameter Tuning :
# C, degree and gamma are the parameters that are used in SVM classffier 'svc(C=..,..),svc(C,degree=..)',svc(C,gamma=..)
# The following functions will return those values that minimize the error on (X_val,y_val) set
# So this (X_val,y_val) set will be used to get the optimal SVM parameters before evaluating the model on the test set


# Optimal C 
def optimal_C_value():
    Ci = np.array(( 0.0001,0.001,0.01,0.05,0.1,4,10,40,100))
    minError = float('Inf')
    optimal_C = float('Inf')

    for c in Ci:
        clf = SVC(C=c,kernel='linear')
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_val)
        error = np.mean(np.double(predictions != y_val))
        if error < minError:
            minError = error
            optimal_C = c
    return optimal_C


# Optimal C and the degree of the polynomial
def optimal_C_d_values():
    Ci = np.array(( 0.0001,0.001,0.01,0.05,0.1,4,10,40,100))
    Di = np.array(( 2, 5, 10, 15, 20, 25, 30))
    minError = float('Inf')
    optimal_C = float('Inf')
    optimal_d = float('Inf')

    for d in Di:
        for c in Ci:
            clf = SVC(C=c,kernel='poly', degree=d)
            clf.fit(X_train, y_train)
            predictions = clf.predict(X_val)
            error = np.mean(np.double(predictions != y_val))
            if error < minError:
                minError = error
                optimal_C = c
                optimal_d = d
    return optimal_C,optimal_d


# Optimal C and gamma
def optimal_C_gamma_values():
    Ci = np.array(( 0.0001,0.001,0.01,0.05,0.1,4,10,40,100))
    Gi = np.array(( 0.000001,0.00001,0.01,1,2,3,5,20,70,100,500,1000))
    minError = float('Inf')
    optimal_C = float('Inf')
    optimal_g = float('Inf')

    for g in Gi:
        for c in Ci:
            clf = SVC(C=c,kernel='rbf', gamma=g)
            clf.fit(X_train, y_train)
            predictions = clf.predict(X_val)
            error = np.mean(np.double(predictions != y_val))
            if error < minError:
                minError = error
                optimal_C = c
                optimal_g = g
    return optimal_C,optimal_g


# ------------------------------------------------------------------------------------------------------------------------------
# Compare the three kernels


def compare_kernels():
    print('------------------------------------------------ Comparison -----------------------------------------------------')
    print('\n')
    f11 = "{:.2f}".format(f1_score(y_test, y_linear, average='macro'))
    f22 = "{:.2f}".format(f1_score(y_test, y_poly, average='macro'))
    f33 = "{:.2f}".format(f1_score(y_test, y_gauss, average='macro'))
    roc1 = "{:.2f}".format(roc_auc_score(y_test, y_linear))
    roc2 = "{:.2f}".format(roc_auc_score(y_test, y_poly))
    roc3 = "{:.2f}".format(roc_auc_score(y_test, y_gauss))
    a1,a2 = confusion_matrix(y_test, y_linear)[0],confusion_matrix(y_test, y_linear)[1]
    b1,b2 = confusion_matrix(y_test, y_poly)[0],confusion_matrix(y_test, y_poly)[1]
    c1,c2 = confusion_matrix(y_test, y_gauss)[0],confusion_matrix(y_test, y_gauss)[1]
    data_rows = [('training time',time1, time2, time3),
                 ('','','',''),
                  ('accuracy %',linear_accuracy, poly_accuracy, gauss_accuracy),
                 ('','','',''),
                 ('confusion matrix',a1, b1, c1),
                ('',a2,b2,c2),
                 ('','','',''),
                ('f1 score',f11,f22,f33),
                 ('','','',''),
                ('roc_auc_score',roc1,roc2,roc3)]
    t = Table(rows=data_rows, names=('metrice','Linear kernel', 'polynomial kernel', 'gaussian kernel'))
    print(t)
    print('\n\n')
    print('The Roc curves :\n')
    y_pred1 = y_linear
    y_pred2 = y_poly
    y_pred3 = y_gauss
    
    fpr,tpr,thresholds=roc_curve(y_test,y_pred1)
    plt.plot([0, 1],[0, 1],'--')
    plt.plot(fpr,tpr,label='Linear kernel')
    plt.xlabel('false positive')
    plt.ylabel('false negative')
    fpr,tpr,thresholds=roc_curve(y_test,y_pred2)
    plt.plot(fpr,tpr,label='Polynomial kernel')
    fpr,tpr,thresholds=roc_curve(y_test,y_pred3)
    plt.plot(fpr,tpr,label='Gaussian kernel')
    plt.legend()
    plt.show()


# ------------------------------------------------------------------------------------------------------------------------------
# Print results of the choosen kernel

def best_kernel(kernel):
    time = 0
    f1 = 0
    accuracy = 0
    rc = 0
    y = 0
    if kernel == 'linear kernel':
        time = time1
        f1 = "{:.2f}".format(f1_score(y_test, y_linear, average='macro'))
        accuracy = linear_accuracy
        rc = roc_auc_score(y_test, y_linear)
        y = y_linear
    elif kernel == 'polynomial kernel':
        time = time2
        f1 = "{:.2f}".format(f1_score(y_test, y_poly, average='macro'))
        accuracy = poly_accuracy
        rc = roc_auc_score(y_test, y_poly)
        y = y_poly
    else :
        time = time3
        f1 = "{:.2f}".format(f1_score(y_test, y_gauss, average='macro'))
        accuracy = gauss_accuracy
        rc = roc_auc_score(y_test, y_gauss)
        y = y_gauss
    print('The choosen kernel :',kernel)
    print('the training :',time)
    print('the accuracy :',round(accuracy),'%')
    print('the f1 score :',f1)
    print('The roc_auc_score is :',rc)
    print('----------------------------------------\nThe ROC curve :')
    fpr,tpr,thresholds=roc_curve(y_test,y)
    plt.plot([0, 1],[0, 1],'--')
    plt.plot(fpr,tpr,label='The best svm kernel : '+kernel)
    plt.xlabel('false positive')
    plt.ylabel('false negative')
    plt.legend()
    plt.show()
    
    
    
    
    
# ------------------------------------------------------------------------------------------------------------------------------
# Splitting the data for SVM
# Here We will split data into test set, cross validation (X_val, y_val) set and training set
# The cross validation (X_val, y_val) is used for choosing the optimal value for svm parameters C, degree and gamma
data = df.to_numpy()
n = data.shape[1]    
x = data[:,0:n-1]
y = data[:,n-1]
X_train,X_rest,y_train,y_rest = train_test_split(x,y,test_size=0.4, random_state=0)
X_val,X_test,y_val,y_test = train_test_split(X_rest,y_rest,test_size=0.4, random_state=0) 

# We will use the three different svm classifier kernels
# Linear kernel, polynomial kernel and gaussian kernel and we will choose the most accurate

<h3>1) Model evaluation:</h3>

For model evaluation we will calculate :

- <span style='color:red'>**Training time**</span>
- <span style='color:red'>**Accuracy**</span>
- <span style='color:red'>**Confusion matrix**</span>
- <span style='color:red'>**ROC curve**</span>
- <span style='color:red'>**ROC score**</span>
- <span style='color:red'>**f1 score**</span>

<h3>2) Training phase:</h3>

**Linear Kernel :**

In [None]:
###################################################### Linear kernel ###########################################################

# Let's get the optimal C value for the linear kernal
optimal_C = optimal_C_value()


# Now let's use the optimal C value
linear_clf = SVC(C=optimal_C,kernel='linear')

# Let's train the model with the optimal C value and calculate the training time
tic = time()
linear_clf.fit(X_train, y_train)
toc = time()
time1 = str(round(1000*(toc-tic))) + "ms"
y_linear = linear_clf.predict(X_test)
linear_accuracy = accuracy_score(y_test, y_linear)*100



# Let's show the resuls
showResults(linear_accuracy, time1, y_linear,'SVM linear kernel')

**Polynomial Kernel :**

In [None]:
###################################################### Polynomial kernel ######################################################

# Let's get the optimal C and the degree value for the polynomial kernal
optimal_C, optimal_d = optimal_C_d_values()

# Now let's use the optimal c value and the optimal degree value
poly_clf = SVC(C=optimal_C,kernel='poly', degree=optimal_d)

# Let's train the model with the optimal C value and calculate the training time
tic = time()
poly_clf.fit(X_train, y_train)
toc = time()
time2 = str(round(1000*(toc-tic))) + "ms"
y_poly = poly_clf.predict(X_test)
poly_accuracy = (accuracy_score(y_test, y_poly)*100)

# Let's show the resuls
showResults(poly_accuracy, time2, y_poly,'SVM polynomial kernel')

**Gaussian Kernel :**

In [None]:
###################################################### Gaussian kernel ######################################################

# Let's get the optimal C value for the gaussian kernal
optimal_C, optimal_gamma = optimal_C_gamma_values() 

# Now let's use the optimal c value
gauss_clf = SVC(C=optimal_C,kernel='rbf',gamma=optimal_gamma)

# Let's train the model with the optimal C value and calculate the training time
tic = time()
gauss_clf.fit(X_train, y_train)
toc = time()
time3 = str(round(1000*(toc-tic))) + "ms"
y_gauss = gauss_clf.predict(X_test)
gauss_accuracy = (accuracy_score(y_test, y_gauss)*100)

# Let's show the resuls
showResults(gauss_accuracy, time3, y_gauss,'SVM gaussian kernel')

<h3>3) Comparison of the three svm kernels:</h3>

**We will compare all the metrics and plots one graph containing all the three ROC curves of the three SVM kernels :**

```python
# we will just call the function :
compare_kernels()

```

In [None]:
compare_kernels()

<h3>4) The most accurate svm kernel is the linear kernel:</h3>

```python
# ust call the function :
best_kernel("linear kernel"), 
#we give it the parameter "linear kernel" as it's it's the most accurate.

```

In [None]:
best_kernel('linear kernel')

<h3>5) Factors affecting performances of studens :</h3>

<h3>6) Conclusion : SVM linear kernel:</h3>

# Comparison of the three algorithms