
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

Answer:  <font color=red> Since the prediction is based on a discrete value,we choose classification model

### Question-2
load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [23]:
# Import libraries
import pandas as pd
import numpy as np


In [24]:
# Read student data
data = pd.read_csv("student-data.csv")
data.head()


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [25]:
# Calculate number of students
n_students = len(data)
print("No of students:",n_students)

No of students: 395


In [26]:
# Calculate number of features
n_features = len(data.columns[:-1])
print("No of features :",n_features)

No of features : 30


In [27]:
# Calculate passing students
n_passed = len(data[data.passed=="yes"])
print("No of students passed :", n_passed)

No of students passed : 265


In [28]:
# Calculate failing students
n_failed = len(data[data.passed=="no"])
print("No of students failed :", n_failed)

No of students failed : 130


In [29]:
# Calculate graduation rate
grad_rate = n_passed/(n_passed+n_failed)*100
print("Graduation Rate :",grad_rate)

Graduation Rate : 67.08860759493672


In [30]:
# Print the results
print("Total number of students: ",n_students)
print("Number of features:",n_features)
print("Number of students who passed: ",n_passed)
print("Number of students who failed: ",n_failed)
print("Graduation rate of the class:",grad_rate)

Total number of students:  395
Number of features: 30
Number of students who passed:  265
Number of students who failed:  130
Graduation rate of the class: 67.08860759493672


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [31]:
# Extract feature columns


In [32]:
f_col = list(data.columns[:-1])

In [33]:
# Extract target column 'passed'

In [34]:
t_col = data.columns[-1] 

In [35]:
# Separate the data into feature data and target data (X and y, respectively)

In [36]:
X = data[f_col]
y = data[t_col]

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [37]:
def preprocess_features(X):
    output = pd.DataFrame(index = X.index)
    for col, col_data in X.iteritems():
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        if col_data.dtype == object:
             col_data = pd.get_dummies(col_data, prefix = col)
        output = output.join(col_data)
    return output

In [38]:
X = preprocess_features(X)
X.head()

Unnamed: 0,school_GP,school_MS,sex_F,sex_M,age,address_R,address_U,famsize_GT3,famsize_LE3,Pstatus_A,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,1,0,1,0,18,0,1,1,0,1,...,1,0,0,4,3,4,1,1,3,6
1,1,0,1,0,17,0,1,1,0,0,...,1,1,0,5,3,3,1,1,3,4
2,1,0,1,0,15,0,1,0,1,0,...,1,1,0,4,3,2,2,3,3,10
3,1,0,1,0,15,0,1,1,0,0,...,1,1,1,3,2,2,1,1,5,2
4,1,0,1,0,16,0,1,1,0,0,...,1,0,0,4,3,2,1,2,5,4


In [39]:
X['Pedu']=X['Medu']+X['Fedu']
X['alc']=X['Dalc']+X['Walc']

In [40]:
X=X.drop(['Medu','Fedu','Dalc','Walc'],axis=1)

In [50]:
X.tail()

Unnamed: 0,school_GP,school_MS,sex_F,sex_M,age,address_R,address_U,famsize_GT3,famsize_LE3,Pstatus_A,...,higher,internet,romantic,famrel,freetime,goout,health,absences,Pedu,alc
390,0,1,0,1,20,0,1,0,1,1,...,1,0,0,5,5,4,4,11,4,9
391,0,1,0,1,17,0,1,0,1,0,...,1,1,0,2,4,5,2,3,4,7
392,0,1,0,1,21,1,0,1,0,0,...,1,0,0,5,5,3,3,3,2,6
393,0,1,0,1,18,1,0,0,1,0,...,1,1,0,4,4,1,5,0,5,7
394,0,1,0,1,19,0,1,0,1,0,...,1,1,0,3,2,3,5,5,2,6


### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [51]:
# splitting the data into train and test

from sklearn.model_selection import train_test_split
num_train = 300
num_test = X.shape[0] - num_train
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=num_train,random_state=42)


In [56]:
# Show the results of the split
print("No of Samples in training set: ",X_train.shape[0])
print("TNo of samples in testing set: ",X_test.shape[0])


No of Samples in training set:  300
TNo of samples in testing set:  95


### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.

In [None]:
Logistic Regression

###  Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

#### explaination
#### 1.Logistic Regression 
#### 2.Random Forest 
#### 3.Decision Tree


In [57]:
# Import the three supervised learning models from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score,accuracy_score


In [46]:
# predict on the test data 

In [59]:
from sklearn.linear_model import LogisticRegression
logit_model = LogisticRegression(solver='lbfgs', max_iter=1000)
logit_model.fit(X_train,y_train)
y_pred = logit_model.predict(X_test)

In [47]:
# calculate the accuracy score

In [60]:
from sklearn.metrics import f1_score,confusion_matrix,accuracy_score,precision_score,recall_score
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.7157894736842105
Precision is: 0.7142857142857143
recall is: 0.9166666666666666
f1 is: 0.8029197080291971


In [61]:
# fit the model-2 on traning data and predict on the test data and measure the accuracy

from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train,y_train)
y_pred = dt_model.predict(X_test)

In [62]:
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 score is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.6
Precision is: 0.6666666666666666
recall is: 0.7333333333333333
f1 score is: 0.6984126984126984


In [63]:
# fit the model-3 on traning data and predict on the test data and measure the accuracy
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)


In [64]:
from sklearn.metrics import f1_score,confusion_matrix
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 score is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.6526315789473685
Precision is: 0.6551724137931034
recall is: 0.95
f1 score is: 0.7755102040816326
