# Activity: Building a Student Intervention System

# Question 1 - Classification vs. Regression

Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?

This is a classification type supervised learning problem.Here we are developing a model to predict whether a student pass or fail.We are predicting a binary outcome.

# Question-2

load necessary Python libraries and load the student data. Note that the last column from this dataset, 'passed', will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
#import libraries
import numpy as np
import pandas as pd

In [2]:
#Read student data
student_data=pd.read_csv(r'D:\downloads\student-data.csv')
student_data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


# Question-3

Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:

The total number of students, n_students.
The total number of features for each student, n_features.
The number of those students who passed, n_passed.
The number of those students who failed, n_failed.
The graduation rate of the class, grad_rate, in percent (%).

In [3]:
n_students=len(student_data.index)
n_features=len(student_data.columns)-1
n_passed=len(student_data[student_data['passed']=='yes'])
n_failed=len(student_data[student_data['passed']=='no'])
grad_rate=(n_passed/n_students)*100

print('The total number of students :',n_students)
print('The total number of features for each student :',n_features)
print('The number of those students who passed :',n_passed)
print('The number of those students who failed :',n_failed)
print('The graduation rate of the class :','%.2f' %grad_rate,'%')

The total number of students : 395
The total number of features for each student : 30
The number of those students who passed : 265
The number of those students who failed : 130
The graduation rate of the class : 67.09 %


# Preparing the Data

you will prepare the data for modeling, training and testing.

# Question-4 Identify feature and target columns

separate the student data into feature and target columns to see if any features are non-numeric.

In [4]:
#Extract feature columns

In [5]:
feature_col=student_data.columns[:-1]
feature_col

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences'],
      dtype='object')

In [6]:
#Extract target column 'passed'

In [7]:
target_col=student_data.columns[-1]
target_col

'passed'

In [8]:
#Separate the data into feature data and target data (X and y, respectively)

In [9]:
X=student_data[feature_col]
y=student_data[target_col]
X.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,yes,no,no,4,3,4,1,1,3,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,yes,no,5,3,3,1,1,3,4
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,yes,no,4,3,2,2,3,3,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,yes,3,2,2,1,1,5,2
4,GP,F,16,U,GT3,T,3,3,other,other,...,yes,no,no,4,3,2,1,2,5,4


# Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply yes/no, e.g. internet. These can be reasonably converted into 1/0 (binary) values.

Other columns, like Mjob and Fjob, have more than two values, and are known as categorical variables. The recommended way to handle such a column is to create as many columns as possible values (e.g. Fjob_teacher, Fjob_other, Fjob_services, etc.), and assign a 1 to one of them and 0 to all others.

These generated columns are sometimes called dummy variables, and we will use the pandas.get_dummies() function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [10]:
X=pd.get_dummies(X)
X.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,activities_no,activities_yes,nursery_no,nursery_yes,higher_no,higher_yes,internet_no,internet_yes,romantic_no,romantic_yes
0,18,4,4,2,2,0,4,3,4,1,...,1,0,0,1,0,1,1,0,1,0
1,17,1,1,1,2,0,5,3,3,1,...,1,0,1,0,0,1,0,1,1,0
2,15,1,1,1,2,3,4,3,2,2,...,1,0,0,1,0,1,0,1,1,0
3,15,4,2,1,3,0,3,2,2,1,...,0,1,0,1,0,1,0,1,0,1
4,16,3,3,1,2,0,4,3,2,1,...,1,0,0,1,0,1,1,0,1,0


In [11]:
X.columns

Index(['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'school_GP',
       'school_MS', 'sex_F', 'sex_M', 'address_R', 'address_U', 'famsize_GT3',
       'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Mjob_at_home', 'Mjob_health',
       'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home',
       'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher',
       'reason_course', 'reason_home', 'reason_other', 'reason_reputation',
       'guardian_father', 'guardian_mother', 'guardian_other', 'schoolsup_no',
       'schoolsup_yes', 'famsup_no', 'famsup_yes', 'paid_no', 'paid_yes',
       'activities_no', 'activities_yes', 'nursery_no', 'nursery_yes',
       'higher_no', 'higher_yes', 'internet_no', 'internet_yes', 'romantic_no',
       'romantic_yes'],
      dtype='object')

# Question - 6 Implementation: Training and Testing Data Split

So far, we have converted all categorical features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:



Randomly shuffle and split the data (X, y) into training and testing subsets.

Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).

Set a random_state for the function(s) you use, if provided.

Store the results in X_train, X_test, y_train, and y_test.

In [12]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42,test_size=95)

In [13]:
X_train.shape

(300, 56)

In [14]:
X_test.shape

(95, 56)

In [15]:
print('Number of training samples :',X_train.shape[0])
print('Number of testing samples :',X_test.shape[0])

Number of training samples : 300
Number of testing samples : 95


# Question - 7 Training and Evaluating Models

In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in scikit-learn. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.

# Model Application

List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?

The three supervised learning models that I will be using for the testing are Logistic Regression,Decision Tree Classifier and Support Vector Machine.

Logistic Regression

General Applications:

Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences.

Strengths:

Logistic regression is easier to implement, interpret, and very efficient to train.

Weakness:
    
It can only be used to predict discrete functions.

Decision Tree

General Applications:
    
Marketing,Retention of Customers,Diagnosis of Diseases and Ailments

Strengths:
    
Decision trees requires less effort for data preparation during pre-processing.

Weakness:
    
A small change in the data can cause a large change in the structure of the decision tree causing instability.

Support Vector Machine

General Applications:
    
Face detection,Protein classification and cancer classification,Handwriting recognition

Strengths:
    
SVM is more effective in high dimensional spaces.SVM is relatively memory efficient

Weakness:
    
SVM algorithm is not suitable for large data sets

In [16]:
import warnings
warnings.filterwarnings('ignore')

In [17]:
# Import the three supervised learning models from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [18]:
# fit model-1  on traning data 
#LogisticRegression
logit_model=LogisticRegression()
X_train_100=X_train[:100]
y_train_100=y_train[:100]

X_train_200=X_train[:200]
y_train_200=y_train[:200]

X_train_300=X_train[:300]
y_train_300=y_train[:300]

logit_model.fit(X_train_100,y_train_100)
y_pred=logit_model.predict(X_test)

from sklearn.metrics import accuracy_score
print('accuracy for 100 train size is :',accuracy_score(y_test,y_pred))

accuracy for 100 train size is : 0.6842105263157895


In [19]:
logit_model.fit(X_train_200,y_train_200)
y_pred=logit_model.predict(X_test)
print('accuracy for 200 train size is :',accuracy_score(y_test,y_pred))

accuracy for 200 train size is : 0.6947368421052632


In [20]:
logit_model.fit(X_train_300,y_train_300)
y_pred=logit_model.predict(X_test)
print('accuracy for 300 train size is :',accuracy_score(y_test,y_pred))

accuracy for 300 train size is : 0.7157894736842105


In [21]:
#Decision Tree Classifier
dt_model=DecisionTreeClassifier()
dt_model.fit(X_train_100,y_train_100)
y_pred=dt_model.predict(X_test)
print('accuracy for 100 train size is :',accuracy_score(y_test,y_pred))

accuracy for 100 train size is : 0.5368421052631579


In [22]:
dt_model.fit(X_train_200,y_train_200)
y_pred=dt_model.predict(X_test)
print('accuracy for 200 train size is :',accuracy_score(y_test,y_pred))

accuracy for 200 train size is : 0.6631578947368421


In [23]:
dt_model.fit(X_train_300,y_train_300)
y_pred=dt_model.predict(X_test)
print('accuracy for 300 train size is :',accuracy_score(y_test,y_pred))

accuracy for 300 train size is : 0.5263157894736842


In [24]:
#SVM
svm_linear=SVC(kernel='linear')
svm_linear.fit(X_train_100,y_train_100)
y_pred=svm_linear.predict(X_test)
print('accuracy for 100 train size is :',accuracy_score(y_test,y_pred))

accuracy for 100 train size is : 0.6842105263157895


In [25]:
svm_linear.fit(X_train_200,y_train_200)
y_pred=svm_linear.predict(X_test)
print('accuracy for 200 train size is :',accuracy_score(y_test,y_pred))

accuracy for 200 train size is : 0.6736842105263158


In [26]:
svm_linear.fit(X_train_300,y_train_300)
y_pred=svm_linear.predict(X_test)
print('accuracy for 300 train size is :',accuracy_score(y_test,y_pred))

accuracy for 300 train size is : 0.6947368421052632


From the three models accuracy score of Logistic Regression is higher.So I think Logistic Regression is the best suit model for this dataset.