# PBT57057 - Smart Academic Advisory

We used SOCS student dataset (kemanggisan & alam sutera) from 2010 - 2021 (odd term only). Most dataset features is presented as categorical type.    
     
**The goal is to build 'the best' possible classification model which can be used to predict who will belong to NR type. The result is expected to help SASC team in the future by getting information (NR and non-NR students) earlier.**

### Description of Dataset Mahasiswa   
    
`Status` = Status mahasiswa (Lulus, Belum Lulus, Dismissal', 'Aktif', 'Cuti)   
`NIM`    = Nomor Induk Mahasiswa   
`Name`   = Nama Mahasiswa   
`IntOrg` = Status keikutsertaan internal organisasi (N = tidak bergabung; Y = bergabung)    
`ExtOrg` = Status keikutsertaan external organisasi (N = tidak bergabung; Y = bergabung)    
`PartInAcadComp` = Status keikutsertaan dalam kompetisi akademik ( N = tidak pernah; Y = pernah)  
`Term` = Semester (1420 --> Binusian 14 semester genap; 1410--> Binusian 14 semester ganjil)      
    
       
### Description of Data Demografi   
`Term`  = Semester   
`nofom` = Nomor formulir saat pendaftaran   
`NIM`   = Nomor Induk Mahasiswa   
`Name`  = Nama mahasiswa    
`BinusianID` = Binusian ID   
`acad_group` = academic group (S0CS)    
`Status` = Status Mahasiswa (Undur Diri', 'Lulus', 'Dismissal', 'Aktif', 'Cuti)    
`Age`  = Usia Mahasiswa (dihitung dari tgl lahir hingga 2022)    
`Gender` = Gender     
`ScholarshipStatus` = Jenis Beasiswa yang diterima    
`English Score` = Nilai tes TOEFL PBT BINUS    
`RangeSalaryFa` = Gaji Ayah dalam bentuk range (4 kategori yaitu < 10jt; 10jt-19,99jt; 20jt - 29,99jt; >=30jt)   
`SalaryFa` = Gaji Ayah (6 kategori yaitu 3jt, 5jt, 10jt, 15jt, 20jt, 30jt)   
`TuitionLevel` = Biaya kuliah yang dibebankan ke mahasiswa   
`Address` = alamat mahasiswa    
`FaJob` = Pekerjaan Ayah (Pegawai negeri sipil; Pegawai swasta; Wiraswasta; ABRI; Tidak bekerja; Pensiun; Guru; Lain - Lain; PTS (Perguruan Tinggi Swasta)'; 'PTN (Perguruan Tinggi Negeri)';'Petani)         
   
`MoJob` = Pekerjaan Ibu (Pegawai negeri sipil; Pegawai swasta; Wiraswasta; ABRI; Tidak bekerja;  Pensiun; Guru; Lain - Lain; PTS (Perguruan Tinggi Swasta)'; 'PTN (Perguruan Tinggi Negeri)';'Petani)        
     
`EducationFa` = Pendidikan terakhir ayah ( MASTER; Sarjana; DOCTOR; Tamat SLTA; DIPLOMA (D3); Diploma(D4); Diploma(D2); Diploma(D1); Tamat SMP, Tamat SD; Specialist 1; Tidak Tamat SD; Specialist 2; High School (SMA))       
   
    
`EducationMo` = Pendidikan terakhir Ibu ( MASTER; Sarjana; DOCTOR; Tamat SLTA; DIPLOMA (D3);    Diploma(D4); Diploma(D2); Diploma(D1); Tamat SMP, Tamat SD; Specialist 1; Tidak Tamat SD; Specialist 2; High School (SMA))       
   
`StatusFa` = Status ayah( Masih Hidup; Telah meninggal)    
`StatusMo` = Status Ibu( Masih Hidup; Telah Meninggal)    
`fixed`    = variabel yang dipakai untuk menarik data    
`variable` = variabel yang dipakai untuk menarik data      
   
### Description of Dataset Prestasi Mahasiswa   
   
`Term`   = Semester   
`NIM`     = Nomor Induk Mahasiswa   
`Jurusan/Program` = Program Studi/Program   
`KategoriJurusan` = Kategori Jurusan (Ganda; reguler)   
`LokasiKuliah` = Lokasi Kuliah (Kemanggisan; Alam Sutera)   
`Angkatan`  = Angkatan mahasiswa (2000 - 2020)   
`PeriodeMasuk` = Semester   
`Semesterke` = Semester berjalan   
`SKSLulusSemesterBerjalan` = Total SKS yang sudah diambil mahasiswa     
`StatusIPK` = Status IPK Mahasiswa (IPK Kurang = dibawah 2.00; OK= >=2.00)     
`StatusSKS` = Stauts SKS mahasiswa (SKS Kurang = SKS kumulatif kurang dari kelipatan 15 sks per semester/lebih dari 10 semester; OK )     
`Evaluasi` = Evaluasi prestasi akademik mahasiswa (NR; Middle--> IPK 2.00 - 2.99; High--> 3.00 - 4.00)       

### **LIBRARIES**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce
import pydotplus
import math
import pickle 
import warnings

from matplotlib.cm import get_cmap
from matplotlib.patches import Patch
from sklearn.model_selection import  train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import OneHotEncoder
import sklearn.linear_model as lm
from sklearn.tree import export_graphviz, export_text
from six import StringIO
from IPython.display import Image  
import statsmodels.api as sm
from scipy import stats
from catboost import Pool
from imblearn.over_sampling import SMOTE

pd.options.mode.chained_assignment = None  # default='warn'
warnings.filterwarnings("ignore")

# A. Data Collection #

### **READ DATA** ###

In [2]:
#Read data mahasiswa
data = pd.read_excel (r'datamahasiswa.xlsx') 

#Read data demografi mahasiswa
data_demo = pd.read_excel (r'DemografiMahasiswa.xlsx') 

#Read data prestasi mahasiswa
data_prestasi = pd.read_excel (r'timsasc.xlsx') 

### **CREATE DATA FRAMES & COPY of THEM** ###

In [3]:
df = pd.DataFrame(data)
df_demo = pd.DataFrame(data_demo)
df_pres = pd.DataFrame(data_prestasi)

#make a copy of all dataframes
df_copy = df
df_demo_copy = df_demo
df_pres_copy = df_pres

### Student Data Cleansing ###

In [4]:
#remove unimportant features
df = df.drop(['BinusianID', 'Name', 'Status'], axis = 1)

#remove duplicate rows
df = df.drop_duplicates(keep='first').reset_index(drop=True)

# Getting rid of shor term records
removal_list = [1330,1430,1530,1630,1730,1830,1930,2030]
df = df[~df['Term'].isin(removal_list)]

# drop missing values from term attribute
df = df.dropna(subset=['Term'])

# Convert type of variables for merging purpose
df['Term']=df['Term'].astype('Int64')
df['NIM']=df['NIM'].astype('Int64')

### Demographic Data Cleansing ###

In [5]:
#Remove Unimportant Features
df_demo = df_demo.drop(['nofom', 'acad_Career', 'Name', 'BinusianID','acad_group', 'Age', 'RangeSalaryFa', 'RangeSalaryMo','SalaryMo','SalaryFa','TuitionLevel','Address', 'fixed', 'variable'], axis = 1)

#replace - with nan
df_demo = df_demo.replace('-', np.nan)

#remove short term records(xx30)
df_demo = df_demo.drop(df_demo[df_demo.Term.isin([1330,1430,1530,1630,1730,1830,1930,2030,2120])].index)

#remove duplicate rows
df_demo = df_demo.drop_duplicates(subset = ['NIM','Term'],keep = 'first').reset_index(drop=True)

#remove NA from Term attribute
df_demo = df_demo.dropna(subset=['Term'])

#Convert term type for merging purpose
df_demo['Term']=df_demo['Term'].astype('Int64')

### Performance Data Cleansing ###

In [6]:
#rename values
df_pres.loc[df_pres["JurusanOrProgram"] == "Computer Science", "JurusanOrProgram"] = 'A'
df_pres.loc[df_pres["JurusanOrProgram"] == "Computer Science & Mathematics", "JurusanOrProgram"] = 'B'
df_pres.loc[df_pres["JurusanOrProgram"] == "Computer Science & Statistics", "JurusanOrProgram"] = 'C'
df_pres.loc[df_pres["JurusanOrProgram"] == "Mobile Application & Technology", "JurusanOrProgram"] = 'D'
df_pres.loc[df_pres["JurusanOrProgram"] == "Game Application & Technology", "JurusanOrProgram"] = 'E'
df_pres.loc[df_pres["JurusanOrProgram"] == "Computer Science - Global Class", "JurusanOrProgram"] = 'F'
df_pres.loc[df_pres["JurusanOrProgram"] == "Master of Information Technology - Master Track", "JurusanOrProgram"] = 'H'
df_pres.loc[df_pres["JurusanOrProgram"] == "Cyber Security", "JurusanOrProgram"] = 'G'

### MERGING DATAFRAMES ###

In [7]:
# specify a left join—also known as a left outer join—with the how parameter. 
#Using a left outer join will leave your new merged DataFrame with all rows from the left DataFrame, 
#while discarding rows from the right DataFrame that don’t have a match in the key column of the left DataFrame.

finalDF = pd.merge(df_demo,df, on=["Term", "NIM"], how="left")
finalDF = pd.merge(finalDF,df_pres, on=["Term", "NIM"], how="left")

#make a copy of finalDF for backup
finalDF_copy=finalDF

# B. Data Preprocessing #

### 1. Data Cleansing ###

In [8]:
#take only 'angkatan' >= 2010 and Semester 1-5
#make a copy of finalDF for indexing
finalDFF = finalDF

finalDFF = finalDFF[finalDFF['SemesterKe'].isin([1,2,3,4,5])][['NIM','SemesterKe']]
df1g = finalDFF.groupby(['NIM']).count()
df1g = df1g.reset_index()
#check data student yg ada semester 3- semester 6
df1g = df1g[df1g['SemesterKe']==5]
#make a list of NIM
index_NIM = list(df1g['NIM'])

In [9]:
#To select rows whose column is in an iterable array, which we'll define as array:
df = finalDF.loc[(finalDF['NIM'].isin(index_NIM)) & finalDF['SemesterKe'].isin([1,2,3,4,5])]

In [10]:
#remove NA from label attribute
df.dropna(subset=['Evaluasi'], inplace=True)

#Set label
#Middle and High ar set as 0(nonNR) while Non Reguler as 1
df.loc[df["Evaluasi"] == "Non Reguler", "Evaluasi"] = 1
df.loc[df["Evaluasi"] == "Middle", "Evaluasi"] = 0
df.loc[df["Evaluasi"] == "High", "Evaluasi"] = 0

In [11]:
#Remove unimportant features
df = df.drop(['NIM','Term', 'Status_x'], axis = 1)
df.rename(columns = {'Status_y':'Status'}, inplace = True)
df = df.drop(['Angkatan','IPKTerakhir', 'SKSKumulatifTerakhir'], axis = 1)

In [12]:
#remove unknown values from Gender column
df = df[df.Gender != 'Unknown']

In [13]:
#Remove unnecessary strips
'''strip leading and trailing space'''
df['EducationFa'] = df['EducationFa'].str.replace(" ","")
df['EducationMo'] = df['EducationMo'].str.replace(" ","")

In [14]:
# Remove missing values in Father's & Mother's Education Columns
df = df[df.EducationFa != 'N/A']
df = df[df.EducationMo != 'N/A']

### 2. Data Transformation ###

In [15]:
#Replace englishscore with englishLevel
df['EnglishScore'] = np.where(df['EnglishScore'] > '550', 'Advance', df['EnglishScore'])
df['EnglishScore'] = np.where((df['EnglishScore'] <= '550') & (df['EnglishScore'] >= '467'), "Intermediate", df['EnglishScore'])
df['EnglishScore'] = np.where(df['EnglishScore'] < '467', 'Beginner', df['EnglishScore'])

In [16]:
#EducationFa attribute
df.loc[(df['EducationFa'] == 'DOCTOR(S3)') | (df['EducationFa'] == 'MASTER(S2)') | (df['EducationFa'] == 'SPECIALIST2(Sp.2)'), 'EducationFa'] = 'level1'

df.loc[(df['EducationFa'] == 'Sarjana(S1)') | (df['EducationFa'] == 'SPECIALIST1(Sp.1)') | (df['EducationFa'] == 'DIPLOMA(D4)')  | (df['EducationFa'] == 'Diploma(D4)')|
       (df['EducationFa'] == 'DIPLOMA(D3)') | (df['EducationFa'] == 'DIPLOMA(D2)')|(df['EducationFa'] == 'Diploma(D2)') | (df['EducationFa'] == 'Diploma(D1)'), 'EducationFa'] = 'level2'

df.loc[(df['EducationFa'] == 'TidaktamatSD') | (df['EducationFa'] == 'TamatSD') | (df['EducationFa'] == 'TamatSMP')|
             (df['EducationFa'] == 'HIGHSCHOOL(SMA)') | (df['EducationFa'] == 'TamatSLTA'), 'EducationFa'] = 'level3'     

In [17]:
#EducationMo attribute
df.loc[(df['EducationMo'] == 'DOCTOR(S3)') | (df['EducationMo'] == 'MASTER(S2)') | (df['EducationMo'] == 'SPECIALIST2(Sp.2)'), 'EducationMo'] = 'level1'

df.loc[(df['EducationMo'] == 'Sarjana(S1)') | (df['EducationMo'] == 'SPECIALIST1(Sp.1)')| (df['EducationMo'] == 'DIPLOMA(D4)') | (df['EducationMo'] == 'Diploma(D4)') | (df['EducationMo'] == 'DIPLOMA(D3)') | (df['EducationMo'] == 'DIPLOMA(D2)')
             |(df['EducationMo'] == 'Diploma(D2)') | (df['EducationMo'] == 'Diploma(D1)'), 'EducationMo'] = 'level2'

df.loc[(df['EducationMo'] == 'TidaktamatSD') | (df['EducationMo'] == 'TamatSD') | (df['EducationMo'] == 'TamatSD')| (df['EducationMo'] == 'TamatSMP')|
             (df['EducationMo'] == 'HIGHSCHOOL(SMA)') | (df['EducationMo'] == 'TamatSLTA'), 'EducationMo']= 'level3'

In [18]:
#Scholarship attribute
df['ScholarshipStatus'] = df['ScholarshipStatus'].str.replace(" ","")
df.loc[(df['ScholarshipStatus'] == 'BINUSIAN') | (df['ScholarshipStatus'] == 'BeasiswaKaryawan')
               | (df['ScholarshipStatus'] == 'SIBLINGSCHOLARSHIP')| (df['ScholarshipStatus'] == 'SiblingScholarship')
               | (df['ScholarshipStatus'] == 'BeasiswaAnakKaryawan') | (df['ScholarshipStatus'] == 'BINUSIANCOMMUNITYSCHOLARSHIP')
               | (df['ScholarshipStatus'] == 'BeasiswaBINUSAmbassador')
                | (df['ScholarshipStatus'] == 'School:BinusianCommunityandEarlyBird')
                | (df['ScholarshipStatus'] == 'School:BinusianCommunityorEarlyBird'), 'ScholarshipStatus'] = 'binusian'

df.loc[ (df['ScholarshipStatus'] == 'TalentDevelopmentProgram')|
             (df['ScholarshipStatus'] == 'NationDevelopmentProgram') | (df['ScholarshipStatus'] == 'BeasiswaBINUSINTERNATIONALSCHOOL') |                                                   
             (df['ScholarshipStatus'] == 'DirectAdmissionBINUSINTERNATIONALSCHOOL') | (df['ScholarshipStatus'] == 'KerjasamaASAK') |                                                     
            (df['ScholarshipStatus'] == 'BeasiswaAnakGuru') |(df['ScholarshipStatus'] == 'BeasiswaKerjasamaBINUS-AyoKuliah')
             |(df['ScholarshipStatus'] == 'BEASISWAJURUSAN') |(df['ScholarshipStatus'] == 'BeasiswaJurusan')                                                
             |(df['ScholarshipStatus'] == 'BeasiswaJuaraKompas-BINUS')|(df['ScholarshipStatus'] == 'BeasiswaKhususEducationExpo')
            |(df['ScholarshipStatus'] == 'BeasiswaTalentMapping')|(df['ScholarshipStatus'] == 'Beasiswa')
            |(df['ScholarshipStatus'] == 'BeasiswaUndanganSekolahKhusus')|(df['ScholarshipStatus'] == 'BeasiswaTPKS') | (df['ScholarshipStatus'] == 'BEASISWATPKSKHUSUS')
            | (df['ScholarshipStatus'] == 'BeasiswaTPKSKhusus(NonRefundable)') |(df['ScholarshipStatus'] == 'BeasiswaBIDIKMISI')
            | (df['ScholarshipStatus'] == 'WidiaPartialScholarship')|(df['ScholarshipStatus'] == 'widia')
             | (df['ScholarshipStatus'] == 'WidiaScholarshipforOutstandingAchievers'), 'ScholarshipStatus'] = 'Other'

df.loc[(df['ScholarshipStatus'] == 'PendaftaranBiasa') | (df['ScholarshipStatus'] == 'PendaftaranBiasa(EarlyBatch)')|
        (df['ScholarshipStatus'] == 'School:Regular')|(df['ScholarshipStatus'] == 'Kalbis(TeknikInformatikadanMatematika2018)'), 'ScholarshipStatus'] = 'regular'

In [19]:
#Father's Job attribute
df.loc[(df['FaJob'] == 'ABRI') | (df['FaJob'] == 'Guru')| (df['FaJob'] == 'Lain - Lain')
             | (df['FaJob'] == 'PTN (Perguruan Tinggi Negeri)') | (df['FaJob'] == 'Wiraswasta') 
       | (df['FaJob'] == 'PTS (Perguruan Tinggi Swasta)'), 'FaJob'] = 'Other'
                                                                           
df.loc[(df['FaJob'] == 'Pensiun') | (df['FaJob'] == 'Tidak bekerja'), 'FaJob'] = 'Unemployement'
df.loc[(df['FaJob'] == 'Pegawai negeri sipil') | (df['FaJob'] == 'Pegawai swasta'), 'FaJob'] = 'Employee'

In [20]:
#Mother's Job attribute
df.loc[(df['MoJob'] == 'ABRI') | (df['MoJob'] == 'Guru')| (df['MoJob'] == 'Lain - Lain')| (df['MoJob'] == 'Petani')
             | (df['MoJob'] == 'PTN (Perguruan Tinggi Negeri)') | (df['MoJob'] == 'Wiraswasta') 
       | (df['MoJob'] == 'PTS (Perguruan Tinggi Swasta)'), 'MoJob'] = 'Other'
                                                                           
df.loc[(df['MoJob'] == 'Pensiun') | (df['MoJob'] == 'Tidak bekerja'), 'MoJob'] = 'Unemployement'
df.loc[(df['MoJob'] == 'Pegawai negeri sipil') | (df['MoJob'] == 'Pegawai swasta'), 'MoJob'] = 'Employee'

#### Data Imputation ####

In [21]:
#finalDF.mode()['ScholarshipStatus'][0]
mode = df.mode(axis=0, numeric_only = False)

##imputation categorical values
cols = ['ScholarshipStatus','EnglishScore', 'FaJob', 'MoJob', 'EducationFa',
       'EducationMo', 'StatusFa', 'StatusMo', 'IntOrg', 'ExtOrg',
       'PartInAcadComp', 'PartInNonacadCom',]
df[cols]=df[cols].fillna(df.mode().iloc[0])

In [22]:
#Export dataframe into a xlsx file
df.to_excel('finaldataset.xlsx', index=False)

### 3. Create Dummy Variables ###

In [23]:
cat_vars=['Gender', 'ScholarshipStatus', 'EnglishScore', 'FaJob', 'MoJob',
       'EducationFa', 'EducationMo', 'StatusFa', 'StatusMo', 'IntOrg',
       'ExtOrg', 'PartInAcadComp', 'PartInNonacadCom', 'JurusanOrProgram',
       'KategoriJurusan', 'LokasiKuliah', 'Status', 'CekIPK','CekSKS']
for var in cat_vars:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(df[var], prefix=var)
    data1=df.join(cat_list)
    df=data1
    
data_vars=df.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]

data_final=df[to_keep]
data_final.rename(columns={'Gender_Laki-Laki': 'Gender_Male', 'Gender_Perempuan': 'Gender_Female', 'StatusMo_Masih Hidup': 'StatusMo_alive', 'StatusMo_Telah Meninggal': 'StatusMo_died', 'LokasiKuliah_Alam Sutera' : 'LokasiKuliah_AlamSutera'}, inplace=True)

### 4. Split training and testing data set ###

In [24]:
X = data_final 
train = X.loc[(X['SemesterKe'] <= 4) ]
test = X.loc[X['SemesterKe'] == 5]
train = train.drop(['SemesterKe'], axis = 1)
test = test.drop(['SemesterKe'], axis = 1)
test = test.sample(n = 230)

#Export test dataframe to a xlsx file
test.to_excel('test.xlsx', index=False)

In [25]:
# Creates list of all column headers for train and test sets
all_columns = list(train) 
train[all_columns] = train[all_columns].astype('int64')

all_columns = list(test)
test[all_columns] = test[all_columns].astype('int64')

In [26]:
#train set
X_train = train.loc[:,train.columns != 'Evaluasi']
y_train = train.loc[:, train.columns == 'Evaluasi']
#test set
X_test = test.loc[:,test.columns != 'Evaluasi']
y_test = test.loc[:, test.columns == 'Evaluasi']

### 5. Data Balancing ###

In [27]:
#Oversampling training & testing sets
smote = SMOTE(random_state = 42)
X, y = smote.fit_resample(X_train, y_train)
X1, y1 = smote.fit_resample(X_test, y_test)

# C. Data Modelling & Evaluation #

In [28]:
import statsmodels.api as sm
logit_model=sm.Logit(y,X.loc[:, ['Gender_Male', 'Gender_Female', 'ScholarshipStatus_Other',
       'ScholarshipStatus_binusian', 'ScholarshipStatus_regular',
       'EnglishScore_Advance', 'EnglishScore_Beginner',
       'EnglishScore_Intermediate', 'FaJob_Employee', 'FaJob_Other',
       'FaJob_Unemployement', 'MoJob_Employee', 'MoJob_Other',
       'MoJob_Unemployement', 'EducationFa_level1', 'EducationFa_level2',
       'EducationFa_level3', 'EducationMo_level1', 'EducationMo_level2',
       'EducationMo_level3',  'StatusMo_alive','LokasiKuliah_AlamSutera',
       'LokasiKuliah_Kemanggisan']])
result=logit_model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.411771
         Iterations 11
                           Logit Regression Results                           
Dep. Variable:               Evaluasi   No. Observations:                 8216
Model:                          Logit   Df Residuals:                     8193
Method:                           MLE   Df Model:                           22
Date:                Wed, 07 Sep 2022   Pseudo R-squ.:                  0.4059
Time:                        11:23:58   Log-Likelihood:                -3383.1
converged:                       True   LL-Null:                       -5694.9
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                 coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Gender_Male                   -2.8426      0.618     -4.600      0.000     

In [29]:
# Logistic Regression Model Fitting
from sklearn.linear_model import LogisticRegression
var = ['Gender_Male', 'Gender_Female', 'ScholarshipStatus_Other',
       'ScholarshipStatus_binusian', 'ScholarshipStatus_regular',
       'EnglishScore_Advance', 'EnglishScore_Beginner',
       'EnglishScore_Intermediate', 'FaJob_Employee', 'FaJob_Other',
       'FaJob_Unemployement', 'MoJob_Employee', 'MoJob_Other',
       'MoJob_Unemployement', 'EducationFa_level1', 'EducationFa_level2',
       'EducationFa_level3', 'EducationMo_level1', 'EducationMo_level2',
       'EducationMo_level3',  'StatusMo_alive','LokasiKuliah_AlamSutera',
       'LokasiKuliah_Kemanggisan']
X = X[var]
X1 = X1[var]
logreg = LogisticRegression()
logreg.fit(X, y)

LogisticRegression()

In [30]:
# Check prediction result
y_pred = logreg.predict(X1)
print(classification_report(y1,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.87      0.93       222
           1       0.89      0.99      0.94       222

    accuracy                           0.93       444
   macro avg       0.94      0.93      0.93       444
weighted avg       0.94      0.93      0.93       444



# D. Model Deployment #

In [31]:
### Create a Pickle file for deployment using serialization  
import pickle
pickle_out = open("modelLogR.pkl","wb")
pickle.dump(logreg, pickle_out)
pickle_out.close()

### Results of Data Analysis :  ###
   
1. There are lots of missing and duplicated values from the raw data, especially student dataset given by IT Unit   
2. Several attributes are considered as unimportant features, so they are removed.   
3. Here are the statistical results (mode: the most common value in each feature)   
Status                           Aktif   
Gender                       Laki-Laki   
ScholarshipStatus    Pendaftaran Biasa   
EnglishScore                   Advance (PBT Score > 550)   
FaJob                       Wiraswasta   
MoJob                    Tidak bekerja    
EducationFa                     level5 (Tidak tamat SD, tamat SD, tamat SMP, SMA, dan tamat SLTA)    
EducationMo                     level5 (Tidak tamat SD, tamat SD, tamat SMP, SMA, dan tamat SLTA)     
StatusFa                   Masih Hidup    
StatusMo                   Masih Hidup      
IntOrg                               N (Tidak mengikuti organisasi yang ada di BINUS)    
ExtOrg                               N (Tidak mengikuti organisasi yang ada di luar BINUS)    
PartInAcadComp                       N (TIdak pernah berpartisipasi dalam kompetisi akademik)     
PartInNonacadCom                     N ( Tidak pernah berpartisipasi dalam kompetisi non-akademik)    
JurusanOrProgram                   CSP (Computer Science Program)    
KategoriJurusan                Reguler    
LokasiKuliah               Kemanggisan    
StatusKuliah                     Aktif    
StatusIPK                           OK (> 2.00)    
StatusSKS                           OK (Tidak kurang dari kelipatan 15 SKS)     
Evaluasi                             0     

4. Total number of students: 28066    
Number of students who passed: 24474    
Number of students who failed: 3592   
SP3 rate of the class: 12.80%    

5. The number of Non-NR observations are higher than NR.    
   
6. IPK Status= Kurang (IPK < 2.00) is the most influential factor to the rate of NR students. followed by SKS Status ( the cumulative of credits is less than multiple of 15 in each term).   
    
7. It is interesting to note that, Father Status (died/lived) is the third highest factor influencing student NR rate. Then, the campus location (kemanggisan/alam sutera) contribute a small amount the NR Rate.   
    
8. Overall, for the last 10 years, Computer Science Program is the program with the highest number of NR students, followed by Cyber Security Program. It is interesting to note that, Mobile Application & Tech (MAT) Program and Game Application Program(GAT) contributed at similar level to the number of NR students.   