# Submission Akhir: Menyelesaikan Permasalahan Institusi Pendidikan

- Nama: Lusi Aulia Jati
- Email: lusiauliajati@gmail.com
- ID Dicoding: lusiaulia

#### Background

Jaya Jaya Institut merupakan salah satu institusi pendidikan perguruan yang telah berdiri sejak tahun 2000. Hingga saat ini ia telah mencetak banyak lulusan dengan reputasi yang sangat baik. Akan tetapi, terdapat banyak juga siswa yang tidak menyelesaikan pendidikannya alias dropout.

Jumlah dropout yang tinggi ini tentunya menjadi salah satu masalah yang besar untuk sebuah institusi pendidikan. Oleh karena itu, Jaya Jaya Institut ingin mendeteksi secepat mungkin siswa yang mungkin akan melakukan dropout sehingga dapat diberi bimbingan khusus.

Nah, sebagai calon data scientist masa depan Anda diminta untuk membantu Jaya Jaya Institut dalam menyelesaikan permasalahannya. Mereka telah menyediakan dataset yang dapat diakses. Selain itu, mereka juga meminta Anda untuk membuatkan dashboard agar mereka mudah dalam memahami data dan memonitor performa siswa. 

#### Objective
Akan dibuat model machine learning untuk bisa mendeteksi siswa yang mungkin akan melakukan dropout, proses pemodelan tersebut akan dilakukan pada file jupyter notebook ini. 

## Persiapan

### Menyiapkan library yang dibutuhkan

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.utils import shuffle
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import joblib
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


### Menyiapkan data yang akan digunakan

In [2]:
student_df = pd.read_csv("https://raw.githubusercontent.com/dicodingacademy/dicoding_dataset/main/students_performance/data.csv",
                         sep=';')
student_df

URLError: <urlopen error [Errno 11001] getaddrinfo failed>

## Data Understanding

In [None]:
student_df.info()

data status yang akan kita prediksi

In [None]:
student_df.describe()

Lebih jelas mengenai dataset : 

A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies. The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters. The data is used to build classification models to predict students' dropout and academic sucess.

| Column name | Description |
| --- | --- |
|Marital status | The marital status of the student. (Categorical) 1 – single 2 – married 3 – widower 4 – divorced 5 – facto union 6 – legally separated |
| Application mode | The method of application used by the student. (Categorical) 1 - 1st phase - general contingent 2 - Ordinance No. 612/93 5 - 1st phase - special contingent (Azores Island) 7 - Holders of other higher courses 10 - Ordinance No. 854-B/99 15 - International student (bachelor) 16 - 1st phase - special contingent (Madeira Island) 17 - 2nd phase - general contingent 18 - 3rd phase - general contingent 26 - Ordinance No. 533-A/99, item b2) (Different Plan) 27 - Ordinance No. 533-A/99, item b3 (Other Institution) 39 - Over 23 years old 42 - Transfer 43 - Change of course 44 - Technological specialization diploma holders 51 - Change of institution/course 53 - Short cycle diploma holders 57 - Change of institution/course (International)|
|Application order | The order in which the student applied. (Numerical) Application order (between 0 - first choice; and 9 last choice) |
|Course | The course taken by the student. (Categorical) 33 - Biofuel Production Technologies 171 - Animation and Multimedia Design 8014 - Social Service (evening attendance) 9003 - Agronomy 9070 - Communication Design 9085 - Veterinary Nursing 9119 - Informatics Engineering 9130 - Equinculture 9147 - Management 9238 - Social Service 9254 - Tourism 9500 - Nursing 9556 - Oral Hygiene 9670 - Advertising and Marketing Management 9773 - Journalism and Communication 9853 - Basic Education 9991 - Management (evening attendance)|
|Daytime/evening attendance | Whether the student attends classes during the day or in the evening. (Categorical) 1 – daytime 0 - evening |
|Previous qualification| The qualification obtained by the student before enrolling in higher education. (Categorical) 1 - Secondary education 2 - Higher education - bachelor's degree 3 - Higher education - degree 4 - Higher education - master's 5 - Higher education - doctorate 6 - Frequency of higher education 9 - 12th year of schooling - not completed 10 - 11th year of schooling - not completed 12 - Other - 11th year of schooling 14 - 10th year of schooling 15 - 10th year of schooling - not completed 19 - Basic education 3rd cycle (9th/10th/11th year) or equiv. 38 - Basic education 2nd cycle (6th/7th/8th year) or equiv. 39 - Technological specialization course 40 - Higher education - degree (1st cycle) 42 - Professional higher technical course 43 - Higher education - master (2nd cycle) |
|Previous qualification (grade) | Grade of previous qualification (between 0 and 200) |
| Nacionality | The nationality of the student. (Categorical) 1 - Portuguese; 2 - German; 6 - Spanish; 11 - Italian; 13 - Dutch; 14 - English; 17 - Lithuanian; 21 - Angolan; 22 - Cape Verdean; 24 - Guinean; 25 - Mozambican; 26 - Santomean; 32 - Turkish; 41 - Brazilian; 62 - Romanian; 100 - Moldova (Republic of); 101 - Mexican; 103 - Ukrainian; 105 - Russian; 108 - Cuban; 109 - Colombian|
|Mother's qualification | The qualification of the student's mother. (Categorical) 1 - Secondary Education - 12th Year of Schooling or Eq. 2 - Higher Education - Bachelor's Degree 3 - Higher Education - Degree 4 - Higher Education - Master's 5 - Higher Education - Doctorate 6 - Frequency of Higher Education 9 - 12th Year of Schooling - Not Completed 10 - 11th Year of Schooling - Not Completed 11 - 7th Year (Old) 12 - Other - 11th Year of Schooling 14 - 10th Year of Schooling 18 - General commerce course 19 - Basic Education 3rd Cycle (9th/10th/11th Year) or Equiv. 22 - Technical-professional course 26 - 7th year of schooling 27 - 2nd cycle of the general high school course 29 - 9th Year of Schooling - Not Completed 30 - 8th year of schooling 34 - Unknown 35 - Can't read or write 36 - Can read without having a 4th year of schooling 37 - Basic education 1st cycle (4th/5th year) or equiv. 38 - Basic Education 2nd Cycle (6th/7th/8th Year) or Equiv. 39 - Technological specialization course 40 - Higher education - degree (1st cycle) 41 - Specialized higher studies course 42 - Professional higher technical course 43 - Higher Education - Master (2nd cycle) 44 - Higher Education - Doctorate (3rd cycle)|
|Father's qualification | The qualification of the student's father. (Categorical) 1 - Secondary Education - 12th Year of Schooling or Eq. 2 - Higher Education - Bachelor's Degree 3 - Higher Education - Degree 4 - Higher Education - Master's 5 - Higher Education - Doctorate 6 - Frequency of Higher Education 9 - 12th Year of Schooling - Not Completed 10 - 11th Year of Schooling - Not Completed 11 - 7th Year (Old) 12 - Other - 11th Year of Schooling 13 - 2nd year complementary high school course 14 - 10th Year of Schooling 18 - General commerce course 19 - Basic Education 3rd Cycle (9th/10th/11th Year) or Equiv. 20 - Complementary High School Course 22 - Technical-professional course 25 - Complementary High School Course - not concluded 26 - 7th year of schooling 27 - 2nd cycle of the general high school course 29 - 9th Year of Schooling - Not Completed 30 - 8th year of schooling 31 - General Course of Administration and Commerce 33 - Supplementary Accounting and Administration 34 - Unknown 35 - Can't read or write 36 - Can read without having a 4th year of schooling 37 - Basic education 1st cycle (4th/5th year) or equiv. 38 - Basic Education 2nd Cycle (6th/7th/8th Year) or Equiv. 39 - Technological specialization course 40 - Higher education - degree (1st cycle) 41 - Specialized higher studies course 42 - Professional higher technical course 43 - Higher Education - Master (2nd cycle) 44 - Higher Education - Doctorate (3rd cycle) |
| Mother's occupation | The occupation of the student's mother. (Categorical) 0 - Student 1 - Representatives of the Legislative Power and Executive Bodies, Directors, Directors and Executive Managers 2 - Specialists in Intellectual and Scientific Activities 3 - Intermediate Level Technicians and Professions 4 - Administrative staff 5 - Personal Services, Security and Safety Workers and Sellers 6 - Farmers and Skilled Workers in Agriculture, Fisheries and Forestry 7 - Skilled Workers in Industry, Construction and Craftsmen 8 - Installation and Machine Operators and Assembly Workers 9 - Unskilled Workers 10 - Armed Forces Professions 90 - Other Situation 99 - (blank) 122 - Health professionals 123 - teachers 125 - Specialists in information and communication technologies (ICT) 131 - Intermediate level science and engineering technicians and professions 132 - Technicians and professionals, of intermediate level of health 134 - Intermediate level technicians from legal, social, sports, cultural and similar services 141 - Office workers, secretaries in general and data processing operators 143 - Data, accounting, statistical, financial services and registry-related operators 144 - Other administrative support staff 151 - personal service workers 152 - sellers 153 - Personal care workers and the like 171 - Skilled construction workers and the like, except electricians 173 - Skilled workers in printing, precision instrument manufacturing, jewelers, artisans and the like 175 - Workers in food processing, woodworking, clothing and other industries and crafts 191 - cleaning workers 192 - Unskilled workers in agriculture, animal production, fisheries and forestry 193 - Unskilled workers in extractive industry, construction, manufacturing and transport 194 - Meal preparation assistants |
| Father's occupation | The occupation of the student's father. (Categorical) 0 - Student 1 - Representatives of the Legislative Power and Executive Bodies, Directors, Directors and Executive Managers 2 - Specialists in Intellectual and Scientific Activities 3 - Intermediate Level Technicians and Professions 4 - Administrative staff 5 - Personal Services, Security and Safety Workers and Sellers 6 - Farmers and Skilled Workers in Agriculture, Fisheries and Forestry 7 - Skilled Workers in Industry, Construction and Craftsmen 8 - Installation and Machine Operators and Assembly Workers 9 - Unskilled Workers 10 - Armed Forces Professions 90 - Other Situation 99 - (blank) 101 - Armed Forces Officers 102 - Armed Forces Sergeants 103 - Other Armed Forces personnel 112 - Directors of administrative and commercial services 114 - Hotel, catering, trade and other services directors 121 - Specialists in the physical sciences, mathematics, engineering and related techniques 122 - Health professionals 123 - teachers 124 - Specialists in finance, accounting, administrative organization, public and commercial relations 131 - Intermediate level science and engineering technicians and professions 132 - Technicians and professionals, of intermediate level of health 134 - Intermediate level technicians from legal, social, sports, cultural and similar services 135 - Information and communication technology technicians 141 - Office workers, secretaries in general and data processing operators 143 - Data, accounting, statistical, financial services and registry-related operators 144 - Other administrative support staff 151 - personal service workers 152 - sellers 153 - Personal care workers and the like 154 - Protection and security services personnel 161 - Market-oriented farmers and skilled agricultural and animal production workers 163 - Farmers, livestock keepers, fishermen, hunters and gatherers, subsistence 171 - Skilled construction workers and the like, except electricians 172 - Skilled workers in metallurgy, metalworking and similar 174 - Skilled workers in electricity and electronics 175 - Workers in food processing, woodworking, clothing and other industries and crafts 181 - Fixed plant and machine operators 182 - assembly workers 183 - Vehicle drivers and mobile equipment operators 192 - Unskilled workers in agriculture, animal production, fisheries and forestry 193 - Unskilled workers in extractive industry, construction, manufacturing and transport 194 - Meal preparation assistants 195 - Street vendors (except food) and street service providers |
| Admission grade | Admission grade (between 0 and 200) |
| Displaced | Whether the student is a displaced person. (Categorical) 	1 – yes 0 – no |
| Educational special needs | Whether the student has any special educational needs. (Categorical) 1 – yes 0 – no |
|Debtor | Whether the student is a debtor. (Categorical) 1 – yes 0 – no|
|Tuition fees up to date | Whether the student's tuition fees are up to date. (Categorical) 1 – yes 0 – no|
|Gender | The gender of the student. (Categorical) 1 – male 0 – female |
|Scholarship holder | Whether the student is a scholarship holder. (Categorical) 1 – yes 0 – no |
|Age at enrollment | The age of the student at the time of enrollment. (Numerical)|
|International | Whether the student is an international student. (Categorical) 1 – yes 0 – no|
|Curricular units 1st sem (credited) | The number of curricular units credited by the student in the first semester. (Numerical) |
| Curricular units 1st sem (enrolled) | The number of curricular units enrolled by the student in the first semester. (Numerical) |
| Curricular units 1st sem (evaluations) | The number of curricular units evaluated by the student in the first semester. (Numerical) |
| Curricular units 1st sem (approved) | The number of curricular units approved by the student in the first semester. (Numerical) |

## Data Preparation / Preprocessing

In [None]:
# df = student_df.copy()
# df

In [None]:
# def transform_status(value):
#     if value == "Dropout":
#         return 0
#     elif value == "Graduate" :
#         return 1
#     else:
#         return 2

In [None]:
# df["Status"] = df["Status"].map(transform_status)
# df

In [None]:
student_df.isna().sum()

In [None]:
student_df

Tidak ada data kosong pada data, selanjutnya bagi data menjadi 90% data train dan 10% data test

In [None]:
train_df, test_df = train_test_split(student_df, test_size=0.1, random_state=42, shuffle=True)
train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)
 
print(train_df.shape)
print(test_df.shape)

In [None]:
print(student_df['Status'].value_counts())

In [None]:
print(len(student_df.Status))
print((1421/len(student_df.Status)*100))

Dengan total siswa sebanyak 4424, dan persentase dropout sebanyak 32,12% merupakan angka yang tinggi sekali. Kemudian, kita lihat untuk data train. 

In [None]:
print(train_df['Status'].value_counts())

untuk data train, ukuran kategori pada data 'Status' tidak cukup seimbang sehingga akan diseimbangkan dengan undersampling. 

In [None]:
df_majority_1 = train_df[(train_df.Status == "Graduate")]
df_majority_2 = train_df[(train_df.Status == "Dropout")]
df_minority = train_df[(train_df.Status == "Enrolled")]

df_majority_1_undersampled = resample(df_majority_1, n_samples=683, random_state=42)
df_majority_2_undersampled = resample(df_majority_2, n_samples=683, random_state=42)

print(df_majority_1_undersampled.shape)
print(df_majority_2_undersampled.shape)

In [None]:
undersampled_train_df = pd.concat([df_minority, df_majority_1_undersampled]).reset_index(drop=True)
undersampled_train_df = pd.concat([undersampled_train_df, df_majority_2_undersampled]).reset_index(drop=True)
undersampled_train_df = shuffle(undersampled_train_df, random_state=42)
undersampled_train_df.reset_index(drop=True, inplace=True)
undersampled_train_df.sample(5)

Sekarang data sudah seimbang untuk kategori statusnya, selanjutnya akan kita bagi menjadi data x dan y, dengan x input y padanannya (output/target)


In [None]:
X_train = undersampled_train_df.drop(columns="Status", axis=1)
y_train = pd.DataFrame()
y_train["Status"] = undersampled_train_df["Status"]
 
X_test = test_df.drop(columns="Status", axis=1)
y_test = pd.DataFrame()
y_test["Status"] = test_df["Status"]

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
# student_df.to_csv("student_data.csv")

setelah itu akan kita ubah data Status menjadi data integer.

In [None]:
def scaling(features, df, df_test=None):
    if df_test is not None:
        df = df.copy()
        df_test = df_test.copy()
        for feature in features:
            scaler = MinMaxScaler()
            X = np.asanyarray(df[feature])
            X = X.reshape(-1,1)
            scaler.fit(X)
            df["{}".format(feature)] = scaler.transform(X)
            joblib.dump(scaler, "./scaler_{}.joblib".format(feature))
            
            X_test = np.asanyarray(df_test[feature])
            X_test = X_test.reshape(-1,1)
            df_test["{}".format(feature)] = scaler.transform(X_test)
        return df, df_test
    else:
        df = df.copy()
        for feature in features:
            scaler = MinMaxScaler()
            X = np.asanyarray(df[feature])
            X = X.reshape(-1,1)
            scaler.fit(X)
            df["{}".format(feature)] = scaler.transform(X)
            joblib.dump(scaler, "./scaler_{}.joblib".format(feature))
        return df
    
def encoding(features, df, df_test=None):
    if df_test is not None:
        df = df.copy()
        df_test = df_test.copy()
        for feature in features:
            encoder = LabelEncoder()
            encoder.fit(df[feature])
            df["{}".format(feature)] = encoder.transform(df[feature])
            joblib.dump(encoder, "./encoder_{}.joblib".format(feature))
            
            df_test["{}".format(feature)] = encoder.transform(df_test[feature])
        return df, df_test
    else:
        df = df.copy()
        for feature in features:
            encoder = LabelEncoder()
            encoder.fit(df[feature])
            df["{}".format(feature)] = encoder.transform(df[feature])
            joblib.dump(encoder, "./encoder_{}.joblib".format(feature))
        return df

In [None]:
import joblib
# numerical_columns = ['Application_order', 'Previous_qualification_grade',
#        'Admission_grade',
#        'Age_at_enrollment',
#        'Curricular_units_1st_sem_credited',
#        'Curricular_units_1st_sem_enrolled',
#        'Curricular_units_1st_sem_evaluations',
#        'Curricular_units_1st_sem_approved', 'Curricular_units_1st_sem_grade',
#        'Curricular_units_1st_sem_without_evaluations',
#        'Curricular_units_2nd_sem_credited',
#        'Curricular_units_2nd_sem_enrolled',
#        'Curricular_units_2nd_sem_evaluations',
#        'Curricular_units_2nd_sem_approved', 'Curricular_units_2nd_sem_grade',
#        'Curricular_units_2nd_sem_without_evaluations','Inflation_rate','GDP']

# columns_all =['Marital_status', 'Application_mode', 'Application_order', 'Course',
#        'Daytime_evening_attendance', 'Previous_qualification',
#        'Previous_qualification_grade', 'Nacionality', 'Mothers_qualification',
#        'Fathers_qualification', 'Mothers_occupation', 'Fathers_occupation',
#        'Admission_grade', 'Displaced', 'Educational_special_needs', 'Debtor',
#        'Tuition_fees_up_to_date', 'Gender', 'Scholarship_holder',
#        'Age_at_enrollment', 'International',
#        'Curricular_units_1st_sem_credited',
#        'Curricular_units_1st_sem_enrolled',
#        'Curricular_units_1st_sem_evaluations',
#        'Curricular_units_1st_sem_approved', 'Curricular_units_1st_sem_grade',
#        'Curricular_units_1st_sem_without_evaluations',
#        'Curricular_units_2nd_sem_credited',
#        'Curricular_units_2nd_sem_enrolled',
#        'Curricular_units_2nd_sem_evaluations',
#        'Curricular_units_2nd_sem_approved', 'Curricular_units_2nd_sem_grade',
#        'Curricular_units_2nd_sem_without_evaluations', 'Unemployment_rate',
#        'Inflation_rate', 'GDP']

columns_all =[
       'Tuition_fees_up_to_date', 
       'Curricular_units_1st_sem_credited',
       'Curricular_units_1st_sem_enrolled',
       'Curricular_units_1st_sem_evaluations',
       'Curricular_units_1st_sem_approved', 'Curricular_units_1st_sem_grade',
       'Curricular_units_1st_sem_without_evaluations',
       'Curricular_units_2nd_sem_credited',
       'Curricular_units_2nd_sem_enrolled',
       'Curricular_units_2nd_sem_evaluations',
       'Curricular_units_2nd_sem_approved', 'Curricular_units_2nd_sem_grade',
       'Curricular_units_2nd_sem_without_evaluations']

# categ_columns = ['Marital_status', 'Application_mode', 'Course',
#        'Daytime_evening_attendance', 'Previous_qualification',
#        'Previous_qualification_grade', 'Nacionality', 'Mothers_qualification',
#        'Fathers_qualification', 'Mothers_occupation', 'Fathers_occupation',
#        'Displaced', 'Educational_special_needs', 'Debtor',
#        'Tuition_fees_up_to_date', 'Gender', 'Scholarship_holder',
#        'Unemployment_rate',
#        'Inflation_rate']

categorical_columns = ['Status']
X_train_scaler, X_test_scaler = scaling(columns_all, X_train, X_test)
y_train, y_test = encoding(categorical_columns, y_train, y_test)

### Proses PCA

In [None]:
corr = student_df.corr()

f, ax = plt.subplots(figsize=(35, 10))

mask = np.triu(np.ones_like(corr, dtype=bool))

cmap = sns.diverging_palette(230, 20, as_cmap=True)

sns.heatmap(corr, annot=True, mask = mask, cmap=cmap)

In [None]:
pca_numerical_columns_1 = [
'Curricular_units_1st_sem_credited',
       'Curricular_units_1st_sem_enrolled',
       'Curricular_units_1st_sem_evaluations',
       'Curricular_units_1st_sem_approved', 'Curricular_units_1st_sem_grade',
       'Curricular_units_1st_sem_without_evaluations',
       'Curricular_units_2nd_sem_credited',
       'Curricular_units_2nd_sem_enrolled',
       'Curricular_units_2nd_sem_evaluations',
       'Curricular_units_2nd_sem_approved', 'Curricular_units_2nd_sem_grade',
       'Curricular_units_2nd_sem_without_evaluations'
]

# pca_numerical_columns_2 = ['Mothers_occupation', 'Fathers_occupation']
# pca_numerical_columns_3 = ['International', 'Nacionality']

In [None]:
train_pca_df = X_train_scaler.copy().reset_index(drop=True)
test_pca_df = X_test_scaler.copy().reset_index(drop=True)

In [None]:
from sklearn.decomposition import PCA
 
pca = PCA(n_components=len(pca_numerical_columns_1), random_state=123)
pca.fit(train_pca_df[pca_numerical_columns_1])
princ_comp = pca.transform(train_pca_df[pca_numerical_columns_1])
 
var_exp = pca.explained_variance_ratio_.round(3)
cum_var_exp = np.cumsum(var_exp)
 
plt.bar(range(len(pca_numerical_columns_1)), var_exp, alpha=0.5, align='center', label='individual explained variance')
plt.step(range(len(pca_numerical_columns_1)), cum_var_exp, where='mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.show()

Gambar di atas menunjukkan visualisasi jumlah varians untuk setiap jumlah komponen utama. Berdasarkan gambar tersebut terlihat bahwa, dengan hanya mengambil 2 komponen utama kita telah memperoleh lebih dari 80% varians. Ini berarti kita bisa mewakili seluruh kelompok feature tersebut (ada 11 feature) hanya dengan 2 komponen utama.

In [None]:
pca_1 = PCA(n_components=2, random_state=123)
pca_1.fit(train_pca_df[pca_numerical_columns_1])
joblib.dump(pca_1, "pca_{}.joblib".format(1))
princ_comp_1 = pca_1.transform(train_pca_df[pca_numerical_columns_1])
train_pca_df[["pc1_1", "pc1_2"]] = pd.DataFrame(princ_comp_1, columns=["pc1_1", "pc1_2"])
train_pca_df.drop(columns=pca_numerical_columns_1, axis=1, inplace=True)
train_pca_df.head()

In [None]:
# pca = PCA(n_components=len(pca_numerical_columns_2), random_state=123)
# pca.fit(train_pca_df[pca_numerical_columns_2])
# princ_comp = pca.transform(train_pca_df[pca_numerical_columns_2])
 
# var_exp = pca.explained_variance_ratio_.round(3)
# cum_var_exp = np.cumsum(var_exp)
 
# plt.bar(range(len(pca_numerical_columns_2)), var_exp, alpha=0.5, align='center', label='individual explained variance')
# plt.step(range(len(pca_numerical_columns_2)), cum_var_exp, where='mid', label='cumulative explained variance')
# plt.ylabel('Explained variance ratio')
# plt.xlabel('Principal component index')
# plt.legend(loc='best')
# plt.show()

In [None]:
# pca_2 = PCA(n_components=1, random_state=123)
# pca_2.fit(train_pca_df[pca_numerical_columns_2])
# joblib.dump(pca_2, "pca_{}.joblib".format(2))
# princ_comp_2 = pca_2.transform(train_pca_df[pca_numerical_columns_2])
# train_pca_df[["pc2_1"]] = pd.DataFrame(princ_comp_2, columns=["pc2_1"])
# train_pca_df.drop(columns=pca_numerical_columns_2, axis=1, inplace=True)
# print(train_pca_df.shape)
# train_pca_df.head()

In [None]:
# pca = PCA(n_components=len(pca_numerical_columns_3), random_state=123)
# pca.fit(train_pca_df[pca_numerical_columns_3])
# princ_comp = pca.transform(train_pca_df[pca_numerical_columns_3])
 
# var_exp = pca.explained_variance_ratio_.round(3)
# cum_var_exp = np.cumsum(var_exp)
 
# plt.bar(range(len(pca_numerical_columns_3)), var_exp, alpha=0.5, align='center', label='individual explained variance')
# plt.step(range(len(pca_numerical_columns_3)), cum_var_exp, where='mid', label='cumulative explained variance')
# plt.ylabel('Explained variance ratio')
# plt.xlabel('Principal component index')
# plt.legend(loc='best')
# plt.show()

In [None]:
# pca_3 = PCA(n_components=1, random_state=123)
# pca_3.fit(train_pca_df[pca_numerical_columns_3])
# joblib.dump(pca_3, "pca_{}.joblib".format(2))
# princ_comp_3 = pca_3.transform(train_pca_df[pca_numerical_columns_3])
# train_pca_df[["pc3_1"]] = pd.DataFrame(princ_comp_3, columns=["pc3_1"])
# train_pca_df.drop(columns=pca_numerical_columns_3, axis=1, inplace=True)
# print(train_pca_df.shape)
# train_pca_df.head()

Sekarang dari total 36 fitur hanya tersisa 24 fitur saja

In [None]:
test_princ_comp_1 = pca_1.transform(test_pca_df[pca_numerical_columns_1])
test_pca_df[["pc1_1", "pc1_2"]] = pd.DataFrame(test_princ_comp_1, columns=["pc1_1", "pc1_2"])
test_pca_df.drop(columns=pca_numerical_columns_1, axis=1, inplace=True)
 
# test_princ_comp_2 = pca_2.transform(test_pca_df[pca_numerical_columns_2])
# test_pca_df[["pc2_1"]] = pd.DataFrame(test_princ_comp_2, columns=["pc2_1"])
# test_pca_df.drop(columns=pca_numerical_columns_2, axis=1, inplace=True)
# test_princ_comp_3 = pca_3.transform(test_pca_df[pca_numerical_columns_3])

# test_pca_df[["pc3_1"]] = pd.DataFrame(test_princ_comp_3, columns=["pc3_1"])
# test_pca_df.drop(columns=pca_numerical_columns_3, axis=1, inplace=True)
print(test_pca_df.shape)
test_pca_df.head()

## Modeling

Akan dicoba tiga model, random forest, gradient boosting dan linear regression

In [None]:
rdf_model = RandomForestClassifier(random_state=123)
 
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [6, 7, 8, 9, 10],
    'criterion' :['gini', 'entropy']
}
 
CV_rdf = GridSearchCV(estimator=rdf_model, param_grid=param_grid, cv=5, n_jobs=-1)
CV_rdf.fit(train_pca_df,y_train)

In [None]:
gboost_model = GradientBoostingClassifier(random_state=123)
 
param_grid = {
    'max_depth': [5, 8, 10],
    'n_estimators': [200, 300],
    'learning_rate': [0.001, 0.01, 0.1],
    'max_features': ['auto', 'sqrt', 'log2']
}
 
CV_gboost = GridSearchCV(estimator=gboost_model, param_grid=param_grid, cv=5, n_jobs=-1)
CV_gboost.fit(train_pca_df,y_train)

In [None]:
liniear_model = LogisticRegression()
liniear_model.fit(train_pca_df, y_train)

In [None]:
print("best parameters: ", CV_rdf.best_params_)

In [None]:
print("best parameters: ", CV_gboost.best_params_)

In [None]:
rdf_model = RandomForestClassifier(
    random_state=123,
    max_depth=10, 
    max_features='auto',
    criterion ='entropy',
    n_estimators=500
)
rdf_model.fit(train_pca_df, y_train)
joblib.dump(rdf_model, "./rdf_model.joblib")

In [None]:
gboost_model = GradientBoostingClassifier(
    random_state=123,
    learning_rate=0.1, 
    max_depth=8, 
    max_features='sqrt',
    n_estimators=200
)
gboost_model.fit(train_pca_df, y_train)
joblib.dump(gboost_model, "./gboost_model.joblib")

In [None]:
joblib.dump(liniear_model, "./linear_model.joblib")

In [None]:
train_pca_df.info()

## Evaluation

In [None]:
predictions = rdf_model.predict(test_pca_df)
predictions1 = gboost_model.predict(test_pca_df)
predictions2 = liniear_model.predict(test_pca_df)

In [None]:
result = f1_score(predictions,y_test, average='micro')
result1 = f1_score(predictions1,y_test, average='micro')
result2 = f1_score(predictions2,y_test, average='micro')
print("F1 Score RDF : %.3f%%" % (result*100.0))
print("F1 Score gboost : %.3f%%" % (result1*100.0))
print("F1 Score linear regression : %.3f%%" % (result2*100.0))