# Klasifikasi Komplain Komentar TikTok Bank BCA

Proyek ini menggabungkan sentiment analysis dengan klasifikasi komplain pada komentar TikTok yang berkaitan dengan Bank BCA. Setelah mengumpulkan data komentar melalui web scraping, saya menggunakan pendekatan machine learning untuk menganalisis sentimen (positif, negatif, netral) dari setiap komentar.

Sebagai langkah lanjutan, saya membagi komentar tersebut ke dalam beberapa kategori komplain, yaitu:



*   Aplikasi (App): Komentar yang berkaitan dengan masalah teknis atau pengalaman pengguna pada aplikasi BCA.

*   Layanan (Service): Komentar yang menyoroti layanan pelanggan atau fasilitas yang disediakan oleh Bank BCA.
*   Non Kategori (Non Category): Komentar yang tidak sesuai dengan kategori komplain tertentu.

*   Kartu Kredit (Credit Card): Komentar yang berkaitan dengan produk kartu kredit Bank BCA.


Dengan membagi komentar ke dalam kategori ini, proyek ini memberikan wawasan yang lebih terperinci tentang masalah yang dihadapi pengguna, sehingga dapat membantu Bank BCA untuk lebih fokus dalam meningkatkan layanan sesuai dengan feedback pelanggan.

# IMPORT LIBRARY

In [None]:
!pip install Sastrawi



In [None]:
import requests
import pandas as pd
import numpy as np
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk import word_tokenize
from nltk.corpus import stopwords
import re
import warnings
import plotly.figure_factory as ff


warnings.filterwarnings("ignore")

import nltk
nltk.download('punkt')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# LABELLING

manual ini data berasal dari Complaint

In [None]:
data = pd.read_excel("topic.xlsx")
data

Unnamed: 0,comments,Category
0,mbanking sekarang jadi makin lemot mau transak...,app
1,Kok ini bca mobile merah lagi y??,app
2,Kok ini bca mobile merah lagi y??,app
3,"ngisi pulsa lewat BCA mobile,pulsa kaga masuk ...",service
4,Sering crash mybca not responding padahal suda...,app
...,...,...
1226,Tidak ada solusi dari CS untuk masalah kartu k...,cc
1227,Kenapa proses kenaikan limit kartu kredit sang...,cc
1228,Sulit mendapatkan cashback dari penggunaan kar...,cc
1229,Kenapa saya tidak pernah mendapat pemberitahua...,cc


In [None]:
data.value_counts('Category')

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
app,480
service,479
cc,217
non,55


# PRE-PROCESSING

In [None]:
def remove_emojis(text):
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"
        "\U0001F300-\U0001F5FF"
        "\U0001F680-\U0001F6FF"
        "\U0001F700-\U0001F77F"
        "\U0001F780-\U0001F7FF"
        "\U0001F800-\U0001F8FF"
        "\U0001F900-\U0001F9FF"
        "\U0001F1E0-\U0001F1FF"
        "\u2600-\u26FF"
        "\u2700-\u27BF"
        "]+"
    )
    return emoji_pattern.sub(r'', text)

def remove_numbers(text):

    return re.sub(r'\d+', '', text)


def preprocessing(comments) :

   factory = StemmerFactory()
   stemmer = factory.create_stemmer()

   comments = re.sub(r'@\w+', '', comments)

   comment = remove_emojis(text = comments)
   no_number = remove_numbers(comment)

   tokens = word_tokenize(no_number)

   lower = [token.lower() for token in tokens]



   stop_words = set(stopwords.words('indonesian'))

   stopword = [word for word in lower if word not in stop_words]

   stemm = [stemmer.stem(word) for word in stopword]

   return " ".join(stemm)

data['pre'] = data.comments.apply(preprocessing)

In [None]:
dataset = data[['pre' , 'Category']]
dataset

Unnamed: 0,pre,Category
0,mbanking lot transaksi lelet banget lokasi nyala,app
1,bca mobile merah y,app
2,bca mobile merah y,app
3,ngisi pulsa bca mobile pulsa kaga masuk masuk...,service
4,crash mybca not responding ulang kali restart hp,app
...,...,...
1226,solusi cs kartu kredit,cc
1227,proses naik limit kartu kredit,cc
1228,sulit cashback guna kartu kredit,cc
1229,pemberitahuan promo kartu kredit,cc


In [None]:
dataset = dataset[dataset.pre != ""]
dataset

Unnamed: 0,pre,Category
0,mbanking lot transaksi lelet banget lokasi nyala,app
1,bca mobile merah y,app
2,bca mobile merah y,app
3,ngisi pulsa bca mobile pulsa kaga masuk masuk...,service
4,crash mybca not responding ulang kali restart hp,app
...,...,...
1226,solusi cs kartu kredit,cc
1227,proses naik limit kartu kredit,cc
1228,sulit cashback guna kartu kredit,cc
1229,pemberitahuan promo kartu kredit,cc


In [None]:
dataset.isna().sum()

Unnamed: 0,0
pre,0
Category,0


In [None]:
category_map = {
    'app': 0,
    'service': 1,
    'cc': 2,
    'non': 3,
}

dataset['Category'] = dataset['Category'].map(category_map)


In [None]:
dataset.head()

Unnamed: 0,pre,Category
0,mbanking lot transaksi lelet banget lokasi nyala,0
1,bca mobile merah y,0
2,bca mobile merah y,0
3,ngisi pulsa bca mobile pulsa kaga masuk masuk...,1
4,crash mybca not responding ulang kali restart hp,0


In [None]:
dataset.to_csv("modelling_data_final.csv" , index=False)

# MODELLING

In [None]:
dataset = pd.read_csv("modelling_data_final.csv")
dataset.head()

Unnamed: 0,pre,Category
0,mbanking lot transaksi lelet banget lokasi nyala,0
1,bca mobile merah y,0
2,bca mobile merah y,0
3,ngisi pulsa bca mobile pulsa kaga masuk masuk...,1
4,crash mybca not responding ulang kali restart hp,0


In [None]:
dataset.Category.value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
0,480
1,479
2,217
3,54


In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1230 entries, 0 to 1229
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   pre       1230 non-null   object
 1   Category  1230 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 19.3+ KB


In [None]:
!pip install catboost



In [None]:
from sklearn.linear_model import LogisticRegression ,  PassiveAggressiveClassifier, Perceptron, SGDClassifier, RidgeClassifier , RidgeClassifierCV
from sklearn.neighbors import KNeighborsClassifier , NearestCentroid
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB , MultinomialNB , BernoulliNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier , ExtraTreesClassifier , GradientBoostingClassifier , RandomForestClassifier , BaggingClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis
from sklearn.semi_supervised import LabelPropagation, LabelSpreading
from sklearn.calibration import CalibratedClassifierCV
from sklearn.dummy import DummyClassifier



from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer , TfidfVectorizer
from sklearn.metrics import f1_score , roc_auc_score , recall_score , precision_score , accuracy_score , balanced_accuracy_score , classification_report , confusion_matrix

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [None]:
MODELS = {
    "Logistic Regression": LogisticRegression(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Gaussian Naive Bayes": GaussianNB(),
    "Support Vector Classifier": SVC(),
    "AdaBoost Classifier": AdaBoostClassifier(),
    "Linear Discriminant Analysis": LinearDiscriminantAnalysis(),
    "Extra Trees Classifier": ExtraTreesClassifier(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "Random Forest Classifier": RandomForestClassifier(),
    "Bagging Classifier": BaggingClassifier(),
    "LightGBM Classifier": LGBMClassifier(),
    "CatBoost Classifier": CatBoostClassifier(),
    "XGBoost Classifier": XGBClassifier(),
    "Quadratic Discriminant Analysis": QuadraticDiscriminantAnalysis(),
    "Nearest Centroid": NearestCentroid(),
    "Label Propagation": LabelPropagation(),
    "Label Spreading": LabelSpreading(),
    "Passive Aggressive Classifier": PassiveAggressiveClassifier(),
    "Perceptron": Perceptron(),
    "SGD Classifier": SGDClassifier(),
    "Bernoulli Naive Bayes": BernoulliNB(),
    "Calibrated Classifier CV": CalibratedClassifierCV(),
    "Ridge Classifier": RidgeClassifier(),
    "Dummy Classifier": DummyClassifier(),
    "Ridge Classifier CV": RidgeClassifierCV(),
}


In [None]:
def style_dataframe(df):
    df.sort_values(["Accuracy", "F1-Score"], inplace=True, ascending=False)
    df_style = df.style.background_gradient(cmap="Blues", subset=["Accuracy"]) \
                    .background_gradient(cmap="Reds", subset=["F1-Score"])
    return df_style



def classifier_report(X_train, X_test, y_train, y_test, print=False):
    accuracy_score_list = []
    f1_score_list = []
    recall_list = []
    precision_score_list = []
    balance_score_list = []
    Model_name_list = []

    for name_model, model in MODELS.items():
        model_now = model
        model_now.fit(X_train, y_train)
        y_pred = model_now.predict(X_test)

        accuracy = np.round(accuracy_score(y_test, y_pred), 2)
        f1 = np.round(f1_score(y_test, y_pred, average='macro'), 2)
        recall = np.round(recall_score(y_test, y_pred, average='macro'), 2)
        precision = np.round(precision_score(y_test, y_pred, average='macro'), 2)
        balance_score = np.round(balanced_accuracy_score(y_test, y_pred), 2)


        if print:
            print("==============================================")
            print(f"Name Model     : {name_model}")
            print(f"Model          : {model}")
            print(f"Accuracy Score : {accuracy}")
            print(f"F1 Score       : {f1}")
            print(f"Recall         : {recall}")
            print(f"Precision      : {precision}")
            print("==============================================")
            print("\n")

        Model_name_list.append(name_model)
        accuracy_score_list.append(accuracy)
        f1_score_list.append(f1)
        recall_list.append(recall)
        precision_score_list.append(precision)
        balance_score_list.append(balance_score)

    report = pd.DataFrame({
        "Model": Model_name_list,
        "Accuracy": accuracy_score_list,
        "F1-Score": f1_score_list,
        "Recall": recall_list,
        "Precision": precision_score_list,
        "Balance": balance_score_list,
    })

    return style_dataframe(report.set_index("Model"))


In [None]:
Count_Data = CountVectorizer(ngram_range=(1,2)).fit(dataset.pre)
X_Count_matrix = Count_Data.transform(dataset.pre)

X_Count = pd.DataFrame(data=X_Count_matrix.toarray() , columns = Count_Data.get_feature_names_out())
X_Count

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg udh,yg viral,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1226,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1227,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1228,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
Tfid_Data = TfidfVectorizer(ngram_range=(1,2)).fit(dataset.pre)
X_Tfid_Matrix = Tfid_Data.transform(dataset.pre)

X_Tfid = pd.DataFrame(data=X_Tfid_Matrix.toarray() , columns = Tfid_Data.get_feature_names_out())
X_Tfid

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg udh,yg viral,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1226,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1227,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
Count_df = pd.concat([X_Count , dataset.Category] , axis=1)
Count_df.head()

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg viral,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu,Category
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
Tfid_df = pd.concat([X_Tfid , dataset.Category] , axis=1)
Tfid_df.head()

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg viral,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu,Category
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


## CASE 1 ( Tampa sampling dan len )

In [None]:
X = Count_df.drop(["Category"] , axis=1)
y = Count_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
model = classifier_report(X_train , X_test , y_train , y_test , print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001161 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 62
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.2958229	total: 136ms	remaining: 2m 15s
1:	learn: 1.2255517	total: 204ms	remaining: 1m 41s
2:	learn: 1.1661381	total: 309ms	remaining: 1m 42s
3:	learn: 1.1254054	total: 419ms	remaining: 1m 44s
4:	learn: 1.0818141	total: 472ms	remaining: 1m 33s
5:	learn: 1.0509642	total: 549ms	remaining: 1m 30s
6:	learn: 1.0164055	total: 605ms	remaining: 1m 25s
7:	learn: 0.9865509	total: 711ms	remaining: 1m 28

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Passive Aggressive Classifier,0.85,0.82,0.79,0.87,0.79
CatBoost Classifier,0.84,0.72,0.69,0.9,0.69
Ridge Classifier,0.83,0.76,0.72,0.89,0.72
Ridge Classifier CV,0.83,0.76,0.72,0.89,0.72
Decision Tree,0.83,0.75,0.71,0.88,0.71
Gradient Boosting Classifier,0.83,0.75,0.71,0.89,0.71
Perceptron,0.83,0.75,0.72,0.88,0.72
Bagging Classifier,0.83,0.74,0.7,0.89,0.7
XGBoost Classifier,0.83,0.74,0.71,0.8,0.71
SGD Classifier,0.83,0.73,0.7,0.88,0.7


In [None]:
X = Tfid_df.drop(["Category"] , axis=1)
y = Tfid_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
model = classifier_report(X_train , X_test , y_train , y_test , print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000662 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 875
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 62
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.3011750	total: 55.1ms	remaining: 55.1s
1:	learn: 1.2279687	total: 105ms	remaining: 52.5s
2:	learn: 1.1714280	total: 153ms	remaining: 50.8s
3:	learn: 1.1246183	total: 202ms	remaining: 50.2s
4:	learn: 1.0770461	total: 246ms	remaining: 48.9s
5:	learn: 1.0391055	total: 288ms	remaining: 47.7s
6:	learn: 1.0069732	total: 329ms	remaining: 46.7s
7:	learn: 0.9801395	total: 370ms	remaining: 45.9s
8:	le

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Extra Trees Classifier,0.83,0.71,0.69,0.88,0.69
Label Propagation,0.82,0.75,0.72,0.81,0.72
Label Spreading,0.82,0.75,0.72,0.81,0.72
Passive Aggressive Classifier,0.82,0.71,0.68,0.88,0.68
Calibrated Classifier CV,0.82,0.71,0.68,0.88,0.68
CatBoost Classifier,0.82,0.68,0.67,0.88,0.67
SGD Classifier,0.82,0.68,0.67,0.88,0.67
Gradient Boosting Classifier,0.81,0.73,0.7,0.84,0.7
Bagging Classifier,0.81,0.72,0.69,0.82,0.69
XGBoost Classifier,0.81,0.72,0.7,0.78,0.7


## Case 2 ( OverSampling )

In [None]:
X = Count_df.drop(["Category"] , axis=1)
y = Count_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
y_train_over.value_counts('Category')

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


In [None]:
model = classifier_report(X_train , X_test , y_train , y_test , print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000871 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 62
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.2958229	total: 39.7ms	remaining: 39.7s
1:	learn: 1.2255517	total: 80.3ms	remaining: 40.1s
2:	learn: 1.1661381	total: 121ms	remaining: 40.2s
3:	learn: 1.1254054	total: 162ms	remaining: 40.2s
4:	learn: 1.0818141	total: 207ms	remaining: 41.2s
5:	learn: 1.0509642	total: 251ms	remaining: 41.6s
6:	learn: 1.0164055	total: 295ms	remaining: 41.9s
7:	learn: 0.9865509	total: 343ms	remaining: 42.5s
8:	l

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SGD Classifier,0.85,0.76,0.73,0.89,0.73
Passive Aggressive Classifier,0.84,0.78,0.75,0.85,0.75
CatBoost Classifier,0.84,0.72,0.69,0.9,0.69
Ridge Classifier,0.83,0.76,0.72,0.89,0.72
Ridge Classifier CV,0.83,0.76,0.72,0.89,0.72
Decision Tree,0.83,0.75,0.72,0.88,0.72
Perceptron,0.83,0.75,0.72,0.88,0.72
XGBoost Classifier,0.83,0.74,0.71,0.8,0.71
Extra Trees Classifier,0.83,0.71,0.69,0.89,0.69
Calibrated Classifier CV,0.83,0.71,0.69,0.88,0.69


In [None]:
X = Tfid_df.drop(["Category"] , axis=1)
y = Tfid_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
y_train_over.value_counts('Category')

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


In [None]:
model = classifier_report(X_train , X_test , y_train , y_test , print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000615 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 875
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 62
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.3011750	total: 65.5ms	remaining: 1m 5s
1:	learn: 1.2279687	total: 107ms	remaining: 53.4s
2:	learn: 1.1714280	total: 148ms	remaining: 49.3s
3:	learn: 1.1246183	total: 193ms	remaining: 48.1s
4:	learn: 1.0770461	total: 244ms	remaining: 48.6s
5:	learn: 1.0391055	total: 294ms	remaining: 48.7s
6:	learn: 1.0069732	total: 336ms	remaining: 47.6s
7:	learn: 0.9801395	total: 377ms	remaining: 46.8s
8:	le

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Extra Trees Classifier,0.83,0.71,0.69,0.89,0.69
Label Propagation,0.82,0.75,0.72,0.81,0.72
Label Spreading,0.82,0.75,0.72,0.81,0.72
Calibrated Classifier CV,0.82,0.71,0.68,0.88,0.68
CatBoost Classifier,0.82,0.68,0.67,0.88,0.67
Gradient Boosting Classifier,0.81,0.74,0.7,0.88,0.7
Decision Tree,0.81,0.72,0.69,0.8,0.69
XGBoost Classifier,0.81,0.72,0.7,0.78,0.7
Bagging Classifier,0.81,0.71,0.68,0.82,0.68
Random Forest Classifier,0.81,0.7,0.68,0.87,0.68


In [None]:
X = Tfid_df.drop(["Category"] , axis=1)
y = Tfid_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
y_train_over.value_counts('Category')

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


In [None]:
model = classifier_report(X_train , X_test , y_train , y_test , print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000872 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 875
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 62
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.3011750	total: 55.1ms	remaining: 55s
1:	learn: 1.2279687	total: 96.7ms	remaining: 48.3s
2:	learn: 1.1714280	total: 141ms	remaining: 46.8s
3:	learn: 1.1246183	total: 189ms	remaining: 47s
4:	learn: 1.0770461	total: 232ms	remaining: 46.1s
5:	learn: 1.0391055	total: 273ms	remaining: 45.2s
6:	learn: 1.0069732	total: 315ms	remaining: 44.6s
7:	learn: 0.9801395	total: 357ms	remaining: 44.2s
8:	learn

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Passive Aggressive Classifier,0.84,0.77,0.73,0.88,0.73
SGD Classifier,0.83,0.69,0.68,0.88,0.68
Label Propagation,0.82,0.75,0.72,0.81,0.72
Label Spreading,0.82,0.75,0.72,0.81,0.72
Extra Trees Classifier,0.82,0.71,0.68,0.89,0.68
Calibrated Classifier CV,0.82,0.71,0.68,0.88,0.68
CatBoost Classifier,0.82,0.68,0.67,0.88,0.67
Gradient Boosting Classifier,0.81,0.74,0.7,0.88,0.7
XGBoost Classifier,0.81,0.72,0.7,0.78,0.7
Ridge Classifier,0.81,0.7,0.67,0.87,0.67


## CASE 3 ( len text )

In [None]:
data_len = dataset.copy()

In [None]:
data_len["text_len"] = data_len['pre'].apply(lambda x: len(x.split()))
data_len.head()

Unnamed: 0,pre,Category,text_len
0,mbanking lot transaksi lelet banget lokasi nyala,0,7
1,bca mobile merah y,0,4
2,bca mobile merah y,0,4
3,ngisi pulsa bca mobile pulsa kaga masuk masuk...,1,12
4,crash mybca not responding ulang kali restart hp,0,8


In [None]:
Count_Data = CountVectorizer(ngram_range=(1,2)).fit(data_len.pre)
X_Count_matrix = Count_Data.transform(data_len.pre)

X_Count = pd.DataFrame(data=X_Count_matrix.toarray() , columns = Count_Data.get_feature_names_out())

In [None]:
Count_df = pd.concat([X_Count , data_len.text_len ,data_len.Category] , axis=1)
Count_df.head()

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu,text_len,Category
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,7,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,12,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,8,0


In [None]:
X = Count_df.drop(["Category"] , axis=1)
y = Count_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
model = classifier_report(X_train , X_test , y_train , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000765 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 195
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 63
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.2978927	total: 58.1ms	remaining: 58s
1:	learn: 1.2266254	total: 121ms	remaining: 1m
2:	learn: 1.1683786	total: 197ms	remaining: 1m 5s
3:	learn: 1.1191301	total: 282ms	remaining: 1m 10s
4:	learn: 1.0765804	total: 358ms	remaining: 1m 11s
5:	learn: 1.0406799	total: 438ms	remaining: 1m 12s
6:	learn: 1.0072745	total: 514ms	remaining: 1m 12s
7:	learn: 0.9839920	total: 589ms	remaining: 1m 13s
8:	learn: 0.9588068	total: 683ms	remaining: 1m 15s
9:	learn: 0.936823

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ridge Classifier,0.83,0.76,0.72,0.89,0.72
Ridge Classifier CV,0.83,0.76,0.72,0.89,0.72
Passive Aggressive Classifier,0.83,0.71,0.69,0.88,0.69
Perceptron,0.82,0.79,0.76,0.85,0.76
Decision Tree,0.82,0.75,0.72,0.84,0.72
Calibrated Classifier CV,0.82,0.71,0.68,0.88,0.68
Bagging Classifier,0.82,0.69,0.67,0.81,0.67
Logistic Regression,0.82,0.68,0.67,0.88,0.67
Gradient Boosting Classifier,0.81,0.72,0.69,0.88,0.69
XGBoost Classifier,0.81,0.72,0.7,0.75,0.7


In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
display(y_train_over.value_counts('complaint'))
len(y_train_over)

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


1476

In [None]:
model = classifier_report(X_train , X_test , y_train , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000592 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 195
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 63
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.2978927	total: 39.6ms	remaining: 39.6s
1:	learn: 1.2266254	total: 80.3ms	remaining: 40.1s
2:	learn: 1.1683786	total: 122ms	remaining: 40.4s
3:	learn: 1.1191301	total: 166ms	remaining: 41.2s
4:	learn: 1.0765804	total: 210ms	remaining: 41.8s
5:	learn: 1.0406799	total: 251ms	remaining: 41.6s
6:	learn: 1.0072745	total: 308ms	remaining: 43.7s
7:	learn: 0.9839920	total: 348ms	remaining: 43.2s
8:	l

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ridge Classifier,0.83,0.76,0.72,0.89,0.72
Ridge Classifier CV,0.83,0.76,0.72,0.89,0.72
Passive Aggressive Classifier,0.83,0.72,0.69,0.88,0.69
Perceptron,0.82,0.79,0.76,0.85,0.76
Bagging Classifier,0.82,0.72,0.69,0.83,0.69
Calibrated Classifier CV,0.82,0.71,0.68,0.88,0.68
Logistic Regression,0.82,0.68,0.67,0.88,0.67
Decision Tree,0.81,0.72,0.7,0.76,0.7
XGBoost Classifier,0.81,0.72,0.7,0.75,0.7
SGD Classifier,0.81,0.71,0.68,0.82,0.68


In [None]:
Tfid_Data = TfidfVectorizer(ngram_range=(1,2)).fit(data_len.pre)
X_Tfid_Matrix = Tfid_Data.transform(data_len.pre)

X_Tfid = pd.DataFrame(data=X_Tfid_Matrix.toarray() , columns = Tfid_Data.get_feature_names_out())

In [None]:
Tfid_df = pd.concat([X_Tfid , data_len.text_len,data_len.Category] , axis=1)
Tfid_df.head()

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu,text_len,Category
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8,0


In [None]:
X = Tfid_df.drop(["Category"] , axis=1)
y = Tfid_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
model = classifier_report(X_train , X_test , y_train , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002614 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 899
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 63
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.3002270	total: 209ms	remaining: 3m 28s
1:	learn: 1.2283778	total: 304ms	remaining: 2m 31s
2:	learn: 1.1692242	total: 385ms	remaining: 2m 8s
3:	learn: 1.1208506	total: 441ms	remaining: 1m 49s
4:	learn: 1.0796375	total: 504ms	remaining: 1m 40s
5:	learn: 1.0442617	total: 581ms	remaining: 1m 36s
6:	learn: 1.0115577	total: 653ms	remaining: 1m 32s
7:	learn: 0.9832062	total: 741ms	remaining: 1m 31s

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Calibrated Classifier CV,0.83,0.71,0.69,0.89,0.69
XGBoost Classifier,0.82,0.73,0.71,0.81,0.71
Extra Trees Classifier,0.82,0.71,0.68,0.88,0.68
Ridge Classifier CV,0.82,0.71,0.68,0.88,0.68
CatBoost Classifier,0.82,0.68,0.66,0.88,0.66
Gradient Boosting Classifier,0.81,0.74,0.7,0.87,0.7
Bagging Classifier,0.81,0.72,0.69,0.82,0.69
Random Forest Classifier,0.81,0.68,0.66,0.88,0.66
Ridge Classifier,0.81,0.68,0.66,0.87,0.66
Logistic Regression,0.81,0.65,0.64,0.88,0.64


In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
display(y_train_over.value_counts('complaint'))
len(y_train_over)

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


1476

In [None]:
model = classifier_report(X_train_over , X_test , y_train_over , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002502 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3452
[LightGBM] [Info] Number of data points in the train set: 1476, number of used features: 217
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
Learning rate set to 0.080714
0:	learn: 1.3078492	total: 148ms	remaining: 2m 27s
1:	learn: 1.2404512	total: 223ms	remaining: 1m 51s
2:	learn: 1.1826900	total: 293ms	remaining: 1m 37s
3:	learn: 1.1393171	total: 362ms	remaining: 1m 30s
4:	learn: 1.0911481	total: 436ms	remaining: 1m 26s
5:	learn: 1.0448198	total: 505ms	remaining: 1m 23s
6:	learn: 1.0105407	total: 578ms	remaining: 1m 21s
7:	learn: 0.9785325	total: 652ms	remaining: 

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Decision Tree,0.84,0.79,0.76,0.84,0.76
CatBoost Classifier,0.83,0.71,0.69,0.83,0.69
Calibrated Classifier CV,0.82,0.71,0.68,0.88,0.68
Ridge Classifier CV,0.82,0.71,0.68,0.88,0.68
Bernoulli Naive Bayes,0.81,0.74,0.72,0.84,0.72
Ridge Classifier,0.81,0.74,0.7,0.87,0.7
Bagging Classifier,0.81,0.73,0.7,0.82,0.7
XGBoost Classifier,0.81,0.73,0.7,0.79,0.7
LightGBM Classifier,0.81,0.71,0.69,0.82,0.69
Extra Trees Classifier,0.81,0.7,0.68,0.87,0.68


## CASE 4 ( 1,3 )

In [None]:
Count_Data = CountVectorizer(ngram_range=(1,3)).fit(dataset.pre)
X_Count_matrix = Count_Data.transform(dataset.pre)

X_Count = pd.DataFrame(data=X_Count_matrix.toarray() , columns = Count_Data.get_feature_names_out())
X_Count

Unnamed: 0,abai,abai cabang,abai cabang bank,abai hubung,abai hubung cs,abis,abis ambil,abis ambil uang,abis setor,abis setor uang,...,yg xpresi udh,yh,yh mbca,zaman,zaman gw,zaman gw kuliah,zte,zte bca,zte bca mobile,zuu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1226,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1227,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1228,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
Count_df = pd.concat([X_Count , dataset.Category] , axis=1)
Count_df.head()

Unnamed: 0,abai,abai cabang,abai cabang bank,abai hubung,abai hubung cs,abis,abis ambil,abis ambil uang,abis setor,abis setor uang,...,yh,yh mbca,zaman,zaman gw,zaman gw kuliah,zte,zte bca,zte bca mobile,zuu,Category
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
X = Count_df.drop(["Category"] , axis=1)
y = Count_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
model = classifier_report(X_train , X_test , y_train , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000592 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 62
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.2986428	total: 71.8ms	remaining: 1m 11s
1:	learn: 1.2285957	total: 123ms	remaining: 1m 1s
2:	learn: 1.1813488	total: 178ms	remaining: 59.1s
3:	learn: 1.1423763	total: 229ms	remaining: 57s
4:	learn: 1.0956741	total: 280ms	remaining: 55.7s
5:	learn: 1.0565721	total: 331ms	remaining: 54.8s
6:	learn: 1.0225260	total: 446ms	remaining: 1m 3s
7:	learn: 0.9929250	total: 551ms	remaining: 1m 8s
8:	lea

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Perceptron,0.84,0.78,0.74,0.89,0.74
Ridge Classifier,0.84,0.75,0.72,0.85,0.72
Ridge Classifier CV,0.84,0.75,0.72,0.85,0.72
SGD Classifier,0.84,0.72,0.69,0.89,0.69
Passive Aggressive Classifier,0.83,0.78,0.77,0.79,0.77
Decision Tree,0.83,0.75,0.72,0.88,0.72
XGBoost Classifier,0.83,0.74,0.71,0.8,0.71
Random Forest Classifier,0.83,0.73,0.7,0.88,0.7
Extra Trees Classifier,0.83,0.72,0.69,0.89,0.69
CatBoost Classifier,0.83,0.72,0.69,0.89,0.69


In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
display(y_train_over.value_counts('complaint'))
len(y_train_over)

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


1476

In [None]:
model = classifier_report(X_train_over , X_test , y_train_over , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000991 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 228
[LightGBM] [Info] Number of data points in the train set: 1476, number of used features: 89
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
Learning rate set to 0.080714
0:	learn: 1.3130858	total: 80.6ms	remaining: 1m 20s
1:	learn: 1.2566679	total: 136ms	remaining: 1m 7s
2:	learn: 1.2126560	total: 196ms	remaining: 1m 5s
3:	learn: 1.1834800	total: 248ms	remaining: 1m 1s
4:	learn: 1.1509809	total: 300ms	remaining: 59.8s
5:	learn: 1.1292358	total: 356ms	remaining: 59s
6:	learn: 1.1041701	total: 413ms	remaining: 58.6s
7:	learn: 1.0787377	total: 466ms	remaining: 57.8s
8:	

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Passive Aggressive Classifier,0.82,0.77,0.78,0.76,0.78
SGD Classifier,0.82,0.75,0.74,0.78,0.74
Calibrated Classifier CV,0.82,0.74,0.73,0.76,0.73
Ridge Classifier,0.81,0.77,0.82,0.75,0.82
Logistic Regression,0.81,0.76,0.8,0.75,0.8
XGBoost Classifier,0.81,0.75,0.77,0.75,0.77
CatBoost Classifier,0.81,0.74,0.77,0.73,0.77
Perceptron,0.81,0.73,0.72,0.76,0.72
Gaussian Naive Bayes,0.8,0.73,0.71,0.81,0.71
Extra Trees Classifier,0.8,0.73,0.73,0.73,0.73


In [None]:
Tfid_Data = TfidfVectorizer(ngram_range=(1,3)).fit(dataset.pre)
X_Tfid_Matrix = Tfid_Data.transform(dataset.pre)

X_Tfid = pd.DataFrame(data=X_Tfid_Matrix.toarray() , columns = Tfid_Data.get_feature_names_out())
X_Tfid

Unnamed: 0,abai,abai cabang,abai cabang bank,abai hubung,abai hubung cs,abis,abis ambil,abis ambil uang,abis setor,abis setor uang,...,yg xpresi udh,yh,yh mbca,zaman,zaman gw,zaman gw kuliah,zte,zte bca,zte bca mobile,zuu
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1226,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1227,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
Tfid_df = pd.concat([X_Tfid , dataset.Category] , axis=1)
Tfid_df.head()

Unnamed: 0,abai,abai cabang,abai cabang bank,abai hubung,abai hubung cs,abis,abis ambil,abis ambil uang,abis setor,abis setor uang,...,yh,yh mbca,zaman,zaman gw,zaman gw kuliah,zte,zte bca,zte bca mobile,zuu,Category
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [None]:
X = Tfid_df.drop(["Category"] , axis=1)
y = Tfid_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
model = classifier_report(X_train , X_test , y_train , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000584 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 878
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 62
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.3002040	total: 80.8ms	remaining: 1m 20s
1:	learn: 1.2330422	total: 142ms	remaining: 1m 11s
2:	learn: 1.1719215	total: 209ms	remaining: 1m 9s
3:	learn: 1.1231677	total: 271ms	remaining: 1m 7s
4:	learn: 1.0867328	total: 348ms	remaining: 1m 9s
5:	learn: 1.0490149	total: 411ms	remaining: 1m 8s
6:	learn: 1.0149604	total: 474ms	remaining: 1m 7s
7:	learn: 0.9906895	total: 537ms	remaining: 1m 6s
8:	

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Extra Trees Classifier,0.84,0.72,0.69,0.89,0.69
SGD Classifier,0.83,0.72,0.69,0.89,0.69
Perceptron,0.83,0.71,0.69,0.88,0.69
Label Propagation,0.82,0.75,0.72,0.81,0.72
Label Spreading,0.82,0.75,0.72,0.81,0.72
Calibrated Classifier CV,0.82,0.71,0.68,0.88,0.68
Ridge Classifier CV,0.82,0.71,0.68,0.88,0.68
Decision Tree,0.81,0.74,0.72,0.81,0.72
XGBoost Classifier,0.81,0.74,0.72,0.8,0.72
Gradient Boosting Classifier,0.81,0.73,0.7,0.84,0.7


In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
display(y_train_over.value_counts('complaint'))
len(y_train_over)

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


1476

In [None]:
model = classifier_report(X_train_over , X_test , y_train_over , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005520 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4298
[LightGBM] [Info] Number of data points in the train set: 1476, number of used features: 302
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
Learning rate set to 0.080714
0:	learn: 1.2933572	total: 376ms	remaining: 6m 15s
1:	learn: 1.2213051	total: 698ms	remaining: 5m 48s
2:	learn: 1.1666219	total: 972ms	remaining: 5m 23s
3:	learn: 1.1111787	total: 1.24s	remaining: 5m 8s
4:	learn: 1.0648619	total: 1.6s	remaining: 5m 18s
5:	learn: 1.0256818	total: 1.86s	remaining: 5m 8s
6:	learn: 0.9883707	total: 2.18s	remaining: 5m 9s
7:	learn: 0.9573114	total: 2.42s	remaining: 5m
8

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Perceptron,0.84,0.72,0.7,0.89,0.7
XGBoost Classifier,0.83,0.75,0.71,0.89,0.71
Passive Aggressive Classifier,0.83,0.71,0.68,0.88,0.68
Extra Trees Classifier,0.82,0.75,0.71,0.88,0.71
Gradient Boosting Classifier,0.82,0.75,0.72,0.82,0.72
CatBoost Classifier,0.82,0.73,0.7,0.83,0.7
Logistic Regression,0.82,0.71,0.68,0.89,0.68
SGD Classifier,0.82,0.71,0.68,0.88,0.68
Calibrated Classifier CV,0.82,0.71,0.68,0.88,0.68
Ridge Classifier CV,0.82,0.71,0.68,0.88,0.68


##  CASE 5 ( 1,3 dan len text )

In [None]:
data_len.head()

Unnamed: 0,pre,Category,text_len
0,mbanking lot transaksi lelet banget lokasi nyala,0,7
1,bca mobile merah y,0,4
2,bca mobile merah y,0,4
3,ngisi pulsa bca mobile pulsa kaga masuk masuk...,1,12
4,crash mybca not responding ulang kali restart hp,0,8


In [None]:
Count_Data = CountVectorizer(ngram_range=(1,3)).fit(data_len.pre)
X_Count_matrix = Count_Data.transform(data_len.pre)

X_Count = pd.DataFrame(data=X_Count_matrix.toarray() , columns = Count_Data.get_feature_names_out())

In [None]:
Count_df = pd.concat([X_Count , data_len.text_len ,data_len.Category] , axis=1)
Count_df.head()

Unnamed: 0,abai,abai cabang,abai cabang bank,abai hubung,abai hubung cs,abis,abis ambil,abis ambil uang,abis setor,abis setor uang,...,yh mbca,zaman,zaman gw,zaman gw kuliah,zte,zte bca,zte bca mobile,zuu,text_len,Category
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,7,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,12,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,8,0


In [None]:
X = Count_df.drop(["Category"] , axis=1)
y = Count_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
model = classifier_report(X_train , X_test , y_train , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000546 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 195
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 63
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.2984366	total: 66.6ms	remaining: 1m 6s
1:	learn: 1.2266265	total: 119ms	remaining: 59.3s
2:	learn: 1.1683459	total: 174ms	remaining: 57.7s
3:	learn: 1.1194208	total: 238ms	remaining: 59.2s
4:	learn: 1.0773777	total: 292ms	remaining: 58.2s
5:	learn: 1.0409952	total: 343ms	remaining: 56.9s
6:	learn: 1.0101661	total: 397ms	remaining: 56.3s
7:	learn: 0.9867514	total: 428ms	remaining: 53s
8:	lear

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Perceptron,0.84,0.78,0.76,0.81,0.76
Ridge Classifier,0.84,0.76,0.72,0.89,0.72
Ridge Classifier CV,0.84,0.76,0.72,0.89,0.72
Bagging Classifier,0.83,0.73,0.7,0.89,0.7
Extra Trees Classifier,0.83,0.72,0.69,0.89,0.69
SGD Classifier,0.83,0.7,0.68,0.88,0.68
Gradient Boosting Classifier,0.82,0.74,0.71,0.88,0.71
Passive Aggressive Classifier,0.82,0.71,0.68,0.82,0.68
Calibrated Classifier CV,0.82,0.71,0.68,0.88,0.68
Random Forest Classifier,0.82,0.69,0.67,0.88,0.67


In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
y_train_over.value_counts('Category')

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


In [None]:
model = classifier_report(X_train_over , X_test , y_train_over , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001483 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 244
[LightGBM] [Info] Number of data points in the train set: 1476, number of used features: 84
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
Learning rate set to 0.080714
0:	learn: 1.3156983	total: 66.8ms	remaining: 1m 6s
1:	learn: 1.2615889	total: 126ms	remaining: 1m 2s
2:	learn: 1.2184846	total: 181ms	remaining: 1m
3:	learn: 1.1827421	total: 235ms	remaining: 58.5s
4:	learn: 1.1517298	total: 289ms	remaining: 57.5s
5:	learn: 1.1209855	total: 345ms	remaining: 57.2s
6:	learn: 1.0978128	total: 399ms	remaining: 56.7s
7:	learn: 1.0821111	total: 455ms	remaining: 56.4s
8:	le

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Logistic Regression,0.83,0.78,0.81,0.76,0.81
Calibrated Classifier CV,0.83,0.76,0.75,0.77,0.75
SGD Classifier,0.82,0.73,0.72,0.76,0.72
CatBoost Classifier,0.81,0.75,0.78,0.74,0.78
Extra Trees Classifier,0.81,0.73,0.74,0.73,0.74
Ridge Classifier,0.8,0.75,0.8,0.73,0.8
Ridge Classifier CV,0.8,0.75,0.8,0.73,0.8
Gaussian Naive Bayes,0.8,0.73,0.71,0.81,0.71
XGBoost Classifier,0.79,0.73,0.74,0.72,0.74
Perceptron,0.78,0.75,0.72,0.82,0.72


In [None]:
Tfid_Data = TfidfVectorizer(ngram_range=(1,3)).fit(data_len.pre)
X_Tfid_Matrix = Tfid_Data.transform(data_len.pre)

X_Tfid = pd.DataFrame(data=X_Tfid_Matrix.toarray() , columns = Tfid_Data.get_feature_names_out())

In [None]:
Tfid_df = pd.concat([X_Tfid , data_len.text_len,data_len.Category] , axis=1)
Tfid_df.head()

Unnamed: 0,abai,abai cabang,abai cabang bank,abai hubung,abai hubung cs,abis,abis ambil,abis ambil uang,abis setor,abis setor uang,...,yh mbca,zaman,zaman gw,zaman gw kuliah,zte,zte bca,zte bca mobile,zuu,text_len,Category
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8,0


In [None]:
X = Tfid_df.drop(["Category"] , axis=1)
y = Tfid_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
model = classifier_report(X_train , X_test , y_train , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000547 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 902
[LightGBM] [Info] Number of data points in the train set: 922, number of used features: 63
[LightGBM] [Info] Start training from score -0.943223
[LightGBM] [Info] Start training from score -0.915749
[LightGBM] [Info] Start training from score -1.776689
[LightGBM] [Info] Start training from score -3.188959
Learning rate set to 0.0788
0:	learn: 1.2963188	total: 92.2ms	remaining: 1m 32s
1:	learn: 1.2226870	total: 186ms	remaining: 1m 32s
2:	learn: 1.1746106	total: 295ms	remaining: 1m 38s
3:	learn: 1.1234133	total: 416ms	remaining: 1m 43s
4:	learn: 1.0779719	total: 537ms	remaining: 1m 46s
5:	learn: 1.0446690	total: 662ms	remaining: 1m 49s
6:	learn: 1.0098118	total: 782ms	remaining: 1m 50s
7:	learn: 0.9800999	total: 901ms	remaining: 1m 5

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Extra Trees Classifier,0.82,0.71,0.68,0.88,0.68
Ridge Classifier CV,0.82,0.71,0.68,0.88,0.68
CatBoost Classifier,0.82,0.68,0.66,0.89,0.66
Perceptron,0.82,0.66,0.65,0.88,0.65
Gradient Boosting Classifier,0.81,0.74,0.7,0.88,0.7
Bagging Classifier,0.81,0.73,0.7,0.83,0.7
XGBoost Classifier,0.81,0.73,0.7,0.81,0.7
Calibrated Classifier CV,0.81,0.7,0.67,0.88,0.67
Ridge Classifier,0.81,0.68,0.66,0.88,0.66
Random Forest Classifier,0.81,0.67,0.66,0.88,0.66


In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
y_train_over.value_counts('complaint')

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


In [None]:
model = classifier_report(X_train_over , X_test , y_train_over , y_test, print=False )
model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003840 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4224
[LightGBM] [Info] Number of data points in the train set: 1476, number of used features: 307
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
Learning rate set to 0.080714
0:	learn: 1.3003615	total: 307ms	remaining: 5m 6s
1:	learn: 1.2307067	total: 588ms	remaining: 4m 53s
2:	learn: 1.1668807	total: 861ms	remaining: 4m 46s
3:	learn: 1.1196143	total: 1.14s	remaining: 4m 44s
4:	learn: 1.0713196	total: 1.4s	remaining: 4m 37s
5:	learn: 1.0345964	total: 1.66s	remaining: 4m 35s
6:	learn: 1.0015438	total: 1.92s	remaining: 4m 32s
7:	learn: 0.9685351	total: 2.21s	remaining: 4m

Unnamed: 0_level_0,Accuracy,F1-Score,Recall,Precision,Balance
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBoost Classifier,0.83,0.73,0.7,0.84,0.7
Gradient Boosting Classifier,0.82,0.72,0.69,0.8,0.69
Bagging Classifier,0.82,0.72,0.69,0.83,0.69
Ridge Classifier CV,0.82,0.71,0.68,0.88,0.68
CatBoost Classifier,0.82,0.7,0.68,0.78,0.68
Perceptron,0.81,0.76,0.74,0.79,0.74
Logistic Regression,0.81,0.72,0.69,0.88,0.69
Decision Tree,0.81,0.71,0.69,0.79,0.69
Extra Trees Classifier,0.81,0.7,0.67,0.88,0.67
Calibrated Classifier CV,0.81,0.7,0.67,0.88,0.67


# KESIMPULAN



*   Setelah melakukan beberapa percobaan model, hasil menunjukkan bahwa Passive Aggressive Classifier adalah model yang memberikan performa terbaik dibandingkan dengan Decision Tree dan Logistic Regression. Model ini menghasilkan akurasi tertinggi sebesar 85% dan F1 score sebesar 87%, yang menunjukkan bahwa Passive Aggressive Classifier mampu menyeimbangkan antara presisi dan recall dengan sangat baik.

*   Sementara itu, Logistic Regression unggul dalam hal recall sebesar 81%, menjadikannya lebih baik dalam mendeteksi kelas positif. Namun, akurasi dan F1 score-nya lebih rendah dibandingkan Passive Aggressive Classifier.
*   Decision Tree memiliki performa yang cukup baik secara keseluruhan, namun kalah dari kedua model lainnya dalam hal akurasi, presisi, recall, dan F1 score.





# Passive Aggressive Classifier

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
dataset = pd.read_csv("modelling_data_final.csv")
dataset.head()

Unnamed: 0,pre,Category
0,mbanking lot transaksi lelet banget lokasi nyala,0
1,bca mobile merah y,0
2,bca mobile merah y,0
3,ngisi pulsa bca mobile pulsa kaga masuk masuk...,1
4,crash mybca not responding ulang kali restart hp,0


In [None]:
Count_Data = CountVectorizer(ngram_range=(1,2)).fit(dataset.pre)
X_Count_matrix = Count_Data.transform(dataset.pre)

X_Count = pd.DataFrame(data=X_Count_matrix.toarray() , columns = Count_Data.get_feature_names_out())
X_Count

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg udh,yg viral,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1226,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1227,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1228,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
import pickle

with open('Countvectorizer_cat.pkl', 'wb') as f:
    pickle.dump(Count_Data, f)

In [None]:
Count_df = pd.concat([X_Count , dataset.Category] , axis=1)
Count_df.head()

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg viral,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu,Category
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
X = Count_df.drop(["Category"] , axis=1)
y = Count_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
model = PassiveAggressiveClassifier()
model.fit(X_train , y_train)
y_pred = model.predict(X_test)
print(classification_report(y_pred , y_test))

              precision    recall  f1-score   support

           0       0.88      0.94      0.91       113
           1       0.85      0.71      0.78       132
           2       0.80      0.86      0.83        57
           3       0.31      0.83      0.45         6

    accuracy                           0.82       308
   macro avg       0.71      0.84      0.74       308
weighted avg       0.84      0.82      0.83       308



In [None]:
def con_mat(y_train , y_pred) :

  cm = confusion_matrix(y_test, y_pred)


  fig = ff.create_annotated_heatmap(
      z=cm,
      x=['Predicted 0', 'Predicted 1' , 'Predicted 2' , 'Predicted 3'],
      y=['Actual 0', 'Actual 1' , 'Actual 2' , 'Actual 3'],
      colorscale='Blues',
      showscale=True
  )

  fig.update_layout(
      title='Confusion Matrix',
      xaxis_title='Predicted Labels',
      yaxis_title='Actual Labels'
  )

  fig.show()

con_mat(y_test , y_pred)

In [None]:
from sklearn.model_selection import GridSearchCV, KFold

param_grid = {
    'C': [0.001, 0.01, 0.1, 1.0, 10],
    'max_iter': [1000, 2000, 5000],
    'loss': ['hinge', 'squared_hinge'],
    'validation_fraction': [0.1, 0.2],
    'fit_intercept': [True, False],
}


kf = KFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=kf)


grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Cross-Validation Score:", best_score)


Best Parameters: {'C': 0.001, 'fit_intercept': True, 'loss': 'squared_hinge', 'max_iter': 2000, 'validation_fraction': 0.2}
Best Cross-Validation Score: 0.8687544065804935


In [None]:
params = {'C': 0.001, 'fit_intercept': True, 'loss': 'squared_hinge', 'max_iter': 2000, 'validation_fraction': 0.2}

PassiveAggressive = PassiveAggressiveClassifier(**params)
PassiveAggressive.fit(X_train , y_train)
y_pred = PassiveAggressive.predict(X_test)
print(classification_report(y_pred , y_test))

              precision    recall  f1-score   support

           0       0.90      0.92      0.91       118
           1       0.85      0.73      0.79       128
           2       0.80      0.88      0.84        56
           3       0.38      1.00      0.55         6

    accuracy                           0.84       308
   macro avg       0.73      0.88      0.77       308
weighted avg       0.85      0.84      0.84       308



In [None]:
con_mat(y_test , y_pred)

# Logistic Regresion

In [None]:
data_len.head()

Unnamed: 0,pre,Category,text_len
0,mbanking lot transaksi lelet banget lokasi nyala,0,7
1,bca mobile merah y,0,4
2,bca mobile merah y,0,4
3,ngisi pulsa bca mobile pulsa kaga masuk masuk...,1,12
4,crash mybca not responding ulang kali restart hp,0,8


In [None]:
Count_Data = CountVectorizer(ngram_range=(1,3)).fit(data_len.pre)
X_Count_matrix = Count_Data.transform(data_len.pre)

X_Count = pd.DataFrame(data=X_Count_matrix.toarray() , columns = Count_Data.get_feature_names_out())

In [None]:
Count_df = pd.concat([X_Count , data_len.text_len ,data_len.Category] , axis=1)
Count_df.head()

Unnamed: 0,abai,abai cabang,abai cabang bank,abai hubung,abai hubung cs,abis,abis ambil,abis ambil uang,abis setor,abis setor uang,...,yh mbca,zaman,zaman gw,zaman gw kuliah,zte,zte bca,zte bca mobile,zuu,text_len,Category
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,7,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,12,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,8,0


In [None]:
X = Count_df.drop(["Category"] , axis=1)
y = Count_df['Category']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
y_train_over.value_counts('Category')

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


In [None]:
model = LogisticRegression()
model.fit(X_train_over , y_train_over)
y_pred = model.predict(X_test)
print(classification_report(y_pred , y_test))

              precision    recall  f1-score   support

           0       0.84      0.95      0.89       107
           1       0.82      0.77      0.79       117
           2       0.79      0.83      0.81        58
           3       0.69      0.42      0.52        26

    accuracy                           0.81       308
   macro avg       0.78      0.74      0.75       308
weighted avg       0.81      0.81      0.81       308



In [None]:
con_mat(y_test , y_pred)

In [None]:
from sklearn.model_selection import GridSearchCV, KFold

param_grid = {
    'penalty': ['l1', 'l2', 'none'],
    'solver': ['newton-cg', 'lbfgs', 'saga'],
    'max_iter': [100, 200, 500],
    'fit_intercept': [True, False],
}



kf = KFold(n_splits=3, shuffle=True, random_state=42)

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=kf)


grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Cross-Validation Score:", best_score)


Best Parameters: {'fit_intercept': True, 'max_iter': 200, 'penalty': 'l2', 'solver': 'saga'}
Best Cross-Validation Score: 0.8513826021969345


In [None]:
import requests

In [None]:
all_komen = []

ids  = ["7136853352922483995", "7140110819378892059", "7212939772577271066", "7215971978136538395",
           "7219156257196854554", "7221843236908829957", "7228501515684646171", "7236673528748100869",
           "7239183745239272709", "7242237857287458053", "7248969456167374085", "7251899962991201541",
           "7255973219251932422", "7257132968525794566", "7259230226473929989", "7262190893908757766",
           "7264812610544045317", "7142048535997943066", "7150588258286685466", "7151977884905721115",
           "7156572629766278426", "7165477268222004506", "7176951125139033371", "7177544466641866011",
           "7184752732337638682", "7185498105759124763", "7189922406566235418", "7199641143213051163",
           "7202863558491786522", "7202988186618957083", "7203270742304115995", "7203746612991102234",
           "7210399148850367770", "7211126688749505819", "7420297360879783175", "7419565921208896773",
           "7419174703371160837", "7419173769429765381", "7418950195855035653", "7418580150289190150",
           "7418554133550009605", "7418493406600629512", "7423339331244297478", "7422172620092755205",
           "7421923351725133061", "7421836312925293829", "7421560752324005126", "7416344622751911174",
           "7416297348613131525", "7415097630629301510", "7413979586863615238", "7413633762652605702",
           "7413594849003195653", "7413278950958419206", "7413226464918654214", "7412851659107273989",
           "7412549894965038342", "7412192874025995525", "7412182720895339781", "7411142956079484165",
           "7410739643328597253", "7410265971358829830", "7409987530377972997", "7409938658683342086",
           "7408527007274781957" ,  "7382580219275660550","7377749296155790597","7369147026434444550",
           "7364352224585059589","7362093408573246726","7352456894847487238","7347193593427152134" ,
           '7421355952508603654' , '7310089069231377669' , '6939727710835002626' , '6943814495261199618' ,
           '6951233236236389634' , '6958987258330434818' ,'6964334359897853185' , '6969110075617070337' ,
           '6976137769823292674' , '6985488284411497755' , '6990240874131115291' , '6995783610963971355' ,
           '6995784184300047642' ,'7003721580878531841' , '7016215952052342042' , '7021817106186620187' ,
           '7025560307623841051' , '7030775791713766683' , '7032545995766418715' , '7035099305090239770' ,
           '7035961716718570778' , '7038150744767302939' , '7041778620871888155' , '7041809761523879195' ,
           '7059279523522546971' , '7070059271232179483' , '7070463811475590426' , '7071157452439407899' ,
           '7076260081930669339' ,  '7077887233239436570' , '7078513708628053275' , '7083770482297802011' ,
           '7084103594693168411' , '7084223525095017755' , '7085327628365876506' , '7087809423418084634' ,
           '7088575631159463195' ,'7088664413753888026' , '7088983982561561883' , '7089045686213299483' ,
           '7105986886711184666']

for id in ids :

    for cursor in range(0 , 100) :

        url = "https://www.tiktok.com/api/comment/list/"

        querystring = {"aweme_id":id,"count":"50","cursor":cursor  * 50,"WebIdLastTime":"1639506389","aid":"1988","app_language":"ja-JP","app_name":"tiktok_web","browser_language":"en-US","browser_name":"Mozilla","browser_online":"true","browser_platform":"Win32","browser_version":"5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0","channel":"tiktok_web","cookie_enabled":"true","current_region":"JP","data_collection_enabled":"true","device_id":"7041626242181744130","device_platform":"web_pc","enter_from":"tiktok_web","focus_state":"false","fromWeb":"1","from_page":"video","history_len":"3","is_fullscreen":"false","is_non_personalized":"false","is_page_visible":"true","odinId":"7264238560303416325","os":"windows","priority_region":"ID","referer":"https://www.bing.com/","region":"ID","root_referer":"https://www.bing.com/","screen_height":"720","screen_width":"1280","tz_name":"Asia/Jakarta","user_is_login":"true","webcast_language":"en","msToken":"ZdSMovTf545nrCnxOpKLhswH6CVJC7QdryvqAay0Pzev8e_hb_z9fb8YlCwttKa3-OXadIMCB7W1Z4L4x84zzhugxzR6lPw06TGgTRpZK0iynwSZxL5aj7y7ATSNTQ11Z-pab0h_dtIKfw==","X-Bogus":"DFSzswVLnjsANVA/tX4vyMSscjVx","_signature":"_02B4Z6wo00001luC7awAAIDC0dbjZCJe3q5bgukAAPAUdd"}

        payload = ""
        headers = {
            "cookie": "odin_tt=8f91cad8a23599ac058b25d8f0db9d99a38ee56bdd5c508ea949cc9a50b8a5a7e7303b90b14b77805ba7406ebbb091b15ac4fa3eb45eb3b472bb295fb4257760bcaac0a9f939a4f53503da521b297d61; msToken=omdGdprVrikrsLNV_9KuSeHutZ_LPcqpbo0u52OSgUkJzi5QLdO9lB73QE2RwYIQD_SrMExn492y3-u1ZOhOtaX027LPKtbrqowqO3JRi7OW1DTVpzD0ffKSMMsBLCGoBn0fOT3S1grUoA%3D%3D",
            "User-Agent": "insomnia/10.0.0"
        }


        response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

        j = response.json()

        try :

         if len(j["comments"]) <=0 :
            break;
        except :
          break ;

        try :
            for i in range(0 , len(j['comments'])):
               all_komen.append(j['comments'][i]['text'])
        except :
           print("UDAH GA ADA")


        print(f"iterasi ke {cursor}")
        print(all_komen[-1])

        print("=================================")
        print(f" number of comments : {len(all_komen)}")
        print("=================================")



In [None]:
params = best_params

LogisticRegression = LogisticRegression(**params)
LogisticRegression.fit(X_train_over , y_train_over)
y_pred = LogisticRegression.predict(X_test)
print(classification_report(y_pred , y_test))

              precision    recall  f1-score   support

           0       0.83      0.95      0.88       105
           1       0.78      0.75      0.76       115
           2       0.79      0.83      0.81        58
           3       0.75      0.40      0.52        30

    accuracy                           0.80       308
   macro avg       0.79      0.73      0.74       308
weighted avg       0.79      0.80      0.79       308



In [None]:
con_mat(y_test , y_pred)

# Decision Tree

In [None]:
data_len.head()

Unnamed: 0,pre,Category,text_len
0,mbanking lot transaksi lelet banget lokasi nyala,0,7
1,bca mobile merah y,0,4
2,bca mobile merah y,0,4
3,ngisi pulsa bca mobile pulsa kaga masuk masuk...,1,12
4,crash mybca not responding ulang kali restart hp,0,8


In [None]:
Tfid_Data = TfidfVectorizer(ngram_range=(1,2)).fit(data_len.pre)
X_Tfid_Matrix = Tfid_Data.transform(data_len.pre)

X_Tfid = pd.DataFrame(data=X_Tfid_Matrix.toarray() , columns = Tfid_Data.get_feature_names_out())

In [None]:
Count_df = pd.concat([X_Count , data_len.text_len ,data_len.Category] , axis=1)
Count_df.head()

Unnamed: 0,abai,abai cabang,abai cabang bank,abai hubung,abai hubung cs,abis,abis ambil,abis ambil uang,abis setor,abis setor uang,...,yh mbca,zaman,zaman gw,zaman gw kuliah,zte,zte bca,zte bca mobile,zuu,text_len,Category
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,7,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,12,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,8,0


In [None]:
Tfid_df = pd.concat([X_Tfid , data_len.text_len,data_len.Category] , axis=1)
Tfid_df.head()

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu,text_len,Category
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8,0


In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
display(y_train_over.value_counts('complaint'))
len(y_train_over)

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


1476

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train_over , y_train_over)
y_pred = model.predict(X_test)
print(classification_report(y_pred , y_test))

              precision    recall  f1-score   support

           0       0.87      0.93      0.90       113
           1       0.67      0.78      0.72        95
           2       0.79      0.76      0.77        63
           3       0.75      0.32      0.45        37

    accuracy                           0.78       308
   macro avg       0.77      0.70      0.71       308
weighted avg       0.78      0.78      0.76       308



In [None]:
from sklearn.model_selection import GridSearchCV, KFold

param_grid = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 10, 20, 30, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'auto', 'sqrt', 'log2'],
}



kf = KFold(n_splits=3, shuffle=True, random_state=42)

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=kf)


grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Cross-Validation Score:", best_score)


Best Parameters: {'criterion': 'gini', 'max_depth': 50, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'best'}
Best Cross-Validation Score: 0.8362310870454194


In [None]:
params = best_params

DecisionTreeClassifier = DecisionTreeClassifier(**params)
DecisionTreeClassifier.fit(X_train_over , y_train_over)
y_pred = DecisionTreeClassifier.predict(X_test)
print(classification_report(y_pred , y_test))

              precision    recall  f1-score   support

           0       0.86      0.93      0.89       112
           1       0.66      0.77      0.71        95
           2       0.79      0.79      0.79        61
           3       0.62      0.25      0.36        40

    accuracy                           0.76       308
   macro avg       0.73      0.68      0.69       308
weighted avg       0.75      0.76      0.75       308



In [None]:
con_mat(y_test , y_pred)

# ENSAMBEL MODEL

## Voting

In [None]:
from sklearn.ensemble import BaggingClassifier , StackingClassifier
from sklearn.ensemble import VotingClassifier

In [None]:
Count_Data = CountVectorizer(ngram_range=(1,2)).fit(dataset.pre)
X_Count_matrix = Count_Data.transform(dataset.pre)

X_Count = pd.DataFrame(data=X_Count_matrix.toarray() , columns = Count_Data.get_feature_names_out())
X_Count

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg udh,yg viral,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1226,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1227,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1228,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
Count_df = pd.concat([X_Count , dataset.Category] , axis=1)
Count_df.head()

Unnamed: 0,abai,abai cabang,abai hubung,abis,abis ambil,abis setor,abis si,abissss,abissss nelpon,about,...,yg viral,yg xpresi,yh,yh mbca,zaman,zaman gw,zte,zte bca,zuu,Category
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.25 , random_state = 33)

In [None]:
from imblearn.over_sampling import SMOTE

over = SMOTE()
X_train_over , y_train_over = over.fit_resample(X_train , y_train)
y_train_over.value_counts('Category')

Unnamed: 0_level_0,proportion
Category,Unnamed: 1_level_1
1,0.25
0,0.25
2,0.25
3,0.25


In [None]:
models_vote = [('DT',DecisionTreeClassifier()),('LR',LogisticRegression(random_state=69)),('XT',ExtraTreesClassifier(random_state=60))]
votting = VotingClassifier(estimators=models_vote,  voting='soft')
votting.fit(X_train_over, y_train_over)

In [None]:
y_pred = votting.predict(X_test)
print(classification_report(y_test , y_pred))

              precision    recall  f1-score   support

           0       0.94      0.85      0.90       121
           1       0.76      0.67      0.71       110
           2       0.79      0.79      0.79        61
           3       0.30      0.75      0.43        16

    accuracy                           0.77       308
   macro avg       0.70      0.77      0.71       308
weighted avg       0.81      0.77      0.78       308



In [None]:
con_mat(y_test , y_pred)

## Stack

In [None]:
stack0 = list()
stack0.append(('DT', DecisionTreeClassifier()))
stack0.append(('EG', ExtraTreesClassifier()))
stack0.append(('LG', LogisticRegression()))
stack0.append(('RC', RidgeClassifier()))
stack0.append(('LGD', SGDClassifier()))


stack1 = PassiveAggressiveClassifier()

model_stack = StackingClassifier(estimators=stack0, final_estimator=stack1, cv=5)

model_stack.fit(X_train_over, y_train_over)

In [None]:
y_pred = model_stack.predict(X_test)
print(classification_report(y_test , y_pred))

              precision    recall  f1-score   support

           0       0.97      0.84      0.90       121
           1       0.67      0.94      0.78       110
           2       1.00      0.56      0.72        61
           3       0.60      0.56      0.58        16

    accuracy                           0.81       308
   macro avg       0.81      0.72      0.74       308
weighted avg       0.85      0.81      0.81       308



In [None]:
con_mat(y_test , y_pred)

# Kesimpulan

model Passive Aggressive Classifier memberikan hasil yang paling baik dibandingkan ensemble model , decision tree mauppun desicion tree

In [None]:
params = {'C': 0.001, 'fit_intercept': True, 'loss': 'squared_hinge', 'max_iter': 2000, 'validation_fraction': 0.2}

PassiveAggressive = PassiveAggressiveClassifier(**params)
PassiveAggressive.fit(X_train , y_train)
y_pred = PassiveAggressive.predict(X_test)
print(classification_report(y_pred , y_test))

In [None]:
import pickle

model = PassiveAggressive

# Menyimpan model ke file
with open('PassiveAggClassifier.pkl', 'wb') as file:
    pickle.dump(model, file)
