<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 100px">

# Capstone Project: Classifying Logistics Research Papers
## Part 4 : Gridsearch Classification

---

 [Part 1: Get Text](01.Get_Text.ipynb) | [Part 2: Add Label](02.Add_Label.ipynb) | [Part 3: EDA](03.EDA.ipynb) | **Part 4: Gridsearch Classification** | [Part 5: Neural Network Classification](05.NeuralNet_Classification.ipynb) | [Part 6: Model Evaluation](06.Model_Evaluation.ipynb) | [Part 7: Final Model](07.Final_Model.ipynb) 

---

### Introduction
This notebook focuses on the model tuning and optimization process for a text classification task. Specifically, we aim to identify the best-performing model by applying GridSearchCV to systematically explore hyperparameter combinations for the following machine learning algorithms:

1. **Support Vector Machine (SVM)**
2. **Naive Bayes (NB)**
3. **Gradient Boosting (GB)**

To preprocess the text data, we use **TfidfVectorizer** as the vectorizer, which converts text into numerical features by calculating the Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word. The TfidfVectorizer is further tuned as part of the GridSearch process by adjusting its key hyperparameters, such as `max_features` and `max_df`, to optimize the quality and efficiency of the text representation.

### Import Library

In [20]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

import numpy as np
import pandas as pd

from pythainlp.tokenize import word_tokenize
from pythainlp.corpus.common import thai_words, thai_stopwords
from pythainlp.util import dict_trie

from transformers import AutoTokenizer, AutoModelForMaskedLM

In [21]:
df = pd.read_csv('../data/cleaned_text.csv')
df.head()

Unnamed: 0,project,abstract,content,category,multi_category,keywords,category_id,abstract_length,content_length,content_word_count
0,การจัดทำคู่มือขั้นตอนการดำเนินการการนำเข้าคราฟ...,ผู้วิจัยได้ตระหนักถึงความยุ่งยากของขั้นตอนการน...,การจัดทำคู่มือขั้นตอนการดำเนินการการนำเข้าคราฟ...,Import-Export and International Trade,"{1: 'Import-Export and International Trade', 2...","คู่มือการนำเข้าคราฟท์เบียร์, การดำเนินงานตามมา...",5,859,11582,2402
1,การเสนอแนวทางในการพัฒนาและสร้างความสัมพันธ์กับ...,งานวิจัยครั้งนี้มีวัตถุประสงค์เพื่อเสนอแนวทางใ...,การเสนอแนวทางในการพัฒนาและสร้างความสัมพันธ์กับ...,Procurement,"{1: 'Procurement', 2: 'Manufacturing/Productio...","การประเมินการปฏิบัติงาน, ผู้ส่งมอบ, แบ่งเกรด, ...",0,1172,15230,3601
2,การพัฒนามาตรฐานรถขนส่งวัตถุอันตรายที่เข้ามาในค...,ดำเนินธุรกิจเป็นผู้นำเข้า และจัดจำหน่ายสินค้าก...,การพัฒนามาตรฐานรถขนส่งวัตถุอันตรายที่เข้ามาในค...,Logistics and Distribution,"{1: 'Logistics and Distribution', 2: 'Inventor...","Chemical Solvent, การควบคุมความปลอดภัย, รถขนส่...",3,1964,13587,2883
3,แนวทางการปรับปรุงกระบวนการการส่งเอกสารใบกำกับภ...,การวิจัยครั้งนี้มีวัตถุประสงค์ เพื่อศึกษาขั้นต...,แนวทางการปรับปรุงกระบวนการการส่งเอกสารใบกำกับภ...,Procurement,"{1: 'Procurement', 2: 'Demand Planning and For...","การวิจัย, กระบวนการจัดส่งใบกำกับภาษี, แผนกบัญช...",0,1252,13124,3283
4,การศึกษาเทคนิคการพยากรณ์ยอดขายสายไฟที่เหมาะสม,จากสถานการณ์การแพร่ระบาดของเชื้อไวรัสโคโรนา 20...,การศึกษาเทคนิคการพยากรณ์ยอดขายสายไฟที่เหมาะสม ...,Demand Planning and Forecasting,"{1: 'Demand Planning and Forecasting', 2: 'Inv...","โควิด-19, ยานยนต์, การพยากรณ์ยอดขาย, Simple mo...",4,1924,25247,6114


### Text Preprocessing with Thai-Specific Methods


Thai-specific text preprocessing methods for Natural Language Processing (NLP) tasks. It includes:

- `Traditional Preprocessing` using stopword removal and tokenization tailored for Thai text.
- `WangchanBERTa` Tokenization, leveraging a pre-trained Thai language model for deep learning applications.

In [22]:
# Add custom word to keep from 5 sample abstract
added_words = ['การนำเข้า', 'ฐานนิยม', 'คราฟท์', 'แนวทาง', 'ผู้ส่งมอบ', 'โซ่อุปทาน', 'ปัจจัยรอง', 
               'การส่งมอบ', 'รถขนส่ง', 'นำไปใช้งาน', 'อย่างถูกต้อง', 'การขับรถ', 'ที่เกี่ยวข้อง', 
               'ในการปฏิบัติงาน', 'พนักงานขับรถ', 'สิ่งสำคัญ', 'ขั้นตอน', 'ที่ชัดเจน', 'การไหล', 'ยอดขาย', 
              'การจัดทำ', 'คราฟท์เบียร์', 'ฝึกสหกิจ', 'อย่างก้าวกระโดด','การจัดซื้อจัดหา','กระบวนการ',
               'แบบประเมิน','เก็บข้อมูล','อย่างชัดเจน','การดำเนินการ','การส่งเสริม','ถังดับเพลิง','แนวทาง']

# Merge custom words with Thai dictionary words
custom_words = set(thai_words()).union(added_words)
custom_trie = dict_trie(custom_words)  # Create a trie from the custom dictionary

In [23]:
def thai_preprocess(text):
    stopwords = thai_stopwords()
    tokens = word_tokenize(text, custom_dict= custom_trie, engine="newmm")
    return " ".join([token for token in tokens if token not in stopwords])


def wangchan(text):   
    tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
    model = AutoModelForMaskedLM.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")

    tokens = tokenizer.tokenize(text)
    return " ".join(tokens)

###  GridSearch
**Setting Vectorizer Parameters** : We tune the TfidfVectorizer by optimizing the following two hyperparameters:

1. `max_features` : Set to 5000, 7000, or None.
2. `max_df` : Set to 0.8 or 0.9.

Given that our dataset contains approximately 10,000 features per file, reducing the number of features may improve the model's performance by focusing on the most informative terms and reducing noise. This step is crucial for balancing model complexity and computational efficiency.

**Models in GridSearch** : We apply GridSearchCV to three models:

1. `Support Vector Machine (SVM)` : SVM is effective in high-dimensional spaces and works well when there's a clear margin of separation between classes.

2. `Naive Bayes` : It's a simple, probabilistic model that works well with high-dimensional data, offering fast and efficient performance for text classification.

3. `Gradient Boosting` :It builds strong models through an ensemble of decision trees, capturing complex patterns and non-linear relationships in the data.

In [24]:
def grid_search(X_train, X_test, y_train, y_test, tokenizer):
    # Define pipelines for each classifier
    pipelines = {
        'svm': Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('classifier', SVC())
        ]),
        'naive_bayes': Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('classifier', MultinomialNB())
        ]),
        'gradient_boosting': Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('classifier', GradientBoostingClassifier())
        ])
    }
    
    # Define parameter grids for each classifier
    param_grids = {
        'svm': {
            'tfidf__max_features': [5000, 7000, None],
            'tfidf__max_df': [0.8, 0.9],
            'classifier__C': [1, 10, 100],  # Regularization parameter
            'classifier__kernel': ['linear', 'rbf']  # Kernel
        },
        'naive_bayes': {
            'tfidf__max_features': [5000, 7000, None],
            'tfidf__max_df': [0.8, 0.9],
            'classifier__alpha': [0.01, 0.1, 1.0]  # Smoothing parameter
        },
        'gradient_boosting': {
            'tfidf__max_features': [5000, 7000, None],
            'tfidf__max_df': [0.8, 0.9],
            'classifier__n_estimators': [100, 200],  # Number of trees
            'classifier__learning_rate': [0.01, 0.1, 0.2],  # Learning rate
            'classifier__max_depth': [3, 5, 7]  # Max depth of trees
        }
    }
    
    # Store results for all classifiers
    all_results = []
    
    for model_name, pipeline in pipelines.items():
        print(f"Running GridSearch for {model_name}...")
        grid_search = GridSearchCV(pipeline, param_grids[model_name], cv=3, scoring='accuracy', verbose=2, n_jobs=-1)
        grid_search.fit(X_train, y_train)
        
        print(f"Best Parameters for {model_name}:", grid_search.best_params_)
        print(f"Best Cross-Validation Accuracy for {model_name}:", grid_search.best_score_)
        
        # Predict on test data
        best_model = grid_search.best_estimator_
        y_pred = best_model.predict(X_test)
        
        # Save results
        result = pd.DataFrame(grid_search.cv_results_)
        result['classifier'] = model_name
        result['tokenizer'] = tokenizer.__name__
        all_results.append(result)
    
    # Combine results from all classifiers
    combined_results = pd.concat(all_results, ignore_index=True)
    return combined_results

### Tuning the model and collect into DataFrame

In [25]:
tokenizers = [thai_preprocess, wangchan]
# Create a dictionary to store the results
results_dict = {}

for tokenizer in tokenizers:
    X = df['abstract'].apply(tokenizer)
    y = df['category_id']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
    
    model_name = tokenizer.__name__
    results_dict[model_name] = grid_search(X_train, X_test, y_train, y_test, tokenizer)

# Combine the results into one DataFrame
combined_results = pd.concat(results_dict.values(), ignore_index=True)

Running GridSearch for svm...
Fitting 3 folds for each of 36 candidates, totalling 108 fits
Best Parameters for svm: {'classifier__C': 10, 'classifier__kernel': 'linear', 'tfidf__max_df': 0.8, 'tfidf__max_features': 5000}
Best Cross-Validation Accuracy for svm: 0.5632383966244725
Running GridSearch for naive_bayes...
Fitting 3 folds for each of 18 candidates, totalling 54 fits
Best Parameters for naive_bayes: {'classifier__alpha': 0.01, 'tfidf__max_df': 0.8, 'tfidf__max_features': 5000}
Best Cross-Validation Accuracy for naive_bayes: 0.5377637130801688
Running GridSearch for gradient_boosting...
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best Parameters for gradient_boosting: {'classifier__learning_rate': 0.1, 'classifier__max_depth': 5, 'classifier__n_estimators': 200, 'tfidf__max_df': 0.9, 'tfidf__max_features': 7000}
Best Cross-Validation Accuracy for gradient_boosting: 0.5504219409282701
Running GridSearch for svm...
Fitting 3 folds for each of 36 candidates, to

In [26]:
combined_results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__C,param_classifier__kernel,param_tfidf__max_df,param_tfidf__max_features,params,split0_test_score,...,split2_test_score,mean_test_score,std_test_score,rank_test_score,classifier,tokenizer,param_classifier__alpha,param_classifier__learning_rate,param_classifier__max_depth,param_classifier__n_estimators
0,0.236482,0.017402,0.078355,0.005809,1.0,linear,0.8,5000.0,"{'classifier__C': 1, 'classifier__kernel': 'li...",0.525,...,0.531646,0.525211,0.00517,13,svm,thai_preprocess,,,,
1,0.228827,0.016699,0.066779,0.002901,1.0,linear,0.8,7000.0,"{'classifier__C': 1, 'classifier__kernel': 'li...",0.525,...,0.531646,0.525211,0.00517,13,svm,thai_preprocess,,,,
2,0.202313,0.014212,0.067912,0.002721,1.0,linear,0.8,,"{'classifier__C': 1, 'classifier__kernel': 'li...",0.525,...,0.531646,0.525211,0.00517,13,svm,thai_preprocess,,,,
3,0.225016,0.008092,0.071555,0.004083,1.0,linear,0.9,5000.0,"{'classifier__C': 1, 'classifier__kernel': 'li...",0.525,...,0.531646,0.525211,0.00517,13,svm,thai_preprocess,,,,
4,0.204003,0.002104,0.079748,0.004877,1.0,linear,0.9,7000.0,"{'classifier__C': 1, 'classifier__kernel': 'li...",0.525,...,0.531646,0.525211,0.00517,13,svm,thai_preprocess,,,,


In [31]:
# Replace NaN with No Max Features
combined_results['param_tfidf__max_features'] = np.where(combined_results['param_tfidf__max_features'].isnull()==True, 'No Max Features',combined_results['param_tfidf__max_features'])

In [None]:
combined_results.to_csv('../data/gridsearch_score.csv', index = False)