<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 100px">

# Capstone Project: Classifying Logistics Research Papers
## Part 5 : Neural Network Classification 

---

 [Part 1: Get Text](01.Get_Text.ipynb) | [Part 2: Add Label](02.Add_Label.ipynb) | [Part 3: EDA](03.EDA.ipynb) | [Part 4: Gridsearch Classification](04.Gridsearch_Classification.ipynb) | **Part 5: Neural Network Classification** | [Part 6: Model Evaluation](06.Model_Evaluation.ipynb) | [Part 7: Final Model](07.Final_Model.ipynb) 

---

### Introduction
This notebook focuses on the model tuning and optimization process for a text classification task using a Neural Network model, while also trying to use under-sampling to handle imbalanced classes.

To preprocess the text data, we use **TfidfVectorizer** as the vectorizer, which converts text into numerical features by calculating the Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word. The TfidfVectorizer is further tuned as part of the GridSearch process by adjusting its key hyperparameters, such as `max_features` and `max_df`, to optimize the quality and efficiency of the text representation.

### Import Library

In [1]:
import pandas as pd
import numpy as np
import time

from pythainlp.tokenize import word_tokenize
from pythainlp.corpus.common import thai_words, thai_stopwords
from pythainlp.util import dict_trie

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.under_sampling import RandomUnderSampler

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping  # For early stopping implementation
from transformers import AutoTokenizer, AutoModelForMaskedLM

In [2]:
df = pd.read_csv('../data/cleaned_text.csv')
df.head()

Unnamed: 0,project,abstract,content,category,multi_category,keywords,category_id,abstract_length,content_length,content_word_count
0,การจัดทำคู่มือขั้นตอนการดำเนินการการนำเข้าคราฟ...,ผู้วิจัยได้ตระหนักถึงความยุ่งยากของขั้นตอนการน...,การจัดทำคู่มือขั้นตอนการดำเนินการการนำเข้าคราฟ...,Import-Export and International Trade,"{1: 'Import-Export and International Trade', 2...","คู่มือการนำเข้าคราฟท์เบียร์, การดำเนินงานตามมา...",5,859,11582,2402
1,การเสนอแนวทางในการพัฒนาและสร้างความสัมพันธ์กับ...,งานวิจัยครั้งนี้มีวัตถุประสงค์เพื่อเสนอแนวทางใ...,การเสนอแนวทางในการพัฒนาและสร้างความสัมพันธ์กับ...,Procurement,"{1: 'Procurement', 2: 'Manufacturing/Productio...","การประเมินการปฏิบัติงาน, ผู้ส่งมอบ, แบ่งเกรด, ...",0,1172,15230,3601
2,การพัฒนามาตรฐานรถขนส่งวัตถุอันตรายที่เข้ามาในค...,ดำเนินธุรกิจเป็นผู้นำเข้า และจัดจำหน่ายสินค้าก...,การพัฒนามาตรฐานรถขนส่งวัตถุอันตรายที่เข้ามาในค...,Logistics and Distribution,"{1: 'Logistics and Distribution', 2: 'Inventor...","Chemical Solvent, การควบคุมความปลอดภัย, รถขนส่...",3,1964,13587,2883
3,แนวทางการปรับปรุงกระบวนการการส่งเอกสารใบกำกับภ...,การวิจัยครั้งนี้มีวัตถุประสงค์ เพื่อศึกษาขั้นต...,แนวทางการปรับปรุงกระบวนการการส่งเอกสารใบกำกับภ...,Procurement,"{1: 'Procurement', 2: 'Demand Planning and For...","การวิจัย, กระบวนการจัดส่งใบกำกับภาษี, แผนกบัญช...",0,1252,13124,3283
4,การศึกษาเทคนิคการพยากรณ์ยอดขายสายไฟที่เหมาะสม,จากสถานการณ์การแพร่ระบาดของเชื้อไวรัสโคโรนา 20...,การศึกษาเทคนิคการพยากรณ์ยอดขายสายไฟที่เหมาะสม ...,Demand Planning and Forecasting,"{1: 'Demand Planning and Forecasting', 2: 'Inv...","โควิด-19, ยานยนต์, การพยากรณ์ยอดขาย, Simple mo...",4,1924,25247,6114


### Text Preprocessing with Thai-Specific Methods


Thai-specific text preprocessing methods for Natural Language Processing (NLP) tasks. It includes:

- `Traditional Preprocessing` using stopword removal and tokenization tailored for Thai text.
- `WangchanBERTa` Tokenization, leveraging a pre-trained Thai language model for deep learning applications.

In [4]:
# Add custom word to keep from 5 sample abstract
added_words = ['การนำเข้า', 'ฐานนิยม', 'คราฟท์', 'แนวทาง', 'ผู้ส่งมอบ', 'โซ่อุปทาน', 'ปัจจัยรอง', 
               'การส่งมอบ', 'รถขนส่ง', 'นำไปใช้งาน', 'อย่างถูกต้อง', 'การขับรถ', 'ที่เกี่ยวข้อง', 
               'ในการปฏิบัติงาน', 'พนักงานขับรถ', 'สิ่งสำคัญ', 'ขั้นตอน', 'ที่ชัดเจน', 'การไหล', 'ยอดขาย', 
              'การจัดทำ', 'คราฟท์เบียร์', 'ฝึกสหกิจ', 'อย่างก้าวกระโดด','การจัดซื้อจัดหา','กระบวนการ',
               'แบบประเมิน','เก็บข้อมูล','อย่างชัดเจน','การดำเนินการ','การส่งเสริม','ถังดับเพลิง','แนวทาง']

# Merge custom words with Thai dictionary words
custom_words = set(thai_words()).union(added_words)
custom_trie = dict_trie(custom_words)  # Create a trie from the custom dictionary

In [5]:
stopwords = thai_stopwords()

def thai_preprocess(text):
    tokens = word_tokenize(text, custom_dict= custom_trie, engine="newmm")
    return " ".join([token for token in tokens if token not in stopwords])

def wangchan(text):   
    tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
    model = AutoModelForMaskedLM.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")

    tokens = tokenizer.tokenize(text)
    return " ".join(tokens)

### Neural Networks Model with Tokenizer and Vectorizer Parameter Tuning
This process is to optimize a Neural Network model for text classification by experimenting with different tokenization methods and TF-IDF vectorizer hyperparameters. By systematically testing various combinations of tokenizers and vectorizer settings.

**Setting Vectorizer Parameters** : We tune the TfidfVectorizer by optimizing the following two hyperparameters:

1. `max_features` : Set to 5000, 7000, or None
2. `max_df` : Set to 0.8 or 0.9

Given that our dataset contains approximately 10,000 features per file, reducing the number of features may improve the model's performance by focusing on the most informative terms and reducing noise. This step is crucial for balancing model complexity and computational efficiency.

**Handling Class Imbalance** : Since class imbalance can significantly affect the performance of classification models, we address this issue using `random undersampling`. By reducing the instances of the majority class, we create a more balanced dataset, which allows the model to better evaluate and differentiate between classes. This approach will be compared against the base model (without undersampling) during the evaluation process to determine whether it results in improved performance. The goal is to identify the best combination of tokenizer, vectorizer parameters, and data handling techniques for the classification task.



In [7]:
# Seting List of params
tokenizer = [thai_preprocess, wangchan]
max_features = [5000,7000, None]
max_df = [0.8, 0.9]
total = len(tokenizer)*len(max_features)*len(max_df)

In [8]:
# Create fuction to fit Neural Network model
def nn_model(X_train_tfidf, X_test_tfidf, y_train_encoded, y_test_encoded,shape):
    
    # Early stop
    es = EarlyStopping(
        monitor = 'val_loss'
        , patience = 5
        , restore_best_weights = True 
    )
    
    nn_model = Sequential()
    nn_model.add(Input(shape=(shape,)))
    nn_model.add(BatchNormalization())
    
    nn_model.add(Dense(512, activation='relu')) # 512 neurons with ReLU activation
    nn_model.add(Dropout(0.2)) #  Dropout layers (with a rate of 0.2)
    
    nn_model.add(Dense(256, activation='relu')) # 256 neurons with ReLU activation
    nn_model.add(Dropout(0.2)) #  Dropout layers (with a rate of 0.2)
    
    nn_model.add(Dense(128, activation='relu')) # 128 neurons with ReLU activation
    nn_model.add(Dropout(0.2)) #  Dropout layers (with a rate of 0.2)
    
    nn_model.add(Dense(64, activation='relu')) # 64 neurons with ReLU activation
    nn_model.add(Dense(32, activation='relu')) # 32 neurons with ReLU activation
    
    nn_model.add(Dense(len(df['category'].unique()), activation='softmax'))  # Softmax for multi-class classification / 8 classes
    
    # model compile
    nn_model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    # train model
    nn_model.fit(X_train_tfidf, y_train_encoded, epochs=50, batch_size=4, validation_data=(X_test_tfidf, y_test_encoded), callbacks = [es], verbose =0)
    
    # predict
    y_pred_nn = nn_model.predict(X_test_tfidf, verbose =0)
    y_pred_nn = y_pred_nn.argmax(axis=1)  # transform predict value to class
    
    # transform y back to value
    y_pred_nn = label_encoder.inverse_transform(y_pred_nn)
    
    return accuracy_score(y_test, y_pred_nn)

### Tuning Model with Different Parameters
Since the score may vary after tuning the model, I will run the model 3 times and calculate the average score to evaluate its performance. Additionally, I will incorporate address class imbalance to ensure a fair and robust evaluation.

In [40]:
# Base model
start_time = time.time()
nn_score_list = []

for tok in tokenizer:
    for max_f in max_features:
        for max_d in max_df:  
            # Preprocess data

            X = df['content'].apply(tok)
            y = df['category_id']

            # Split data
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

            # Apply TF-IDF
            tfidf = TfidfVectorizer(max_features=max_f, max_df=max_d)
            X_train_tfidf = tfidf.fit_transform(X_train)
            X_test_tfidf = tfidf.transform(X_test)

            # transform y_train and y_test to numeric
            label_encoder = LabelEncoder()
            y_train_encoded = label_encoder.fit_transform(y_train)
            y_test_encoded = label_encoder.transform(y_test)
    
            # Get input shape for the NN model
            shape = X_train_tfidf.shape[1]

            # Run nn_model 5 times and store accuracies
            accuracies = []
            for _ in range(3):
                acc = nn_model(X_train_tfidf, X_test_tfidf, y_train_encoded, y_test_encoded, shape)
                accuracies.append(acc)
            
            # Compute mean accuracy
            mean_acc = sum(accuracies) / len(accuracies)
            
            # Append results
            nn_score_list.append({
                'classifier': 'neural network',
                'tokenizer': tok.__name__,
                'max_features': max_f,
                'max_df': max_d,
                'accuracy': mean_acc
            })

runtime = (time.time() - start_time)/60

print(f'All model tuning complete! Total runtime = {runtime:.0f} minutes.')
# Convert to DataFrame
nn_score_df = pd.DataFrame(nn_score_list)

All model tuning complete! Total runtime = 67 minutes.


In [30]:
nn_score_df = nn_score_df.sort_values(by = 'accuracy', ascending = False)

# Replace NaN with No Max Features
nn_score_df['max_features'] = np.where(nn_score_df['max_features'].isnull()==True, 'No Max Features',nn_score_df['max_features'])

nn_score_df.to_csv('../data/nn_score.csv', index = False)
nn_score_df

Unnamed: 0,classifier,tokenizer,max_features,max_df,accuracy
0,neural network,thai_preprocess,5000,0.8,0.735294
1,neural network,thai_preprocess,5000,0.9,0.72549
2,neural network,thai_preprocess,7000,0.9,0.715686
3,neural network,thai_preprocess,No Max Features,0.9,0.696078
4,neural network,wangchan,7000,0.8,0.696078
5,neural network,wangchan,No Max Features,0.9,0.696078
6,neural network,wangchan,5000,0.8,0.676471
7,neural network,wangchan,No Max Features,0.8,0.676471
8,neural network,wangchan,7000,0.9,0.666667
9,neural network,thai_preprocess,No Max Features,0.8,0.647059


In [86]:
# Mean score 
mean_score = float(nn_score_df['accuracy'].mean())
print('Mean score:',mean_score)
# Check performance by paramerters
print('\nMean score by',nn_score_df.groupby('tokenizer')['accuracy'].mean())
print('\nMean score by',nn_score_df.groupby('max_features')['accuracy'].mean())
print('\nMean score by',nn_score_df.groupby('max_df')['accuracy'].mean())

Mean score: 0.6838235294117648

Mean score by tokenizer
thai_preprocess    0.691176
wangchan           0.676471
Name: accuracy, dtype: float64

Mean score by max_features
5000               0.696078
7000               0.676471
No Max Features    0.678922
Name: accuracy, dtype: float64

Mean score by max_df
0.8    0.676471
0.9    0.691176
Name: accuracy, dtype: float64


### Try Handling Imbalanced Classes with Random Undersampling

In [88]:
start_time = time.time()
nn_uamp_score_list = []

for tok in tokenizer:
    for max_f in max_features:
        for max_d in max_df:  
            
            # Preprocess data
            X = df['content'].apply(tok)
            y = df['category_id']

            # Split data
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

            # Apply TF-IDF
            tfidf = TfidfVectorizer(max_features=max_f, max_df=max_d)
            X_train_tfidf = tfidf.fit_transform(X_train)
            X_test_tfidf = tfidf.transform(X_test)

            # transform y_train and y_test to numeric
            label_encoder = LabelEncoder()
            y_train_encoded = label_encoder.fit_transform(y_train)
            y_test_encoded = label_encoder.transform(y_test)
    
            # Get input shape for the NN model
            shape = X_train_tfidf.shape[1]

            # Apply RandomUnderSampler
            rus = RandomUnderSampler(random_state=42)
            X_train_resampled, y_train_resampled = rus.fit_resample(X_train_tfidf, y_train_encoded)
            
            # Convert sparse matrices to dense arrays
            X_train_resampled = X_train_resampled.toarray()
            X_test_tfidf = X_test_tfidf.toarray()

            # Run nn_model 3 times and store accuracies
            accuracies = []
            for _ in range(3):
                acc = nn_model(X_train_resampled, X_test_tfidf, y_train_resampled, y_test_encoded, shape)
                accuracies.append(acc)
            
            # Compute mean accuracy
            mean_acc = sum(accuracies) / len(accuracies)
            
            # Append results
            nn_uamp_score_list.append({
                'classifier': 'neural network',
                'tokenizer': tok.__name__,
                'max_features': max_f,
                'max_df': max_d,
                'accuracy': mean_acc
            })

runtime = (time.time() - start_time)/60

print(f'All model tuning complete! Total runtime = {runtime:.0f} minutes.')

# Convert to DataFrame
nn_usamp_score_df = pd.DataFrame(model_params_list)

All model tuning complete! Total runtime = 72 minutes.


In [22]:
nn_usamp_score_df = nn_usamp_score_df.sort_values(by = 'accuracy', ascending = False)

# Replace NaN with No Max Features
nn_usamp_score_df['max_features'] = np.where(nn_usamp_score_df['max_features'].isnull()==True, 'No Max Features',nn_usamp_score_df['max_features'])

nn_usamp_score_df.to_csv('../data/nn_usamp_score.csv', index = False)
nn_usamp_score_df

Unnamed: 0,classifier,tokenization,max_features,max_df,accuracy
1,neural network,thai_preprocess,5000.0,0.9,0.588235
9,neural network,wangchan,7000.0,0.9,0.578431
6,neural network,wangchan,5000.0,0.8,0.571895
11,neural network,wangchan,No Max Features,0.9,0.571895
5,neural network,thai_preprocess,No Max Features,0.9,0.565359
3,neural network,thai_preprocess,7000.0,0.9,0.555556
8,neural network,wangchan,7000.0,0.8,0.542484
0,neural network,thai_preprocess,5000.0,0.8,0.539216
7,neural network,wangchan,5000.0,0.9,0.539216
4,neural network,thai_preprocess,No Max Features,0.8,0.535948


In [110]:
# Mean score 
usamp_mean_score = float(nn_usamp_score_df['accuracy'].mean())
print('Mean score:',usamp_mean_score)
# Check performance by paramerters
print('\nMean score by',nn_usamp_score_df.groupby('tokenizer')['accuracy'].mean())
print('\nMean score by',nn_usamp_score_df.groupby('max_features')['accuracy'].mean())
print('\nMean score by',nn_usamp_score_df.groupby('max_df')['accuracy'].mean())

Mean score: 0.5490196078431372

Mean score by tokenizer
thai_preprocess    0.544662
wangchan           0.553377
Name: accuracy, dtype: float64

Mean score by max_features
5000.0             0.559641
7000.0             0.540033
No Max Features    0.547386
Name: accuracy, dtype: float64

Mean score by max_df
0.8    0.531590
0.9    0.566449
Name: accuracy, dtype: float64


### Best Parameters for Base Neural Network and Model with Random Undersampling:
- **Tokenization**: `thai_preprocess` performs slightly better than wangchan.
- **Max_features**: Setting `max_features = 5000` yields the best performance.
- **Max_df**: Setting `max_df = 0.9` provides the best results.

the base model achieved a mean accuracy score of 0.68, while the model with undersampling resulted in a lower mean accuracy score of 0.55.

**Summary: The base neural network model performs better than the model with random undersampling.**