# ColorSkim Machine Learning AI

Saat ini `item_description` untuk artikel ditulis dalam bentuk/format `nama_artikel + warna` dimana pemisahan `nama_artikel` dan `warna` bervariasi antar brand, beberapa menggunakan spasi, dash, garis miring dsbnya.

Pembelajaran mesin ini merupakan pembelajaran yang akan menerapkan jaringan saraf buatan (neural network) untuk mempelajari pola penulisan artikel yang bercampur dengan warna untuk mengekstrak warna saja dari artikel.

Akan dilakukan beberapa scenario modelling **Natural Language Procesing** untuk permasalahan *sequence to sequence* ini. Pada intinya kita akan membagi kalimat (`item_description`) berdasarkan kata per kata dan mengkategorisasikan masing - masing kata ke dalam satu dari dua kategori warna atau bukan_warna (logistik biner).



In [44]:
# import modul
import tensorflow as tf
from tensorflow.python.client import device_lib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import wandb as wb
from rahasia import API_KEY_WANDB
tf.config.run_functions_eagerly(True)
tf.data.experimental.enable_debug_mode()

In [7]:
# cek ketersediaan GPU untuk modeling
# GeForce MX250 - office
# GeForce GTX 1060 - home
device_lib.list_local_devices()[1]

name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 1406005863
locality {
  bus_id: 1
  links {
  }
}
incarnation: 909604535619376993
physical_device_desc: "device: 0, name: NVIDIA GeForce MX250, pci bus id: 0000:02:00.0, compute capability: 6.1"
xla_global_id: 416903419

In [9]:
# login ke wandb
wb.login(key=API_KEY_WANDB)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\jPao/.netrc


True

## Membaca data

In [10]:
# Membaca data ke dalam DataFrame pandas
data = pd.read_csv('data/setengah_dataset_artikel.csv')
data[:10]

Unnamed: 0,nama_artikel,kata,label,urut_kata,total_kata
0,ADISSAGE-BLACK/BLACK/RUNWHT,ADISSAGE,bukan_warna,1,4
1,ADISSAGE-BLACK/BLACK/RUNWHT,BLACK,warna,2,4
2,ADISSAGE-BLACK/BLACK/RUNWHT,BLACK,warna,3,4
3,ADISSAGE-BLACK/BLACK/RUNWHT,RUNWHT,warna,4,4
4,ADISSAGE-N.NAVY/N.NAVY/RUNWHT,ADISSAGE,bukan_warna,1,4
5,ADISSAGE-N.NAVY/N.NAVY/RUNWHT,N.NAVY,warna,2,4
6,ADISSAGE-N.NAVY/N.NAVY/RUNWHT,N.NAVY,warna,3,4
7,ADISSAGE-N.NAVY/N.NAVY/RUNWHT,RUNWHT,warna,4,4
8,3 STRIPE D 29.5-BASKETBALL NATURAL,3,bukan_warna,1,6
9,3 STRIPE D 29.5-BASKETBALL NATURAL,STRIPE,bukan_warna,2,6


## Eksplorasi data

In [11]:
# distribusi label dalam data
data['label'].value_counts()

bukan_warna    34174
warna          22577
Name: label, dtype: int64

## Konversi data ke dalam train dan test

In [13]:
from sklearn.model_selection import train_test_split
train_kata, test_kata, train_label, test_label = train_test_split(data['kata'].to_numpy(), data['label'].to_numpy(), test_size=0.2, random_state=42)
train_kata[:5], test_kata[:5], train_label[:5], test_label[:5]

(array(['INVIS', 'SOLRED', 'JR', 'REACT', 'WHITE'], dtype=object),
 array(['6', 'GA', 'NIKE', 'BLUE', 'BLACK'], dtype=object),
 array(['bukan_warna', 'warna', 'bukan_warna', 'bukan_warna', 'warna'],
       dtype=object),
 array(['bukan_warna', 'bukan_warna', 'bukan_warna', 'warna', 'warna'],
       dtype=object))

## Konversi label ke dalam numerik

In [14]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
train_label_encode = label_encoder.fit_transform(train_label)
test_label_encode = label_encoder.transform(test_label)
train_label_encode[:5], test_label_encode[:5]

(array([0, 1, 0, 0, 1]), array([0, 0, 0, 1, 1]))

## Model 0: model dasar

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Membuat pipeline untuk mengubah kata ke dalam tf-idf
model_0 = Pipeline([
    ("tf-idf", TfidfVectorizer()),
    ("clf", MultinomialNB())
])

# Fit pipeline dengan data training
model_0.fit(X=train_kata, y=train_label_encode)

In [16]:
# Evaluasi model_0 pada data test
model_0.score(X=test_kata, y=test_label_encode)

0.9935688485595983

In [17]:
# Membuat prediksi menggunakan data test
pred_model_0 = model_0.predict(test_kata)
pred_model_0

array([0, 0, 0, ..., 1, 1, 0])

In [18]:
# Membuat fungsi dasar untuk menghitung accuray, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def hitung_metrik(target, prediksi):
    """
    Menghitung accuracy, precision, recall dan f1-score dari model klasifikasi biner
    
    Args:
        target: label yang sebenarnya dalam bentuk 1D array
        prediksi: label yang diprediksi dalam bentuk 1D array
        
    Returns:
        nilai accuracy, precision, recall dan f1-score dalam bentuk dictionary
    """
    # Menghitung akurasi model
    model_akurasi = accuracy_score(target, prediksi)
    # Menghitung precision, recall, f1-score dan support dari model
    model_presisi, model_recall, model_f1, _ = precision_recall_fscore_support(target, prediksi, average='weighted')
    
    hasil_model = {'akurasi': model_akurasi,
                   'presisi': model_presisi,
                   'recall': model_recall,
                   'f1-score': model_f1}
    
    return hasil_model

In [19]:
# Menghitung metrik dari model_0
model_0_metrik = hitung_metrik(target=test_label_encode, 
                               prediksi=pred_model_0)
model_0_metrik

{'akurasi': 0.9935688485595983,
 'presisi': 0.9935690437085363,
 'recall': 0.9935688485595983,
 'f1-score': 0.9935671438217326}

## Menyiapkan data (text) untuk model deep sequence

### Text Vectorizer Layer

In [20]:
# jumlah data (kata) dalam train_data
len(train_kata)

45400

In [21]:
# jumlah data unik (kata unik) dalam train_kata
jumlah_kata_train = len(np.unique(train_kata))
jumlah_kata_train

2940

In [22]:
# Membuat text vectorizer
from tensorflow.keras.layers import TextVectorization
vectorizer_kata = TextVectorization(max_tokens=jumlah_kata_train,
                                    output_sequence_length=1,
                                    standardize='lower')

In [23]:
# Mengadaptaasikan text vectorizer ke dalam train_kata
vectorizer_kata.adapt(train_kata)

In [24]:
# Test vectorizer kata
import random
target_kata = random.choice(train_kata)
print(f'Kata:\n{target_kata}\n')
print(f'Kata setelah vektorisasi:\n{vectorizer_kata([target_kata])}')

Kata:
1PP

Kata setelah vektorisasi:
[[364]]


In [25]:
vectorizer_kata.get_config()

{'name': 'text_vectorization',
 'trainable': True,
 'batch_input_shape': (None,),
 'dtype': 'string',
 'max_tokens': 2940,
 'standardize': 'lower',
 'split': 'whitespace',
 'ngrams': None,
 'output_mode': 'int',
 'output_sequence_length': 1,
 'pad_to_max_tokens': False,
 'sparse': False,
 'ragged': False,
 'vocabulary': None,
 'idf_weights': None}

In [26]:
# Jumlah vocabulary dalam vectorizer_kata
jumlah_vocab = vectorizer_kata.get_vocabulary()
len(jumlah_vocab)

2938

### Membuat Text Embedding

In [27]:
# Membuat text embedding layer
from tensorflow.keras.layers import Embedding
kata_embed = Embedding(input_dim=len(jumlah_vocab),
                       output_dim=64,
                       mask_zero=True,
                       name='layer_token_embedding')

In [28]:
# Contoh vectorizer dan embedding
print(f'Kata sebelum vektorisasi:\n{target_kata}\n')
kata_tervektor = vectorizer_kata([target_kata])
print(f'\nKata sesudah vektorisasi (sebelum embedding):\n{kata_tervektor}\n')
kata_terembed = kata_embed(kata_tervektor)
print(f'\nKata setelah embedding:\n{kata_terembed}\n')
print(f'Shape dari kata setelah embedding:\n{kata_terembed.shape}')

Kata sebelum vektorisasi:
1PP


Kata sesudah vektorisasi (sebelum embedding):
[[364]]


Kata setelah embedding:
[[[-4.68876362e-02  1.39226429e-02  1.53589956e-02  4.58656624e-03
    3.38894464e-02 -6.04242086e-06  2.18849666e-02  1.71338394e-03
   -4.49796915e-02 -8.10725614e-03 -5.08914143e-03 -4.80378158e-02
   -2.03994270e-02  4.37932089e-03  1.24381408e-02 -1.91058517e-02
    1.82107799e-02  4.98666205e-02 -3.25337872e-02  1.53612345e-04
   -3.75577807e-02  1.64883621e-02 -4.04143445e-02 -3.25289257e-02
    4.32353057e-02  2.53272094e-02  2.05454491e-02  2.43189670e-02
    4.46835868e-02  4.18224074e-02 -1.92648061e-02 -6.11367077e-03
    2.24282406e-02  3.76079567e-02 -3.73811647e-03 -4.47501801e-02
   -3.07079442e-02  2.99662985e-02  3.80241610e-02 -2.02864297e-02
   -2.23221537e-02  1.91515349e-02  2.69093402e-02  1.99612118e-02
    3.74258496e-02  1.38394535e-05  1.75894536e-02 -3.01381350e-02
    3.93041410e-02 -2.25336794e-02  3.17663290e-02 -3.27903517e-02
    2.77244337e-0

### Membuat TensorFlow Dataset

In [29]:
# Membuat TensorFlow dataset
train_dataset = tf.data.Dataset.from_tensor_slices((train_kata, train_label_encode))
test_dataset = tf.data.Dataset.from_tensor_slices((test_kata, test_label_encode))

train_dataset

<TensorSliceDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None))>

In [39]:
# Membuat TensorSliceDataset menjadi prefetched dataset
train_dataset = train_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

train_dataset

<PrefetchDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

## Model 1: Conv1D dengan embedding

In [30]:
# Membuat model_1 dengan layer Conv1D dari kata yang divektorisasi dan di-embed
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string, name='layer_input')
layer_vektor = vectorizer_kata(inputs)
layer_embed = kata_embed(layer_vektor)
x = layers.Conv1D(filters=64, kernel_size=5, padding='same', activation='relu')(layer_embed)
x = layers.GlobalMaxPooling1D(name='layer_max_pool')(x)
outputs = layers.Dense(units=1, activation='sigmoid', name='layer_output')(x)
model_1 = tf.keras.Model(inputs=inputs, outputs=outputs, name='model_1_Conv1D_embed')

# Compile
model_1.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [31]:
# Ringkasa model_1
model_1.summary()

Model: "model_1_Conv1D_embed"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 layer_input (InputLayer)    [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 1)                0         
 torization)                                                     
                                                                 
 layer_token_embedding (Embe  (None, 1, 64)            188032    
 dding)                                                          
                                                                 
 conv1d (Conv1D)             (None, 1, 64)             20544     
                                                                 
 layer_max_pool (GlobalMaxPo  (None, 64)               0         
 oling1D)                                                        
                                              

In [37]:
# Plot model_1
from tensorflow.keras.utils import plot_model
plot_model(model_1, show_shapes=True)

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model/model_to_dot to work.


In [42]:
# import WandbCallback
from wandb.keras import WandbCallback

# Setup wandb init dan config
wb.init(project='ColorSkim',
        entity='jpao',
        name='model_1_Conv1D_embed',
        config={'epochs': 5,
                'n_layers': len(model_1.layers)})

# Fit model_1
hist_model_1 = model_1.fit(train_dataset,
                           epochs=wb.config.epochs,
                           validation_data=test_dataset,
                           callbacks=[WandbCallback()])

0,1
GFLOPS,0.0


Epoch 1/5

[34m[1mwandb[0m: [32m[41mERROR[0m Can't save model in the h5py format. The model will be saved as W&B Artifacts in the SavedModel format.


INFO:tensorflow:Assets written to: d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best\assets


INFO:tensorflow:Assets written to: d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best\assets
[34m[1mwandb[0m: Adding directory to artifact (d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best)... Done. 0.2s


Epoch 2/5



INFO:tensorflow:Assets written to: d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best\assets


INFO:tensorflow:Assets written to: d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best\assets
[34m[1mwandb[0m: Adding directory to artifact (d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best)... Done. 0.1s


Epoch 3/5



INFO:tensorflow:Assets written to: d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best\assets


INFO:tensorflow:Assets written to: d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best\assets
[34m[1mwandb[0m: Adding directory to artifact (d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best)... Done. 0.0s


Epoch 4/5



INFO:tensorflow:Assets written to: d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best\assets


INFO:tensorflow:Assets written to: d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best\assets
[34m[1mwandb[0m: Adding directory to artifact (d:\ColorSkim\wandb\run-20220706_175519-1w82lbhy\files\model-best)... Done. 0.2s


Epoch 5/5


In [45]:
# Test prediksi dengan model_1 (model_1_Conv1D_embed)
class_list = ['warna', 'bukan_warna']
article = 'PUMA XTG WOVEN PANTS PUMA BLACK-PUMA WHITE'
article_list = article.replace("-"," ").split()
model_test = model_1.predict(article.replace("-"," ").split())
for i in range(0, len(article_list)):
    print(f'Kata: {article_list[i]}\nPrediksi: {class_list[model_test[i]]}\n\n')



TypeError: only integer scalar arrays can be converted to a scalar index