# NLP

---
# Model Inference
## Analisis Keluhan Pelanggan Mobil Menggunakan NLP dan Artificial Neural Network untuk Klasifikasi Komponen Bermasalah

---

### 1. Identitas Kelompok

- **Nama Anggota:** Hafiz Alfariz, Fhad Saleh, Bagus, Rivaldi Revin
- **Batch:** RMT 045
- **Kelompok:** 01

### 2. Dataset

- **Sumber:** Kaggle - NHTSA Complaints
- **Deskripsi:** Dataset berisi data keluhan pelanggan dari berbagai brand dan model mobil, dengan fitur utama berupa kolom `summary` (teks keluhan) dan target `components` (komponen mobil yang dikeluhkan).
- **File:** `complaints.csv`
---

# 2. Import Library.

In [1]:
# --- 1. Import Library ---
import tensorflow as tf
import pandas as pd
import numpy as np
import pickle
import json
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


# 2. Load Model, Tokenizer, Label Mapping.

In [2]:
# 2. Load Model, Tokenizer, Label Mapping
try:
    model = tf.keras.models.load_model('gru_glove_improved_model.keras')
    with open('tokenizer.pkl', 'rb') as f:
        tokenizer = pickle.load(f)
    with open('label_mapping.json', 'r') as f:
        label_mapping = json.load(f)
    y_train_aug = np.load('y_train_aug.npy')
except Exception as e:
    print("Error loading model/tokenizer/mapping:", e)
    raise

index_to_label = {v: k for k, v in label_mapping.items()}

# 3. Preprocessing Function (Sesuai Training)

In [3]:
# --- TEXT CLEANING SESUAI TRAINING (untuk inference) ---
# 1. AUTO_KEEP (copy dari training)
AUTO_KEEP = {
    "engine","motor","coolant","radiator","thermostat","overheat","overheating","misfire","stall","stalled","idling","idle",
    "oil","filter","spark","sparkplug","plug","injector","fuel","pump","nozzle","tank","maf","map","o2","lambda","tps","egr","evap",
    "battery","alternator","starter","voltage","wiring","harness","short","ground","relay","fuse","ecu","pcm","can","bus","sensor","actuator","module",
    "transmission","gearbox","gear","cvt","dct","clutch","torque","converter","drivetrain","powertrain","differential","axle","driveshaft","shifter","shift","engage","slip","jerk",
    "brake","brakes","braking","abs","pad","pads","rotor","caliper","booster","hydraulic","fluid","epb","spongy","soft","fade",
    "steer","steering","rack","pinion","alignment","wander","pull",
    "wheel","wheels","rim","hub","lug","bearing","tire","tyre","tires","tyres","pressure","psi","tpms",
    "suspension","strut","shock","bushing","linkage","controlarm","stabilizer","swaybar",
    "airbag","airbags","srs","pretensioner","seatbelt","seatbelts","restraint","deploy","deployment",
    "headlight","headlamp","taillight","foglight","lamp","bulb","beam","wiper","windshield","windscreen","washer","mirror","flicker","flickering","dim",
    "esc","esp","stability","traction","lane","ldw","fcw","adas","radar","lidar","camera","cruise","adaptive",
    "hvac","heater","ac","a/c","compressor","evaporator","expansion","condenser","blower","cabin","filter",
    "exhaust","muffler","catalyst","catalytic","converter","o2sensor","dpf","scr",
    "speed","mph","kmh","kph","rpm","throttle","pedal","warning","malfunction","limp","limpmode",
    "hybrid","hev","phev","ev","inverter","charger","charging","soc","soh","bms","dc","dc/dc","highvoltage","hv","12v","48v","400v","800v",
    "door","hood","trunk","tailgate","bumper","fender","pillar","chassis","frame","corrosion","rust","paint",
}
AUTO_KEEP |= {w+"s" for w in list(AUTO_KEEP) if not w.endswith("s")}
AUTO_KEEP |= {w.replace(" ", "") for w in list(AUTO_KEEP)}
AUTO_KEEP |= {w.replace("-", "") for w in list(AUTO_KEEP)}

# 2. FINAL_STOP (copy dari training, contoh minimal)
FINAL_STOP = {
    "a","an","the","and","or","but","if","then","than","so","because","while","of","to","for","from","by",
    "on","in","at","as","with","about","into","through","over","under","between","within",
    "is","am","are","was","were","be","been","being","do","does","did","done","have","has","had",
    "can","could","shall","should","will","would","may","might","must",
    "i","im","i'm","ive","id","we","our","you","your","he","she","it","they","them","their",
    "this","that","these","those","there","here","where","when","why","how",
    "very","really","just","also","still","yet","only","even","ever","always","often","sometimes","usually",
    "more","most","less","least","lot","lots","many","much","few","several","some","any","none",
    "up","down","again","back","away","out","off","around","across","along","together",
    "contact","contacts","consumer","owner","dealer","manufacturer","manuf","vehicle","vehicles"
    # tambahkan hasil mining noise dari training jika ada
}

# 3. Regex dan helper
RE_DTC      = re.compile(r"^[PBUC]\d{4}$", re.I)
RE_ACRONYM  = re.compile(r"^[A-Z]{2,6}$")
RE_ALNUM    = re.compile(r"^[A-Za-z0-9]+(?:[-/\.][A-Za-z0-9]+)*$")
RE_YEAR     = re.compile(r"^(19[5-9]\d|20[0-3]\d)$")
RE_VOLT     = re.compile(r"^\d{2,4}v$", re.I)
RE_UNITPSI  = re.compile(r"^\d{1,3}psi$", re.I)
RE_SPEED1   = re.compile(r"^\d{1,3}mph$", re.I)
RE_SPEED2   = re.compile(r"^\d{1,3}(kph|kmh)$", re.I)
RE_TEMP     = re.compile(r"^\d{1,3}[CF]$", re.I)
RE_VIN      = re.compile(r"\b[0-9A-HJ-NPR-Z]{11,17}\b")
RE_PHONE    = re.compile(r"\b(?:\+?\d{1,3}[-.\s]?)?(?:\(?\d{3}\)?[-.\s]?){2}\d{4}\b")
RE_EMAIL    = re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b")
RE_URL      = re.compile(r"https?://\S+")
RE_TOKEN    = re.compile(r"[A-Za-z0-9]+(?:[-/\.][A-Za-z0-9]+)*")
RE_REPEAT3  = re.compile(r"(.)\1{2,}")
VOWELS      = set("aeiouAEIOU")
NEGATIONS = {"no","not","without","never","doesnt","doesn't","cant","can't","wont","won't"}

PHRASE_PATTERNS = [
    r"\bTL\*?\b",
    r"\bTHE\s+CONTACT(?:\s+OWNER)?\b",
    r"\bTHE\s+CONSUMER\b",
    r"\bTHE\s+OWNER\b",
    r"\bTHE\s+DEALER\b",
    r"\bTHE\s+MANUFACTURER\b",
    r"\bCONTACT(?:ED)?\s+THE\s+DEALER\b",
    r"\bCONTACT(?:ED)?\s+THE\s+MANUFACTURER\b",
    r"\bSTATED\s+THAT\b",
    r"\bWAS\s+ADVISED\b",
    r"\bWAS\s+INFORMED\b",
    r"\bWAS\s+TOLD\b",
    r"\bWAS\s+UNABLE\b",
    r"\bWERE\s+UNABLE\b",
    r"\bTHE\s+FAILURE\s+RECURRED\b",
    r"\bTHE\s+FAILURE\b",
]
PHRASE_REGEX = [re.compile(pat, flags=re.I) for pat in PHRASE_PATTERNS]

def _is_whitelisted(tok: str) -> bool:
    if len(tok) >= 50:
        return False
    if (RE_DTC.match(tok) or RE_ACRONYM.match(tok) or RE_YEAR.match(tok) or RE_VOLT.match(tok) or
        RE_UNITPSI.match(tok) or RE_SPEED1.match(tok) or RE_SPEED2.match(tok) or RE_TEMP.match(tok) or
        RE_ALNUM.match(tok)):
        return True
    if tok.lower() in AUTO_KEEP:
        return True
    return False

def _normalize_special(tok: str) -> str:
    if RE_YEAR.match(tok): return "<YEAR>"
    if RE_VOLT.match(tok): return "<VOLT>"
    if RE_UNITPSI.match(tok): return "<PSI>"
    if RE_SPEED1.match(tok) or RE_SPEED2.match(tok): return "<SPEED>"
    if RE_TEMP.match(tok): return "<TEMP>"
    return tok

def scrub_boilerplate(s: str) -> str:
    s = RE_URL.sub(" <URL> ", s)
    s = RE_EMAIL.sub(" <EMAIL> ", s)
    s = RE_PHONE.sub(" <PHONE> ", s)
    s = RE_VIN.sub(" <VIN> ", s)
    for rgx in PHRASE_REGEX:
        s = rgx.sub(" ", s)
    s = re.sub(r"[^A-Za-z0-9\-\./\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def clean_text_domain_aware(s: str) -> str:
    s = scrub_boilerplate(str(s))
    toks = RE_TOKEN.findall(s)
    out = []
    for tok in toks:
        tok = RE_REPEAT3.sub(r"\1\1", tok)
        low = tok.lower()
        if len(tok) >= 50:
            continue
        if any(x in low for x in ['www', 'http', 'https', '.com', '.net', '.org', '.gov', '.edu']):
            continue
        if low in NEGATIONS:
            out.append(low)
            continue
        if _is_whitelisted(tok) or (low in AUTO_KEEP):
            out.append(_normalize_special(tok))
            continue
        if low in FINAL_STOP:
            continue
        out.append(low)
    cleaned = " ".join(t for t in out if t)
    cleaned = re.sub(r"\s+", " ", cleaned).strip()
    return cleaned

# 4. Input Keluhan.

In [4]:
# 4. Input Keluhan (contoh inference, bisa asli/buatan)
data_infer = [
    "Engine warning light illuminated and lost power.",
    "Door handle broken, cannot open.",
    "Transmission slips when shifting.",
    "Seat belt does not retract.",
    "Steering wheel feels loose.",
    "Parking brake failed.",
    "Airbag warning light stays on.",
    "Suspension noisy over bumps.",
    "Tire tread worn out.",
    "Visibility reduced due to foggy windshield.",
    "Battery drained quickly after starting.",
    "ABS sensor malfunctioned during rain.",
    "Coolant leak detected under the car.",
    "Mirror adjustment not working.",
    "Panel vibration at high speed.",
    "Axle cracked after hitting pothole.",
    "Fuel pump stopped suddenly.",
    "Window stuck and won't close.",
    "Brake pedal feels soft.",
    "Other minor issue not listed."
]
df_infer = pd.DataFrame({'summary': data_infer})

# 5. Preprocessing

In [5]:
# df_infer['summary_clean'] = df_infer['summary'].apply(preprocess_text)
df_infer['summary_clean'] = df_infer['summary'].apply(clean_text_domain_aware)

# 6. Vectorization

In [6]:
# 6. Vectorization (maxlen harus sama dengan training)
maxlen = 200
X_infer_seq = tokenizer.texts_to_sequences(df_infer['summary_clean'])
X_infer_pad = tf.keras.preprocessing.sequence.pad_sequences(X_infer_seq, maxlen=maxlen, padding='post')

# 7. Predict

In [7]:
# 7. Predict
y_pred_prob = model.predict(X_infer_pad)
y_pred = np.argmax(y_pred_prob, axis=1)
df_infer['predicted_label'] = [index_to_label[int(idx)] for idx in y_pred]
df_infer['predicted_prob'] = [np.max(prob) for prob in y_pred_prob]


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 314ms/step


# 8. Output Hasil Prediksi

In [8]:
# 8. Output Hasil Prediksi
for i, row in df_infer.iterrows():
    print(f"Keluhan: {row['summary']}")
    print(f"Preprocessed: {row['summary_clean']}")
    print(f"Prediksi Index: {y_pred[i]}, Label: {row['predicted_label']} (Prob: {row['predicted_prob']:.3f})")
    print(f"Sequence: {X_infer_seq[i]}")
    print("-" * 40)

# Ringkasan Distribusi Label Hasil Prediksi
print("Distribusi label hasil prediksi:")
print(df_infer['predicted_label'].value_counts())


# Distribusi Label Train (Augmentasi) untuk Perbandingan
print("\nDistribusi label train (augmentasi):")
print(pd.Series(y_train_aug).value_counts())

Prediksi Index: 3, Label: ENGINE (Prob: 0.789)
Sequence: [25, 42, 43, 115, 4, 288, 85]
----------------------------------------
Keluhan: Door handle broken, cannot open.
Preprocessed: Door handle broken cannot open
Prediksi Index: 10, Label: STRUCTURE/BODY (Prob: 0.625)
Sequence: [129, 1195, 668, 358, 222]
----------------------------------------
Keluhan: Transmission slips when shifting.
Preprocessed: Transmission slips when shifting
Prediksi Index: 7, Label: POWER TRAIN (Prob: 0.910)
Sequence: [82, 2408, 29, 494]
----------------------------------------
Keluhan: Seat belt does not retract.
Preprocessed: Seat belt does not retract
Prediksi Index: 8, Label: RESTRAINTS & AIRBAGS (Prob: 0.997)
Sequence: [181, 441, 187, 11, 2332]
----------------------------------------
Keluhan: Steering wheel feels loose.
Preprocessed: Steering wheel feels loose
Prediksi Index: 9, Label: STEERING (Prob: 0.995)
Sequence: [66, 141, 727, 642]
----------------------------------------
Keluhan: Parking brake f

# 9. Prediksi Keluhan Panjang.

In [9]:
# 9. Prediksi Keluhan Panjang.

long_complaints = [
    "While driving at highway speed, the engine suddenly lost power and the check engine light illuminated. The vehicle struggled to accelerate and had to be towed.",
    "Steering wheel feels loose and vibrates at higher speeds. Alignment and balancing have been performed but issue continues.",
    "Parking brake failed to engage on a slope, causing the vehicle to roll slightly before stopping.",
    "Airbag warning light stays on after starting the vehicle. Diagnostic scan shows a fault in the airbag module.",
    "Tire tread worn unevenly despite regular rotation and alignment. Dealer recommends replacing all tires.",
    "Visibility is reduced due to foggy windshield that does not clear even with defroster on.",
    "ABS sensor malfunctioned during heavy rain, causing the ABS light to illuminate and braking performance to decrease."
]

# Preprocessing
df_long = pd.DataFrame({'summary': long_complaints})
df_long['summary_clean'] = df_long['summary'].apply(clean_text_domain_aware)

# df_infer['summary_clean'] = df_infer['summary'].apply(clean_text_domain_aware)

# Vectorization
X_long_seq = tokenizer.texts_to_sequences(df_long['summary_clean'])
X_long_pad = tf.keras.preprocessing.sequence.pad_sequences(X_long_seq, maxlen=maxlen, padding='post')

# Predict
y_long_prob = model.predict(X_long_pad)
y_long_pred = np.argmax(y_long_prob, axis=1)
df_long['predicted_label'] = [index_to_label[int(idx)] for idx in y_long_pred]
df_long['predicted_prob'] = [np.max(prob) for prob in y_long_prob]

# Output hasil prediksi keluhan panjang
for i, row in df_long.iterrows():
    print(f"Keluhan: {row['summary']}")
    print(f"Preprocessed: {row['summary_clean']}")
    print(f"Prediksi Index: {y_long_pred[i]}, Label: {row['predicted_label']} (Prob: {row['predicted_prob']:.3f})")
    print("Top 3 Probabilitas:")
    top3 = np.argsort(y_long_prob[i])[::-1][:3]
    for idx in top3:
        print(f"  {index_to_label[idx]}: {y_long_prob[i][idx]:.3f}")
    print(f"Sequence: {X_long_seq[i]}")
    print("-" * 40)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 296ms/step
Keluhan: While driving at highway speed, the engine suddenly lost power and the check engine light illuminated. The vehicle struggled to accelerate and had to be towed.
Preprocessed: While driving at highway speed the engine suddenly lost power and the check engine light illuminated The vehicle struggled to accelerate and had to be towed
Prediksi Index: 3, Label: ENGINE (Prob: 0.829)
Top 3 Probabilitas:
  ENGINE: 0.829
  FUEL & PROPULSION: 0.078
  POWER TRAIN: 0.030
Sequence: [27, 28, 21, 114, 94, 2, 25, 314, 288, 85, 4, 2, 144, 25, 43, 115, 2, 8, 5294, 3, 329, 4, 24, 3, 30, 156]
----------------------------------------
Keluhan: Steering wheel feels loose and vibrates at higher speeds. Alignment and balancing have been performed but issue continues.
Preprocessed: Steering wheel feels loose and vibrates at higher speeds Alignment and balancing have been performed but issue continues
Prediksi Index: 9, Label: STEERI

# 10. Kesimpulan Model Inference

Pada tahap inference, model **GRU + GloVe** yang telah dilatih dan disimpan digunakan untuk melakukan prediksi komponen bermasalah berdasarkan keluhan pelanggan mobil dari berbagai merek dan model, baik untuk keluhan pendek maupun panjang. Proses inference mengikuti seluruh pipeline preprocessing, vectorization, dan mapping label yang konsisten dengan tahap training, sehingga hasil prediksi dapat diinterpretasikan dengan akurat.

Hasil inference menunjukkan bahwa model mampu mengklasifikasikan keluhan pelanggan ke dalam kategori komponen utama seperti ENGINE, STRUCTURE, POWER TRAIN, dan lainnya dengan tingkat probabilitas yang tinggi. Model juga dapat mengenali keluhan pada kelas minoritas, meskipun performa pada kelas dengan data sangat sedikit masih perlu perhatian lebih lanjut. Distribusi label hasil prediksi pada data baru sesuai dengan distribusi label pada data training, menandakan model tidak bias hanya pada kelas mayoritas.

Proses inference berjalan stabil dan kompatibel dengan versi library yang digunakan, sehingga model siap untuk di-deploy pada aplikasi nyata seperti Streamlit. Dengan pipeline yang sistematis dan reproducible, model dapat digunakan untuk otomatisasi klasifikasi keluhan pelanggan secara efisien dan minim bias.

**Kesimpulan:**  
Model inference sudah berjalan sesuai ekspektasi, mampu melakukan prediksi komponen bermasalah pada data keluhan pelanggan mobil secara otomatis dan konsisten dengan hasil training. Model ini sangat layak untuk digunakan dalam aplikasi layanan pelanggan dan dapat membantu proses penanganan keluhan secara lebih cepat, akurat, dan berbasis data.