# **Studi Kasus Data Mining: Text Mining**

### **Tabel Kontribusi Kelompok 7:**

| Nama                                | NPM        | Kontribusi                                          | % Kontribusi |
|-------------------------------------|------------|-----------------------------------------------------|--------------|
| Golda Aurelia Silalahi              | 2206826173 | Terlibat aktif dalam diskusi dan ikut serta dalam mengerjakan keseluruhan bagian. | 100%         |
| Yiesha Reyhani                      | 2206828115 | Terlibat aktif dalam diskusi dan ikut serta dalam mengerjakan keseluruhan bagian. | 100%         |
| Jason Justin Andryana               | 2206029670 | Terlibat aktif dalam diskusi dan ikut serta dalam mengerjakan keseluruhan bagian. | 100%         |
| Aditya Raja Fadlurahman Kusuma      | 2206051626 | Terlibat aktif dalam diskusi dan ikut serta dalam mengerjakan keseluruhan bagian. | 100%         |

<br>


### **Natural Language Processing with Disaster Tweets**

**Sumber Data**: [Kaggle - Natural Language Processing with Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started/data)

Dataset ini berisi 10,000 tweet yang telah diklasifikasikan secara manual untuk memprediksi apakah sebuah tweet terkait dengan bencana nyata atau tidak. Dataset ini dapat digunakan untuk melatih model machine learning yang dapat membedakan antara tweet tentang bencana nyata dan tweet yang tidak relevan. Berikut adalah deskripsi dari masing-masing kolom dalam dataset:

| Variable Name  | Description                                                                                              |
|----------------|----------------------------------------------------------------------------------------------------------|
| id             | Identifikasi unik untuk setiap tweet                                                                     |
| text           | Teks dari tweet                                                                                          |
| location       | Lokasi dari mana tweet dikirim (dapat kosong)                                                            |
| keyword        | Kata kunci tertentu dari tweet (dapat kosong)                                                            |
| target         | Hanya ada di *train.csv*, menunjukkan apakah tweet terkait dengan bencana nyata (1) atau tidak (0)       |

 <br>



## **Import Libraries and Load Data**

In [None]:
# Import Library
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from scipy.sparse import hstack
!pip install contractions
import contractions
import warnings
from wordcloud import STOPWORDS, WordCloud
from collections import Counter

import nltk
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Import dari scikit-learn (sklearn)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Set seed
SEED = 825

# Warnings and display settings
warnings.filterwarnings('ignore')

In [None]:
# Load Data
url = 'https://raw.githubusercontent.com/goldasilalahi/data-mining-and-business-intelligence/refs/heads/main/'

train_df = pd.read_csv(url + 'train3.csv', delimiter=',') # train
test_df = pd.read_csv(url + 'test3.csv', delimiter=',') # test
ss_df = pd.read_csv(url + 'sample_submission3.csv', delimiter=',') # submission

In [None]:
# Cuplikan data train
train_df.head()

## EDA

In [None]:
# Cek data shape
print("Train Dataset Shape:", train_df.shape)
print("Test Dataset Shape:", test_df.shape)
print("Submission Dataset Shape:", ss_df.shape)

Akan dicek missing value pada data train dan data test.

In [None]:
# Cek missing values
train_df.isnull().sum().rename("Missing Value Amount(Train data)")

In [None]:
(train_df.isnull().sum()/train_df.shape[0]*100).rename("Missing values percentage per kolom (Train data)")

In [None]:
test_df.isnull().sum().rename("Missing values percentage per kolom (Test data)")

In [None]:
(test_df.isnull().sum()/test_df.shape[0]*100).rename("Missing values percentage per kolom (Test data)")

Terdapat missing value pada fitur keyword dan location, dengan paling banyak pada location. Kita tidak akan menggunakan fitur tersebut untuk membuat model, sehingga akan dibiarkan.

In [None]:
# Cek duplikasi
duplicate_rows_train = train_df[train_df.duplicated()]
print(f"Number of duplicate rows in train_df: {len(duplicate_rows_train)}")

Tidak terdapat data duplikat, namun akan kita cek apakah terdapat fitur text yang duplikat.


In [None]:
# Cek duplikasi di data train
print("\nJumlah duplikasi data pada kolom Text di Train Data:")
print(train_df["text"].duplicated().sum())
# Cek duplikasi di data test
print("\nJumlah duplikasi data pada kolom Text di Test Data:")
print(test_df["text"].duplicated().sum())

Tidak terdapat data duplikat, namun akan kita cek apakah terdapat fitur text yang duplikat.


In [None]:
# Cek duplikasi text di data train
duplicates = pd.concat(x for _, x in train_df.groupby(["text"]) if len(x) > 1)
with pd.option_context("display.max_rows", None, "max_colwidth", 240):
    display(duplicates[["id", "target", "text"]])

Terlihat bahwa ada duplikasi pada text dari data train. Beberapa memiliki kelas target yang sama, namun ada juga yang memiliki kelas target berbeda. Untuk tweet dengan kelas yang sama, akan dihapus duplikatnya.

In [None]:
# Drop duplikat
train_df.drop(
    [
        6449, 7034, 3589, 3591, 3597, 3600, 3603,
        3604, 3610, 3613, 3614, 119, 106, 115,
        2666, 2679, 1356, 7609, 3382, 1335, 2655,
        2674, 1343, 4291, 4303, 1345, 48, 3374,
        7600, 164, 5292, 2352, 4308, 4306, 4310,
        1332, 1156, 7610, 2441, 2449, 2454, 2477,
        2452, 2456, 3390, 7611, 6656, 1360, 5771,
        4351, 5073, 4601, 5665, 7135, 5720, 5723,
        5734, 1623, 7533, 7537, 7026, 4834, 4631,
        3461, 6366, 6373, 6377, 6378, 6392, 2828,
        2841, 1725, 3795, 1251, 7607
    ], inplace=True
)
duplicates = pd.concat(x for _, x in train_df.groupby(["text"]) if len(x) > 1)
with pd.option_context("display.max_rows", None, "max_colwidth", 240):
    display(duplicates[["id", "target", "text"]])

Kita tidak tahu metode yang digunakan oleh pembuat dataset untuk melabel kelas target. Sehingga, cukup sulit untuk memutuskan mana kelas target yang benar untuk data duplikasi dengan kelas target berbeda. Sehingga, kami memutuskan untuk menghapus data duplikasi yang bermasalah. Meskipun akan mengurangi ukuran data. lebih baik untuk memastikan tidak ada bias yang tidak disengaja.

In [None]:
# Drop duplikat
train_df.drop(
    [
        4290, 4299, 4312, 4221, 4239, 4244, 2830,
        2831, 2832, 2833, 4597, 4605, 4618, 4232,
        4235, 3240, 3243, 3248, 3251, 3261, 3266,
        4285, 4305, 4313, 1214, 1365, 6614, 6616,
        1197, 1331, 4379, 4381, 4284, 4286, 4292,
        4304, 4309, 4318, 610, 624, 630, 634, 3985,
        4013, 4019, 1221, 1349, 6091, 6094,
        6103, 6123, 5620, 5641
    ], inplace=True
)

In [None]:
print("Train Dataset Shape:", train_df.shape)

In [None]:
train_df

In [None]:
print("Train Dataset Shape:", train_df.shape)

In [None]:
# Distribusi target
plt.figure(figsize=(12, 6))

# Pie Chart
plt.subplot(1, 2, 1)
target_counts = train_df['target'].value_counts()
plt.pie(target_counts, labels=['Not Disaster', 'Disaster'], autopct='%1.1f%%', startangle=90, colors=sns.color_palette('Dark2'))
plt.title('Distribution of Target Variable')
plt.legend(['Not Disaster', 'Disaster'], loc='upper right')  # Add legend

# Bar Chart
plt.subplot(1, 2, 2)
ax = sns.countplot(x='target', data=train_df, palette='Dark2')
plt.title('Distribution of Target Variable')
plt.xlabel('Target')
plt.ylabel('Count')

# Annotate bar chart
for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}',
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom')

# Update xtick labels
ax.set_xticklabels(['Not Disaster', 'Disaster'])  # Set labels for x ticks

plt.tight_layout()
plt.show()

Berdasarkan grafik distribusi kelas target, terlihat bahwa distribusinya cukup seimbang (tidak imbalance).

In [None]:
# Plot distribusi panjang text berdasar target
def plot_text_length_distribution(df, text_column='text', target_column='target'):
    # Drop rows with missing values in the text or target columns
    df = df.dropna(subset=[text_column, target_column])

    # Ensure the target column is treated as categorical for hue
    df[target_column] = df[target_column].astype(str)

    # Calculate text lengths
    df['text_length'] = df[text_column].apply(lambda x: len(str(x)))

    # Plot the distribution of text lengths
    plt.figure(figsize=(12, 6))
    sns.histplot(
        data=df,
        x='text_length',
        hue=target_column,
        kde=True,
        bins=30,
        palette='viridis',
        multiple='stack'
    )
    print(target_column)
    plt.title('Distribution of Text Lengths by Target Class', fontsize=16)
    plt.xlabel('Text Length', fontsize=14)
    plt.ylabel('Frequency', fontsize=14)
    plt.legend(df[target_column].unique(), fontsize=12)
    plt.tight_layout()
    plt.show()

    # Print summary statistics for text lengths by target
    print("Text Length Statistics by Target Class:")
    print(df.groupby(target_column)['text_length'].describe())
plot_text_length_distribution(train_df, text_column='text', target_column='target')

In [None]:
# Plot distribusi panjang text
def analyze_text_length(df, text_column='text'):
    # Calculate text lengths
    df['text_length'] = df[text_column].dropna().apply(len)

    # Plot the distribution of text lengths
    plt.figure(figsize=(10, 6))
    sns.histplot(df['text_length'], kde=True, bins=30, color='blue')
    plt.title('Distribution of Text Lengths', fontsize=16)
    plt.xlabel('Text Length', fontsize=14)
    plt.ylabel('Frequency', fontsize=14)
    plt.show()

    # Summary statistics for text lengths
    print(df['text_length'].describe())
# Analyze Text Length Distribution
analyze_text_length(test_df, text_column='text')

In [None]:
# Fungsi untuk menganalisis noise pada data
def analyze_noisy_data(df, text_column='text'):
    # Patterns to detect noise
    patterns = {
        "URLs": r'https?://\S+|www\.\S+',
        "HTML Tags": r'<.*?>',
        "Emojis": r'[^\x00-\x7F]',
        "Mentions (@)": r'@\w+',
        "Special Characters": r'[^a-zA-Z0-9\s]'
    }

    # Count occurrences of each type of noise
    noise_counts = {}
    for noise_type, pattern in patterns.items():
        noise_counts[noise_type] = df[text_column].dropna().apply(lambda x: len(re.findall(pattern, x))).sum()

    # Display noise statistics
    print("Noise Statistics:")
    for noise_type, count in noise_counts.items():
        print(f"{noise_type}: {count}")

    # Visualize noise counts
    plt.figure(figsize=(10, 6))
    sns.barplot(x=list(noise_counts.keys()), y=list(noise_counts.values()),palette='viridis')
    plt.title('Presence of Noisy Data', fontsize=16)
    plt.ylabel('Count', fontsize=14)
    plt.xlabel('Noise Type', fontsize=14)
    plt.xticks(rotation=45)
    plt.show()

In [None]:
# Analyze Noisy Data
analyze_noisy_data(train_df, text_column='text')

In [None]:
# Analyze Noisy Data
analyze_noisy_data(test_df, text_column='text')

In [None]:
# import pandas as pd
# from collections import Counter
# from sklearn.feature_extraction.text import CountVectorizer
# import matplotlib.pyplot as plt
# import seaborn as sns
# import nltk
# from nltk.corpus import stopwords

# # Download NLTK stopwords if not already installed
# nltk.download('stopwords')

In [None]:
# Plot kata yang umum
def analyze_common_words(df, text_column='text', n=10, custom_stopwords=None):
    # Combine all text into one string
    all_text = " ".join(df[text_column].dropna().tolist())

    # Tokenize words
    words = all_text.split()

    # Define stopwords: use the default STOPWORDS and add any custom stopwords if provided
    stopwords = set(STOPWORDS)
    if custom_stopwords:
        stopwords.update(custom_stopwords)

    # Remove stopwords from the word list
    filtered_words = [word for word in words if word.lower() not in stopwords]

    # Count word frequencies
    word_counts = Counter(filtered_words)

    # Get the most common words
    common_words = word_counts.most_common(n)

    # Print the most common words
    print("Most Common Words (Excluding Stopwords):")
    for word, count in common_words:
        print(f"{word}: {count}")

    # Visualize
    words, counts = zip(*common_words)
    plt.figure(figsize=(10, 6))
    sns.barplot(x=list(counts), y=list(words), palette="viridis")
    plt.title("Most Common Words (Excluding Stopwords)", fontsize=16)
    plt.xlabel("Count", fontsize=14)
    plt.ylabel("Words", fontsize=14)
    plt.show()

# Example usage
analyze_common_words(train_df, text_column='text', n=10, custom_stopwords={'-', '...','&amp;',"|","2","??","????"})


In [None]:
# Wordcloud
def generate_wordcloud_from_dataframe(df, text_column, custom_stopwords=None):
    # Combine all text data into one string
    text = ' '.join(df[text_column].dropna().astype(str))

    # Preprocess the text (remove special characters, URLs, etc.)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.lower()  # Convert to lowercase

    # Define stopwords
    stopwords = set(STOPWORDS)

    # Add custom stopwords if provided
    if custom_stopwords:
        stopwords.update(custom_stopwords)

    # Generate the word cloud
    wordcloud = WordCloud(
        width=800,
        height=400,
        background_color='white',
        stopwords=stopwords,
        colormap='viridis',
        max_words=200
    ).generate(text)

    # Plot the word cloud
    plt.figure(figsize=(12, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title("Word Cloud (Without Stopwords)", fontsize=16)
    plt.show()


# Generate the word cloud
generate_wordcloud_from_dataframe(train_df, text_column='text', custom_stopwords={'-', '...','&amp;',"|","2","??","????","amp","im","u","will"})

In [None]:
#Plot N-grams
def analyze_ngrams(df, text_column='text', ngram_range=(2, 3), top_n=10):
    # Vectorize to extract n-grams
    vectorizer = CountVectorizer(ngram_range=ngram_range, stop_words='english')
    ngrams = vectorizer.fit_transform(df[text_column].dropna())

    # Count n-grams
    ngram_counts = Counter(dict(zip(vectorizer.get_feature_names_out(), ngrams.sum(axis=0).A1)))

    # Get the most common n-grams
    common_ngrams = ngram_counts.most_common(top_n)

    # Print the most common n-grams
    print(f"Most Common {ngram_range[0]}-grams:")
    for ngram, count in common_ngrams:
        print(f"{ngram}: {count}")

    # Visualize
    ngrams, counts = zip(*common_ngrams)
    plt.figure(figsize=(10, 6))
    sns.barplot(x=list(counts), y=list(ngrams), palette="magma")
    plt.title(f"Most Common {ngram_range[0]}-grams", fontsize=16)
    plt.xlabel("Count", fontsize=14)
    plt.ylabel(f"{ngram_range[0]}-grams", fontsize=14)
    plt.show()
analyze_ngrams(train_df, text_column='text', ngram_range=(2, 2), top_n=10)

In [None]:
# Bar chart Trigrams
analyze_ngrams(train_df, text_column='text', ngram_range=(3, 3), top_n=10)

In [None]:
def identify_stopwords_abbreviations(df, text_column='text'):
    # Get all text
    all_text = " ".join(df[text_column].dropna().tolist())

    # Tokenize words
    words = all_text.split()

    # Load NLTK stopwords
    stop_words = set(stopwords.words('english'))

    # Identify stopwords
    detected_stopwords = [word for word in words if word.lower() in stop_words]
    print(f"Detected Stopwords (Top 10): {Counter(detected_stopwords).most_common(10)}")

    # Identify potential abbreviations (all uppercase words)
    abbreviations = [word for word in words if word.isupper() and len(word) > 1]
    print(f"Detected Abbreviations: {Counter(abbreviations).most_common(10)}")

    # (Optional) Identify domain-specific terms (requires external domain knowledge)
    print("For domain-specific terms, define a dictionary or list of terms specific to your dataset's context.")


In [None]:
# Identify Stopwords, Abbreviations, and Domain-Specific Terms
identify_stopwords_abbreviations(train_df, text_column='text')

## **Preprocessing**

In [None]:
# Menginisialisasi WordNetLemmatizer
lemmer = WordNetLemmatizer()

# Fungsi untuk stemming
def stemming(tweet):
    sb = SnowballStemmer('english')
    s = ''
    for word in tweet.split():
        s += sb.stem(word) + ' '
    return s.strip()

# Fungsi untuk menghapus stopwords
stopwords_set = set(stopwords.words('english'))  # pastikan stopwords diunduh
def remove_stopwords(tweet):
    sentence = ' '.join(e.lower() for e in tweet.split() if e.lower() not in stopwords_set)
    return sentence

# Fungsi untuk membersihkan teks
def clean_text(text):
    text = contractions.fix(text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'@\w+', '', text)  # Remove mentions (@)
    text = re.sub(r'\brt\b(?!\w)', '', text)
    text = re.sub(r"[^a-zA-Z\d\s]", "", text)  # Remove everything except letters, digits, and spaces
    text = re.sub(r"[0-9]", "", text)  # Remove numbers
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'[^\x00-\x7F]+', '', text)  # Remove emojis
    text = text.lower()  # Convert to lowercase
    return text.strip()

# Fungsi lemmatization menggunakan WordNetLemmatizer
def preprocess_wordnet(text):
    # Tokenisasi kalimat menjadi kata-kata
    words = word_tokenize(text)
    # Lemmatization
    lemmatized_text = ' '.join([lemmer.lemmatize(word) for word in words])
    return lemmatized_text

# Preprocessing Lengkap
def preprocessing(train_df, test_df):
    # 2. Clean text
    train_df['clean_text'] = train_df['text'].apply(clean_text)
    test_df['clean_text'] = test_df['text'].apply(clean_text)

    # 3. Lemmatize using WordNet
    train_df['clean_text'] = train_df['clean_text'].apply(preprocess_wordnet)
    test_df['clean_text'] = test_df['clean_text'].apply(preprocess_wordnet)

    # 4. Remove stopwords
    train_df['clean_text'] = train_df['clean_text'].apply(remove_stopwords)
    test_df['clean_text'] = test_df['clean_text'].apply(remove_stopwords)

    # 5. Stemming
    train_df['clean_text'] = train_df['clean_text'].apply(stemming)
    test_df['clean_text'] = test_df['clean_text'].apply(stemming)

    return train_df, test_df

In [None]:
# Preprocessing Data
train_df, test_df = preprocessing(train_df, test_df)

# TF-IDF Vectorization
tfidf = TfidfVectorizer(
    max_features=10000,  # Tambahkan jumlah fitur
    min_df=1,  # Pertimbangkan kata yang jarang muncul
    ngram_range=(1, 2),  # Gunakan bigram
    sublinear_tf=True  # Scaling TF-IDF
)
X_train = tfidf.fit_transform(train_df['clean_text'])
X_test = tfidf.transform(test_df['clean_text'])

# Prepare Target Variable
y_train = train_df['target']

# Split Data (Train-Validation Split)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=SEED)

In [None]:
pd.set_option('display.max_colwidth', None)

# Melihat hasil preprocessing
train_df[['text', 'target', 'clean_text']].head(20)

## **Modelling**

In [None]:
# Logistic Regression Model
logreg_model = LogisticRegression(random_state=SEED)
logreg_model.fit(X_train, y_train)
logreg_y_val_pred = logreg_model.predict(X_val)
print("Logistic Regression Validation Accuracy:", accuracy_score(y_val, logreg_y_val_pred))
print("\nClassification Report for Logistic Regression:\n", classification_report(y_val, logreg_y_val_pred))

# Random Forest Model
rf_model = RandomForestClassifier(random_state=SEED)
rf_model.fit(X_train, y_train)
rf_y_val_pred = rf_model.predict(X_val)
print("Random Forest Validation Accuracy:", accuracy_score(y_val, rf_y_val_pred))
print("\nClassification Report for Random Forest:\n", classification_report(y_val, rf_y_val_pred))

# Naive Bayes Model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
nb_y_val_pred = nb_model.predict(X_val)
print("Naive Bayes Validation Accuracy:", accuracy_score(y_val, nb_y_val_pred))
print("\nClassification Report for Naive Bayes:\n", classification_report(y_val, nb_y_val_pred))

# SVM Model
svm_model = SVC(kernel='linear', random_state=SEED)
svm_model.fit(X_train, y_train)
svm_y_val_pred = svm_model.predict(X_val)
print("SVM Validation Accuracy:", accuracy_score(y_val, svm_y_val_pred))
print("\nClassification Report for SVM:\n", classification_report(y_val, svm_y_val_pred))

# Gradient Boosting Model
gb_model = GradientBoostingClassifier(random_state=SEED)
gb_model.fit(X_train, y_train)
gb_y_val_pred = gb_model.predict(X_val)
print("Gradient Boosting Validation Accuracy:", accuracy_score(y_val, gb_y_val_pred))
print("\nClassification Report for Gradient Boosting:\n", classification_report(y_val, gb_y_val_pred))

# XGBoost Model
xgb_model = XGBClassifier(random_state=SEED, use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_train, y_train)
xgb_y_val_pred = xgb_model.predict(X_val)
print("XGBoost Validation Accuracy:", accuracy_score(y_val, xgb_y_val_pred))
print("\nClassification Report for XGBoost:\n", classification_report(y_val, xgb_y_val_pred))

# LightGBM Model
lgbm_model = LGBMClassifier(random_state=SEED)
lgbm_model.fit(X_train, y_train)
lgbm_y_val_pred = lgbm_model.predict(X_val)
print("LightGBM Validation Accuracy:", accuracy_score(y_val, lgbm_y_val_pred))
print("\nClassification Report for LightGBM:\n", classification_report(y_val, lgbm_y_val_pred))

# Voting Classifier (Ensemble of Models)
voting_model = VotingClassifier(estimators=[
    ('logreg', logreg_model),
    ('rf', rf_model),
    #('xgb', xgb_model),
    ('svm', svm_model)
], voting='hard')

voting_model.fit(X_train, y_train)
voting_y_val_pred = voting_model.predict(X_val)
print("Voting Classifier Validation Accuracy:", accuracy_score(y_val, voting_y_val_pred))
print("\nClassification Report for Voting Classifier:\n", classification_report(y_val, voting_y_val_pred))

Interpretasi :

1. Logistic Regression:  
  Accuracy: 81.17%
  Precision dan Recall untuk kelas Disaster (1) lebih rendah dibandingkan Not Disaster (0), tapi masih cukup baik (precision 85%, recall 71%).
  F1-score: Model seimbang untuk Not Disaster (84%) dan Disaster (77%).

2. Random Forest:  
  Accuracy: 79.70%  
  Precision dan Recall untuk Not Disaster (0) cukup baik, namun lebih rendah untuk Disaster (1) (precision 84%, recall 69%).  
  Model cenderung lebih kuat di kelas Not Disaster, dengan F1-score 83% dan 75% untuk kelas 0 dan 1.
  
3. Naive Bayes:  
  Accuracy: 80.64%  
  Precision dan recall untuk Not Disaster (0) cukup baik (precision 77%, recall 91%), namun agak kurang untuk Disaster (1) (precision 87%, recall 68%).  
  F1-score untuk Not Disaster lebih tinggi (84%) dibandingkan Disaster (76%).  

4. SVM:  
  Accuracy: 80.91%  
  Precision dan recall hampir seimbang antara kelas Not Disaster (precision 80%, recall 88%) dan Disaster (precision 83%, recall 73%).  
  F1-score untuk Not Disaster lebih baik (83%) daripada untuk Disaster (78%).

5. Gradient Boosting:  
  Accuracy: 74.63%  
  Model menunjukkan precision yang lebih tinggi untuk Disaster (1) (89%) tetapi recall yang sangat rendah (51%).  
  F1-score untuk kelas Disaster rendah (64%), sementara Not Disaster lebih baik (80%).

6. XGBoost:  
  Accuracy: 79.17%  
  Precision untuk Disaster 84% dan recall 67%, sedikit lebih baik daripada
  Random Forest dan Naive Bayes dalam mendeteksi kelas Disaster.  
  F1-score seimbang di 82% untuk Not Disaster dan 75% untuk Disaster.

7. LightGBM:  
  Accuracy: 79.71%  
  Model ini mirip dengan Random Forest, dengan precision dan recall lebih baik untuk Not Disaster dan sedikit lebih rendah untuk Disaster.  
  F1-score untuk Not Disaster (83%) lebih baik daripada untuk Disaster (76%).

8. Voting Classifier:  
  Accuracy: 80.91%  
  Precision dan Recall seimbang untuk Not Disaster (precision 79%, recall 89%) dan Disaster (precision 84%, recall 71%).  
  F1-score 84% untuk Not Disaster dan 77% untuk Disaster, menunjukkan kinerja yang cukup baik dan seimbang di kedua kelas.

## **Model Evaluation: F1 Score**

In [None]:
# F1-score untuk setiap model
models = ['Logistic Regression', 'Random Forest', 'Naive Bayes', 'SVM', 'Gradient Boosting', 'XGBoost', 'LightGBM', 'Voting Classifier']

f1_scores = [
    classification_report(y_val, logreg_y_val_pred, output_dict=True)['weighted avg']['f1-score'],
    classification_report(y_val, rf_y_val_pred, output_dict=True)['weighted avg']['f1-score'],
    classification_report(y_val, nb_y_val_pred, output_dict=True)['weighted avg']['f1-score'],
    classification_report(y_val, svm_y_val_pred, output_dict=True)['weighted avg']['f1-score'],
    classification_report(y_val, gb_y_val_pred, output_dict=True)['weighted avg']['f1-score'],
    classification_report(y_val, xgb_y_val_pred, output_dict=True)['weighted avg']['f1-score'],
    classification_report(y_val, lgbm_y_val_pred, output_dict=True)['weighted avg']['f1-score'],
    classification_report(y_val, voting_y_val_pred, output_dict=True)['weighted avg']['f1-score']
]

# Gabungkan nama model dan F1-score
model_f1_scores = sorted(zip(models, f1_scores), key=lambda x: x[1])

# Sort
sorted_models, sorted_f1_scores = zip(*model_f1_scores)

# Bar Chart
plt.figure(figsize=(12,6))  # Memperlebar ukuran gambar
bars = plt.barh(sorted_models, sorted_f1_scores, color='skyblue')

# Menambahkan angka di ujung setiap bar
for bar in bars:
    plt.text(bar.get_width(), bar.get_y() + bar.get_height() / 2,
             f'{bar.get_width():.4f}', va='center', ha='left', color='black')

plt.xlabel('F1 Score')
plt.title('F1 Score Comparison')

# Menambahkan margin untuk menghindari angka terpotong
plt.tight_layout()
plt.show()

In [None]:
print("F1 Scores for All Models:")
for model, f1 in zip(sorted_models, sorted_f1_scores):
    print(f"{model}: {f1:.5f}")

Interpretasi :  
Model logistic regression, SVM, dan Voting classifier memberikan hasil yang cukup baik untuk f1 score. Untuk proses selanjutnya, kami memutuskan untuk menggunakan Model SVM.

## **Hyperparameter Tuning**

### SVM

In [None]:
# Parameter grid
param_grid_svm = {
    'C': [0.1, 1, 10],  # Regularisasi
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

# Inisialisasi GridSearchCV
grid_svm = GridSearchCV(svm_model, param_grid_svm, cv=5, n_jobs=-1)

# Fit model
grid_svm.fit(X_train, y_train)

# Tampilkan hasil
print("Best parameters found: ", grid_svm.best_params_)
print("Best cross-validation score: ", grid_svm.best_score_)

Berdasarkan hasil hyperparameter tuning, didapatkan parameter terbaiknya adalah :  
{'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}

In [None]:
best_params = grid_svm.best_params_
svm_best_model = SVC(C=best_params['C'], gamma=best_params['gamma'], kernel=best_params['kernel'], random_state=825)

# Train model
svm_best_model.fit(X_train, y_train)

# Prediksi menggunakan data validasi
svm_y_val_pred = svm_best_model.predict(X_val)

# Evaluasi model
from sklearn.metrics import accuracy_score, classification_report

# Akurasi
print("SVM Validation Accuracy:", accuracy_score(y_val, svm_y_val_pred))

# Classification Report
print("\nClassification Report for SVM:\n", classification_report(y_val, svm_y_val_pred))

### Voting model

In [None]:
# Define parameter grids for each individual model
logreg_param_grid = {
    'C': [0.1, 1, 10],  # Regularization strength for Logistic Regression
    'solver': ['liblinear', 'saga']
}

rf_param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in Random Forest
    'max_depth': [10, 20, None]       # Maximum depth of the trees
}

svm_param_grid = {
    'C': [0.1, 1, 10],  # Regularization parameter for SVM
    'kernel': ['linear', 'rbf']
}

# Perform grid search for each model
logreg_grid_search = GridSearchCV(LogisticRegression(random_state=42), logreg_param_grid, cv=5)
rf_grid_search = GridSearchCV(RandomForestClassifier(random_state=42), rf_param_grid, cv=5)
svm_grid_search = GridSearchCV(SVC(random_state=42), svm_param_grid, cv=5)

# Fit the grid search models
logreg_grid_search.fit(X_train, y_train)
rf_grid_search.fit(X_train, y_train)
svm_grid_search.fit(X_train, y_train)

# Print best parameters for each model
print("Best parameters for Logistic Regression:", logreg_grid_search.best_params_)
print("Best parameters for Random Forest:", rf_grid_search.best_params_)
print("Best parameters for SVM:", svm_grid_search.best_params_)

# Use the best models from grid search to create the VotingClassifier
voting_best_model = VotingClassifier(estimators=[
    ('logreg', logreg_grid_search.best_estimator_),
    ('rf', rf_grid_search.best_estimator_),
    ('svm', svm_grid_search.best_estimator_)
], voting='hard')

# Fit and evaluate the Voting Classifier
voting_best_model.fit(X_train, y_train)
voting_y_val_pred = voting_best_model.predict(X_val)
print("Voting Classifier Validation Accuracy:", accuracy_score(y_val, voting_y_val_pred))
print("\nClassification Report for Voting Classifier:\n", classification_report(y_val, voting_y_val_pred))

### Confussion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Confusion matrix for SVM model
svm_conf_matrix = confusion_matrix(y_val, svm_y_val_pred)
print("Confusion Matrix for SVM:\n", svm_conf_matrix)

# Display confusion matrix for SVM
svm_disp = ConfusionMatrixDisplay(confusion_matrix=svm_conf_matrix, display_labels=svm_best_model.classes_)
svm_disp.plot()
plt.title("Confusion Matrix for SVM")
plt.show()

# Confusion matrix for Voting Classifier
voting_conf_matrix = confusion_matrix(y_val, voting_y_val_pred)
print("Confusion Matrix for Voting Classifier:\n", voting_conf_matrix)

# Display confusion matrix for Voting Classifier
voting_disp = ConfusionMatrixDisplay(confusion_matrix=voting_conf_matrix, display_labels=voting_best_model.classes_)
voting_disp.plot()
plt.title("Confusion Matrix for Voting Classifier")
plt.show()

## **Submission**

In [None]:
# Predict on Test Data
best_pred = svm_best_model.predict(X_test)
# best_pred = voting_best_model.predict(X_test)

# Save Predictions
submission = pd.DataFrame({
    'id': ss_df['id'],
    'target': best_pred
})
submission.to_csv('submission_final.csv', index=False)
print("Submission saved successfully.")

# **Pre-Trained Model**

In [None]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments
import torch
from sklearn.model_selection import train_test_split
import pandas as pd
import os

# Disable Weights and Biases logging
os.environ["WANDB_DISABLED"] = "true"

# Load Dataset
# Pastikan train_df sudah memiliki kolom 'text' dan 'target'
train_texts = train_df['text'].tolist()
train_labels = train_df['target'].tolist()

# Split Data (Train-Test Split)
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=42
)

# Load Pre-trained Tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Tokenize Data
def tokenize_function(texts):
    return tokenizer(texts, padding='max_length', truncation=True, max_length=128, return_tensors="pt")

def prepare_encodings(texts, labels):
    encodings = tokenizer(texts, padding='max_length', truncation=True, max_length=128)
    encodings['labels'] = labels
    return encodings

train_encodings = prepare_encodings(train_texts, train_labels)
val_encodings = prepare_encodings(val_texts, val_labels)

# Convert to Torch Dataset
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings['labels'])

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

train_dataset = CustomDataset(train_encodings)
val_dataset = CustomDataset(val_encodings)

# Load Pre-trained Model
model = RobertaForSequenceClassification.from_pretrained(
    'roberta-base', num_labels=2
)

# Define Training Arguments
training_args = TrainingArguments(
    output_dir='./results',           # Directory to save results
    evaluation_strategy='epoch',     # Evaluate every epoch
    save_strategy='epoch',           # Save every epoch
    learning_rate=2e-5,              # Learning rate
    per_device_train_batch_size=16,  # Batch size
    per_device_eval_batch_size=16,   # Evaluation batch size
    num_train_epochs=3,              # Number of epochs
    weight_decay=0.01,               # Weight decay
    logging_dir='./logs',            # Directory for logs
    logging_steps=10,                # Log every 10 steps
    load_best_model_at_end=True,     # Load the best model
)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train the Model
trainer.train()

# Save the Model
trainer.save_model('./roberta-finetuned')
tokenizer.save_pretrained('./roberta-finetuned')

# Evaluate the Model
results = trainer.evaluate()
print("Evaluation Results:", results)

In [None]:
# Predict on Test Data
test_texts = test_df['text'].tolist()
test_encodings = tokenizer(test_texts, padding='max_length', truncation=True, max_length=128, return_tensors="pt")

# Convert test data to Dataset (without labels)
class TestDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings['input_ids'])

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

test_dataset = TestDataset(test_encodings)

# Generate predictions
predictions = trainer.predict(test_dataset)
predicted_labels = torch.argmax(torch.tensor(predictions.predictions), axis=1).tolist()

# Save predictions
submission = pd.DataFrame({
    'id': test_df['id'],
    'target': predicted_labels
})
submission.to_csv('submission_roberta.csv', index=False)
print("Submission saved successfully.")

**Modelling Pretrained: Fine-Tuning Roberta**

Tujuan: Menggunakan model bahasa berbasis transformer, Roberta, yang telah dilatih sebelumnya untuk meningkatkan akurasi klasifikasi sentimen tweet.

**Tahapan Proses**
1. Tokenisasi:

  Data teks diubah menjadi token menggunakan tokenizer bawaan Roberta.
  Panjang maksimum tokenisasi diatur ke 128 untuk mempertahankan konteks tetapi tetap efisien.

2. Fine-Tuning:

  Model Roberta-base dilatih ulang pada dataset dengan dua label (binary classification). Parameter utama:
  - Learning rate: 2e-5
  - Batch size: 16
  - Jumlah epoch: 3
  - Weight decay: 0.01
  - Model dievaluasi pada set validasi setelah setiap epoch.

3. Hasil Pelatihan:

  Loss training cenderung stabil pada epoch pertama, tetapi menunjukkan peningkatan pada epoch ke-3.
  Validation Loss meningkat, yang dapat mengindikasikan overfitting pada akhir pelatihan.

4. Evaluasi Model:

  Model dievaluasi menggunakan metrik bawaan Hugging Face Trainer:
  - Validation loss: 0.396559 (epoch 1) hingga 0.476117 (epoch 3).
  - Runtime Evaluasi: 9.6 detik dengan rata-rata 158 sampel per detik.
  
  Hasil menunjukkan bahwa model berhasil mengklasifikasikan data dengan akurasi baik namun memerlukan pengendalian overfitting.