To build a model that identifies if a tweet informs about a disaster using Support Vector Machine (SVM), we need to follow these steps:

Data collection: We need to collect a dataset of tweets that inform about 
disasters and tweets that do not inform about disasters.

Data pre-processing: We need to clean and preprocess the data by removing stop words, special characters, and converting text to lowercase.

Feature extraction: We need to extract features from the preprocessed data. One way to do this is to use the Bag of Words model, which creates a vocabulary of unique words from the text and then counts the number of occurrences of each word in each tweet.

Train the SVM model: We will train the SVM model on the preprocessed and feature extracted data.

Evaluate the model: We will evaluate the performance of the SVM model using metrics such as accuracy, precision, recall, and F1 score.




In [None]:
# Step 1: Data collection
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

# Load the dataset of tweets
file_path = "/content/drive/MyDrive/CS298/tweets.csv"
df = pd.read_csv(file_path)

Mounted at /content/drive


In [None]:
# Step 2: Data pre-processing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download the stop words and stemmer from NLTK
nltk.download('stopwords')
stemmer = PorterStemmer()
nltk.download('punkt')
# Function to clean and preprocess the text
def preprocess(text):
    # Remove URLs, mentions, and special characters
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'@[^\s]+', '', text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Convert text to lowercase
    text = text.lower()
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    # Remove stop words and stem the words
    words = [stemmer.stem(word) for word in tokens if word not in stopwords.words('english')]
    # Join the words back into a string
    return ' '.join(words)

# Apply the preprocess function to the text column
df['text'] = df['text'].apply(preprocess)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# Step 3: Feature extraction
from sklearn.feature_extraction.text import CountVectorizer

# Create a bag of words model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['text'])
y = df['target']

# Step 4: Train the SVM model
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the SVM model
svm = SVC(kernel='linear', C=1)
svm.fit(X_train, y_train)

# Step 5: Evaluate the model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predict the target values for the test set
y_pred = svm.predict(X_test)

# Calculate the evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 score:', f1)

Accuracy: 0.7708470124753776
Precision: 0.7551020408163265
Recall: 0.6841294298921418
F1 score: 0.7178658043654003
