# Disaster or Not?

Twitter has emerged as a vital platform for communication during emergencies. The widespread use of smartphones allows individuals to report emergencies as they happen in real time. Consequently, an increasing number of organizations, including disaster relief agencies and news outlets, are showing interest in systematically tracking Twitter.
However, determining whether someone's posts genuinely indicate a disaster can often be challenging.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# Load the datasets
train_df = pd.read_csv('train.csv')
train_df

Unnamed: 0,id,keyword,place,tweet,disaster
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [2]:
test_df = pd.read_csv('test.csv')
test_df

Unnamed: 0,id,keyword,place,tweet
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...


In [3]:
# Fill missing keyword values with 'no_keyword' to ensure no missing values in this important feature

train_df['keyword'].fillna('no_keyword', inplace=True)

In [4]:
test_df['keyword'].fillna('no_keyword', inplace=True)

In [5]:
# Drop the 'place' column due to a high number of missing values
train_df.drop(columns=['place'], inplace=True)

In [6]:
test_df.drop(columns=['place'], inplace=True)

In [7]:
# Display the first few rows of the datasets to understand their structure
train_df.head()

Unnamed: 0,id,keyword,tweet,disaster
0,1,no_keyword,Our Deeds are the Reason of this #earthquake M...,1
1,4,no_keyword,Forest fire near La Ronge Sask. Canada,1
2,5,no_keyword,All residents asked to 'shelter in place' are ...,1
3,6,no_keyword,"13,000 people receive #wildfires evacuation or...",1
4,7,no_keyword,Just got sent this photo from Ruby #Alaska as ...,1


In [8]:
test_df.head()

Unnamed: 0,id,keyword,tweet
0,0,no_keyword,Just happened a terrible car crash
1,2,no_keyword,"Heard about #earthquake is different cities, s..."
2,3,no_keyword,"there is a forest fire at spot pond, geese are..."
3,9,no_keyword,Apocalypse lighting. #Spokane #wildfires
4,11,no_keyword,Typhoon Soudelor kills 28 in China and Taiwan


In [9]:
import re
from bs4 import BeautifulSoup
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

In [10]:
# Download NLTK data required for text preprocessing
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Priyanka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Priyanka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Priyanka\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
# Function to preprocess text data
def preprocess_text(text):
    # Ensure input is string
    if not isinstance(text, str):
        return ""
    
    # Convert text to lowercase to standardize the text
    text = text.lower()
    # Remove URLs to avoid noise
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove HTML tags to clean the text
    text = BeautifulSoup(text, "html.parser").get_text()
    # Remove punctuation to simplify the text
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove numbers to focus on textual content
    text = re.sub(r'\d+', '', text)
    # Expand common contractions to their full forms
    contractions = {
        "can't": "cannot", "won't": "will not", "n't": " not", "'re": " are",
        "'s": " is", "'d": " would", "'ll": " will", "'t": " not",
        "'ve": " have", "'m": " am"
    }
    for contraction, full_form in contractions.items():
        text = text.replace(contraction, full_form)
    # Tokenize the text to break it down into individual words
    tokens = word_tokenize(text)
    # Remove stop words that do not contribute much to the meaning
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize tokens to reduce words to their base form
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Apply text preprocessing to the 'tweet' column
train_df['clean_tweet'] = train_df['tweet'].apply(preprocess_text)
test_df['clean_tweet'] = test_df['tweet'].apply(preprocess_text)


In [12]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [13]:
# Remove rows with empty 'clean_tweet'
train_df = train_df.loc[train_df.clean_tweet != "", :]

In [14]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
email_train, email_test = train_test_split(train_df, test_size=0.2, random_state=42)

In [15]:
# Function to split text into words
def split_into_words(i):
    return [word for word in i.split(" ")]

In [16]:
# Create a bag-of-words model
emails_bow = CountVectorizer(analyzer=split_into_words).fit(train_df.clean_tweet)
all_emails_matrix = emails_bow.transform(train_df.clean_tweet)

In [17]:
# For training messages
train_emails_matrix = emails_bow.transform(email_train.clean_tweet)

In [18]:
# For testing messages
test_emails_matrix = emails_bow.transform(email_test.clean_tweet)

In [19]:
# Learn term weighting and normalization
tfidf_transformer = TfidfTransformer().fit(all_emails_matrix)

In [20]:
# Prepare TF-IDF for train emails
train_tfidf = tfidf_transformer.transform(train_emails_matrix)

In [21]:
# Prepare TF-IDF for test emails
test_tfidf = tfidf_transformer.transform(test_emails_matrix)
print(test_tfidf.shape)

(1523, 15768)


In [22]:
from sklearn.metrics import accuracy_score, classification_report

In [23]:
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    return accuracy, report

In [24]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

# Initialize an empty dictionary to store the results
results = {}

In [25]:
# Naive Bayes
nb_model = MultinomialNB()
nb_accuracy, nb_report = evaluate_model(nb_model, train_tfidf, email_train['disaster'], test_tfidf, email_test['disaster'])
results['Naive Bayes'] = (nb_accuracy, nb_report)
print('Naive Bayes Accuracy:', nb_accuracy)
print('Naive Bayes Classification Report:\n', nb_report)

Naive Bayes Accuracy: 0.8003939592908733
Naive Bayes Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.90      0.84       874
           1       0.83      0.67      0.74       649

    accuracy                           0.80      1523
   macro avg       0.81      0.78      0.79      1523
weighted avg       0.80      0.80      0.80      1523



In [26]:
# Logistic Regression
lr_model = LogisticRegression()
lr_accuracy, lr_report = evaluate_model(lr_model, train_tfidf, email_train['disaster'], test_tfidf, email_test['disaster'])
results['Logistic Regression'] = (lr_accuracy, lr_report)
print('Logistic Regression Accuracy:', lr_accuracy)
print('Logistic Regression Classification Report:\n', lr_report)

Logistic Regression Accuracy: 0.7997373604727511
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.91      0.84       874
           1       0.85      0.65      0.73       649

    accuracy                           0.80      1523
   macro avg       0.81      0.78      0.79      1523
weighted avg       0.81      0.80      0.79      1523



In [27]:
# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_accuracy, rf_report = evaluate_model(rf_model, train_tfidf, email_train['disaster'], test_tfidf, email_test['disaster'])
results['Random Forest'] = (rf_accuracy, rf_report)
print('Random Forest Accuracy:', rf_accuracy)
print('Random Forest Classification Report:\n', rf_report)

Random Forest Accuracy: 0.778069599474721
Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.89      0.82       874
           1       0.81      0.63      0.71       649

    accuracy                           0.78      1523
   macro avg       0.79      0.76      0.76      1523
weighted avg       0.78      0.78      0.77      1523



In [28]:
# Support Vector Machine (SVM)
svm_model = SVC(kernel='linear')
svm_accuracy, svm_report = evaluate_model(svm_model, train_tfidf, email_train['disaster'], test_tfidf, email_test['disaster'])
results['SVM'] = (svm_accuracy, svm_report)
print('SVM Accuracy:', svm_accuracy)
print('SVM Classification Report:\n', svm_report)

SVM Accuracy: 0.7957977675640184
SVM Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.87      0.83       874
           1       0.80      0.69      0.74       649

    accuracy                           0.80      1523
   macro avg       0.80      0.78      0.79      1523
weighted avg       0.80      0.80      0.79      1523



In [29]:
# Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_accuracy, dt_report = evaluate_model(dt_model, train_tfidf, email_train['disaster'], test_tfidf, email_test['disaster'])
results['Decision Tree'] = (dt_accuracy, dt_report)
print('Decision Tree Accuracy:', dt_accuracy)
print('Decision Tree Classification Report:\n', dt_report)

Decision Tree Accuracy: 0.7340774786605384
Decision Tree Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.77      0.77       874
           1       0.69      0.68      0.69       649

    accuracy                           0.73      1523
   macro avg       0.73      0.73      0.73      1523
weighted avg       0.73      0.73      0.73      1523



In [30]:
# Gradient Boosting
gb_model = GradientBoostingClassifier(random_state=42)
gb_accuracy, gb_report = evaluate_model(gb_model, train_tfidf, email_train['disaster'], test_tfidf, email_test['disaster'])
results['Gradient Boosting'] = (gb_accuracy, gb_report)
print('Gradient Boosting Accuracy:', gb_accuracy)
print('Gradient Boosting Classification Report:\n', gb_report)

Gradient Boosting Accuracy: 0.747209455022981
Gradient Boosting Classification Report:
               precision    recall  f1-score   support

           0       0.72      0.93      0.81       874
           1       0.84      0.50      0.63       649

    accuracy                           0.75      1523
   macro avg       0.78      0.72      0.72      1523
weighted avg       0.77      0.75      0.73      1523



In [31]:
# K-Nearest Neighbors (KNN)
knn_model = KNeighborsClassifier()
knn_accuracy, knn_report = evaluate_model(knn_model, train_tfidf, email_train['disaster'], test_tfidf, email_test['disaster'])
results['KNN'] = (knn_accuracy, knn_report)
print('KNN Accuracy:', knn_accuracy)
print('KNN Classification Report:\n', knn_report)

KNN Accuracy: 0.7741300065659882
KNN Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.85      0.81       874
           1       0.77      0.67      0.72       649

    accuracy                           0.77      1523
   macro avg       0.77      0.76      0.76      1523
weighted avg       0.77      0.77      0.77      1523



In [32]:
# Find the best model based on accuracy
best_model_name = max(results, key=lambda k: results[k][0])
best_accuracy, best_report = results[best_model_name]


In [33]:
print(f'Best Model: {best_model_name}')

Best Model: Naive Bayes


In [34]:
print(f'Best Model Accuracy: {best_accuracy}')


Best Model Accuracy: 0.8003939592908733


In [35]:
print(f'Best Model Classification Report:\n {best_report}')

Best Model Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.90      0.84       874
           1       0.83      0.67      0.74       649

    accuracy                           0.80      1523
   macro avg       0.81      0.78      0.79      1523
weighted avg       0.80      0.80      0.80      1523



In [36]:
# Retrieve the best model based on best_model_name
if best_model_name == "Naive Bayes":
    best_model = nb_model
elif best_model_name == "Logistic Regression":
    best_model = lr_model
elif best_model_name == "Random Forest":
    best_model = rf_model
elif best_model_name == "SVM":
    best_model = svm_model
elif best_model_name == "Decision Tree":
    best_model = dt_model
elif best_model_name == "Gradient Boosting":
    best_model = gb_model
elif best_model_name == "KNN":
    best_model = knn_model
else:
    raise ValueError(f"Model '{best_model_name}' not found")

In [37]:
# Transform the test data
X_test = emails_bow.transform(test_df['clean_tweet'])
test_tfidf = tfidf_transformer.transform(X_test)

In [38]:
# Predict the labels
test_pred = best_model.predict(test_tfidf)

In [39]:
# Add the predictions to the test dataframe
test_df['disaster'] = test_pred

In [40]:
test_df[['id', 'disaster']].to_csv(r"C:\Users\Priyanka\Downloads\test_predictions.csv", index=False)
