# NLP Disaster Tweets Mini-Project

## Introduction

In this project, I am using a library known as Natural Language Toolkit (NLTK) to preocess text rom tweets and then use XGBoost to train a model accordingly. My approach involves using the 'wordnet', an excellent lexicon of words, which is a part of the NLTK, to gather keywords in the text.

## Data Source

The data I used for this project come from the "[Natural Language Processing with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started/data)" hosted on Kaggle. This dataset contains over 7000 tweets.
## Setting Up the Model

### Importing Libraries and Setting Parameters

In [21]:
import pandas as pd
import re

import nltk 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer

from sklearn import model_selection, metrics, model_selection, metrics
from sklearn.feature_extraction.text import CountVectorizer

from xgboost import XGBClassifier

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sukuna\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sukuna\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sukuna\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [22]:
file_path = "train.csv"
train_data = pd.read_csv(file_path)
print("Data points count: ", train_data['id'].count())
train_data.head()

Data points count:  7613


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [23]:
test_data = pd.read_csv('test.csv')
test_data

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...


In [24]:
train_data["target"].value_counts()

target
0    4342
1    3271
Name: count, dtype: int64

In [25]:
train_data['text']

0       Our Deeds are the Reason of this #earthquake M...
1                  Forest fire near La Ronge Sask. Canada
2       All residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       Just got sent this photo from Ruby #Alaska as ...
                              ...                        
7608    Two giant cranes holding a bridge collapse int...
7609    @aria_ahrary @TheTawniest The out of control w...
7610    M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...
7611    Police investigating after an e-bike collided ...
7612    The Latest: More Homes Razed by Northern Calif...
Name: text, Length: 7613, dtype: object

In [26]:
def process_text(text1):

    # remove URL elements from text
    text1 = re.sub(r"http\S+", "", text1)
    
    # remove numbers from text
    text1 = re.sub(r'\d+', '', text1)

    # tokenize each text
    text1 = word_tokenize(text1)
    
    # remove special characters
    text2 = []
    for word in text1:
        text2.append("".join([e for e in word if e.isalnum()]))

    # remove stop words and lower
    stop_words = set(stopwords.words('english'))
    processed_text = [x.lower() for x in text2 if not x.lower() in stop_words]  

    # Lemmatize all words
    wnl = WordNetLemmatizer()
    lemmatized_text = [wnl.lemmatize(x) for x in processed_text]
    
    return " ".join(" ".join(lemmatized_text).split())


In [27]:
train_data['processed_text'] = train_data['text'].apply(lambda x: process_text(x))
train_data['keyword'] = train_data['keyword'].fillna("none")
train_data['processed_keyword'] = train_data['keyword'].apply(lambda x: process_text(x))

In [28]:
train_data

Unnamed: 0,id,keyword,location,text,target,processed_text,processed_keyword
0,1,none,,Our Deeds are the Reason of this #earthquake M...,1,deed reason earthquake may allah forgive u,none
1,4,none,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,none
2,5,none,,All residents asked to 'shelter in place' are ...,1,resident asked shelter place notified officer ...,none
3,6,none,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfire evacuation order calif...,none
4,7,none,,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo ruby alaska smoke wildfire pour...,none
...,...,...,...,...,...,...,...
7608,10869,none,,Two giant cranes holding a bridge collapse int...,1,two giant crane holding bridge collapse nearby...,none
7609,10870,none,,@aria_ahrary @TheTawniest The out of control w...,1,ariaahrary thetawniest control wild fire calif...,none
7610,10871,none,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1,utc km volcano hawaii,none
7611,10872,none,,Police investigating after an e-bike collided ...,1,police investigating ebike collided car little...,none


In [29]:
# Merge contents of 'clean_keyword' and 'clean_text' into one
train_data['keyword_text'] = train_data['processed_keyword'] + " " + train_data["processed_text"]
train_data

Unnamed: 0,id,keyword,location,text,target,processed_text,processed_keyword,keyword_text
0,1,none,,Our Deeds are the Reason of this #earthquake M...,1,deed reason earthquake may allah forgive u,none,none deed reason earthquake may allah forgive u
1,4,none,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,none,none forest fire near la ronge sask canada
2,5,none,,All residents asked to 'shelter in place' are ...,1,resident asked shelter place notified officer ...,none,none resident asked shelter place notified off...
3,6,none,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfire evacuation order calif...,none,none people receive wildfire evacuation order ...
4,7,none,,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo ruby alaska smoke wildfire pour...,none,none got sent photo ruby alaska smoke wildfire...
...,...,...,...,...,...,...,...,...
7608,10869,none,,Two giant cranes holding a bridge collapse int...,1,two giant crane holding bridge collapse nearby...,none,none two giant crane holding bridge collapse n...
7609,10870,none,,@aria_ahrary @TheTawniest The out of control w...,1,ariaahrary thetawniest control wild fire calif...,none,none ariaahrary thetawniest control wild fire ...
7610,10871,none,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1,utc km volcano hawaii,none,none utc km volcano hawaii
7611,10872,none,,Police investigating after an e-bike collided ...,1,police investigating ebike collided car little...,none,none police investigating ebike collided car l...


In [30]:
test_data['processed_text'] = test_data['text'].apply(lambda x: process_text(x))
test_data['keyword'] = test_data['keyword'].fillna("none")
test_data['processed_keyword'] = test_data['keyword'].apply(lambda x: process_text(x))

In [31]:
test_data

Unnamed: 0,id,keyword,location,text,processed_text,processed_keyword
0,0,none,,Just happened a terrible car crash,happened terrible car crash,none
1,2,none,,"Heard about #earthquake is different cities, s...",heard earthquake different city stay safe ever...,none
2,3,none,,"there is a forest fire at spot pond, geese are...",forest fire spot pond goose fleeing across str...,none
3,9,none,,Apocalypse lighting. #Spokane #wildfires,apocalypse lighting spokane wildfire,none
4,11,none,,Typhoon Soudelor kills 28 in China and Taiwan,typhoon soudelor kill china taiwan,none
...,...,...,...,...,...,...
3258,10861,none,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...,earthquake safety los angeles ûò safety fasten...,none
3259,10865,none,,Storm in RI worse than last hurricane. My city...,storm ri worse last hurricane city amp others ...,none
3260,10868,none,,Green Line derailment in Chicago http://t.co/U...,green line derailment chicago,none
3261,10874,none,,MEG issues Hazardous Weather Outlook (HWO) htt...,meg issue hazardous weather outlook hwo,none


In [32]:
test_data['keyword_text'] = test_data['processed_keyword'] + " " + test_data["processed_text"]
test_data

Unnamed: 0,id,keyword,location,text,processed_text,processed_keyword,keyword_text
0,0,none,,Just happened a terrible car crash,happened terrible car crash,none,none happened terrible car crash
1,2,none,,"Heard about #earthquake is different cities, s...",heard earthquake different city stay safe ever...,none,none heard earthquake different city stay safe...
2,3,none,,"there is a forest fire at spot pond, geese are...",forest fire spot pond goose fleeing across str...,none,none forest fire spot pond goose fleeing acros...
3,9,none,,Apocalypse lighting. #Spokane #wildfires,apocalypse lighting spokane wildfire,none,none apocalypse lighting spokane wildfire
4,11,none,,Typhoon Soudelor kills 28 in China and Taiwan,typhoon soudelor kill china taiwan,none,none typhoon soudelor kill china taiwan
...,...,...,...,...,...,...,...
3258,10861,none,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...,earthquake safety los angeles ûò safety fasten...,none,none earthquake safety los angeles ûò safety f...
3259,10865,none,,Storm in RI worse than last hurricane. My city...,storm ri worse last hurricane city amp others ...,none,none storm ri worse last hurricane city amp ot...
3260,10868,none,,Green Line derailment in Chicago http://t.co/U...,green line derailment chicago,none,none green line derailment chicago
3261,10874,none,,MEG issues Hazardous Weather Outlook (HWO) htt...,meg issue hazardous weather outlook hwo,none,none meg issue hazardous weather outlook hwo


In [33]:
feature = "keyword_text"
label = "target"

# Split train and test data
X_train, X_test,y_train, y_test = model_selection.train_test_split(train_data[feature],train_data[label],test_size=0.3,random_state=0,shuffle=True)

In [34]:
test_text = test_data["keyword_text"]

In [35]:
# Vectorize text
vectorizer = CountVectorizer()
X_train_XGB = vectorizer.fit_transform(X_train)
X_test_XGB = vectorizer.transform(X_test)
test_text_XGB = vectorizer.transform(test_text)

In [36]:

model1 = XGBClassifier(random_state=42, seed=7,learning_rate=0.1,
        max_depth=6,
        n_estimators=7000,
        eta=0.001,
        eval_metric='pre',
        )

In [37]:
model1.fit(X_train_XGB, y_train)

In [38]:
# Evaluate the model
predicted_prob = model1.predict_proba(X_test_XGB)[:,1]
predicted = model1.predict(X_test_XGB)

accuracy = metrics.accuracy_score(predicted, y_test)
print("Test accuracy: ", accuracy)
print(metrics.classification_report(y_test, predicted, target_names=["0", "1"]))
print("Test F-score: ", metrics.f1_score(y_test, predicted))

Test accuracy:  0.7911558669001751
              precision    recall  f1-score   support

           0       0.82      0.83      0.82      1338
           1       0.76      0.73      0.74       946

    accuracy                           0.79      2284
   macro avg       0.79      0.78      0.78      2284
weighted avg       0.79      0.79      0.79      2284

Test F-score:  0.7442359249329759


In [39]:
predictions = model1.predict(test_text_XGB)

In [40]:
submission_df = pd.DataFrame()
submission_df['id'] = test_data['id']
submission_df['target'] = predictions
submission_df

Unnamed: 0,id,target
0,0,1
1,2,0
2,3,1
3,9,1
4,11,1
...,...,...
3258,10861,1
3259,10865,1
3260,10868,1
3261,10874,1


In [41]:
#submission_df.to_csv('submission.csv',index=False)