# Facciola NLP Disaster Tweet Model

- In this competition we are building an NLP model to predict whether a Tweet is about a real disaster or not. 

In [1]:
import warnings
import os
import pandas as pd
import numpy as np


warnings.filterwarnings('ignore')
DATA_DIR = "/kaggle/input/nlp-getting-started/"

## Import the training data

In [2]:
train_df = pd.read_csv(os.path.join(DATA_DIR, 'train.csv'))
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


## Import the test data

In [3]:
test_df = pd.read_csv(os.path.join(DATA_DIR, 'test.csv'))
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


## EDA
- examine the structure of the data

In [4]:
print("Train set info")
print(train_df.info())
print()
print("Test set info")
print(test_df.info())

Train set info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB
None

Test set info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB
None


## Data Cleaning
- here we clean the text data by removing unneccssary characters, handling missing values, and normalizing text

In [5]:
import re
import nltk
from nltk.corpus import stopwords

nltk.set_proxy('http://proxy-dmz.intel.com:911/')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
#print(stop_words)

def clean_text(text):
   #remove URLS
   text = re.sub(r'http\S+', '', text)
   #remove HTML tags
   text = re.sub(r'<.*?>', '', text)
   # Remove non-alphanumeric characters except hashtags and mentions
   text = re.sub(r'[^a-zA-Z0-9\s#@]', '', text)
   # Convert to lowercase
   text = text.lower()
   # Remove stopwords
   text = ' '.join([word for word in text.split() if word not in stop_words])
   return text

train_df['clean_text'] = train_df['text'].apply(clean_text)
test_df['clean_text'] = test_df['text'].apply(clean_text)

train_df.head()

[nltk_data] Error loading stopwords: <urlopen error [Errno -2] Name or
[nltk_data]     service not known>


Unnamed: 0,id,keyword,location,text,target,clean_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deeds reason #earthquake may allah forgive us
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,residents asked shelter place notified officer...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,13000 people receive #wildfires evacuation ord...
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo ruby #alaska smoke #wildfires p...


## Feature Engineering
- **text length**: Calculate the length of each tweet. This can help capture information about tweet complexity or verbosity.
- **word count**: Count the number of words in each tweet, which may provide insight into tweet structure.
- **hashtag count**: Count the number of hashtags in each tweet, as this can be indicative of topic relevance or trending discussions.
- **mention count**: Count the number of user mentions, which can indicate the tweet's engagement level.
- **hasUrl**: Create a binary feature indicating whether the tweet contains a URL.
- **sentiment score**: Use a pre-trained sentiment analyzer to get a sentiment score for each tweet.
- **pos tags**: Count the occurrence of different parts of speech in each tweet.
- **profanity count**: Count the number of profane words in each tweet using a predefined list of profane words.

In [6]:
train_df['text_length'] = train_df['text'].apply(len)
train_df.sample(5)

Unnamed: 0,id,keyword,location,text,target,clean_text,text_length
3905,5555,flattened,Pomfret/Providence,'the fallacy is it is up to the steam roller. ...,0,fallacy steam roller object whether flattened ...,139
7204,10320,weapon,www.twitch.tv/PKSparkxx,Slosher is GOAT. Freaking love that weapon. Ca...,0,slosher goat freaking love weapon cant wait ep...,130
5634,8035,refugees,,...//..// whao.. 12000 Nigerian refugees repat...,1,whao 12000 nigerian refugees repatriated cameroon,89
6736,9653,thunderstorm,Killafornia made me,9:35 pm. Thunderstorm. No rain. 90 degrees. Th...,1,935 pm thunderstorm rain 90 degrees weather weird,63
4899,6974,massacre,Ecuador,Don't mess with my Daddy I can be a massacre. ...,0,dont mess daddy massacre #becarefulharry,61


In [7]:
train_df['word_count'] = train_df['text'].apply(lambda x: len(x.split()))
train_df.sample(5)

Unnamed: 0,id,keyword,location,text,target,clean_text,text_length,word_count
5823,8314,rubble,London,#360WiseNews : China's Stock Market Crash: Are...,1,#360wisenews chinas stock market crash gems ru...,95,13
7212,10331,weapon,"New York, NY",03/08/11: Police stop a 41-year-old in the Bro...,1,030811 police stop 41yearold bronx citing casi...,106,18
4297,6103,hellfire,,@HellFire_eV @JackPERU1 then I do this to one ...,0,@hellfireev @jackperu1 one,58,11
140,201,airplane%20accident,,@AlexAllTimeLow awwww they're on an airplane a...,1,@alexalltimelow awwww theyre airplane accident...,104,17
1913,2753,curfew,,@TheComedyQuote @50ShadezOfGrey the thirst has...,0,@thecomedyquote @50shadezofgrey thirst curfew ...,79,9


In [8]:
train_df['hashtag_count'] = train_df['text'].apply(lambda x: len([w for w in x.split() if w.startswith('#')]))
train_df.sample(5)

Unnamed: 0,id,keyword,location,text,target,clean_text,text_length,word_count,hashtag_count
6695,9592,thunder,,That was the l9udest thunder I've ever heard,0,l9udest thunder ive ever heard,45,8,0
2276,3264,demolish,,Ugh So hungry I'm going to demolish this food!,0,ugh hungry im going demolish food,46,9,0
3599,5139,fatal,,@spookyfob @feelslikefob I am okay thank you y...,0,@spookyfob @feelslikefob okay thank yes kindne...,118,19,0
2425,3485,derailed,SEC Country,@BobbyofHomewood @JOXRoundtable as in dropping...,0,@bobbyofhomewood @joxroundtable dropping nospo...,115,17,0
7502,10731,wreck,Canada BC,@raineishida lol...Im just a nervous wreck :P,0,@raineishida lolim nervous wreck p,45,7,0


In [9]:
train_df['mention_count'] = train_df['text'].apply(lambda x: len([w for w in x.split() if w.startswith('@')]))
train_df.sample(5)

Unnamed: 0,id,keyword,location,text,target,clean_text,text_length,word_count,hashtag_count,mention_count
3109,4463,electrocuted,USA,South Side factory where worker electrocuted p...,1,south side factory worker electrocuted pays 17...,99,12,1,0
7607,10867,,,#stormchase Violent Record Breaking EF-5 El Re...,1,#stormchase violent record breaking ef5 el ren...,134,16,1,0
6986,10018,twister,,Twister was fun https://t.co/qCT6fb8wOn,0,twister fun,39,4,0,0
1640,2368,collapsed,Paris,... The pain of those seconds must have been a...,1,pain seconds must awful heart burst lungs coll...,121,24,0,0
3080,4421,electrocute,Mass,@Mmchale13 *tries to electrocute self with pho...,0,@mmchale13 tries electrocute self phone cord,54,8,0,1


In [10]:
train_df['has_url'] = train_df['text'].apply(lambda x: 1 if re.search("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", x) else 0)
train_df.sample(5)

Unnamed: 0,id,keyword,location,text,target,clean_text,text_length,word_count,hashtag_count,mention_count,has_url
1972,2838,cyclone,,'I'm a cyclone passion overblown' https://t.co...,0,im cyclone passion overblown,57,6,0,0,1
865,1250,blood,,Bruh white people buy the ugliest shoes and th...,0,bruh white people buy ugliest shoes super tigh...,94,18,0,0,0
7308,10458,wild%20fires,nap central,Wild fires freak me the fuck out. Like hell no,1,wild fires freak fuck like hell,46,10,0,0,0
3109,4463,electrocuted,USA,South Side factory where worker electrocuted p...,1,south side factory worker electrocuted pays 17...,99,12,1,0,1
5028,7169,mudslide,the burrow,DORETTE THATS THE NAME OF THE MUDSLIDE CAKE MAKER,0,dorette thats name mudslide cake maker,49,9,0,0,0


In [11]:
from textblob import TextBlob
train_df['sentiment_score'] = train_df['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
train_df.sample(5)

Unnamed: 0,id,keyword,location,text,target,clean_text,text_length,word_count,hashtag_count,mention_count,has_url,sentiment_score
5285,7552,outbreak,Indonesia,More than 40 families affected by the fatal ou...,1,40 families affected fatal outbreak legionnair...,136,20,0,0,1,0.5
1837,2642,crashed,too far,He was only .4 of a second faster than me and ...,0,4 second faster overtook twice crashed tru luv...,101,21,0,0,0,0.0
6278,8969,storm,New Delhi,@johngreen storm and silence by @RobThier_EN,0,@johngreen storm silence @robthieren,44,6,0,2,0,0.0
2265,3245,deluged,"Los Angeles, CA",@valdes1978 forgive me if I was a bit testy. H...,0,@valdes1978 forgive bit testy deluged hatred a...,100,18,0,1,0,0.0
4409,6268,hijacking,,@USAgov Koreans are performing hijacking of th...,1,@usagov koreans performing hijacking tokyo oly...,91,10,0,1,1,0.0


In [12]:
import spacy
nlp = spacy.load('en_core_web_sm')
train_df['noun_count'] = train_df['text'].apply(lambda x: len([token.pos_ for token in nlp(x) if token.pos_ == 'NOUN' or token.pos_ == 'PROPN']))
train_df['verb_count'] = train_df['text'].apply(lambda x: len([token.pos_ for token in nlp(x) if token.pos_ == 'VERB']))
train_df['adverb_count'] = train_df['text'].apply(lambda x: len([token.pos_ for token in nlp(x) if token.pos_ == 'ADV']))
train_df['adjective_count'] = train_df['text'].apply(lambda x: len([token.pos_ for token in nlp(x) if token.pos_ == 'ADJ']))
train_df.sample(5)

Unnamed: 0,id,keyword,location,text,target,clean_text,text_length,word_count,hashtag_count,mention_count,has_url,sentiment_score,noun_count,verb_count,adverb_count,adjective_count
1480,2133,catastrophe,"Denver, CO",#Denver CO #Insurance #Job: Claims Property Fi...,0,#denver co #insurance #job claims property fie...,136,17,3,0,1,0.0,17,0,0,0
2843,4088,displaced,,PennLive - Two families displaced by Mechanics...,1,pennlive two families displaced mechanicsburg ...,113,17,0,0,1,0.0,7,2,0,0
7319,10478,wild%20fires,Indiana,'Your love will surely come find us\nLike blaz...,0,love surely come find us like blazing wild fir...,78,14,0,0,0,0.366667,3,4,1,1
3619,5166,fatalities,,Las Vegas in top 5 cities for red-light runnin...,0,las vegas top 5 cities redlight running fatali...,91,13,0,0,1,0.5,7,1,0,2
3038,4359,earthquake,Earth,1.9 earthquake occurred 15km E of Anchorage Al...,1,19 earthquake occurred 15km e anchorage alaska...,110,14,2,0,1,0.0,8,1,0,0


In [13]:
!pip install better_profanity

from better_profanity import profanity

train_df['profanity_count'] = train_df['text'].apply(lambda x: len([w for w in x if w in profanity.CENSOR_WORDSET]))
train_df.sample(5)

Collecting better_profanity
  Downloading better_profanity-0.7.0-py3-none-any.whl.metadata (7.1 kB)
Downloading better_profanity-0.7.0-py3-none-any.whl (46 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.1/46.1 kB[0m [31m737.8 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: better_profanity
Successfully installed better_profanity-0.7.0


Unnamed: 0,id,keyword,location,text,target,clean_text,text_length,word_count,hashtag_count,mention_count,has_url,sentiment_score,noun_count,verb_count,adverb_count,adjective_count,profanity_count
7489,10711,wreck,Primum non nocere,@GeorgeFoster72 and The Wreck of the Edmund Fi...,1,@georgefoster72 wreck edmund fitzgerald,54,8,0,1,0,0.0,3,0,0,0,0
562,812,battle,Earth,Check out this item I just got! [Phantasmal Cu...,0,check item got phantasmal cummerbund #warcraft,88,11,1,0,1,0.0,5,2,1,0,0
3912,5563,flood,United States,JKL cancels Flash Flood Warning for Bell Harla...,1,jkl cancels flash flood warning bell harlan kn...,85,12,1,0,1,0.0,8,2,0,0,0
1792,2571,crash,"Melbourne, Australia",#INCIDENT\nCrash in Pascoe Vale South outbound...,1,#incident crash pascoe vale south outbound tul...,136,21,1,0,0,0.1,15,0,0,0,0
5741,8195,riot,Belgrade,To All The Meat-Loving Feminists Of The World ...,0,meatloving feminists world riot grill arrived ...,135,21,1,0,1,0.0,10,3,0,0,0


### Test set Feature engineering
- now apply the same to the test set

In [14]:
test_df['text_length'] = test_df['text'].apply(len)
test_df['word_count'] = test_df['text'].apply(lambda x: len(x.split()))
test_df['hashtag_count'] = test_df['text'].apply(lambda x: len([w for w in x.split() if w.startswith('#')]))
test_df['mention_count'] = test_df['text'].apply(lambda x: len([w for w in x.split() if w.startswith('@')]))
test_df['has_url'] = test_df['text'].apply(lambda x: 1 if re.search("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", x) else 0)
test_df['sentiment_score'] = test_df['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
test_df['noun_count'] = test_df['text'].apply(lambda x: len([token.pos_ for token in nlp(x) if token.pos_ == 'NOUN' or token.pos_ == 'PROPN']))
test_df['verb_count'] = test_df['text'].apply(lambda x: len([token.pos_ for token in nlp(x) if token.pos_ == 'VERB']))
test_df['adverb_count'] = test_df['text'].apply(lambda x: len([token.pos_ for token in nlp(x) if token.pos_ == 'ADV']))
test_df['adjective_count'] = test_df['text'].apply(lambda x: len([token.pos_ for token in nlp(x) if token.pos_ == 'ADJ']))
test_df['profanity_count'] = test_df['text'].apply(lambda x: len([w for w in x if w in profanity.CENSOR_WORDSET]))

In [15]:
test_df.sample(5)

Unnamed: 0,id,keyword,location,text,clean_text,text_length,word_count,hashtag_count,mention_count,has_url,sentiment_score,noun_count,verb_count,adverb_count,adjective_count,profanity_count
3160,10491,wildfire,Colorado,11:57am Wildfire by The Mynabirds from Lovers ...,1157am wildfire mynabirds lovers know,50,8,0,0,0,0.0,4,1,0,0,0
1945,6562,injury,,nflweek1picks: Michael Floyd's hand injury sho...,nflweek1picks michael floyds hand injury shoul...,121,16,0,0,0,0.0,9,4,0,0,0
532,1742,buildings%20burning,,kou is like [CASH REGISTER] [BUILDINGS BURNING],kou like cash register buildings burning,47,7,0,0,0,0.0,2,2,0,0,0
1382,4558,emergency%20plan,"Vancouver, Canada",Calgary takes another beating from summer stor...,calgary takes another beating summer storms ci...,141,19,0,0,1,0.0,11,3,0,0,0
1512,5031,eyewitness,,How 'Little Boy' Affected the People In Hirosh...,little boy affected people hiroshima eyewitnes...,102,13,0,0,1,-0.234375,6,1,0,1,0


## TF-IDF Vectorization
- Convert the cleaned text data into numerical features using TF-IDF

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=10000)

# Fit and transform the training data
X_train_tfidf = tfidf.fit_transform(train_df['clean_text'])

# Transform the test data
X_test_tfidf = tfidf.transform(test_df['clean_text'])

## BERT Embeddings
- Generate BERT embeddings for the text data

In [17]:
from transformers import BertModel, BertTokenizer
import torch

#load tokenizer and BERT model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

#tokenize and encode the text
# Tokenize and encode the text
def get_bert_embeddings(text_list):
   inputs = tokenizer(text_list, return_tensors='pt', padding=True, truncation=True, max_length=512)
   with torch.no_grad():
      outputs = bert_model(**inputs)
   return outputs.last_hidden_state[:, 0, :].numpy()

# Get BERT embeddings for train and test data
X_train_bert = get_bert_embeddings(train_df['clean_text'].tolist())
X_test_bert = get_bert_embeddings(test_df['clean_text'].tolist())

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [18]:
X_train_combined = np.hstack((X_train_tfidf.toarray(), X_train_bert, 
                              train_df[['text_length', 'word_count', 'hashtag_count', 'mention_count', 'has_url', 
                                       'sentiment_score', 'noun_count', 'verb_count', 'adverb_count', 'adjective_count', 'profanity_count']].values))

X_test_combined = np.hstack((X_test_tfidf.toarray(), X_test_bert, 
                              test_df[['text_length', 'word_count', 'hashtag_count', 'mention_count', 'has_url', 
                                       'sentiment_score', 'noun_count', 'verb_count', 'adverb_count', 'adjective_count', 'profanity_count']].values))

## Model Selection
- here we test a variety of models and choose a few to fine tune based on classification report 

In [19]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.svm import SVC

# Split and scale the data
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train_combined, train_df['target'], test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_split_scaled = scaler.fit_transform(X_train_split)
X_val_split_scaled = scaler.transform(X_val_split)

# Define the SVM model
model = SVC(kernel='rbf')


model.fit(X_train_split_scaled, y_train_split)
y_pred = model.predict(X_val_split_scaled)

print(f"Model: SVM")
print(classification_report(y_val_split, y_pred))
print('-' * 60)

Model: SVM
              precision    recall  f1-score   support

           0       0.78      0.93      0.85       874
           1       0.87      0.64      0.74       649

    accuracy                           0.81      1523
   macro avg       0.82      0.78      0.79      1523
weighted avg       0.82      0.81      0.80      1523

------------------------------------------------------------


## Hyperparameter Tuning

In [20]:
hyperparameter_grid = {'C': [0.1, 1, 10, 100], 'kernel': ['linear', 'rbf'], 'gamma': ['scale', 'auto']}

grid_search = GridSearchCV(model, hyperparameter_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_split_scaled, y_train_split)
print("Best Parameters for SVM:", grid_search.best_params_)

Best Parameters for SVM: {'C': 1, 'gamma': 'auto', 'kernel': 'rbf'}


## Re-evaluate with best Hyperparameters

In [21]:
model = SVC(C=1, gamma='auto', kernel='rbf')

model.fit(X_train_split_scaled, y_train_split)
y_pred = model.predict(X_val_split_scaled)

print(f"Model: SVM")
print(classification_report(y_val_split, y_pred))
print('-' * 60)

Model: SVM
              precision    recall  f1-score   support

           0       0.78      0.92      0.85       874
           1       0.86      0.65      0.74       649

    accuracy                           0.81      1523
   macro avg       0.82      0.79      0.79      1523
weighted avg       0.81      0.81      0.80      1523

------------------------------------------------------------


## Final Submission

In [23]:
X_train = X_train_combined
X_test = X_test_combined
y_train = train_df['target']

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = SVC(C=1, gamma='auto', kernel='rbf')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

submission = pd.DataFrame({
    'id' : test_df['id'],
    'target' : y_pred
})

submission.to_csv('svm_submission.csv', index=False)

print('Submission created successfully')

Submission created successfully
