# NLP with Disaster Tweets Competition (Kaggle)

This Natural Language Processing(NLP) project was created to participate in the "Twitter Disaster Detection" competition on Kaggle. Explore Natural Language Processing techniques, data analysis, and model creation to detect real-time emergencies on Twitter.

### Competition Description
Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

Link to Competition: https://www.kaggle.com/competitions/nlp-getting-started

# Summary

### Chapter 1: Exploratory and Statistical Analysis
- 1.1: Importing Data and First Look
- 1.2: Checking Data Types and Statistics
- 1.3: Checking Null Values

### Chapter 2: Data Processing
- 2.1: Handling Null Values 
- 2.2: Creating Tags Column and Handling Undesired Content
- 2.3: Final Processing

### Chapter 3: Tags Processing
- 3.1: Stop Words
- 3.2: Stemming
- 3.3: Train Test Split and Count Vectorizer

### Chapter 4: Creating and Testing ML Models
- 4.1: Naive Bayes
- 4.2: Random Forest
- 4.3: Ensemble Model (Naive Bayes + Random Forest)

# Chapter 1 - Exploratory and Statistical Analysis

In this section, we take a close look at our data to understand what it can tell us. 

This step is crucial for getting to know our data better before we dive into more advanced techniques.

### 1.1: Importing Datas and First Look

In [282]:
import pandas as pd
import re 
import warnings
warnings.filterwarnings("ignore")

In [283]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
sample_df = pd.read_csv('sample_submission.csv')

# Let's see the first 5 row of train_df
train_df

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [284]:
# Let's see the first 5 row of test_df
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [285]:
# Let's see the first 5 row of sample_df
sample_df.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


### 1.2: Checking Data Types and Statistics

In [286]:
train_df.dtypes

id           int64
keyword     object
location    object
text        object
target       int64
dtype: object

In [287]:
train_df.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


### 1.3: Checking Null Values

In [288]:
train_df.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

Let's take a look into the first line of 'Text' column.

In [289]:
train_df['text'][0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [290]:
test_df.isnull().sum()

id             0
keyword       26
location    1105
text           0
dtype: int64

In [291]:
test_df['text'][0]

'Just happened a terrible car crash'

# Chapter 2 - Data Processing
In this chapter, we will handle null and undesired values, as well as create the Tags column, which is crucial for our project.

### 2.1: Handling Null Values

We will replace the NaN values with 'unknown' to represent blank values. This will bring similarity between the data that has empty values in the 'keyword' and 'location' columns.

In [292]:
# train_df
train_df['keyword'].fillna('Unknown', inplace=True)
train_df['location'].fillna('Unknown', inplace=True)
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,Unknown,Unknown,Our Deeds are the Reason of this #earthquake M...,1
1,4,Unknown,Unknown,Forest fire near La Ronge Sask. Canada,1
2,5,Unknown,Unknown,All residents asked to 'shelter in place' are ...,1
3,6,Unknown,Unknown,"13,000 people receive #wildfires evacuation or...",1
4,7,Unknown,Unknown,Just got sent this photo from Ruby #Alaska as ...,1


In [293]:
# test_df
test_df['keyword'].fillna('Unknown', inplace=True)
test_df['location'].fillna('Unknown', inplace=True)
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,Unknown,Unknown,Just happened a terrible car crash
1,2,Unknown,Unknown,"Heard about #earthquake is different cities, s..."
2,3,Unknown,Unknown,"there is a forest fire at spot pond, geese are..."
3,9,Unknown,Unknown,Apocalypse lighting. #Spokane #wildfires
4,11,Unknown,Unknown,Typhoon Soudelor kills 28 in China and Taiwan


### 2.2: Creating Tags Column and Handling Undesired Content

Creating the column 'Tags'.

In [294]:
train_df['tags'] = train_df['keyword'] + " " + train_df['location'] + " " + train_df['text']
train_df['tags'][0]

'Unknown Unknown Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [295]:
test_df['tags'] = test_df['keyword'] + " " + test_df['location'] + " " + test_df['text']
test_df['tags'][21]

'ablaze Washington State Burning Man Ablaze! by Turban Diva http://t.co/hodWosAmWS via @Etsy'

We will remove tags representing web links. These tags start with 'http' and do not provide relevant information for our analysis

In [296]:
train_df['tags'] = train_df['tags'].apply(lambda x: re.sub(r'\bhttp\S+', '', str(x)))
train_df['tags'][0]

'Unknown Unknown Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [297]:
test_df['tags'] = test_df['tags'].apply(lambda x: re.sub(r'\bhttp\S+', '', str(x)))
test_df['tags'][0]

'Unknown Unknown Just happened a terrible car crash'

### 2.3: Final Processing

This function, process_tag, performs several text processing steps on a given tag. It removes single quotes, replaces non-alphabetic characters with spaces, eliminates multiple spaces, and converts the text to lowercase. This preprocessing is commonly used to clean and standardize text data for further analysis or modeling tasks.

In [298]:
def process_tag(tag):
    # Substituir aspas simples por vazio
    tag = re.sub("\'", "", tag)
    # Substituir todos os caracteres não alfabéticos por espaços
    tag = re.sub("[^a-zA-Z]"," ",tag)
    # Substituir múltiplos espaços por um único espaço
    tag = re.sub(r'\s+', ' ', tag)
    # Converter o texto para minúsculas
    tag = tag.lower()
    return tag

In [299]:
# Aplicar a função clean_text a todas as linhas da coluna 'tags' e substituir os valores originais
train_df['tags'] = train_df['tags'].apply(process_tag)
train_df['tags'][48]

'ablaze live on webcam check these out nsfw'

In [300]:
# Aplicar a função clean_text a todas as linhas da coluna 'tags' e substituir os valores originais
test_df['tags'] = test_df['tags'].apply(process_tag)
test_df['tags'][48]

'aftershock california when the aftershock happened nepal we were the last intl team still there in a way we were st responders chief collins laco fd'

# 3 - Tags Processing

### 3.1: Stop Words

We'll download and load the English stop words list using the Natural Language Toolkit (NLTK). The stop words are then used to create a function called 'remover_stop_words' that removes stop words from the 'tags' column in the DataFrame.

This step helps in preprocessing the text data by removing common words that do not contribute significantly to the meaning of the text.

In [301]:
import nltk
from nltk.corpus import stopwords

# Baixando a lista de stop words em inglês
nltk.download('stopwords')

# Carregando as stop words em inglês
stop_words = set(stopwords.words('english'))

# Função para remover stop words
def remover_stop_words(tags):
    if isinstance(tags, str):  
        return ' '.join([word for word in tags.split() if word.lower() not in stop_words])
    else:
        return tags  

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yamas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [302]:
# Aplicando a função à coluna 'tags' usando uma expressão lambda
train_df['tags'] = train_df['tags'].apply(lambda x: remover_stop_words(x))
train_df['tags'][48]

'ablaze live webcam check nsfw'

In [303]:
# Aplicando a função à coluna 'tags' usando uma expressão lambda
test_df['tags'] = test_df['tags'].apply(lambda x: remover_stop_words(x))
test_df['tags'][48]

'aftershock california aftershock happened nepal last intl team still way st responders chief collins laco fd'

### 3.2: Steeming

We are going to perform the stemming process on our tags

- Stemming is the process of reducing words to their root or base form, even if the result is not a valid word. This helps to group together words with similar meanings. For example, "running," "runs," and "runner" would all be stemmed to "run."

In [304]:
# Steeming

from nltk.stem import PorterStemmer
ps = PorterStemmer()

def stems(text):
    T = []
    
    for i in text.split():
        T.append(ps.stem(i))
    
    return " ".join(T)

In [305]:
# Aplica a função aos valores não nulos
train_df['tags'] = train_df['tags'].apply(lambda x: stems(x) if isinstance(x, str) else x)
train_df['tags'][48]

'ablaz live webcam check nsfw'

In [306]:
# Aplica a função aos valores não nulos
test_df['tags'] = test_df['tags'].apply(lambda x: stems(x) if isinstance(x, str) else x)
test_df['tags'][0]

'unknown unknown happen terribl car crash'

In [307]:
test_df.head()

Unnamed: 0,id,keyword,location,text,tags
0,0,Unknown,Unknown,Just happened a terrible car crash,unknown unknown happen terribl car crash
1,2,Unknown,Unknown,"Heard about #earthquake is different cities, s...",unknown unknown heard earthquak differ citi st...
2,3,Unknown,Unknown,"there is a forest fire at spot pond, geese are...",unknown unknown forest fire spot pond gees fle...
3,9,Unknown,Unknown,Apocalypse lighting. #Spokane #wildfires,unknown unknown apocalyps light spokan wildfir
4,11,Unknown,Unknown,Typhoon Soudelor kills 28 in China and Taiwan,unknown unknown typhoon soudelor kill china ta...


### 3.3: Train Test Split and Count Vectorizer

We're about to split the datas into train and test sets, and apply CountVectorizer to our tags, converting them into numerical representations for analysis.

- CountVectorizer is a method used for converting a collection of text documents into a matrix of token counts. It essentially converts text data into numerical data that can be used by machine learning algorithms.

In [308]:
from sklearn.model_selection import train_test_split

# 2. Divisão dos dados em conjunto de treinamento e teste
X = train_df['tags']  # recursos (tags processadas)
y = train_df['target']  # target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [309]:
from sklearn.feature_extraction.text import CountVectorizer

# 3. Extração de recursos (neste caso, vamos usar a contagem de palavras)
vectorizer = CountVectorizer()
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)


In [310]:
X_test_real_vect = vectorizer.transform(test_df['tags'])

In [311]:
X_test_real_vect [0]

<1x12830 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in Compressed Sparse Row format>

# 4 - Creating and Testing ML Models

In this stage, we will build and test our ML models to see the results we can get.

In [312]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix,classification_report

### 4.1: Naive Bayes


Naive Bayes is a probabilistic classification algorithm that relies on Bayes' theorem to predict the class of a sample. It assumes conditional independence between features, simplifying the probability calculations. 

In [313]:
from sklearn.naive_bayes import MultinomialNB

# 4. Treinamento do modelo (Naive Bayes)
nb_model = MultinomialNB()
nb_model.fit(X_train_vect, y_train)

# 5. Avaliação do modelo
y_pred_nb = nb_model.predict(X_test_vect)
accuracy = accuracy_score(y_test, y_pred_nb)
# Calculando F1 para o modelo Naive Bayes
f1_nb = f1_score(y_test, y_pred_nb)

print(confusion_matrix(y_test,y_pred_nb))
print(classification_report(y_test,y_pred_nb))

[[723 151]
 [157 492]]
              precision    recall  f1-score   support

           0       0.82      0.83      0.82       874
           1       0.77      0.76      0.76       649

    accuracy                           0.80      1523
   macro avg       0.79      0.79      0.79      1523
weighted avg       0.80      0.80      0.80      1523



In [314]:
# # Fazer previsões usando o modelo treinado
# y_pred_nb_real = nb_model.predict(X_test_real_vect)

# # Criar um DataFrame com as previsões
# submission_df = pd.DataFrame({'id': test_df['id'], 'target': y_pred_nb_real})

# submission_df.head()

### 4.2: Random Forest

Random Forest is a versatile machine learning algorithm that operates by constructing a multitude of decision trees during training and outputting the mode of the classes (classification) or the mean prediction (regression) of the individual trees. It excels in handling high-dimensional datasets with complex interactions and is robust against overfitting. Random Forest is widely used for classification.

In [315]:
from sklearn.ensemble import RandomForestClassifier

# 4. Treinamento do modelo (Random Forest)
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_vect, y_train)

# 5. Avaliação do modelo
y_pred_rf = rf_model.predict(X_test_vect)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
# Calculando F1 para o modelo Naive Bayes
f1_rf = f1_score(y_test, y_pred_rf)

print(confusion_matrix(y_test,y_pred_rf))
print(classification_report(y_test,y_pred_rf))

[[756 118]
 [199 450]]
              precision    recall  f1-score   support

           0       0.79      0.86      0.83       874
           1       0.79      0.69      0.74       649

    accuracy                           0.79      1523
   macro avg       0.79      0.78      0.78      1523
weighted avg       0.79      0.79      0.79      1523



### 4.3: Ensemble Model

An ensemble model combining Naive Bayes and Random Forest leverages the strengths of both algorithms to enhance overall predictive performance. Naive Bayes can capture probabilistic relationships in the data efficiently, while Random Forest excels in capturing complex interactions and handling high-dimensional datasets. By combining these models, the ensemble can potentially achieve better generalization and robustness.

In [316]:
from sklearn.ensemble import VotingClassifier

# Criando uma lista de tuplas com os modelos
models = [('Naive Bayes', nb_model), ('Random Forest', rf_model)]

# Criando o ensemble
ensemble = VotingClassifier(estimators=models, voting='hard')

# Treinando o ensemble
ensemble.fit(X_train_vect, y_train)

# Avaliando o ensemble
y_pred_ensemble = ensemble.predict(X_test_vect)
accuracy_ensemble = accuracy_score(y_test, y_pred_ensemble)
# Calculando F1 para o ensemble
f1_ensemble = f1_score(y_test, y_pred_ensemble)

# print("Acurácia do ensemble:", accuracy_ensemble)
# print("F1 do ensemble:", f1_ensemble)

print(confusion_matrix(y_test,y_pred_ensemble))
print(classification_report(y_test,y_pred_ensemble))

[[803  71]
 [221 428]]
              precision    recall  f1-score   support

           0       0.78      0.92      0.85       874
           1       0.86      0.66      0.75       649

    accuracy                           0.81      1523
   macro avg       0.82      0.79      0.80      1523
weighted avg       0.82      0.81      0.80      1523



### Creating Project to Submit

In [317]:
# Fazer previsões usando o modelo de ensemble
y_pred_ensemble = ensemble.predict(X_test_real_vect)

# Criar um DataFrame com as previsões
submission_df = pd.DataFrame({'id': test_df['id'], 'target': y_pred_ensemble})

# Salvar o DataFrame como um arquivo CSV para submissão
submission_df.to_csv('submission_ensemble.csv', index=False)

In [318]:
submission_df.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,1
4,11,1
