# Natural Language Processing with Disaster Tweets

## Problem Description

This is a text classification problem in the field of Natural Language Processing (NLP). The task is to classify tweets on Twitter to determine whether a tweet is actually about a real disaster or not.

This is a binary classification problem, where the model needs to learn how to understand the context and meaning of text to make accurate decisions.

## Project Objectives

Build an RNN (Recurrent Neural Network) model capable of:
- Understanding the context and meaning of tweet text
- Accurately distinguishing between tweets about real disasters and unrelated tweets
- Achieving good accuracy on the test dataset

## Technologies Used

- **Deep Learning**: RNN (LSTM/GRU) for processing sequential data
- **NLP Libraries**: NLTK, spaCy, or other text processing libraries
- **Deep Learning Frameworks**: TensorFlow/Keras or PyTorch
- **Evaluation Metric**: F1-score (commonly used in imbalanced classification problems)





In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Deep Learning and NLP libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# NLP libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Load training and test data
train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')




## Data Description

### Data Structure

The data is divided into 3 main files:

1. **train.csv**: Training data
   - **Size**: Approximately 8,561 samples
   - **Columns**:
     - `id`: Unique identifier for each tweet
     - `keyword`: Disaster-related keyword (may have null values)
     - `location`: Geographic location (may have null values)
     - `text`: Tweet content (text to be classified)
     - `target`: Classification label (0 = not a disaster, 1 = real disaster)

2. **test.csv**: Test data
   - **Size**: Approximately 3,699 samples
   - **Columns**: Similar to train.csv but without the `target` column
   - Purpose: Predict labels for tweets in the test set

### Data Characteristics

- **Data Type**: Short text (tweets) - typically limited length (maximum 280 characters for Twitter)
- **Language**: English
- **Format**: CSV (Comma-Separated Values)

### Label Distribution

The training data contains two classes:
- **Class 0**: Tweet not about a real disaster
- **Class 1**: Tweet about a real disaster

## EDA

Whether the distribution is balanced or imbalanced between the two classes is an important factor to consider when building the model.

### Data Processing Challenges

1. **Missing Data**: Need to handle null values in `keyword` and `location`
2. **Text Preprocessing**: Cleaning hashtags, URLs, mentions, special characters
3. **Tokenization**: Breaking down text into appropriate tokens
4. **Sequence Padding**: Normalizing sequence lengths to fit RNN input requirements
5. **Word Embedding**: Converting words into numerical vectors
