# Real or Not? NLP with Disaster Tweets
## Predict which Tweets are about real disasters and which ones are not
https://www.kaggle.com/c/nlp-getting-started
#### Lisa Hwang
#### Posted to GitHub on 3/24/21

### Competition Description from Kaggle
>"Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

>In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running."

Despite never having Tweeted myself, I decided to try my hand at this competition in order to practice NLP and ML modeling. 

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Importing the train and test data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
# Taking a look at the train dataframe
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


The train dataframe consists of 5 columms. The ones I'll focus on are ```text``` (the actual Tweet) and ```target``` (1 = Tweet is a real disaster, 0 = Tweet is not a real disaster). 

In [4]:
# Peeking at the test dataframe
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


The test dataframe is similar to the train, only the target column is missing.

Next, I'll continue to review the data.

In [5]:
train.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


There are 7,613 rows or Tweets. Are there any nulls in the data that I should be concerned about?

In [6]:
# Checking for any nulls in the text and target columns
train.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

There are no nulls in the ```text``` and ```target``` columns. There are some in the ```keyword``` and ```location``` columns which I can just ignore since we aren't considering those columns.

In [7]:
# How many 0s and 1s are there in the dataset?
train['target'].value_counts(normalize = True)

0    0.57034
1    0.42966
Name: target, dtype: float64

It looks like about 57% of the Tweets are not disaster-related while 43% of them are. 

Now I can start working with the data.

### Modeling

In [8]:
# Defining our X and y variables
X = train['text']
y = train['target']

In [9]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 115)

I'll keep this pretty simple at first and use ```CountVectorizer``` on the corpus but not adjust any parameters right now.

In [10]:
# Instantiating CountVectorizer
cvec = CountVectorizer()

In [11]:
# Fitting the vectorizer on the corpus
cvec.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [12]:
# Transforming the corpus
X_train_cv = cvec.transform(X_train)
X_test_cv = cvec.transform(X_test)

I'm a fan of the classic logistic regression, so I'll use it for my model.

In [13]:
# Instantiating, fitting, and scoring the model
lr = LogisticRegression()
lr.fit(X_train_cv, y_train)
lr.score(X_train_cv, y_train), lr.score(X_test_cv, y_test)



(0.9702224557715887, 0.8088235294117647)

I'm getting a training accuracy score of ```0.9702``` which is pretty respectable but a much lower testing score of ```0.8088```, indicating overfitting. I'll go ahead and submit it to Kaggle to see how I do.

### Generating a CSV to submit to the Kaggle challenge

Used https://www.kaggle.com/catris25/logistic-regression-with-countvectorizer as a reference.

In [14]:
# Transforming the test corpus
test_cv = cvec.transform(test['text'])

In [15]:
# Using features to generate predictions
preds = lr.predict(test_cv)
preds

array([1, 1, 1, ..., 1, 1, 0])

In [16]:
# Creating a dataframe with id and target
preds_df = pd.DataFrame({
    'id': test['id'],
    'target': preds
})

In [17]:
# Checking that the dataframe was created correctly
preds_df.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1


In [18]:
# Reviewing the dataframe with .describe()
preds_df.describe()

Unnamed: 0,id,target
count,3263.0,3263.0
mean,5427.152927,0.360711
std,3146.427221,0.48028
min,0.0,0.0
25%,2683.0,0.0
50%,5500.0,0.0
75%,8176.0,1.0
max,10875.0,1.0


In [19]:
# How many 0s and 1s are there in the dataset?
preds_df['target'].value_counts(normalize = True)

0    0.639289
1    0.360711
Name: target, dtype: float64

It looks as though there are about 64% non-disaster Tweets and 36% disaster Tweets.

In [20]:
# Generating a CSV to upload to Kaggle
preds_df.to_csv('preds.csv', index = False)

This submission was originally uploaded to the competition on 12/3/20, and it received a score of ```0.79190```. With this entry, I became number 797 on the leaderboard out of 1338. 