In [25]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import nltk
import string
import re

Read in the data.

In [4]:
train_df = pd.read_csv("../data/raw/train.csv")

In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


Firstly, I will explore the data a little bit by looking at proportions of missing values and looking at a few of the text values from target==0 and target==1.

In [9]:
100*train_df.isnull().sum()[(train_df.isnull().sum() > 0)] / len(train_df)

keyword      0.801261
location    33.272035
dtype: float64

So the location is missing a large amount of data and so won't probably be of use to me. The keyword isn't missing a large proportion of data, it could be useful for the analysis.

In [13]:
train_df[train_df['target'] == 1]['text'].iloc[0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [14]:
train_df[train_df['target'] == 0]['text'].iloc[0]

"What's up man?"

From the two different tweets we see a few differences that perhaps will help our predictions, the use of the word earthquake is an obvious possible indicator the topic of the tweet. Though this obviously will not always be the case as earthquake is not a word that is exclusively used for describing disasters.

My initial thoughts for cleaning the data:

- Lowercasing. So that words that are the same can be identified regardless of case.
- Removing puntuation. Again so identical strings could be found, though perhaps it would be a good idea to find the hashtags as these will give an idication of topic in some cases.
- Normalise similar words. As this is social media data I expect there to be slang (gud or gooood instead of good) so it will be important to put these words into their canonical form.
- Stopword removal. Removing words that aren't important to determining topic will help to reduce the dimensions of the processed data.
- Stemming. Standardising the data could be important to determining similarities between tweets.

In [17]:
def lowercase_text(col):
    return col.apply(lambda x: x.lower())

In [19]:
train_df['text'] = lowercase_text(train_df['text'])

As this is just meant to be a short introduction to NLP, I'm going to ignore the importance of the hashtags and remove them with other punctuation.

In [39]:
def remove_punctuation(col):
    # make translation dict
    table = str.maketrans(dict.fromkeys(string.punctuation))
    
    return col.apply(lambda s: s.translate(table))

In [40]:
train_df['text'] = remove_punctuation(train_df['text'])