# An NLP workshop - Categorizing tweets into relevant or non-relevant
#### adapted from https://github.com/hundredblocks/concrete_NLP_tutorial.git

## 1. EDA - Exploratory Data Analysis

In this notebook, we will load and explore the dataset to get a better feel for it.

### Our Dataset: Disasters on social media
Contributors looked at over 10,000 tweets retrieved with a variety of searches like “ablaze”, “quarantine”, and “pandemonium”, then noted whether the tweet referred to a disaster event (as opposed to a joke with the word or a movie review or something non-disastrous). Thank you [Crowdflower](https://www.crowdflower.com/data-for-everyone/).

First lets import all the libraries we will need upfront

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import nltk

The following line ensures graphs get rendered properly

In [None]:
%matplotlib inline

### Let's inspect the data

In [None]:
questions = pd.read_csv("socialmedia_relevant_cols.csv", encoding='ISO-8859-1')
questions.columns=['text', 'choose_one', 'class_label']
questions.head()

In [None]:
questions.tail()

In [None]:
questions.describe()

In [None]:
questions.info()

### Data Overview
Let's look at our class balance.

In [None]:
questions.groupby("choose_one").count()

We can see our classes are pretty balanced, with a slight oversampling of the "Irrelevant" class. There's a few samples that the labellers weren't sure about - maybe we should drop those?

Let's see what the length of each tweet is

In [None]:
questions['tweet_len'] = questions.text.apply(len)

In [None]:
questions.tweet_len.hist()

So quite a lot of 140 character tweets - not unexpected!

### Tokenizing

Usually in NLP we work with words not characters, so let's get a sense of the words by breaking down each tweet into a list of tokens.

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')

questions["tokens"] = questions["text"].apply(tokenizer.tokenize)
questions.head()

Let's see how many words our tweets contain

In [None]:
questions['token_len'] = questions.tokens.apply(len)
questions.token_len.hist()

How many unique words do we have? This is our *vocabulary*

In [None]:
all_words = {word for tokens in questions.tokens for word in tokens}
print(f"Total number of unique tokens {len(all_words)}")

Let's take a closer look at some of the tokens ...

In [None]:
from pprint import pprint
pprint(all_words)

#### Some observations:

- There's a mixture of all lower-case words and words starting with an upper-case letter. Does case matter for our purposes? Should we just convert everything to lower-case?
- There's a lot of words that are random strings of letters. Where are they coming from? URLs?
- There's a lot of numbers. Do they matter for our purposes? What should we do with them?
- The RegexpTokenizer seems to have removed almost punctuation. Is that a good thing?

Let's look again at a random sample of some full tweets

In [None]:
pd.set_option('display.max_colwidth', 100)

In [None]:
questions.text.sample(10)

What you see will vary depending on what tweets you get. You can run the above cell multiple times to see different samples.

#### Further observations

- What to do with @ mentions?
- URLs are typically shortened to `http://t.co/<some random chars>`