# An NLP workshop - Categorizing tweets into relevant or non-relevant
#### adapted from https://github.com/hundredblocks/concrete_NLP_tutorial.git

## 2. Preprocessing

Let's clean up the data based on what we observed during our EDA

In [None]:
import pandas as pd
import nltk
import re
import itertools

In [None]:
%matplotlib inline

### Let's load the data again

In [None]:
questions = pd.read_csv("socialmedia_relevant_cols.csv", encoding='ISO-8859-1')
questions.columns=['text', 'choose_one', 'class_label']
questions.head()

## Data Cleansing

Let's create a function to clean up our data, and save it back to disk for future use.

For now all we are going to do is to convert everything to lower case and remove URLs, but after examining the data you might want to add code to remove punctuation marks or other irrelevant characters or words.

In [None]:
def standardize_text(df, text_field):
    df[text_field] = df[text_field].str.lower()
    df[text_field] = df[text_field].apply(lambda elem: re.sub(r"http\S+", "", elem))  # get rid of URLs
    # Add additional clean up coded here
    return df

In [None]:
questions = standardize_text(questions, "text")

Let's take a look at the effects

In [None]:
pd.set_option('display.max_colwidth', 100)

In [None]:
questions.text.sample(10)

In [None]:
questions.text.loc[6880]

Go back to `standardize_text` and any other cleanup you think would be useful 

Let's check the impact on the vocabulary

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
questions["tokens"] = questions["text"].apply(tokenizer.tokenize)

In [None]:
all_words = {word for tokens in questions.tokens for word in tokens}
print(f"Total number of unique tokens {len(all_words)}")

In [None]:
from pprint import pprint
pprint(all_words)

Once we're happy we've cleaned as much as we want to, let's write the clean data back to disk

In [None]:
questions.to_csv("clean_data.csv")