# Introduction
***
One of the most common applications of NLP is sentiment analysis. From opinion polls to creating entire marketing strategies, this domain has completely reshaped the way businesses work.

Thousands of text documents can be processed for sentiment (and other features such as named entities, topics, themes, etc.) in seconds.

In this article, we solve the Twitter Sentiment Analysis Practice Problem. We follow a sequence of steps needed to solve a general sentiment analysis problem. We start with preprocessing and cleaning the raw text of the tweets.

Then, we'll explore the cleaned text and try to get some intuition about the context of the tweets. 

After that, we will extract numerical features from the data and use them to train models and identify the sentiments of the tweets.
# Understand the Problem Statement
***
The object of this task is to **identify hate speech in tweets**. For the sake of simplicity, we say a tweet **contains hate speech if it has a racist or sexist sentiment associated with it.** So in other words, the task is to **classify tweets as racist or sexists or neither**. 
# Tweet Preprocessing and Cleaning
***
Cleaning the data is necessary in order to remove noise in the data and make it more consistent. We clean noise that is less relevant to find the sentiment of tweets such as punctuation, special characters, numbers and terms that don't carry explanatory weight in general. 
# The Data
***
You can download the data from [here](https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/)

The collection of tweets is split in the ratio of 65:35 in training to testing. Out of the testing set, 30% is public and the rest is private. 

The training set is contained in **train.csv** which contains 31,962 labelled tweets. Each line contains the tweet id, its label and the tweet. 

And the testing set is contained in **test_tweets.csv** which contains tweet ids and the tweet text.

Here are the imports we will need:

In [1]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import nltk
import warnings
import os
warnings.filterwarnings("ignore",category=DeprecationWarning)

%matplotlib inline

Load in the data

In [2]:
DATA_DIR = "data/hate_speech_tweet_data/"

Create a helper function for loading the data.

In [3]:
def load_data( data_dir ):
    train_file_name = "tweet_train_set.csv"
    test_file_name = "tweet_test_set.csv"
    train_set = pd.read_csv(os.path.join(data_dir,train_file_name))
    test_set = pd.read_csv(os.path.join(data_dir,test_file_name))
    return train_set, test_set

Here is what one tweet looks like:

In [4]:
x_train, x_test = load_data( DATA_DIR )
x_train.iloc[0]['tweet']

' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run'

Note that we have the tweet id, the label and tweet text. The label here is 0 for innocent, or 1 for hate speech.

Here's some data cleaning steps we must perform:

* the twitter handles are masked as @user for anonymity, so they don't provide any information.
* we could try getting rid of punctuations, numbers and special characters
* we can remove small words with little meaning, such as "pdx", "his" and "all"
* then we tokenize the tweets into tokens
* then we perform lemmatization. so ('loves','loving','lovable') -> 'love' to reduce the total number of unique words in the dataset without losing so much information

## Removing Anonymized Twitter Handles (@user)
***
We define a helper function that removes an unwanted pattern of text from an input string and returns the original string without the unwanted pattern:

In [5]:
def remove_pattern( input_txt, pattern ):
    """
    Args:
        input_txt (str) - the input text to work on
        pattern (str) - the regex pattern to remove from the text
    Returns:
        input_txt (str) - the original text without the unwanted pattern
    """
    r = re.findall(pattern, input_txt)
    for match in r:
        input_txt = re.sub(match, "", input_txt)
        
    return input_txt

Now we can create a new column containing the "tidied" tweet. 

Our pattern matches any word that starts with an '@', followed by any number of letters '[\w]*'.

In [6]:
anonymized_twitter_handle_pattern = "@[\w]*"

x_train["tidy_tweet"] = x_train["tweet"].apply(
    remove_pattern, args=(anonymized_twitter_handle_pattern,))
x_test["tidy_tweet"] = x_test["tweet"].apply(
    remove_pattern, args=(anonymized_twitter_handle_pattern,))

Now lets take a look at our "tidied" tweets:

In [7]:
print("Original tweet")
print("--------------")
print(x_train['tweet'].iloc[0])
print("Tidied tweet")
print("--------------")
print(x_train['tidy_tweet'].iloc[0])

Original tweet
--------------
 @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run
Tidied tweet
--------------
  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run


## Removing Punctuations, Numbers and Special Characters
***
We can access the "str" attribute of a pandas Series of dtype object to use the built in string methods for pandas Series

In [8]:
anything_not_characters_nor_pound = "[^a-zA-Z#]"
x_train["tidy_tweet"] = x_train["tidy_tweet"].str.replace(anything_not_characters_nor_pound, " ")
x_test["tidy_tweet"] = x_test["tidy_tweet"].str.replace(anything_not_characters_nor_pound, " ")

## Removing Short Words
***
We can remove all words having length 3 or less. We could also swap this step out for removing stop words using an established package.

In [9]:
def remove_words_less_or_equal_three_characters( text ):
    return " ".join([word for word in text.split() if len(word) > 3])

In [10]:
x_train["tidy_tweet"] = x_train["tidy_tweet"].apply(remove_words_less_or_equal_three_characters)
x_test["tidy_tweet"] = x_test["tidy_tweet"].apply(remove_words_less_or_equal_three_characters)

Now we can look at the tidied tweets versus the original tweets:

In [11]:
print("Original tweet")
print("--------------")
print(x_train['tweet'].iloc[0])
print("Tidied tweet")
print("--------------")
print(x_train['tidy_tweet'].iloc[0])

Original tweet
--------------
 @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run
Tidied tweet
--------------
when father dysfunctional selfish drags kids into dysfunction #run


## Tokenization
***
Now we tokenize the cleaned tweets in our dataset. We split the string of a text into tokens.

In [12]:
x_train["tokenized_tidy_tweet"] = x_train["tidy_tweet"].apply(lambda tweet: tweet.split())
x_test["tokenized_tidy_tweet"] = x_test["tidy_tweet"].apply(lambda tweet: tweet.split())

print(x_train["tokenized_tidy_tweet"].iloc[0])

['when', 'father', 'dysfunctional', 'selfish', 'drags', 'kids', 'into', 'dysfunction', '#run']


## Stemming
***
Stemming is a rule-based process of stripping the suffixes ("ing","ly","es","s", etc.) from a word. This way, "play","player","played","plays", and "playing" all reduce to "play".

In [13]:
stemmer = nltk.stem.porter.PorterStemmer()

x_train["tokenized_tidy_tweet"] = x_train["tokenized_tidy_tweet"].\
    apply(lambda tweet: [stemmer.stem(token) for token in tweet])
x_test["tokenized_tidy_tweet"] = x_test["tokenized_tidy_tweet"].\
    apply(lambda tweet: [stemmer.stem(token) for token in tweet])

print(x_train["tokenized_tidy_tweet"].iloc[0])

['when', 'father', 'dysfunct', 'selfish', 'drag', 'kid', 'into', 'dysfunct', '#run']


## Bringing it All Together
***
Now that we've done a bunch of preprocessing, we can put all of our lists of tokens back into contiguous strings. This way we can perform our vectorization of the strings as we want.

In [None]:
x_train["tidy_tweet"] = x_train["tidy_tweet"].apply(lambda token_list: " ".join(token_list))
x_test["tidy_tweet"] = x_test["tidy_tweet"].apply(lambda token_list: " ".join(token_list))