# Twitter Hate Speech Sentiment Analysis
This code was written in order to solve the challenge on this <a href="https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/">link</a>.<br>


In [1]:
#import the necessary libraries for dataset preparation, feature engineering, model training
from sklearn import model_selection, preprocessing, metrics, svm, ensemble
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd, numpy, string
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
#Remove Special Charactors
import re

## 1. Dataset preparation and Cleaning

For the purpose of this exercise, I have removed the special characters, converted words to lower case and made sure there are no null rows. Also I have created a data frame with two columns, tweets and labels to represent the excel sheet.

In [4]:
porter=PorterStemmer()
#Import Training and Testing Data
train = pd.read_csv('train.csv')
print("Training Set:"% train.columns, train.shape, len(train))
test = pd.read_csv('test_tweets.csv')
print("Test Set:"% test.columns, test.shape, len(test))
#Tokenize words in order to clean and stem
tok = WordPunctTokenizer()
# patterns to remove html tags numbers and special Characters
pat1 = r'@[A-Za-z0-9]+'
pat2 = r'https?://[A-Za-z0-9./]+'
combined_pat = r'|'.join((pat1, pat2))

def tweet_cleaner(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    stripped = re.sub(combined_pat, '', souped)
    try:
        clean = stripped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        clean = stripped
    letters_only = re.sub("[^a-zA-Z]", " ", clean)
    lower_case = letters_only.lower()
    # During the letters_only process two lines above, it has created unnecessay white spaces,
    # I will tokenize and join together to remove unneccessary white spaces
    words = tok.tokenize(lower_case)
    #Stemming
    stem_sentence=[]
    for word in words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    #Rejoin the words back to create the cleaned tweet
    words="".join(stem_sentence).strip()    
    return words

Training Set: (31962, 3) 31962
Test Set: (17197, 2) 17197


In [5]:
nums = [0,len(train)]
clean_tweet_texts = []
for i in range(nums[0],nums[1]):
    clean_tweet_texts.append(tweet_cleaner(train['tweet'][i]))
nums = [0,len(test)]
test_tweet_texts = []
for i in range(nums[0],nums[1]):
    test_tweet_texts.append(tweet_cleaner(test['tweet'][i]))  

In [6]:
train_clean = pd.DataFrame(clean_tweet_texts,columns=['tweet'])
train_clean['label'] = train.label
train_clean['id'] = train.id
test_clean = pd.DataFrame(test_tweet_texts,columns=['tweet'])
test_clean['id'] = test.id

In [7]:
# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(train_clean['tweet'],train_clean['label'])
# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

TF-IDF score represents the relative importance of a term in the document and the entire corpus. TF-IDF score is composed by two terms: the first computes the normalized Term Frequency (TF), the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.<br>
<u>Word Level TF-IDF</u> : Matrix representing tf-idf scores of every term in different documents

In [8]:
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', stop_words='english', max_features=100000)
tfidf_vect.fit(train_clean['tweet'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

In [9]:
#Return the f1 Score
def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)    

    return metrics.f1_score(valid_y,predictions)

# SKLEARN Ensemble Model

Extremely Randomized Trees are a type of ensemble models, particularly bagging models. They are part of the tree based model family. Extremely randomized trees pick a node split very extremely (both a variable index and variable splitting value are chosen randomly)

In [12]:
# Extremely Randomized Trees Classifier on Word Level TF IDF Vectors
accuracy = train_model(ensemble.ExtraTreesClassifier(n_estimators=300), xtrain_tfidf, train_y, xvalid_tfidf)
print("Extremely Randomized Trees, WordLevel TF-IDF: ", accuracy)

Extremely Randomized Trees, WordLevel TF-IDF:  0.6965517241379311


In [13]:
#Return Predictions Label
def train_model_Pred(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    
    return predictions

In [14]:
#Now working with Real challenge Data
train_x=train_clean['tweet']
valid_x=test_clean['tweet']
train_y=train_clean['label']
# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', stop_words='english', max_features=100000)
tfidf_vect.fit(train_clean['tweet'])
tfidf_vect.fit(test_clean['tweet'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

In [15]:
# Extremely Randomized Trees on Word Level TF IDF Vectors
accuracy = train_model_Pred(ensemble.ExtraTreesClassifier(n_estimators=300), xtrain_tfidf, train_y, xvalid_tfidf)
print("Extremely Randomized Trees, WordLevel TF-IDF: ", accuracy)

Extremely Randomized Trees, WordLevel TF-IDF:  [0 1 0 ... 0 0 0]


Convert to DF and then export to csv

In [16]:
d={'id':test['id'],'Tweet':valid_x,'label':accuracy}

In [17]:
df=pd.DataFrame(data=d)

In [18]:
df.to_csv("test_predictions.csv", index=False)