# Introduction
The goal of this project is to use twitter streaming API to collect text data, and perform Natural Language Processing (NLP) for sentiment analysis, and do a statistical analysis to see if a tweet in reply to different gender / affiliation /  shows statistically meaningful difference in terms of aggression /insult. The degree of "aggression/ insult" in a text are modeled based on https://arxiv.org/ftp/arxiv/papers/1604/1604.06648.pdf
https://arxiv.org/pdf/1604.06650.pdf
https://arxiv.org/pdf/1702.06877.pdf
and references therein.
As a pilot survey, we only include 50 significant figures on twitter according to wikipedia (whose gender is known). Accounts for groups / organizations are hand-picked and removed. We only collect tweets that have replies. 

# Data
Uses of twitter APIata collecting and preprocessing step reference:
https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/

## Training data
For training purposes, we started by collecting aggressive users on twitter and their tweets, provided by Despoina Chatzakou (Mean Birds): https://arxiv.org/pdf/1702.06877.pdf
However many of the tweets are unaccessible due to user suspension / authorization issues.
http://www.yichang-cs.com/yahoo/WWW16_Abusivedetection.pdf and dataset provided therein (e.g. Kaggle challenge) provides insulting comments with verification set.
The list of Google-banned bad words are obtained via https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
We also apply GloVe and see how the result appears.
## Test data
Using Twitter streaming API, we collect tweets in reply to top 50 most followed users on twitter according to Wikipedia. The size of the dataset is 10k tweets to start with. 

# Processing tweets
We perform standard pre-processing of the tweeter text data, which involves:
tokenize, removing stop words, twitter-specific features (e.g. RT, @, ...).

# Analysis
We test various sentiment analysis here. We use Vader as a starter, to assess the performance of a typical and easy-to-use sentiment analyzer on our training / verification data. We then test various widely used word embedding and algorithm to assess the performance of themover Vader. Finally we apply the algorithm to collected tweets, visualize and understand the result.

## Word embedding
### TfidfVectorizer
CountVectorizer (simple token count)-> TfidfTransformer. Probably more suitable for a large corpus with consistent context.
### GloVe
pre-trained unsupervised word clustering / vetorization of words provided by Stanford group.

## Classification
### Vader 
provides pre-trained positive / negtaive sentiment analyzer. Tweets can be classifies and the intensity of the sentiment is returned. 
### Logistic Regression
### NaiveBayse
### RNN

In [None]:
# TIP: Install a pip package in the current Jupyter kernel
#import sys
#!{sys.executable} -m pip install numpy

In [1]:
try:
    import json
except ImportError:
    import simplejson as json

# Import the necessary methods from "twitter" library
# from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream

In [2]:
import pandas as pd
# Obtain list of 50 most followed people worldwide from wikipedia
#import wikipedia as wp
 
#Get the html source
#html = wp.page("List of most-followed Twitter accounts").html().encode("UTF-8")
#df = pd.read_html(html)[0]
#df.to_csv('twitter_list_of_influencers.csv',header=0,index=False)
#user_list = list(df[2])[1:]
# remove '@' infront of the screen names and save the list as user_list
#user_list = [a[1:] for a in user_list]
#print(user_list)

['katyperry', 'justinbieber', 'BarackObama', 'rihanna', 'taylorswift13', 'ladygaga', 'TheEllenShow', 'Cristiano', 'YouTube', 'jtimberlake', 'twitter', 'KimKardashian', 'britneyspears', 'ArianaGrande', 'ddlovato', 'selenagomez', 'cnnbrk', 'realDonaldTrump', 'shakira', 'jimmyfallon', 'BillGates', 'JLo', 'narendramodi', 'BrunoMars', 'Oprah', 'nytimes', 'KingJames', 'MileyCyrus', 'CNN', 'NiallOfficial', 'neymarjr', 'instagram', 'BBCBreaking', 'Drake', 'iamsrk', 'SportsCenter', 'KevinHart4real', 'SrBachchan', 'LilTunechi', 'espn', 'wizkhalifa', 'BeingSalmanKhan', 'Louis_Tomlinson', 'Pink', 'LiamPayne', 'Harry_Styles', 'onedirection', 'aliciakeys', 'realmadrid', 'KAKA']


In [3]:
# Variables that contains the user credentials to access Twitter API 
#ACCESS_TOKEN = '187549975-3LD41YaLCw3XnOvRUJVkqXrjt6gMsuT1HrEUlqDi'
#ACCESS_SECRET = 'ZykwgJpFYQaZdP6vEFHtbqBfkQfgBo9mV0LM3MkmEp5Oj'
#CONSUMER_KEY = 'qQTzJ4OceyUFMLcWkn7ZZ5Wrp'
#CONSUMER_SECRET = 'z6aV6zjg2yac1TiZw5ERUvZ2XLXAGBQJ1OD0yUCFtCczWNKjik'

#oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

#t = Twitter(auth=oauth)
#id_list = t.users.lookup(screen_name=','.join(user_list))
# user_list is the list of screen names: we need to convert this to user id using twitter API for filtering purposes
# filter out non-person accounts, non-english accounts and obtain tweets that have replies by hand
# maybe I need to pick out overly controvertial accounts as well..........

In [4]:
#user_id = [id_list[x]['id_str'] for x in range(50)]
#print(user_id)

['21447363', '27260086', '813286', '79293791', '17919972', '14230524', '15846407', '155659213', '10228272', '26565946', '783214', '25365536', '16409683', '34507480', '21111883', '23375688', '428333', '25073877', '44409004', '15485441', '50393960', '85603854', '18839785', '100220864', '19397785', '807095', '23083404', '268414482', '759251', '105119490', '158487331', '180505807', '5402612', '27195114', '101311381', '26257166', '23151437', '145125358', '116362700', '2557521', '20322929', '132385468', '84279963', '28706024', '158314798', '181561712', '209708391', '35094637', '14872237', '60865434']


In [5]:
# Initiate the connection to Twitter Streaming API
#twitter_stream = TwitterStream(auth=oauth)
#iterator = twitter_stream.statuses.filter(lang='en',follow=','.join(user_id))

# Get a sample of the public data following through Twitter
# iterator = twitter_stream.statuses.sample()
# As a pilot survey we set it to stop after getting 1000 tweets. 
# You don't have to set it to stop, but can continue running 
# the Twitter API to collect data for days or even longer. 

# Collecting data can take a long time, so I separately implemented a notebook for stream twitter.
# This code reads file generated from Collect_tweets_50mostpop_users.ipynb
tweets_filename = 'twitter_savereplies_nsamp10000.json'
#tweets_file = open(tweets_filename, "r")
import os.path
# find out a way to filter replies only:
# save to json file
tweet_count = 100
tweet_cnt = tweet_count
try:
    os.path.isfile(tweets_filename)
    print("file exists")
    with open(tweets_filename, 'r') as f:
        tweets = json.load(f) # readline only the first tweet/line
    pass
except:
    print("file does not exist, stream twitter for collecting tweets")
    tweets = []
    for tweet in iterator:
    # select only "replies" to top 50 followed users
        if str(tweet['in_reply_to_user_id']) in user_id:
        #print(tweet['in_reply_to_user_id'])
            tweet_count -= 1
            tweets.append(tweet)
    # Twitter Python Tool wraps the data returned by Twitter 
    # as a TwitterDictResponse object.
    # We convert it back to the JSON format to print/score
    # loads converts json format to python dictionary
    # dumps converts python dictionary to json format
    # The command below will do pretty printing for JSON data, try it out
    # print json.dumps(tweet, indent=4)
            if tweet_count <= 0:
                break     
    with open(tweets_filename, 'w') as outfile:
        json.dump(tweets,outfile,indent=4)
text = []
for i0 in range(tweet_cnt):
    text.append(tweets[i0]['text'])

file exists


In [7]:
print(tweets[1]['in_reply_to_user_id'])

25073877


In [8]:
print(tweets[50]['text'])

@realDonaldTrump when are you going to pay the troops in harms way a visit. Cannot think of a single POTUS who hasn… https://t.co/3nWb4cyJlB


# Processing the text
Using nltk, we pre-process the text here. Tokenized words will be feed into multiple classifier to determine the degree of aggression in the text. We compare their performances and decide which algorithm to use.

In [235]:
import nltk
from nltk.tokenize import word_tokenize
#print(word_tokenize(text[0]))
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=False)

In [10]:
import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)

In [11]:
import string
import re
 
from nltk.corpus import stopwords 
stopwords_english = stopwords.words('english')
 
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
wnl = WordNetLemmatizer()

from nltk.tokenize import TweetTokenizer
 
# Happy Emoticons
emoticons_happy = set([
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    ])
 
# Sad Emoticons
emoticons_sad = set([
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('
    ])
 
# all emoticons (happy + sad)
emoticons = emoticons_happy.union(emoticons_sad)
 
def clean_tweets(tweet):
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)    
    # remove abbreviation marks?
    tweet = re.sub(r'/[.]{2,}/g','',tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
 
    tweets_clean = []    
    for word in tweet_tokens:
        if (word not in stopwords_english and # remove stopwords
              word not in emoticons and # remove emoticons
                word not in string.punctuation): # remove punctuation
            #tweets_clean.append(word)
            #stem_word = stemmer.stem(word) 
            # stemming word : tend to remove 'e' from ending of some words. replaced by WordNetLemmatizer
            wnl_word = wnl.lemmatize(word)
            #tweets_clean.append(stem_word)
            tweets_clean.append(wnl_word)
 
    return tweets_clean

In [13]:
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

In [14]:
print(preprocess(text[50]))
print(clean_tweets(text[50]))

['@realDonaldTrump', 'when', 'are', 'you', 'going', 'to', 'pay', 'the', 'troops', 'in', 'harms', 'way', 'a', 'visit', '.', 'Cannot', 'think', 'of', 'a', 'single', 'POTUS', 'who', 'hasn', '…', 'https://t.co/3nWb4cyJlB']
['going', 'pay', 'troop', 'harm', 'way', 'visit', 'cannot', 'think', 'single', 'potus', '…']


In [246]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences

from keras.models import Model
from keras.models import Sequential

from keras.layers import Input, Dense, Embedding, Conv1D, Conv2D, MaxPooling1D, MaxPool2D
from keras.layers import Reshape, Flatten, Dropout, Concatenate
from keras.layers import SpatialDropout1D, concatenate
from keras.layers import GRU, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D

from keras.callbacks import Callback
from keras.optimizers import Adam

from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.models import load_model
from keras.utils.vis_utils import plot_model

In [212]:
def get_coefs(word, *arr):
    try:
        return word, np.asarray(arr, dtype='float32')
    except:
        return None, None
    
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open('./GloVe/glove.twitter.27B.200d.txt'))

embed_size=200
for k in list(embeddings_index.keys()):
    v = embeddings_index[k]
    try:
        if v.shape != (embed_size, ):
            embeddings_index.pop(k)
    except:
        pass
            
#embeddings_index.pop(None)  

In [213]:
values = list(embeddings_index.values())
all_embs = np.stack(values)

emb_mean, emb_std = all_embs.mean(), all_embs.std()

In [247]:
#def tokenize(s):
#    return tokens_re.findall(s)
def tokenize(tweet):
    tweet = re.sub(r'http\S+', '', tweet)
    tweet = re.sub(r"#(\w+)", '', tweet)
    tweet = re.sub(r"@(\w+)", '', tweet)
    tweet = re.sub(r'[^\w\s]', '', tweet)
    tweet = tweet.strip().lower()
    tokens = word_tokenize(tweet)
    return tokens

In [250]:
MAX_NB_WORDS = 80000
tokenizer = Tokenizer(num_words=MAX_NB_WORDS,char_level=False)


insult['tokens'] = insult.Comment.map(tokenize)
insult['cleaned_text'] = insult['tokens'].map(lambda tokens: ' '.join(tokens))
print(insult['cleaned_text'][15])
tokenizer.fit_on_texts(insult['cleaned_text'])
print(tokenizer.texts_to_sequences([insult['cleaned_text'][15]]))
#print(insult['cleaned_text'][15])

for some reason u sound retarded lol damn where u been negro
[[17, 82, 419, 56, 138, 195, 141, 305, 162, 56, 154, 823]]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [251]:
train_sequences = tokenizer.texts_to_sequences(train_x)
test_sequences = tokenizer.texts_to_sequences(ver_x)

In [252]:
MAX_LENGTH = 35

padded_train_sequences = pad_sequences(train_sequences, maxlen=MAX_LENGTH)
padded_test_sequences = pad_sequences(test_sequences, maxlen=MAX_LENGTH)

In [253]:

word_index = tokenizer.word_index
nb_words = MAX_NB_WORDS
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

oov = 0
for word, i in word_index.items():
    if i >= MAX_NB_WORDS: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else:
        oov += 1

print(oov)

embedding_dim = 200
def get_rnn_model_with_glove_embeddings():
    inp = Input(shape=(MAX_LENGTH, ))
    x = Embedding(MAX_NB_WORDS, embedding_dim, weights=[embedding_matrix], input_length=MAX_LENGTH, trainable=True)(inp)
    x = SpatialDropout1D(0.3)(x)
    x = Bidirectional(GRU(100, return_sequences=True))(x)
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    conc = concatenate([avg_pool, max_pool])
    outp = Dense(1, activation="sigmoid")(conc)
    
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    return model

rnn_model_with_embeddings = get_rnn_model_with_glove_embeddings()

filepath="./models/rnn_with_embeddings/weights-improvement-{epoch:02d}-{val_acc:.4f}-%03d.hdf5"%(embedding_dim)
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

batch_size = 512
epochs = 20


history = rnn_model_with_embeddings.fit(x=padded_train_sequences, 
                    y=train_y, 
                    validation_data=(padded_test_sequences, ver_y), 
                    batch_size=batch_size, 
                    callbacks=[checkpoint], 
                    epochs=epochs, 
                    verbose=1)


1089
Train on 3947 samples, validate on 2235 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [254]:
best_rnn_model_with_glove_embeddings = load_model('./models/rnn_with_embeddings/weights-improvement-12-0.7289-200.hdf5')

y_pred_rnn_with_glove_embeddings = best_rnn_model_with_glove_embeddings.predict(
    padded_test_sequences, verbose=1, batch_size=2048)

y_pred_rnn_with_glove_embeddings = pd.DataFrame(y_pred_rnn_with_glove_embeddings, columns=['prediction'])
y_pred_rnn_with_glove_embeddings['prediction'] = y_pred_rnn_with_glove_embeddings['prediction'].map(lambda p: 
                                                                                                    1 if p >= 0.5 else 0)
y_pred_rnn_with_glove_embeddings.to_csv('./predictions/y_pred_rnn_with_glove_embeddings.csv', index=False)



In [255]:
y_pred_rnn_with_glove_embeddings = pd.read_csv('./predictions/y_pred_rnn_with_glove_embeddings.csv')
print(accuracy_score(ver_y, y_pred_rnn_with_glove_embeddings))

0.728859060403


In [165]:
##### Initiate the connection to Twitter REST API
#twitter = Twitter(auth=oauth)
            
# Search for latest tweets about "#nlproc"
#twitter.search.tweets(q='#WorldCup') 
#a = twitter.search.tweets(q='#WorldCup',geocode='30.357245,-97.7611217,1000km',result_type='recent')  #Austin
#print('Austin searches: '+str(len(a['statuses'][:])))

In [17]:
#a = twitter.search.tweets(q='instagram',geocode='30.357245,-97.7611217,100km',count=100,since='2016-06-01')  #Austin
#print('Austin searches: '+str(len(a['statuses'][:])))
#print(a['statuses'][-1]['created_at'])

In [18]:
#import wget

In [19]:
#wget.download(a['statuses'][3]['entities']['urls'][0]['expanded_url'],'/home/ijee/2018_fall/twitter_region_language_timestamp')  

In [20]:
#twitter = Twitter(auth=oauth)
#WhitePower = twitter.search.tweets(q='#WhitePower',count=100,lang='en')

In [21]:
#print(WhitePower['statuses'][4]['text'])

In [22]:
import numpy as np
training_file = './dataset/data'
training_data = pd.read_csv(training_file,delim_whitespace=True,names=['user','category','tweet_id'])
ID = []
for i in range(len(training_data)):
    ID.append(training_data['tweet_id'][i].split(","))
#print(training_data)
# category is divided into four: aggressor, bully, normal and spammer
# each up to (43, 101, 883, 1303) id indices and 5-10 tweets.
# It can take long time to collect these tweets, so I save them as json files with their indices provided in the dataset.

In [23]:
# saved aggressive / bullying tweets for later runs
agg_filename = 'agg_tweets.json'
bull_filename= 'bull_tweets.json'
norm_filename = 'norm_tweets.json' 

In [24]:
#t.statuses.oembed(_id=672282716436467714)
try:
    os.path.isfile(agg_filename)
    with open(agg_filename, 'r') as f:
        agg_tweets = json.load(f) # readline only the first tweet/line
    print("agg_file exist: read and pass")
    pass
except:
    print("agg_file does not exist: collect tweets from twitter database")
    agg_tweets = []
    agg_index = []
    ID_idx = [len(x) for x in ID]
    for i in range(0,43):
        for j in range(ID_idx[i]):
            try:
                agg_tweets.append(t.statuses.show(_id=ID[i][j]))
                agg_index.append(i)
                print(i,j)
            except:
                pass
    with open(agg_filename, 'w') as outfile:
        json.dump(agg_tweets,outfile,indent=4)
    with open('agg_tweets_idx.txt','w') as outfile:
        json.dump(agg_index,outfile)
    print("aggressive tweets collected and saved")

agg_file exist: read and pass


In [25]:
#t.statuses.oembed(_id=672282716436467714)
try:
    os.path.isfile(bull_filename)
    with open(bull_filename, 'r') as f:
        bull_tweets = json.load(f) # readline only the first tweet/line
    print("bull_file exist: read and pass")
    pass
except:
    print("bull_file does not exist: collect tweets from twitter database")
    ID_idx = [len(x) for x in ID]
    bull_tweets = []
    bull_index = []
    for i in range(43,101):
        for j in range(ID_idx[i]):
            try:
                bull_tweets.append(t.statuses.show(_id=ID[i][j]))
                bull_index.append(i)
                print(i,j)
            except:
                pass
    with open(bull_filename, 'w') as outfile:
        json.dump(bull_tweets,outfile,indent=4)
    with open('bull_tweets_idx.txt','w') as outfile:
        json.dump(bull_index,outfile)
    print("bully tweets collected and saved")

bull_file exist: read and pass


In [28]:
#t.statuses.oembed(_id=672282716436467714)
try:
    os.path.isfile(norm_filename)
    with open(norm_filename, 'r') as f:
        norm_tweets = json.load(f) # readline only the first tweet/line
    print("norm_file exist: read and pass")
    pass
except:
    print("norm_file does not exist: collect tweets from twitter database")
    norm_tweets = []
    norm_index = []
    ID_idx = [len(x) for x in ID]
    for i in range(101,883):
        for j in range(ID_idx[i]):
            try:
                norm_tweets.append(t.statuses.show(_id=ID[i][j]))
                norm_index.append(i)
                print(i,j)
            except:
                pass
    with open(norm_filename, 'w') as outfile:
        json.dump(norm_tweets,outfile,indent=4)
    with open('norm_tweets_idx.txt','w') as outfile:
        json.dump(norm_index,outfile)
    print("normal tweets collected and saved")

norm_file exist: read and pass


In [29]:
print((bull_tweets[10]['in_reply_to_user_id']))

None


In [35]:
norm_tweets_text = [norm_tweets[i]['text'] for i in range(len(norm_tweets))]
# tweets from aggressive + bullying users currently available are about 100... :x

In [31]:
# tweeter seems to have worked well and suspended most of the accounts the authors classified as aggressors / bullies,
# which is unfortunate for our purpose but we will proceed with the currently available data...?

# External dataset
Due to the lack of twitter data, we use external data to train the network. One is from Kaggle challenge, and the other is from banned words by Google (not a official list, obtained from https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/). Kaggle data also provide verification set with labels separatly provided, so we will proceed and test our model on the data. 

In [32]:
banwords_file = './Google_BanWords/full-list-of-bad-words-text-file_2018_03_26.txt'
insult_corpus_file = './kaggle_insult/train.csv'
insult_train = pd.read_csv(insult_corpus_file)
banwords = pd.read_csv(banwords_file,sep='\r\n',engine='python',names = ['banword'])

In [98]:
#insult_train.head(4)
#banwords['word'][462]
insult = insult_train.loc[insult_train['Insult']==1]
normal = insult_train.loc[insult_train['Insult']==0]
print((insult['Comment'][0]))
print(len(normal))
print(clean_tweets(insult['Comment'][8]))

"You fuck your dad."
2898
['either', 'fake', 'extremely', 'stupid', '...', 'maybe', '...']


In [73]:
# let's import verification data from the Kaggle challenge
ver_file = './kaggle_insult/impermium_verification_labels.csv'
ver_data = pd.read_csv(ver_file)
print(ver_data.head(5))

insult = ver_data.loc[ver_data['Insult']==1]
normal = ver_data.loc[ver_data['Insult']==0]

insult_ver_set = []
for tweet in insult['Comment']:
    insult_ver_set.append((bag_of_words(tweet), 'ins'))
    
norm_ver_set = []
for tweet in normal['Comment']:
    norm_ver_set.append((bag_of_words(tweet), 'norm'))

   id  Insult             Date  \
0   1       0  20120603163526Z   
1   2       1  20120531215447Z   
2   3       1  20120823164228Z   
3   4       1  20120826010752Z   
4   5       1  20120602223825Z   

                                             Comment        Usage  
0                 "like this if you are a tribe fan"  PrivateTest  
1              "you're idiot......................."  PrivateTest  
2  "I am a woman Babs, and the only "war on women...  PrivateTest  
3  "WOW & YOU BENEFITTED SO MANY WINS THIS YEAR F...  PrivateTest  
4  "haha green me red you now loser whos winning ...  PrivateTest  


In [148]:
# we try uni-/bigram (word / character) approach here following 
# https://ahmedbesbes.com/overview-and-benchmark-of-traditional-and-deep-learning-models-in-text-classification.html
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, auc, roc_auc_score
from sklearn.externals import joblib

import scipy
from scipy.sparse import hstack

In [153]:
# Try TFIDF 
train_x = insult_train['Comment'][:]
train_y = insult_train['Insult'][:]
ver_x = ver_data['Comment'][:]
ver_y = ver_data['Insult'][:]
vectorizer_word = TfidfVectorizer(max_features=40000,
                             min_df=5, 
                             max_df=0.5, 
                             analyzer='word', 
                             stop_words='english', 
                             ngram_range=(1, 2))

vectorizer_word.fit(train_x)

tfidf_matrix_word_train = vectorizer_word.transform(train_x)
tfidf_matrix_word_test = vectorizer_word.transform(ver_x)

In [154]:
lr_word = LogisticRegression(solver='sag', verbose=2)
lr_word.fit(tfidf_matrix_word_train, train_y)

convergence after 19 epochs took 0 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='sag', tol=0.0001,
          verbose=2, warm_start=False)

In [156]:
joblib.dump(lr_word, 'lr_word_ngram.pkl')

y_pred_word = lr_word.predict(tfidf_matrix_word_test)
pd.DataFrame(y_pred_word, columns=['y_pred']).to_csv('lr_word_ngram.csv', index=False)

In [158]:
y_pred_word = pd.read_csv('lr_word_ngram.csv')
print(accuracy_score(ver_y, y_pred_word))

0.656823266219


In [163]:
vectorizer_char = TfidfVectorizer(max_features=40000,
                             min_df=5, 
                             max_df=0.5, 
                             analyzer='char', 
                             ngram_range=(1, 4))

vectorizer_char.fit(train_x);

tfidf_matrix_char_train = vectorizer_char.transform(train_x)
tfidf_matrix_char_test = vectorizer_char.transform(ver_x)

lr_char = LogisticRegression(solver='sag', verbose=2)
lr_char.fit(tfidf_matrix_char_train, train_y)

y_pred_char = lr_char.predict(tfidf_matrix_char_test)
joblib.dump(lr_char, 'lr_char_ngram.pkl')

pd.DataFrame(y_pred_char, columns=['y_pred']).to_csv('lr_char_ngram.csv', index=False)

convergence after 17 epochs took 0 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s finished


In [164]:
y_pred_char = pd.read_csv('lr_char_ngram.csv')
print(accuracy_score(ver_y, y_pred_char))

0.663982102908


In [37]:
# feature extractor function
def bag_of_words(tweet):
    words = clean_tweets(tweet)
    words_dictionary = dict([word, True] for word in words)    
    return words_dictionary
 
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
print (bag_of_words(custom_tweet))
'''
Output:
 
{'great': True, 'good': True, 'morning': True, 'hello': True, 'day': True}
'''
# insult tweets feature set
insult_tweets_set = []
for tweet in insult['Comment']:
    insult_tweets_set.append((bag_of_words(tweet), 'ins'))    
for words in banwords['banword']:
    insult_tweets_set.append((bag_of_words(words),'ins'))
# normal tweets feature set
normal_tweets_set = []
for tweet in normal['Comment']:
    normal_tweets_set.append((bag_of_words(tweet), 'norm'))
for tweet in norm_tweets_text:
    normal_tweets_set.append((bag_of_words(tweet), 'norm'))
print(len(insult_tweets_set), len(normal_tweets_set))

{'morning': True, 'hello': True, 'good': True, 'day': True, 'great': True}
2298 5793


In [None]:
print(clean_tweets(insult['Comment'][200]))

In [50]:
print(normal_tweets_set[1])

({'tell': True, 'emabiggestfans': True, 'mbf': True, 'u': True, 'notifs': True, 'vamp': True, 'dm': True, 'rt': True, '1d': True, 'turn': True, 'want': True, 'solo': True}, 'norm')


In [74]:
# try n-fold (minibatch) later
from random import shuffle, seed
seed(10)
shuffle(insult_tweets_set)
shuffle(normal_tweets_set)
 
#test_set = insult_tweets_set[:1000] + normal_tweets_set[:1000]
#train_set = insult_tweets_set[1000:] + normal_tweets_set[1000:]
train_set = insult_tweets_set + normal_tweets_set
test_set = insult_ver_set + norm_ver_set
print(len(test_set),  len(train_set)) # Output: (2000, 8000)

2235 8091


# Word embedding: efficient vectorization of words
By vectorizing words, we can reduce the sparcity and dimensionality of the feature space. 
The representation is normally ~ 100-1000 dimension. The basic concept behind it is that words used in
similar context have similar meanings.
## Embedding layer
Using NN for language modeling or document classification. 
## Word2vec
probably not easy with twitter texts as individual tweets are very short and thus predicting word from the surrounding (both for CBoW and Cont. skip-gram) would be hard? : not much freedom in choosing the sliding window size...
## GloVe
Word2Vec with additional information about the context from the whole corpus. Provides "Unsupervised" clustering of vocab, purely based on a large corpus of data. We use pre-trained GloVe vocabulary vector library collected from Twitter data from https://github.com/stanfordnlp/GloVe (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB).

Reference : https://machinelearningmastery.com/what-are-word-embeddings/

# Vader sentiment analysis (pos,neg)- how useful is it?

In [81]:
#nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [110]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
snt = analyser.polarity_scores(insult['Comment'][0])
print(snt)
#AttributeError: 'SentimentIntensityAnalyzer' object has no attribute 'polarity_score'

{'neu': 0.462, 'compound': -0.5423, 'pos': 0.0, 'neg': 0.538}


In [147]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
cnt = 0
insult_cmt = insult.reset_index()
for i in range(len(insult_cmt)):
    snt = analyser.polarity_scores(insult_cmt['Comment'][i])
    if snt.get('neg') > snt.get('pos'):
        cnt += 1
#        print(i,snt)
print(cnt/len(insult))
print(insult_cmt['Comment'][10])
# only about 11% of the ban words are categorized as having negative connotation according to Vader.
# only 9% of the insult comments from Kaggle competition are negative if neg > .5 is counted.
# (when negative > positive: 63% : this is similar to our NBC result).

0 {'neu': 0.462, 'compound': -0.5423, 'pos': 0.0, 'neg': 0.538}
1 {'neu': 0.577, 'compound': -0.7003, 'pos': 0.119, 'neg': 0.304}
2 {'neu': 0.693, 'compound': -0.4767, 'pos': 0.0, 'neg': 0.307}
3 {'neu': 0.769, 'compound': -0.5106, 'pos': 0.0, 'neg': 0.231}
4 {'neu': 0.432, 'compound': -0.5574, 'pos': 0.173, 'neg': 0.395}
7 {'neu': 0.702, 'compound': -0.5267, 'pos': 0.0, 'neg': 0.298}
8 {'neu': 0.421, 'compound': -0.6705, 'pos': 0.0, 'neg': 0.579}
11 {'neu': 0.761, 'compound': -0.8767, 'pos': 0.0, 'neg': 0.239}
13 {'neu': 0.81, 'compound': -0.7506, 'pos': 0.04, 'neg': 0.15}
14 {'neu': 0.619, 'compound': -0.5719, 'pos': 0.0, 'neg': 0.381}
16 {'neu': 0.481, 'compound': -0.8238, 'pos': 0.0, 'neg': 0.519}
18 {'neu': 0.948, 'compound': -0.1027, 'pos': 0.024, 'neg': 0.028}
20 {'neu': 0.701, 'compound': -0.7906, 'pos': 0.068, 'neg': 0.231}
21 {'neu': 0.637, 'compound': -0.8395, 'pos': 0.0, 'neg': 0.363}
23 {'neu': 0.645, 'compound': -0.8775, 'pos': 0.094, 'neg': 0.261}
26 {'neu': 0.364, 'comp

In [75]:
from nltk import classify
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier, MaxentClassifier, SklearnClassifier, DecisionTreeClassifier
from sklearn import cross_validation
from sklearn.svm import LinearSVC, SVC
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

print("!!!NaiveBayesClassifier!!!")
classifier = NaiveBayesClassifier.train(train_set)
accuracy = classify.accuracy(classifier, test_set)
print(accuracy) # Output: 0.765
print (classifier.show_most_informative_features(10)) 

print("!!!MaxEntropyClassifier!!!")
classifier = MaxentClassifier.train(train_set, 'GIS', trace=0, encoding=None, labels=None, gaussian_prior_sigma=0, max_iter = 1)
accuracy = nltk.classify.util.accuracy(classifier, test_set)
print(accuracy)
print(classifier.show_most_informative_features(10))

print("!!!SVM!!!")
svmclassifier = SklearnClassifier(SVC(kernel='linear',probability=True), sparse=False)
svmclassifier.train(train_set)
accuracy = nltk.classify.util.accuracy(svmclassifier, test_set)
print(accuracy)
#print(classifier.show_most_informative_features(10))

# Decision Tree takes forever to run... need to find an alternative!
#print("!!!DecisionTree!!!")
#dtclassifier = DecisionTreeClassifier.train(train_set)
#accuracy = nltk.classify.util.accuracy(dtclassifier, test_set)
#print(accuracy)
#print(dtclassifier.show_most_informative_features(10))
#pos_precision = nltk.metrics.precision(refsets['pos'], testsets['pos'])
#pos_recall = nltk.metrics.recall(refsets['pos'], testsets['pos'])
#pos_fmeasure = nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
#neg_precision = nltk.metrics.precision(refsets['neg'], testsets['neg'])
#neg_recall = nltk.metrics.recall(refsets['neg'], testsets['neg'])
#neg_fmeasure =  nltk.metrics.f_measure(refsets['neg'], testsets['neg'])

!!!NaiveBayesClassifier!!!
0.6335570469798658
Most Informative Features
                       � = True              ins : norm   =     48.6 : 1.0
               direction = True             norm : ins    =     30.3 : 1.0
                    cunt = True              ins : norm   =     22.7 : 1.0
                    game = True             norm : ins    =     22.6 : 1.0
                     win = True             norm : ins    =     19.7 : 1.0
                  result = True             norm : ins    =     18.4 : 1.0
                      sa = True             norm : ins    =     18.1 : 1.0
                   crawl = True              ins : norm   =     17.6 : 1.0
                   happy = True             norm : ins    =     16.8 : 1.0
                 hundred = True             norm : ins    =     16.8 : 1.0
None
!!!MaxEntropyClassifier!!!
0.5552572706935123
  -0.000 direction==True and label is 'ins'
  -0.000 game==True and label is 'ins'
  -0.000 win==True and label is 'ins'
  -0.0

In [68]:
custom_tweet = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_tweet_set = bag_of_words(agg_tweets[30]['text'])
print(custom_tweet_set)
print (svmclassifier.classify(custom_tweet_set)) # Output: neg
# Negative tweet correctly classified as negative

# probability result
prob_result = svmclassifier.prob_classify(custom_tweet_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: neg
print (prob_result.prob("ins")) # Output: 0.941844352481
print (prob_result.prob("norm")) # Output: 0.0581556475194
 
 
custom_tweet = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_tweet_set = bag_of_words(custom_tweet)
 
print (svmclassifier.classify(custom_tweet_set)) # Output: pos
# Positive tweet correctly classified as positive
 
# probability result
prob_result = svmclassifier.prob_classify(custom_tweet_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: pos
print (prob_result.prob("ins")) # Output: 0.00131055449755
print (prob_result.prob("norm")) # Output: 0.998689445502

{'parliament': True, 'racist': True, 'university': True, '–': True, 'mp': True, 'racism': True, 'lash': True, 'cape': True, 'academ': True, 'town': True, 'academic': True, '...': True, 'uct': True}
norm
<ProbDist with 2 samples>
norm
0.0463495889777
0.953650411022
norm
<ProbDist with 2 samples>
norm
0.0126036422799
0.98739635772
