<a href="https://colab.research.google.com/github/josbex/HS-detection_in_social_media_posts/blob/master/data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as  pd
import csv
import numpy as np

## Dataset

The dataset needs to be loaded from the drive (this entails the dataset is in your drive). If so, just run the cell below and follow the link to get an authorization code. 

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


After the the dataset can be read from the drive, just specify the name of the dataset you want to read. For this case the OLID training dataset is loaded. 

## Training data

In [2]:
df = pd.read_csv("/content/gdrive/My Drive/thesis/dataset/olid-training-v1.0.tsv", sep="\t") 
print(df.head())

      id                                              tweet  ... subtask_b subtask_c
0  86426  @USER She should ask a few native Americans wh...  ...       UNT       NaN
1  90194  @USER @USER Go home you’re drunk!!! @USER #MAG...  ...       TIN       IND
2  16820  Amazon is investigating Chinese employees who ...  ...       NaN       NaN
3  62688  @USER Someone should'veTaken" this piece of sh...  ...       UNT       NaN
4  43605  @USER @USER Obama wanted liberals &amp; illega...  ...       NaN       NaN

[5 rows x 5 columns]


## Data pre-processing 

Some steps are needed for the data processing first the tweet-preprocesser (https://pypi.org/project/tweet-preprocessor/) is used for cleaning up the tweets from urls, users, hashtags and emoticons. 

In [4]:
!pip install tweet-preprocessor



In [3]:
import preprocessor as p

## Tokenize tweets

Here we replace the @, emojis and urls using the tweet preprocessor. 

In [4]:
def tokenize_tweets():
  #p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.EMOJI)
  p.set_options(p.OPT.EMOJI)
  for tweet in df.tweet:
    df.replace(tweet, p.tokenize(tweet), inplace=True)

def remove_pattern(input_txt, pattern, replace):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, replace, input_txt)    
    return input_txt  

## Hashtag manipulation

In [5]:
import sys
sys.path.append('/content/gdrive/My Drive/thesis')

In [6]:
import vocab_parser as vp
import hashtag_manipulation as hs

## Vocab

To be able to parse hashtags efficiently a big vocabulary is needed and just for this method the vocab needs to be in a list format sorted in word lenght order, shortest to longest. This will be updated later to be saved to an csv file, just to make it easier to add new words. 

As of now a list of 3000 most common english words was combined with a list of 1300 differents slurs and curse words. We can probably add some better vocab list later since this one doesn't work for different variations of words, for example it can split #humanright but not #humanrights. 

In [8]:
big_vocab = vp.file_to_list("/content/gdrive/My Drive/thesis/big_vocab.txt")
big_vocab.sort(key=len)

In [9]:
vocab = vp.file_to_list("/content/gdrive/My Drive/thesis/vocab.txt")
vocab.sort(key=len)

In [10]:
print(vocab[0:10])

['', 'a', 'I', 'ad', 'ah', 'AM', 'as', 'at', 'be', 'by']


In [22]:
print(hs.find_hashtags(df.tweet[13239]))

['#Spanishrevenge', '#justice', '#HumanRights', '#FreedomOfExpression', '#Spain', '#fakedemocracy', '#cddr', '#shameonSpain', '#WakeupEurope']


In [7]:
def reformat_hashtags():
  for tweet in df.tweet:
    new_tweet = tweet
    hashtags = hs.find_hashtags(tweet)
    if hashtags is not None:
      for tag in hashtags:
        c_tag = hs.clean_hashtag(str(tag))
        split_hashtag = hs.split_tag(c_tag)
        if len(split_hashtag) > 0:
          new_tweet = hs.replace_hashtag(new_tweet, tag.strip(), hs.tag_to_string(split_hashtag))
        else:
          new_tweet = hs.replace_hashtag(new_tweet, tag.strip(), c_tag)
    df.replace(tweet, new_tweet, inplace=True)

In [20]:
def reformat_hashtags_vocab():
  for tweet in df.tweet:
    new_tweet = tweet
    hashtags = hs.find_hashtags(tweet)
    if hashtags is not None:
      for tag in hashtags:
        c_tag = hs.clean_hashtag(str(tag))
        if len(c_tag) < 7:
          split_hashtag = hs.split_tag(c_tag, vocab)
          if len(split_hashtag) > 0:
            new_tweet = hs.replace_hashtag(new_tweet, tag.strip(), hs.tag_to_string(split_hashtag))
          else:
            new_tweet = hs.replace_hashtag(new_tweet, tag.strip(), c_tag)
        else:
          new_tweet = hs.replace_hashtag(new_tweet, tag.strip(), c_tag)
    df.replace(tweet, new_tweet, inplace=True)

##Processing the training data

In [8]:
tokenize_tweets()

In [9]:
reformat_hashtags()

In [10]:
print(df.tweet[55])
print(df.tweet[13239])

 gun control advocates must STOP falling all over themselves to assure electorate that they too love the HORRIFIC 2A URL
spanishrevenge vs.  justice  human rights and  freedom of expression spain is a fakedemocracy @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER cddr shameonspain wakeupeurope @USER URL


## Saving the labels and parsed tweets of the training data

Saves the training data into numpy arrays. Labels are changed into binary representation where none offensive tweets are set to 0 and offensive is 1.

In [12]:
def write_to_tsv(filename, tweets, labels):
  with open('/content/gdrive/My Drive/thesis/dataset/' + filename + '.tsv', 'wt') as out_file:
    tsv_writer = csv.writer(out_file, delimiter='\t')
    tsv_writer.writerow(['tweet', 'label'])
    for tweet, label in zip(tweets, labels):
      tsv_writer.writerow([tweet, label])
  out_file.close()

In [13]:
tweets = df.tweet.values
labels = df.subtask_a.values
labels = np.where(labels == "NOT", 0, 1)
write_to_tsv('training_data', tweets, labels)

## Test data

In [14]:
df = pd.read_csv("/content/gdrive/My Drive/thesis/dataset/testset-levela.tsv", sep="\t") 
c_reader = csv.reader(open('/content/gdrive/My Drive/thesis/dataset/labels-levela.csv', 'r'), delimiter=',')
labels = [x[1] for x in c_reader]

##Processing the test data

Only the test data for sub_task_a of the OLID dataset is processed.

In [15]:
tokenize_tweets()
reformat_hashtags()
tweets = df.tweet.values
write_to_tsv('test_data', tweets, labels)