<a href="https://colab.research.google.com/github/josbex/HS-detection_in_social_media_posts/blob/master/data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as  pd
import csv
import numpy as np

## Dataset

The dataset needs to be loaded from the drive (this entails the dataset is in your drive). If so, just run the cell below and follow the link to get an authorization code. 

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


After the the dataset can be read from the drive, just specify the name of the dataset you want to read. For this case the OLID training dataset is loaded. 

## Training data

In [21]:
df = pd.read_csv("/content/gdrive/My Drive/thesis/dataset/olid-training-v1.0.tsv", sep="\t") 
print(df.head())

      id                                              tweet  ... subtask_b subtask_c
0  86426  @USER She should ask a few native Americans wh...  ...       UNT       NaN
1  90194  @USER @USER Go home you’re drunk!!! @USER #MAG...  ...       TIN       IND
2  16820  Amazon is investigating Chinese employees who ...  ...       NaN       NaN
3  62688  @USER Someone should'veTaken" this piece of sh...  ...       UNT       NaN
4  43605  @USER @USER Obama wanted liberals &amp; illega...  ...       NaN       NaN

[5 rows x 5 columns]


## Data pre-processing 

Some steps are needed for the data processing first the text preprocesser from https://github.com/cbaziotis/ekphrasis  is used for cleaning up the tweets from urls, users, hashtags and emoticons. 

In [None]:
!pip install ekphrasis

In [10]:
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons

text_processor = TextPreProcessor(
    # terms that will be normalized
    normalize=['email', 'percent', 'money', 'phone',
        'time', 'url', 'date', 'number'],
    fix_html=True,  # fix HTML tokens
    
    # corpus from which the word statistics are going to be used 
    # for word segmentation 
    segmenter="twitter", 
    
    # corpus from which the word statistics are going to be used 
    # for spell correction
    corrector="twitter", 
    
    unpack_hashtags=True,  # perform word segmentation on hashtags
    unpack_contractions=True,  # Unpack contractions (can't -> can not)
    spell_correct_elong=False,  # spell correction for elongated words
    
    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. You can pass more than one dictionaries.
    dicts=[emoticons]
)

Reading twitter - 1grams ...
Reading twitter - 2grams ...
Reading twitter - 1grams ...


##Helper functions

In [22]:
def tokens_to_string(tokens):
  return " ".join(tokens)

def update_tweet(tweet, processed_tweet):
  df.replace(tweet, processed_tweet, inplace=True)

def tokenize_tweets():
  for tweet in df.tweet:
    update_tweet(tweet, tokens_to_string(text_processor.pre_process_doc(tweet)))

def write_to_tsv(filename, tweets, labels):
  with open('/content/gdrive/My Drive/thesis/dataset/' + filename + '.tsv', 'wt') as out_file:
    tsv_writer = csv.writer(out_file, delimiter='\t')
    tsv_writer.writerow(['tweet', 'label'])
    for tweet, label in zip(tweets, labels):
      tsv_writer.writerow([tweet, label])
  out_file.close()

##Processing the training data

Example of before and after the tokenization. Hashtag segmetation is done using the twitter corpus.

In [12]:
print(df.tweet[55])
print(df.tweet[13239])

#GUNCONTROL advocates must STOP falling all over themselves to assure electorate that they too love the HORRIFIC 2A URL
#Spanishrevenge vs. #justice #HumanRights and #FreedomOfExpression #Spain is a  #fakedemocracy @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER #cddr #shameonSpain #WakeupEurope @USER URL


In [23]:
tokenize_tweets()

In [14]:
print(df.tweet[55])
print(df.tweet[13239])

guncontrol advocates must stop falling all over themselves to assure electorate that they too love the horrific 2 a url
spanishrevenge vs . justice human rights and freedom of expression spain is a fake democracy @user @user @user @user @user @user @user @user @user @user @user @user @user @user @user cd dr shameon spain wakeup europe @user url


## Saving the labels and parsed tweets of the training data

Saves the training data into numpy arrays. Labels are changed into binary representation where none offensive tweets are set to 0 and offensive is 1.

In [24]:
tweets = df.tweet.values
labels = df.subtask_a.values
labels = np.where(labels == "NOT", 0, 1)
write_to_tsv('training_data', tweets, labels)

##Saving the labels and parsed tweets of the test data

In [19]:
df = pd.read_csv("/content/gdrive/My Drive/thesis/dataset/testset-levela.tsv", sep="\t") 
c_reader = csv.reader(open('/content/gdrive/My Drive/thesis/dataset/labels-levela.csv', 'r'), delimiter=',')
tokenize_tweets()
labels = [x[1] for x in c_reader]
labels = np.array(labels)
labels = np.where(labels == "NOT", 0, 1)
tweets = df.tweet.values
write_to_tsv('test_data', tweets, labels)