### Tweet Preprocessing Using ekphrasis
ekphrasis Link: https://github.com/cbaziotis/ekphrasis

ekphrasis offers the following functionality:

1. Social Tokenizer. A text tokenizer geared towards social networks (Facebook, Twitter...), which understands complex emoticons, emojis and other unstructured expressions like dates, times and more.

2. Word Segmentation. You can split a long string to its constituent words. Suitable for hashtag segmentation.

3. Spell Correction. You can replace a misspelled word, with the most probable candidate word.

4. Customization. Taylor the word-segmentation, spell-correction and term identification, to suit your needs.

Word Segmentation and Spell Correction mechanisms, operate on top of word statistics, collected from a given corpus. We provide word statistics from 2 big corpora (from Wikipedia and Twitter), but you can also generate word statistics from your own corpus. You may need to do that if you are working with domain-specific texts, like biomedical documents. For example a word describing a technique or a chemical compound may be treated as a misspelled word, using the word statistics from a general purposed corpus.

ekphrasis tokenizes the text based on a list of regular expressions. You can easily enable ekphrasis to identify new entities, by simply adding a new entry to the dictionary of regular expressions (ekphrasis/regexes/expressions.txt).

5. Pre-Processing Pipeline. You can combine all the above steps in an easy way, in order to prepare the text files in your dataset for some kind of analysis or for machine learning. In addition, to the aforementioned actions, you can perform text normalization, word annotation (labeling) and more.


### Installation the library

In [1]:
# pip install ekphrasis

### Import Required Libraries

In [2]:
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons
import pandas as pd
from sklearn.model_selection import train_test_split

C:\Users\User\anaconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\User\anaconda3\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
  stacklevel=1)


### Define a Text Pre-Processing pipeline

You can easily define a preprocessing pipeline, by using the TextPreProcessor.

In [3]:
text_processor = TextPreProcessor(
    # terms that will be normalized
    normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
        'time', 'url', 'date', 'number'],
    # terms that will be annotated
    annotate={"hashtag", "allcaps", "elongated", "repeated",
        'emphasis', 'censored'},
    fix_html=True,  # fix HTML tokens
    
    # corpus from which the word statistics are going to be used 
    # for word segmentation 
    segmenter="twitter",
    
    # corpus from which the word statistics are going to be used 
    # for spell correction
    corrector="twitter", 
    
    unpack_hashtags=True,  # perform word segmentation on hashtags
    unpack_contractions=True,  # Unpack contractions (can't -> can not)
    spell_correct_elong=False,  # spell correction for elongated words
    
    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. You can pass more than one dictionaries.
    dicts=[emoticons]
)

  self.tok = re.compile(r"({})".format("|".join(pipeline)))


Reading twitter - 1grams ...
Reading twitter - 2grams ...
Reading twitter - 1grams ...


  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in


### Test Some tweets

Notes:
1. elongated words are automatically normalized.
2. Spell correction affects performance.


In [4]:
sentences = [
    "CANT WAIT for the new season of #TwinPeaks ＼(^o^)／!!! #davidlynch #tvseries :)))",
    "I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies :/",
    "@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! :-D http://sentimentsymposium.com/."
]

In [5]:
for s in sentences:
    print(" ".join(text_processor.pre_process_doc(s)))
    print()

<allcaps> cant wait </allcaps> for the new season of <hashtag> twin peaks </hashtag> ＼(^o^)／ ! <repeated> <hashtag> david lynch </hashtag> <hashtag> tv series </hashtag> <happy>

i saw the new <hashtag> john doe </hashtag> movie and it sucks <elongated> ! <repeated> <allcaps> waisted </allcaps> <money> . <repeated> <hashtag> bad movies </hashtag> <annoyed>

<user> : can not wait for the <date> <hashtag> sentiment </hashtag> talks ! <allcaps> yay <elongated> </allcaps> ! <repeated> <laugh> <url>



### Clean data for task

In [6]:
# fins = ['EI-oc-En-train\\EI-oc-En-fear-train.txt', '2018-EI-oc-En-dev\\2018-EI-oc-En-fear-dev.txt', '2018-EI-oc-En-test\\2018-EI-oc-En-fear-test.txt']
# fouts = ['EI-oc-En-train\\PrePro_EI-oc-En-fear-train.txt', '2018-EI-oc-En-dev\\PrePro_2018-EI-oc-En-fear-dev.txt', '2018-EI-oc-En-test\\PrePro_2018-EI-oc-En-fear-test.txt']

#BASE = 'D:\\ResearchDataGtx1060\\SentimentData\\Harasment\\Sharing Data\\'
# fins = ['Racial Data.csv', 'Sextual Data.csv', 'Political Data.csv', 'Intelligence Data.csv', 'Appearance Data.csv']
# fouts = ['PrePro_Racial.csv', 'PrePro_Sextua.csv', 'PrePro_Political.csv', 'PrePro_Intelligence.csv', 'PrePro_Appearance.csv']

# BASE = 'D:\\ResearchDataGtx1060\\SentimentData\\Racism\\'
# fins = ['NAACL_SRW_2016_Tweets.csv']
# fouts = ['PrePro_NAACL_SRW_2016_Tweets.csv']

# BASE = 'D:\\ResearchDataGtx1060\\MisInformation\\Thiru\\'
# fins = ['final_tweets_share.csv', 'poynter.csv']
# fouts = ['PrePro_final_tweets_share.csv', 'PrePro_poynter.csv']

BASE = 'D:\\ResearchDataGtx1060\\SentimentData\\Hate\\random-hate\\'
fins = ['train_E6oV3lV.csv', 'test_tweets_anuFYb8.csv']
fouts = ['PrePro_train_E6oV3lV.csv', 'PrePro_test_tweets_anuFYb8.csv']

In [7]:
track=1
df_one = pd.read_csv(BASE+fins[track], sep=',', encoding='latin1')
print(len(df_one))
df_one.head(20)

17197


Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."
5,31968,choose to be :) #momtips
6,31969,something inside me dies Ã°ÂÂÂ¦Ã°ÂÂÂ¿Ã¢ÂÂ...
7,31970,#finished#tattoo#inked#ink#loveitÃ¢ÂÂ¤Ã¯Â¸Â ...
8,31971,@user @user @user i will never understand why...
9,31972,#delicious #food #lovelife #capetown mannaep...


In [8]:
#df_one['misinfo'] = df_one['misinfo'].str.lower().str.strip()
#df_one.groupby('label').count()

In [9]:
#df_one['facts'] = df_one['facts'].astype(str)

In [10]:
#count = 1
for idx in df_one.index:
    sent = df_one.loc[idx,'tweet']
    sent = sent.replace('‘', '\'').replace('’', '\'').replace('“', '"').replace('”', '"')
    #print(sent)
    sent = ' '.join(text_processor.pre_process_doc(sent))
    #print(sent)
    df_one.loc[idx,'tweet'] = sent

In [11]:
df_one.head(20)

Unnamed: 0,id,tweet
0,31963,<hashtag> studio life </hashtag> <hashtag> a i...
1,31964,<user> <hashtag> white </hashtag> <hashtag> su...
2,31965,safe ways to heal your <hashtag> acne </hashta...
3,31966,is the hp and the cursed child book up for res...
4,31967,3 rd <hashtag> bih day </hashtag> to my amazin...
5,31968,choose to be <happy> <hashtag> mom tips </hash...
6,31969,something inside me dies ã ° â  â  â ¦ ã ° â...
7,31970,<hashtag> finished </hashtag> <hashtag> tattoo...
8,31971,<user> <user> <user> i will never understand w...
9,31972,<hashtag> delicious </hashtag> <hashtag> food ...


In [12]:
# df_one = df_one[['facts', 'misinfo']]
# df_one

In [13]:
df_one.to_csv(BASE+fouts[track], index=None)