<a href="https://colab.research.google.com/github/priebet/sentiment/blob/master/sentiment3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Twitter-Sentimentanalyse Teil 3
## Vorverarbeitung von Tweets im Umfeld des Brexit-Voting

In [0]:
## Mount Google Drive for easy and fast read/write access to data folder 
#from google.colab import drive
#drive.mount('/content/drive')
#datapath = "/content/drive/My Drive/Colab Notebooks/data/"

In [0]:
# For demonstration purposes, pull data from webserver instead
datapath = "http://priebe.onl/data/"

In [6]:
import pandas as pd  

df = pd.read_csv(datapath+"brexit.csv",usecols=[0,1,3,11])
df.head()

Unnamed: 0,id,created_at,text,entities_hashtags_text
0,736284933686239233,2016-05-27T19:56:22,@FXdestination @lynn_weiser Those who died to ...,brexit
1,736284905710211073,2016-05-27T19:56:15,THIS! Top swears. If 'bollocks' is good enough...,EUref|strongerin|brexit|brexin
2,736284877084065792,2016-05-27T19:56:08,dont the scots understand that joining the lea...,Brexit|VoteRemain|VoteLeave
3,736284846151106563,2016-05-27T19:56:01,The vision from enslaved #Brexit Shut the door...,Brexit
4,736284840992092160,2016-05-27T19:56:00,It isn't all about GDP and jobs. We will adapt...,Brexit|EuRef


In [0]:
# There appear to be some tweets with no text, remove them and reset index
df.dropna(inplace=True)
df.reset_index(drop=True,inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 865873 entries, 0 to 865872
Data columns (total 4 columns):
id                        865873 non-null int64
created_at                865873 non-null object
text                      865873 non-null object
entities_hashtags_text    865873 non-null object
dtypes: int64(1), object(3)
memory usage: 26.4+ MB


In [0]:
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()

pat1 = r'@[A-Za-z0-9_]+'
pat2 = r'https?://[^ ]+'
combined_pat = r'|'.join((pat1, pat2))
www_pat = r'www.[^ ]+'
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

def tweet_cleaner_updated(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        bom_removed = souped
    stripped = re.sub(combined_pat, '', bom_removed)
    stripped = re.sub(www_pat, '', stripped)
    lower_case = stripped.lower()
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)
    letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
    # During the letters_only process two lines above, it has created unnecessay white spaces,
    # I will tokenize and join together to remove unneccessary white spaces
    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1]
    return (" ".join(words)).strip()

In [0]:
%%time
print("Cleaning the tweets...\n")
clean_tweet_texts = []
for i in range(0,len(df)):
    if ((i+1)%100000 == 0 ):
        print("Tweets %d of %d has been processed" % (i+1, len(df)))
    clean_tweet_texts.append(tweet_cleaner_updated(df['text'][i]))

Cleaning the tweets...

Tweets 100000 of 865873 has been processed
Tweets 200000 of 865873 has been processed
Tweets 300000 of 865873 has been processed
Tweets 400000 of 865873 has been processed
Tweets 500000 of 865873 has been processed
Tweets 600000 of 865873 has been processed
Tweets 700000 of 865873 has been processed
Tweets 800000 of 865873 has been processed
CPU times: user 3min 46s, sys: 13.3 s, total: 3min 59s
Wall time: 3min 59s


In [0]:
clean_df = pd.DataFrame(clean_tweet_texts,columns=['text'])
clean_df['id'] = df['id']
clean_df['created'] = df['created_at']
clean_df['hashtags'] = df['entities_hashtags_text'].str.casefold()

# Remove rows with empty text (after cleaning) and reset index
clean_df = clean_df.loc[clean_df['text'] != ""]
clean_df.reset_index(drop=True,inplace=True)

clean_df.head()

Unnamed: 0,text,id,created,hashtags
0,those who died to protect democracy will be ro...,736284933686239233,2016-05-27T19:56:22,brexit
1,this top swears if bollocks is good enough for...,736284905710211073,2016-05-27T19:56:15,euref|strongerin|brexit|brexin
2,dont the scots understand that joining the lea...,736284877084065792,2016-05-27T19:56:08,brexit|voteremain|voteleave
3,the vision from enslaved brexit shut the door ...,736284846151106563,2016-05-27T19:56:01,brexit
4,it is not all about gdp and jobs we will adapt...,736284840992092160,2016-05-27T19:56:00,brexit|euref


In [0]:
#clean_df.to_csv(datapath+'brexit_cleaned.csv',encoding='utf-8',index=False)
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 865549 entries, 0 to 865548
Data columns (total 4 columns):
text        865549 non-null object
id          865549 non-null int64
created     865549 non-null object
hashtags    865549 non-null object
dtypes: int64(1), object(3)
memory usage: 26.4+ MB
