<a href="https://colab.research.google.com/github/priebet/sentiment/blob/master/sentiment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Twitter-Sentimentanalyse Teil 1
## Vorbereitung des Sentiment140 Trainings-Datensatzes 

In [0]:
# Mount Google Drive for easy and fast read/write access to data folder 
#from google.colab import drive
#drive.mount('/content/drive')
#datapath = "/content/drive/My Drive/Colab Notebooks/data/"

In [0]:
# For demonstration purposes, pull data from webserver instead
datapath = "http://priebe.onl/data/"

In [0]:
import pandas as pd  

# Load Sentiment140 training data set, choose only relevant columns and encode 
# positive sentiment as 1 instead of 4
df = pd.read_csv(datapath+"training.csv",header=None,
                 usecols=[0,5],names=['sentiment','text'],encoding='latin1')
df['sentiment'] = df['sentiment'].map({0: 0, 4: 1})
df.head()

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [0]:
# This code is 1:1 as presented by Ricky Kim, please refer to:
# https://towardsdatascience.com/another-twitter-sentiment-analysis-bb5b01ebad90
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()

pat1 = r'@[A-Za-z0-9_]+'
pat2 = r'https?://[^ ]+'
combined_pat = r'|'.join((pat1, pat2))
www_pat = r'www.[^ ]+'
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

def tweet_cleaner_updated(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        bom_removed = souped
    stripped = re.sub(combined_pat, '', bom_removed)
    stripped = re.sub(www_pat, '', stripped)
    lower_case = stripped.lower()
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)
    letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1]
    return (" ".join(words)).strip()

In [0]:
%%time
print("Cleaning the tweets...\n")
clean_tweet_texts = []
for i in range(0,len(df)):
    if( (i+1)%100000 == 0 ):
        print("Tweets %d of %d has been processed" % (i+1, len(df)))                                                               
    clean_tweet_texts.append(tweet_cleaner_updated(df['text'][i]))

Cleaning the tweets...

Tweets 100000 of 1600000 has been processed
Tweets 200000 of 1600000 has been processed
Tweets 300000 of 1600000 has been processed
Tweets 400000 of 1600000 has been processed
Tweets 500000 of 1600000 has been processed
Tweets 600000 of 1600000 has been processed
Tweets 700000 of 1600000 has been processed
Tweets 800000 of 1600000 has been processed
Tweets 900000 of 1600000 has been processed
Tweets 1000000 of 1600000 has been processed
Tweets 1100000 of 1600000 has been processed
Tweets 1200000 of 1600000 has been processed
Tweets 1300000 of 1600000 has been processed
Tweets 1400000 of 1600000 has been processed
Tweets 1500000 of 1600000 has been processed
Tweets 1600000 of 1600000 has been processed
CPU times: user 7min 44s, sys: 25.1 s, total: 8min 9s
Wall time: 8min 10s


In [0]:
clean_df = pd.DataFrame(clean_tweet_texts,columns=['text'])
clean_df['sentiment'] = df['sentiment']

# Remove rows with empty text (after cleaning) and reset index
clean_df = clean_df.loc[clean_df['text'] != ""]
clean_df.reset_index(drop=True,inplace=True)

clean_df.head()

Unnamed: 0,text,sentiment
0,awww that bummer you shoulda got david carr of...,0
1,is upset that he can not update his facebook b...,0
2,dived many times for the ball managed to save ...,0
3,my whole body feels itchy and like its on fire,0
4,no it not behaving at all mad why am here beca...,0


In [0]:
# Store cleaned training data as CSV file, does not apply when using webserver
# as storage location
#clean_df.to_csv(datapath+'training_cleaned.csv',encoding='utf-8',index=False)
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1596041 entries, 0 to 1596040
Data columns (total 2 columns):
text         1596041 non-null object
sentiment    1596041 non-null int64
dtypes: int64(1), object(1)
memory usage: 24.4+ MB
