### 02. Load train and testdata
To evaluate our collected Twitter data we need a trained model. Therefore we use an online available training and test set. The training data consists of 1.600.000 pre-evaluated tweets and can be downloaded from the Kaggle website: https://www.kaggle.com/kazanova/sentiment140. The set contains 1.6 million tweets

In [1]:
# import libraries
import pandas as pd
import numpy as np

In [24]:
# load words and vectors
words = np.load('words.npy')
words = words.tolist()
vectors = np.load('vectors.npy')

In [3]:
words = np.load('wordsList-Copy1.npy')
words = words.tolist() #Originally loaded as numpy array
words = [word.decode('UTF-8') for word in words] #Encode words as UTF-8
vectors = np.load('wordVectors-Copy1.npy')

#### Data

In [4]:
# load test and traindata
data = pd.read_csv('/Users/olafdeleeuw/Desktop/ODSC/Project/ODSC-London-2018/data/data.csv', delimiter=',', quotechar='"', encoding="ISO-8859-1", header=None)

In [5]:
print(data.shape)
print(data.head())

(1600000, 6)
   0           1                             2         3                4  \
0  0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY  _TheSpecialOne_   
1  0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY    scotthamilton   
2  0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY         mattycus   
3  0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY          ElleCTF   
4  0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY           Karoli   

                                                   5  
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...  
1  is upset that he can't update his Facebook by ...  
2  @Kenichan I dived many times for the ball. Man...  
3    my whole body feels itchy and like its on fire   
4  @nationwideclass no, it's not behaving at all....  


#### Store the labels

To train the model we will use the label (positive or negative). These are in the first column of the dataframe where 4 is positive and 0 is negative. The labels will be stored as array [0,1] for a negative sentiment and [1,0] for a positive sentiment.

In [6]:
labels = []
for l in range(0,len(data)):
    if data[0].values[l] == 0:
        label = [0, 1]
    if data[0].values[l] == 4:
        label = [1, 0]
    labels.append(label)

In [7]:
data['labels'] = labels

In [8]:
data.head()

Unnamed: 0,0,1,2,3,4,5,labels
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...","[0, 1]"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,"[0, 1]"
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,"[0, 1]"
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,"[0, 1]"
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....","[0, 1]"


Split the data in a set with positive tweets and negative tweets. This is easier in the end when we need to get batches to train the model.

In [11]:
data_pos = data[data[0].values == 4]
data_neg = data[data[0].values == 0]

### Data cleaning

As always the data need to be cleaned. We remove the special characters and convert to lower characters. We use regular expressions.

In [12]:
# clean up the text: remove special characters and convert to lower characters
import re
remove_chrs = re.compile("[^A-Za-z0-9 ]+")

def clean_up_text(tweet):
    tweet = tweet.lower().replace("<br />", " ")
    return re.sub(remove_chrs, "", tweet.lower())

In [13]:
data_cleaned_tweets_pos = []
for tweet in data_pos[5]:  # the tweet is in the 5th column of the dataframe
    cleaned_tweet = clean_up_text(tweet)
    data_cleaned_tweets_pos.append(cleaned_tweet)
data_cleaned_tweets_neg = []
for tweet in data_neg[5]:  # the tweet is in the 5th column of the dataframe
    cleaned_tweet = clean_up_text(tweet)
    data_cleaned_tweets_neg.append(cleaned_tweet)

Check out the result of the cleaning on the first 10 tweets

In [14]:
print(data_cleaned_tweets_pos[0:10])
print(data_cleaned_tweets_neg[0:10])

['i love health4uandpets u guys r the best ', 'im meeting up with one of my besties tonight cant wait   girl talk', 'darealsunisakim thanks for the twitter add sunisa i got to meet you once at a hin show here in the dc area and you were a sweetheart ', 'being sick can be really cheap when it hurts too much to eat real food  plus your friends make you soup', 'lovesbrooklyn2 he has that effect on everyone ', 'productoffear you can tell him that i just burst out laughing really loud because of that  thanks for making me come out of my sulk', 'rkeithhill thans for your response ihad already find this answer ', 'keepinupwkris i am so jealous hope you had a great time in vegas how did you like the acms love your show ', 'tommcfly ah congrats mr fletcher for finally joining twitter ', 'e4voip i responded  stupid cat is helping me type forgive errors ']
['switchfoot httptwitpiccom2y1zl  awww thats a bummer  you shoulda got david carr of third day to do it d', 'is upset that he cant update his 

#### Split them to lists

This must be done to find the word indices and vectors

In [15]:
data_split_cleaned_pos = []
for tweet_cleaned in data_cleaned_tweets_pos:
    tweet_splitted = tweet_cleaned.split()
    data_split_cleaned_pos.append(tweet_splitted)
data_split_cleaned_neg = []
for tweet_cleaned in data_cleaned_tweets_neg:
    tweet_splitted = tweet_cleaned.split()
    data_split_cleaned_neg.append(tweet_splitted)

In [16]:
lengths = []
for sentence in data_split_cleaned_pos:
    length = len(sentence)
    lengths.append(length)

In [17]:
max(lengths)

41

#### Run the function as defined and explained in the previous notebook to get the word indices
The maximum length of our tweets is about 41, maybe in another set a bit more. So to be on the safe side we create index vectors of length 75. The longer the vector, the longer it takes until a model is trained.

In [18]:
def turn_sentence_to_indices(sentence):
    indices = np.zeros(75, dtype='int32')
    for i in range(0,len(sentence)):
        try:
            indices[i] = words.index(sentence[i])
        except:
            indices[i] = 0
    return(indices)

In [19]:
data_split_word_indices_pos = []
for tweet in data_split_cleaned_pos:
    tweet_word_indices = turn_sentence_to_indices(tweet)
    data_split_word_indices_pos.append(tweet_word_indices)
data_split_word_indices_neg = []
for tweet in data_split_cleaned_neg:
    tweet_word_indices = turn_sentence_to_indices(tweet)
    data_split_word_indices_neg.append(tweet_word_indices)

In [20]:
data_split_word_indices_pos = np.asarray(data_split_word_indices_pos)
data_split_word_indices_neg = np.asarray(data_split_word_indices_neg)

In [21]:
# save the indices
np.save("indices_pos_wl2", data_split_word_indices_pos)
np.save("indices_neg_wl2", data_split_word_indices_neg)

In [22]:
# checkout the indices
print(data_split_word_indices_pos[0:10])
print(data_split_word_indices_neg[0:10])

[[    41    835      0   6479   2284   1911 201534    254      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0]
 [ 14663    286     60     17     48      3    192      0   4385  52717
    2472   1749   1077      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0    