# Exploration and pre-process the data



In all Machine Learning problem one of the main task is to gain suffitant knowledge about the data one is given. This will ensure to better build the features representation of a tweet and to maximize our "computational power / prediction accuracy" ratio.

In the first part we will therefore investigate how the data is generated, how many simple words do we have, are there some duplicates, is there any odds with the tweets? How are they classified (we are given two datasets, one for the positive tweets one for negatives) i.e. what is a typical word that occurs when a :( is present in a tweet?


In the second part, we normalize the tweets. Some tweets may contain words that are not usefull, or can be categorized. This is done be the following pipeline.

1. Import all the tweets from train_pos_full.txt and train_neg_full.txt
2. Cast the repetition of more than three following similar letter to one letter
3. Find the representative of all words
4. 



### Helper functions and files

IOTweets contains the functions :
- build_df
- import_
- export

ProcessTweets contains the functions :
- merging
- filter_single_rep
- powerful_words
- sem_by_repr
- sem_by_repr2
- no_dot
- find_repetition
- no_s
- stem_tweet
- set_min_diff
- contains
- process_tweets_1
- process_tweets_2


In [1]:
import pandas as pd  # Pandas will be our framework for the first part
from nltk.stem import PorterStemmer
import nltk
from nltk.tokenize import TweetTokenizer
from IOTweets import *
from ProcessTweets import *
import csv
#%pylab inline # depreceated, use individual imports instead

ps = PorterStemmer()
tknzr = TweetTokenizer(False)

In [2]:
vocab_pos = "vocab_cut_pos.txt"
vocab_neg = "vocab_cut_neg.txt"

vocab_pos_full = "vocab_cut_pos_full.txt"
vocab_neg_full = "vocab_cut_neg_full.txt"
vocab_test = "vocab_test_data.txt"

path_tweets_pos = "train_pos.txt"
path_tweets_neg = "train_neg.txt"

path_tweets_pos_full = "train_pos_full.txt"
path_tweets_neg_full = "train_neg_full.txt"
path_tweets_test = "test_data.txt"

## 1. An insight into the dataset 
We begin by analysing the non-processed dataset.

In [3]:
def write_vocab(tweets, cut_threshold, file_name , bitri):
    counter = build_vocab_counter(tweets, cut_threshold, bitri)
    write_vocab_to_file(counter, (file_name + "_cut=" +str(cut_threshold) +"_bitri="+str(bitri)))
    

In [4]:
#import tweets
pos_tweets = import_(path_tweets_pos)
neg_tweets = import_(path_tweets_neg)
pos_full_tweets = import_(path_tweets_pos_full)
neg_full_tweets = import_(path_tweets_neg_full)
test_tweets = import_without_id(path_tweets_test)

#Create vocabs
cut_threshold = 5
bitri = True

write_vocab(pos_tweets, cut_threshold, "raw_vocab_pos" , bitri)
write_vocab(neg_tweets, cut_threshold, "raw_vocab_neg" , bitri)
write_vocab(test_tweets, cut_threshold, "raw_vocab_test" , bitri)
write_vocab(pos_full_tweets, cut_threshold, "raw_vocab_pos_full" , bitri)
write_vocab(neg_full_tweets, cut_threshold, "raw_vocab_neg_full" , bitri)


# build the vocab Dataframe (DF) for full positive and negative tweets
pos_full_df = build_df("raw_vocab_pos_full_cut="+str(cut_threshold)+"_bitri="+str(bitri), bitri)
neg_full_df = build_df("raw_vocab_neg_full_cut="+str(cut_threshold)+"_bitri="+str(bitri), bitri)

# build the vocab Dataframe (DF) for partial positive and negative tweets
pos_df = build_df("raw_vocab_pos_cut="+str(cut_threshold)+"_bitri="+str(bitri), bitri)
neg_df = build_df("raw_vocab_neg_cut="+str(cut_threshold)+"_bitri="+str(bitri), bitri)

#Build the vocab DF for tweets to be decided (i.e. "test" tweets)
test_df = build_df("raw_vocab_test_cut="+str(cut_threshold)+"_bitri="+str(bitri), bitri)

removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:00:12.543123
removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:00:14.648871
removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:00:01.305080
removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:02:34.135139
removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:03:04.960437


In [5]:
cut_threshold = 5
bitri = True

pos_df = build_df("raw_vocab_pos_cut="+str(cut_threshold)+"_bitri="+str(bitri), bitri)
neg_df = build_df("raw_vocab_neg_cut="+str(cut_threshold)+"_bitri="+str(bitri), bitri)
pos_full_df = build_df("raw_vocab_pos_full_cut="+str(cut_threshold)+"_bitri="+str(bitri), bitri)
neg_full_df = build_df("raw_vocab_neg_full_cut="+str(cut_threshold)+"_bitri="+str(bitri), bitri)

#Build the vocab DF for tweets to be decided (i.e. "test" tweets)
test_df = build_df("raw_vocab_test_cut="+str(cut_threshold)+"_bitri="+str(bitri), bitri)

### 1.1 Study of word pertinence 
The idea is that the greater a word ratio i, the more pertinent it will be to cosider it when trying to label a tweet that contains it

In [6]:

neg_full_df

#merge the pos and neg vocabs and study penrtinence of words via ratio
merged_full = merging(neg_full_df, pos_full_df)
merged_full.sort_values(by = ["ratio","somme"], ascending=[True, True])


Unnamed: 0,word,occurence_neg,occurence_pos,difference,somme,ratio,len
1201,"(photo,)",3000,3005,5,6005,0.000833,1
581,"(had, a)",5724,5738,14,11462,0.001221,2
1076,"(all, my)",3308,3299,9,6607,0.001362,2
917,"(yesterday,)",3799,3786,13,7585,0.001714,1
2770,"("", .)",1394,1399,5,2793,0.001790,2
579,"(class,)",5729,5754,25,11483,0.002177,1
1492,"(turn,)",2432,2443,11,4875,0.002256,1
134,"(or,)",24351,24227,124,48578,0.002553,1
1416,"(..., but)",2566,2580,14,5146,0.002721,2
2430,"(next, year)",1576,1567,9,3143,0.002864,2


##### From here, we realise that some words have strangely strong occurence in the negative tweets.
By seeing the words in context, we realised that some tweets occured more than once.
We checked if those words were also in the test_data that we have to classify. The check was positive.
We will therefore capture those words, (i.e. "1gb" or "cd-rom") because they are luckily to be in the test_data set and classify directly the tweets countaining those words. We will also drop all the duplicated tweets for training in order not to let on the side some other words and this will save us power computationnal efficiency.

Two example of such tweets are:

    1) 1.26 - 7x14 custom picture frame / poster frame 1.265 " wide complete cherry wood frame ( 440ch this frame is manufactu ... <url>
    
    2) misc - 50pc diamond burr set - ceramics tile glass lapidary for rotary tools ( misc . assorted shapes and sizes for your ... <url>
    
Other important notice, some Tweets are almost the same but just fiew things change: Example of a tweet from amazon [here](https://www.amazon.com/Custom-Picture-Frame-Poster-Complete/dp/B004FNYSBA?SubscriptionId=AKIAJ6364XFIEG2FHXPA&tag=megaebookmall-20&linkCode=sp1&camp=2025&creative=165953&creativeASIN=B004FNYSBA) 

1. `3x14 custom picture frame / poster frame 1.265 " wide complete black wood frame ( 440bk this frame is manufactur ... <url>`

2. `24x35 custom picture frame / poster frame 1.265 " wide complete green wood frame ( 440gr this frame is manufactu ... <url>`

3. `22x31 custom picture frame / poster frame 2 " wide complete black executive leather frame ( 74093 this frame is ... <url>`

### 1.2 Study of names with same beginning 
Let's try to have an insight of words that have the same beginning

In [7]:
# are words (filtered by length) whose beginning match another word close to this word?
same_begin = list(merged_full[merged_full["len"]==8]["word"])
same_begin = [list(merged_full.loc[(merged_full.word.str.startswith(w))]["word"]) for w in same_begin]

#We visualize which word would be mapped on which one 
filter_single_rep(same_begin)

[]

### 1.3 Expected syntaxic divergences
As expected, words like "haha" can be written in many ways (extended length, typo..). We know that they mean the same,it is therefore interesting to spot them in order to later stem them to the same representative.

In [8]:
#words beginning with "haha" or "ahah" are sure to be "instances" of haha
word_haha = test_df.loc[test_df.word.str.startswith("haha") | test_df.word.str.startswith("ahah")]


#print("Occurence of the words that can be remplaced by haha = "+ str(int(word_haha.occurence.sum()) - int(word_haha[word_haha.word == "haha"].occurence)))

# We make the list of all those words that have this same semantic
word_haha_list = list(word_haha.word)
list(word_haha.word)

[]

### 1.4 Study duplicates
We noticed that several tweets were duplicated in test, pos and neg tweets. We study ho many they are, and which tweets (if any) are shared between the test set and the pos/neg set.
We noticed after running that there are no tweets shared between all tweet sets, thus we can label duplicate of the test dataset by directly looking for corresponding duplicates in pos/neg 

In [9]:
#load raw tweets
pos = import_(path_tweets_pos_full)
neg = import_(path_tweets_neg_full)
data = import_without_id(path_tweets_test)

In [10]:
common_pos_test = []
common_neg_test = []

pos = set(pos_tweets)
neg = set(neg_tweets)
test_tweets = set(test_tweets)
len_test = len(test_tweets)
for i, test_tweet in enumerate(test_tweets) :
    if test_tweet in pos:
        common_pos_test.append(test_tweet)
        print("duplicate pos-test")
        print("\ttweet :" + str(pos_tweet))  
        break
    elif test_tweet in neg:
        common_neg_test.append(test_tweet)
        print("duplicate neg-data")
        print("\ttweet :" + str(pos_tweet))
        break
    print("{:.1f}".format(i/len_test), "%", end='\r')
            
print ("done")

done% % % % % %%


In [11]:
def occurences(tweet, tweets):
    count = 0
    for t in tweets:
        if t == tweet:
            count += 1
    return count
    

## 2 data pre-processing
In this part we start from raw tweets and preprocess them to achieve uniformization of the words


In [12]:
#load raw tweets
pos = import_(path_tweets_pos_full)
neg = import_(path_tweets_neg_full)
data = import_without_id(path_tweets_test)

### 2.1  Data Standardization
We start by removing repetitions of more than 3 letters, and replace dots by spaces
ex : 

1) `no_dot` : "Funny.I" becomes "Funny I" 

2) `find_repetitions` : "I looooove" becomes "I love"
            
Then, we define which word is equivalent to each word
More precisely, all words of length 8 or more are truncated, and words begining with "hah" or "aha" are stemmed to "haha" 

This is done by the function `standardize_tweets`


In [13]:
#standardize tweets
print("Process data pos")
processed_tweets_pos_full = standardize_tweets(pos)
print("Process data neg")
processed_tweets_neg_full = standardize_tweets(neg) 
print("Process data test")
processed_tweets_test_full = standardize_tweets(data) 

print("export processed tweets to files")
export(processed_tweets_pos_full,  "cleaned_pos_full")
export(processed_tweets_neg_full,  "cleaned_neg_full")
export(processed_tweets_test_full, "cleaned_test_full")

Process data pos
Process data neg
Process data test
export processed tweets to files


### 2.2 Build the vocabs
We build vocabs for positive, negative and test tweets, removing tokens present less than 5 times.

In [14]:
#build word counters
cut_threshold = 5
vocab_pos =  build_vocab_counter(processed_tweets_pos_full, cut_threshold, True)
vocab_neg =  build_vocab_counter(processed_tweets_neg_full, cut_threshold, True)
vocab_test = build_vocab_counter(processed_tweets_test_full, cut_threshold, True)

#write vocabs to files
write_vocab_to_file(vocab_pos, "cleaned_vocab_pos_full_bitri=True")
write_vocab_to_file(vocab_neg, "cleaned_vocab_neg_full_bitri=True")
write_vocab_to_file(vocab_test, "cleaned_vocab_test_full_bitri=True")

removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:02:32.115393
removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:03:02.518599
removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:00:01.345433


### 2.3 create relevant vocab 
We keep pertinent words in a special vocab called relevant vocab

In [3]:
#Load dataframes before we 
neg_df = build_df("cleaned_vocab_pos_full_bitri=True", True)
pos_df = build_df("cleaned_vocab_neg_full_bitri=True", True)

#Merge dataframes
merged = merging(neg_df, pos_df, False)

#create relevant vocab
create_relevant_vocab(pertinence_thres=0.3, min_count=300, dataframe=merged)

### 2.4 Find characteristic words
We set a list of words that appear only in neg/pos at least min_diff times such that if we see such a word in a tweet we assume it is a neg/pos tweet

In [4]:
data_tweets = import_("relevant_vocab_pert=0.3_count=300")

characteristic_words(data_tweets=data_tweets, merged=merged, is_full=True)

40170not sucessful
25493not sucessful
25430not sucessful
25415not sucessful
25410not sucessful
25405not sucessful
25402not sucessful
25035not sucessful
25034not sucessful
24685not sucessful
24002not sucessful
23692not sucessful
21299not sucessful
16703not sucessful
14617not sucessful
11778not sucessful
11415not sucessful
11274not sucessful
10407not sucessful
9995not sucessful
9912not sucessful
9715not sucessful
9264not sucessful
9262not sucessful
9182not sucessful
9144not sucessful
7944not sucessful
7757not sucessful
7657not sucessful
7261not sucessful
7175not sucessful
7141not sucessful
6364not sucessful
6194not sucessful
6039not sucessful
5354not sucessful
5329not sucessful
4844not sucessful
4661not sucessful
4657not sucessful
4582not sucessful
4460not sucessful
4451not sucessful
4437not sucessful
4390not sucessful
4293not sucessful
4290not sucessful
4273not sucessful
4216not sucessful
4107not sucessful
4103not sucessful
4090not sucessful
4074not sucessful
4039not sucessful
3968not s

828not sucessful
827not sucessful
826not sucessful
824not sucessful
823not sucessful
822not sucessful
821not sucessful
819not sucessful
818not sucessful
816not sucessful
815not sucessful
814not sucessful
806not sucessful
805not sucessful
804not sucessful
801not sucessful
800not sucessful
797not sucessful
794not sucessful
792not sucessful
791not sucessful
790not sucessful
788not sucessful
787not sucessful
786not sucessful
785not sucessful
784not sucessful
779not sucessful
778not sucessful
777not sucessful
773not sucessful
772not sucessful
770not sucessful
769not sucessful
768not sucessful
765not sucessful
764not sucessful
763not sucessful
762not sucessful
761not sucessful
760not sucessful
759not sucessful
758not sucessful
755not sucessful
754not sucessful
753not sucessful
752not sucessful
751not sucessful
750not sucessful
748not sucessful
747not sucessful
746not sucessful
745not sucessful
744not sucessful
742not sucessful
741not sucessful
740not sucessful
739not sucessful
738not sucessf

213not sucessful
212not sucessful
211not sucessful
210not sucessful
209not sucessful
208not sucessful
207not sucessful
206not sucessful
205not sucessful
204not sucessful
203not sucessful
202not sucessful
201not sucessful
200not sucessful
199not sucessful
198not sucessful
197not sucessful
196not sucessful
195not sucessful
194not sucessful
193not sucessful
192not sucessful
191not sucessful
190not sucessful
189not sucessful
188not sucessful
187not sucessful
186not sucessful
185not sucessful
184not sucessful
183not sucessful
182not sucessful
181not sucessful
180not sucessful
179not sucessful
178not sucessful
177not sucessful
176not sucessful
175not sucessful
174not sucessful
173not sucessful
172not sucessful
171not sucessful
170not sucessful
169not sucessful
168not sucessful
167not sucessful
166not sucessful
165not sucessful
164not sucessful
163not sucessful
162not sucessful
161not sucessful
160not sucessful
159not sucessful
158not sucessful
157not sucessful
156not sucessful
155not sucessf

### 2.5 Recapitulative Function
This function does the work presented above, we created it for clarity purposes

In [None]:
#clean raw tweets
def setup(path_pos, path_neg, path_test, is_full,  pertinence_thres_relevant, min_count_relevant, cut_threshold=5, bitri = True):
    #path pos, path_neg, path_data : paths of the raw tweets
    #is_full : True iff we clean 
    
    #export cleaned tweets and their vocabulary
    cleaned_tweets = clean_tweets(path_pos, path_neg, path_test, is_full, cut_threshold)
    
    if is_full:
        full_string = "_full"
    else :
        full_string = ""
    
    
    #load dataframes    
    pos_df = build_df("cleaned_vocab_pos"+str(full_string)+"_bitri="+str(bitri), bitri)
    neg_df = build_df("cleaned_vocab_neg"+str(full_string)+"_bitri="+str(bitri), bitri)
   
    #Merge dataframes
    merged = merging(neg_df, pos_df, False)
    
    #create relevant vocab
    create_relevant_vocab(pertinence_thres_relevant, min_count_relevant, merged)
    
    data_tweets = import_("cleaned_test_bitri="+str(bitri))
    
    #create characteristic_words
    characteristic_words(data_tweets, merged, is_full)
    
    #build the global vocabulary
    build_global_vocab(is_full, bitri, cut_threshold)
    
    
setup("train_pos.txt", "train_neg.txt" , "test_data.txt", is_full=False, pertinence_thres_relevant=0.3, min_count_relevant=250)