# Exploration and pre-process the data



In all Machine Learning problem one of the main task is to gain suffitant knowledge about the data one is given. This will ensure to better build the features representation of a tweet and to maximize our "computational power / prediction accuracy" ratio.

In the first part we will therefore investigate how the data is generated, how many simple words do we have, are there some duplicates, is there any odds with the tweets? How are they classified (we are given two datasets, one for the positive tweets one for negatives) i.e. what is a typical word that occurs when a :( is present in a tweet?


In the second part, we normalize the tweets. Some tweets may contain words that are not usefull, or can be categorized. This is done be the following pipeline.

1. Import all the tweets from train_pos_full.txt and train_neg_full.txt
2. Cast the repetition of more than three following similar letter to one letter
3. Find the representative of all words
4. 



### Helper functions and files

IOTweets contains the functions :
- build_df
- import_
- export

ProcessTweets contains the functions :
- merging
- filter_single_rep
- powerful_words
- sem_by_repr
- sem_by_repr2
- no_dot
- find_repetition
- no_s
- stem_tweet
- set_min_diff
- contains
- process_tweets_1
- process_tweets_2


In [1]:
import pandas as pd  # Pandas will be our framework for the first part
from nltk.stem import PorterStemmer
import nltk
from IOTweets import *
from ProcessTweets import *
import csv
#%pylab inline # depreceated, use individual imports instead

ps = PorterStemmer()

In [44]:
vocab_pos = "vocab_cut_pos.txt"
vocab_neg = "vocab_cut_neg.txt"

vocab_pos_full = "vocab_cut_pos_full.txt"
vocab_neg_full = "vocab_cut_neg_full.txt"
vocab_test = "vocab_test_data.txt"

path_tweets_pos = "train_pos.txt"
path_tweets_neg = "train_neg.txt"

path_tweets_pos_full = "train_pos_full.txt"
path_tweets_neg_full = "train_neg_full.txt"
path_tweets_test = "test_data.txt"

## 1. An insight into the dataset 
We begin by analysing the non-processed dataset.

In [3]:
#import tweets
pos_tweets = import_(path_tweets_pos)
neg_tweets = import_(path_tweets_neg)
pos_full_tweets = import_(path_tweets_pos_full)
neg_full_tweets = import_(path_tweets_neg_full)
test_tweets = import_without_id(path_tweets_test)

#Count for vocab
cut_threshold = 5
pos_counter = build_vocab_counter(pos_tweets, cut_threshold, True)
neg_counter = build_vocab_counter(neg_tweets, cut_threshold, True)
test_counter = build_vocab_counter(test_tweets, cut_threshold, True)
pos_full_counter = build_vocab_counter(pos_full_tweets, cut_threshold, True)
neg_full_counter = build_vocab_counter(neg_full_tweets, cut_threshold, True)

#write raw vocabs to file
write_vocab_to_file(pos_counter, ("raw_vocab_pos_cut="+str(cut_threshold)))
write_vocab_to_file(neg_counter, ("raw_vocab_neg_cut="+str(cut_threshold)))
write_vocab_to_file(test_counter, ("raw_vocab_test_cut="+str(cut_threshold)))
write_vocab_to_file(pos_full_counter, ("raw_vocab_pos_full_cut="+str(cut_threshold)))
write_vocab_to_file(neg_full_counter, ("raw_vocab_neg_full_cut="+str(cut_threshold)))

# build the vocab Dataframe (DF) for full positive and negative tweets
pos_full_df = build_df("raw_vocab_pos_full_cut="+str(cut_threshold))
neg_full_df = build_df("raw_vocab_neg_full_cut="+str(cut_threshold))

# build the vocab Dataframe (DF) for partial positive and negative tweets
pos_df = build_df("raw_vocab_pos_cut="+str(cut_threshold))
neg_df = build_df("raw_vocab_neg_cut="+str(cut_threshold))

#Build the vocab DF for tweets to be decided (i.e. "test" tweets)
test_df = build_df("raw_vocab_test_cut="+str(cut_threshold))

removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:00:12.594980
removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:00:14.823640
removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:00:01.366152
removing items present less than 5 times% %%%% %%% % %%% % % % % % % % % %%%% % %%
Time elpased (hh:mm:ss.ms) 0:02:45.660750
removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:03:16.936722


### 1.1 Study of word pertinence 
The idea is that the greater a word ratio i, the more pertinent it will be to cosider it when trying to label a tweet that contains it

In [4]:
#merge the pos and neg vocabs and study penrtinence of words via ratio
merged_full = merging(neg_full_df, pos_full_df)
merged_full.sort_values(by = ["ratio","somme"], ascending=[False, False]).head()

Unnamed: 0,word,occurence_neg,occurence_pos,difference,somme,ratio,len
67,"((, paperback)",40169,0,40169,40169,1.0,2
120,"("", wide)",25488,0,25488,25488,1.0,2
121,"(frame, /)",25425,0,25425,25425,1.0,2
122,"(/, poster)",25415,0,25415,25415,1.0,2
123,"(poster, frame)",25408,0,25408,25408,1.0,2


##### From here, we realise that some words have strangely strong occurence in the negative tweets.
By seeing the words in context, we realised that some tweets occured more than once.
We checked if those words were also in the test_data that we have to classify. The check was positive.
We will therefore capture those words, (i.e. "1gb" or "cd-rom") because they are luckily to be in the test_data set and classify directly the tweets countaining those words. We will also drop all the duplicated tweets for training in order not to let on the side some other words and this will save us power computationnal efficiency.

Two example of such tweets are:

    1) 1.26 - 7x14 custom picture frame / poster frame 1.265 " wide complete cherry wood frame ( 440ch this frame is manufactu ... <url>
    
    2) misc - 50pc diamond burr set - ceramics tile glass lapidary for rotary tools ( misc . assorted shapes and sizes for your ... <url>
    
Other important notice, some Tweets are almost the same but just fiew things change: Example of a tweet from amazon [here](https://www.amazon.com/Custom-Picture-Frame-Poster-Complete/dp/B004FNYSBA?SubscriptionId=AKIAJ6364XFIEG2FHXPA&tag=megaebookmall-20&linkCode=sp1&camp=2025&creative=165953&creativeASIN=B004FNYSBA) 

1. `3x14 custom picture frame / poster frame 1.265 " wide complete black wood frame ( 440bk this frame is manufactur ... <url>`

2. `24x35 custom picture frame / poster frame 1.265 " wide complete green wood frame ( 440gr this frame is manufactu ... <url>`

3. `22x31 custom picture frame / poster frame 2 " wide complete black executive leather frame ( 74093 this frame is ... <url>`

### 1.2 Study of names with same beginning 
Let's try to have an insight of words that have the same beginning

In [11]:
# are words (filtered by length) whose beginning match another word close to this word?
same_begin = list(merged_full[merged_full["len"]==8]["word"])
same_begin = [list(merged_full.loc[(merged_full.word.str.startswith(w))]["word"]) for w in same_begin]

#We visualize which word would be mapped on which one 
filter_single_rep(same_begin)

[]


[]

### 1.3 Expected syntaxic divergences
As expected, words like "haha" can be written in many ways (extended length, typo..). We know that they mean the same,it is therefore interesting to spot them in order to later stem them to the same representative.

In [6]:
#words beginning with "haha" or "ahah" are sure to be "instances" of haha
word_haha = test_df.loc[test_df.word.str.startswith("haha") | test_df.word.str.startswith("ahah")]


#print("Occurence of the words that can be remplaced by haha = "+ str(int(word_haha.occurence.sum()) - int(word_haha[word_haha.word == "haha"].occurence)))

# We make the list of all those words that have this same semantic
word_haha_list = list(word_haha.word)
list(word_haha.word)

[]

### 1.4 Study duplicates
We noticed that several tweets were duplicated in test, pos and neg tweets. We study ho many they are, and which tweets (if any) are shared between the test set and the pos/neg set.
We noticed after running that there are no tweets shared between all tweet sets, thus we can label duplicate of the test dataset by directly looking for corresponding duplicates in pos/neg 

In [8]:
#load raw tweets
pos = import_(path_tweets_pos_full)
neg = import_(path_tweets_neg_full)
data = import_without_id(path_tweets_test)

In [9]:
common_pos_test = []
common_neg_test = []

pos = set(pos_tweets)
neg = set(neg_tweets)
test_tweets = set(test_tweets)
len_test = len(test_tweets)
for i, test_tweet in enumerate(test_tweets) :
    if test_tweet in pos:
        common_pos_test.append(test_tweet)
        print("duplicate pos-test")
        print("\ttweet :" + str(pos_tweet))  
        break
    elif test_tweet in neg:
        common_neg_test.append(test_tweet)
        print("duplicate neg-data")
        print("\ttweet :" + str(pos_tweet))
        break
    print("{:.1f}".format(i/len_test), "%", end='\r')
            
print ("done")

done% %%% %


In [10]:
def occurences(tweet, tweets):
    count = 0
    for t in tweets:
        if t == tweet:
            count += 1
    return count
    

## 2 data pre-processing
In this part we start from raw tweets and preprocess them to achieve uniformization of the words


In [12]:
#load raw tweets
pos = import_(path_tweets_pos_full)
neg = import_(path_tweets_neg_full)
data = import_without_id(path_tweets_test)

### 2.1  Data Standardization
We start by removing repetitions of more than 3 letters, and replace dots by spaces
ex : 

1) `no_dot` : "Funny.I" becomes "Funny I" 

2) `find_repetitions` : "I looooove" becomes "I love"
            

This is done by the function `standardize_tweets`


In [13]:
#standardize tweets
print("Process data pos")
processed_tweets_pos_full = standardize_tweets(pos)
print("Process data neg")
processed_tweets_neg_full = standardize_tweets(neg) 
print("Process data test")
processed_tweets_test_full = standardize_tweets(data) 

print("export processed tweets to files")
export(processed_tweets_pos_full,  "preprocessed_pos_full")
export(processed_tweets_neg_full,  "preprocessed_neg_full")
export(processed_tweets_test_full, "preprocessed_test_full")

Process data pos
Process data neg
Process data test
export processed tweets to files


### 2.2 Build the vocabs
We build vocabs for positive, negative and test tweets, removing tokens present less than 5 times.

In [14]:
#build word counters
cut_threshold = 5
vocab_pos =  build_vocab_counter(processed_tweets_pos_full, cut_threshold, True)
vocab_neg =  build_vocab_counter(processed_tweets_neg_full, cut_threshold, True)
vocab_test = build_vocab_counter(processed_tweets_test_full, cut_threshold, True)

#write vocabs to files
write_vocab_to_file(vocab_pos, "preprocessed_vocab_pos_full")
write_vocab_to_file(vocab_neg, "preprocessed_vocab_neg_full")
write_vocab_to_file(vocab_test, "preprocessed_vocab_test_full")

removing items present less than 5 times%%%% %%%%% % % %
Time elpased (hh:mm:ss.ms) 0:03:21.198569
removing items present less than 5 times%% % %% %% % %% % %%% %
Time elpased (hh:mm:ss.ms) 0:03:40.127369
removing items present less than 5 times
Time elpased (hh:mm:ss.ms) 0:00:01.459434


### 2.3 Build and Merge the dataframes 
We build dataframes containing several informations for each type of tweets, and we combine them all in order to facilitate the stemming we will do later

In [15]:
path_pos_full = "preprocessed_vocab_pos_full"
path_neg_full = "preprocessed_vocab_neg_full"
path_test_data = "preprocessed_vocab_test_full"

    
pos_full =  build_df(path_pos_full)
neg_full =  build_df(path_neg_full)
test_data = build_df(path_test_data)
    
    
#Merge Pos and neg vocabs    
merged_full = merging(neg_full, pos_full, True)
merged_full = merging(merged_full, test_data, True)
    

### 2.4 Build semantic equivalences
We define which word is equivalent to each word (this can take a while)

In [16]:
'''print("Building semantic")
#Build equivalence list
same_begin = list(merged_full[merged_full["len"]==8]["word"])
same_begin = [list(merged_full.loc[(merged_full.word.str.startswith(w))]["word"]) for w in same_begin]
    
print("Building haha semantic")
#build "haha" equivalences
word_haha_test = test_data.loc[test_data.word.str.startswith("haha") | test_data.word.str.startswith("ahah")]
word_haha_pos = pos_full.loc[pos_full.word.str.startswith("haha") | pos_full.word.str.startswith("ahah")]
word_haha_neg = neg_full.loc[neg_full.word.str.startswith("haha") | neg_full.word.str.startswith("ahah")]
    
word_haha_list = list(set(list(word_haha_test.word)+list(word_haha_pos.word)+list(word_haha_neg.word)))
word_haha_list.remove("haha")
word_haha_list.insert(0, "haha")

    

    
print("filter semantic")
#append the two semantic lists and filter them
same_begin.append(list(word_haha_list))
semantic = filter_single_rep(same_begin)'''

'print("Building semantic")\n#Build equivalence list\nsame_begin = list(merged_full[merged_full["len"]==8]["word"])\nsame_begin = [list(merged_full.loc[(merged_full.word.str.startswith(w))]["word"]) for w in same_begin]\n    \nprint("Building haha semantic")\n#build "haha" equivalences\nword_haha_test = test_data.loc[test_data.word.str.startswith("haha") | test_data.word.str.startswith("ahah")]\nword_haha_pos = pos_full.loc[pos_full.word.str.startswith("haha") | pos_full.word.str.startswith("ahah")]\nword_haha_neg = neg_full.loc[neg_full.word.str.startswith("haha") | neg_full.word.str.startswith("ahah")]\n    \nword_haha_list = list(set(list(word_haha_test.word)+list(word_haha_pos.word)+list(word_haha_neg.word)))\nword_haha_list.remove("haha")\nword_haha_list.insert(0, "haha")\n\n    \n\n    \nprint("filter semantic")\n#append the two semantic lists and filter them\nsame_begin.append(list(word_haha_list))\nsemantic = filter_single_rep(same_begin)'

### 2.5 Map the equivalent words, stem them all thanks to nltk and export the result

In [17]:
'''#process tweets
print("Process data pos")
processed_tweets_pos_full = stem_tweets( processed_tweets_pos_full, semantic)
print("Process data neg")
processed_tweets_neg_full = stem_tweets( processed_tweets_neg_full, semantic) #Sous form d'un tableau de tweet   
print("Process data test")
processed_tweets_test_full = stem_tweets( processed_tweet_data_full, semantic) #Sous form d'un tableau de tweet    
    
    
    
print("export data")
#export tweets
export(processed_tweets_neg_full,  "cleaned_neg_full")
export(processed_tweets_pos_full,  "cleaned_pos_full")
export(processed_tweets_test_full, "cleaned_test_full")'''

'#process tweets\nprint("Process data pos")\nprocessed_tweets_pos_full = stem_tweets( processed_tweets_pos_full, semantic)\nprint("Process data neg")\nprocessed_tweets_neg_full = stem_tweets( processed_tweets_neg_full, semantic) #Sous form d\'un tableau de tweet   \nprint("Process data test")\nprocessed_tweets_test_full = stem_tweets( processed_tweet_data_full, semantic) #Sous form d\'un tableau de tweet    \n    \n    \n    \nprint("export data")\n#export tweets\nexport(processed_tweets_neg_full,  "cleaned_neg_full")\nexport(processed_tweets_pos_full,  "cleaned_pos_full")\nexport(processed_tweets_test_full, "cleaned_test_full")'

### 2.7 Create the simple vocab (with occurences) to build the dataframes
This is done by using shell commands or python code, as presented before

In [18]:
'''#build word counters
vocap_pos =  build_vocab_counter(processed_tweets_pos_full, cut_threshold, True)
vocab_neg =  build_vocab_counter(processed_tweets_neg_full, cut_threshold, True)
vocab_test = build_vocab_counter(processed_tweets_test_full, cut_threshold, True)

#build vocabs
write_vocab_to_file(vocab_pos, "cleaned_vocab_pos_full")
write_vocab_to_file(vocab_neg, "cleaned_vocab_neg_full")
write_vocab_to_file(vocab_test, "cleaned_vocab_test_full")'''

'#build word counters\nvocap_pos =  build_vocab_counter(processed_tweets_pos_full, cut_threshold, True)\nvocab_neg =  build_vocab_counter(processed_tweets_neg_full, cut_threshold, True)\nvocab_test = build_vocab_counter(processed_tweets_test_full, cut_threshold, True)\n\n#build vocabs\nwrite_vocab_to_file(vocab_pos, "cleaned_vocab_pos_full")\nwrite_vocab_to_file(vocab_neg, "cleaned_vocab_neg_full")\nwrite_vocab_to_file(vocab_test, "cleaned_vocab_test_full")'

### 2.8 create relevant vocab 
We keep pertinent words in a special vocab called relevant vocab

In [42]:
#Load dataframes before we 
neg_df = build_df("raw_vocab_neg_full_cut=5")
pos_df = build_df("raw_vocab_pos_full_cut=5")

#Merge dataframes
merged = merging(neg_df, pos_df, False)

#create relevant vocab
create_relevant_vocab(pertinence_thres=0.3, min_count=300, dataframe=merged)

### 2.9 Find characteristic words
We set a list of words that appear only in neg/pos at least min_diff times such that if we see such a word in a tweet we assume it is a neg/pos tweet

In [27]:
data_tweets = import_("relevant_vocab_pert=0.3_count=300")

characteristic_words(data_tweets, merged)

ValueError: max() arg is an empty sequence

### 2.9 Recapitulative Function
This function does the work presented above, we created them for clarity purposes

In [3]:
process_data_no_stem("train_pos.txt", "train_neg.txt", "test_data.txt", False, 100)


Import Data
Process data pos
Process data neg
Process data test
build vocab data pos
removing items present less than 100 times
Time elpased (hh:mm:ss.ms) 0:00:11.113098
build vocab data neg
removing items present less than 100 times
Time elpased (hh:mm:ss.ms) 0:00:14.880160
build vocab test data
removing items present less than 100 times
Time elpased (hh:mm:ss.ms) 0:00:01.622162
export data


In [14]:
df = pd.read_table(filepath_or_buffer = filepath, encoding="utf-8", quoting=csv.QUOTE_NONE,  header=None, names=["word"])
    

NameError: name 'filepath' is not defined

In [4]:
#stemming(False)

Import Data


NameError: name 'import_without_id' is not defined