# Exploration and pre-process the data



In all Machine Learning problem one of the main task is to gain suffitant knowledge about the data one is given. This will ensure to better build the features representation of a tweet and to maximize our "computational power / prediction accuracy" ratio.

In the first part we will therefore investigate how the data is generated, how many simple words do we have, are there some duplicates, is there any odds with the tweets? How are there classified (we are given two datasets, one for the positive tweets one for negatives) i.e. what is a typical word that occurs when a :( is present in a tweet?


In the second part, we normalize the tweets. Some tweets may contain words that are not usefull, or can be categorized. This is done be the following pipeline.

1. Import all the tweets from train_pos_full.txt and train_neg_full.txt
2. Cast the repetition of more than three following similar letter to one letter
3. Find the representative of all words
4. 



### Helper functions and files

IOTweets contains the functions :
- build_df
- import_
- export

ProcessTweets contains the functions :
- merging
- filter_single_rep
- powerful_words
- sem_by_repr
- sem_by_repr2
- no_dot
- find_repetition
- no_s
- stem_tweet
- set_min_diff
- contains
- process_tweets_1
- process_tweets_2


## 1. An insight into the dataset 

In [1]:
import pandas as pd  # Pandas will be our framework for the first part
from nltk.stem import PorterStemmer
import nltk
from IOTweets import *
from ProcessTweets import *
%pylab inline
ps = PorterStemmer()

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


In [2]:
path_pos = "./vocab_cut_pos.txt"
path_neg = "./vocab_cut_neg.txt"

path_pos_full = "./vocab_cut_pos_full.txt"
path_neg_full = "./vocab_cut_neg_full.txt"
path_test_data = "./vocab_test_data.txt"

path_tweets_pos_full = "./train_pos_full.txt"
path_tweets_neg_full = "./train_neg_full.txt"
path_tweets_test_full = "test_data.txt"

In [3]:
# build the vocab Dataframe (DF) for full positive and negative tweets
pos_full = build_df(path_pos_full)
neg_full = build_df(path_neg_full)

#Build the vocab DF for tweets to be decided (i.e. "test" tweets)
test_data = build_df(path_test_data)

### 1.1 Study of word pertinence 
The idea is that the greater a word ratio i, the more pertinent it will be to cosider it when trying to label a tweet that contains it

In [None]:
#merge the pos and neg vocabs and study penrtinence of words via ratio
merged_full = merging(neg_full, pos_full)
merged_full.sort_values(by = ["ratio","somme"], ascending=[False, False]).head()

##### From here, we realise that some words have strangely strong occurence in the negative tweets.
By seeing the words in context, we realised that some tweets occured more than once.
We checked if those words were also in the test_data that we have to classify. The check was positive.
We will therefore capture those words, (i.e. "1gb" or "cd-rom") because they are luckily to be in the test_data set and classify directly the tweets countaining those words. We will also drop all the duplicated tweets for training in order not to let on the side some other words and this will save us power computationnal efficiency.

Two example of such tweets are:

    1) 1.26 - 7x14 custom picture frame / poster frame 1.265 " wide complete cherry wood frame ( 440ch this frame is manufactu ... <url>
    
    2) misc - 50pc diamond burr set - ceramics tile glass lapidary for rotary tools ( misc . assorted shapes and sizes for your ... <url>
    
Other important notice, some Tweets are almost the same but just fiew things change: Example of a tweet from amazon [here](https://www.amazon.com/Custom-Picture-Frame-Poster-Complete/dp/B004FNYSBA?SubscriptionId=AKIAJ6364XFIEG2FHXPA&tag=megaebookmall-20&linkCode=sp1&camp=2025&creative=165953&creativeASIN=B004FNYSBA) 

1. `3x14 custom picture frame / poster frame 1.265 " wide complete black wood frame ( 440bk this frame is manufactur ... <url>`

2. `24x35 custom picture frame / poster frame 1.265 " wide complete green wood frame ( 440gr this frame is manufactu ... <url>`

3. `22x31 custom picture frame / poster frame 2 " wide complete black executive leather frame ( 74093 this frame is ... <url>`

### 1.2 Study of names with same beginning 
Let's try to have an insight of words that have the same beginning

In [None]:
# are words (filtered by length) whose beginning match another word close to this word?
same_begin = list(merged_full[merged_full["len"]==8]["word"])
same_begin = [list(merged_full.loc[(merged_full.word.str.startswith(w))]["word"]) for w in same_begin]

#We visualize which word would be mapped on which one 
filter_single_rep(same_begin)

### 1.3 Expected syntaxic divergences
As expected, words like "haha" can be written in many ways (extended length, typo..). We know that they mean the same,it is therefore interesting to spot them in order to later stem them to the same representative.

In [None]:
#words beginning with "haha" or "ahah" are sure to be "instances" of haha
word_haha = test_data.loc[test_data.word.str.startswith("haha") | test_data.word.str.startswith("ahah")]


print("Occurence of the words that can be remplaced by haha = "+ str(word_haha.occurence.sum() - word_haha[word_haha.word = "haha"].occurence))

# We make the list of all those words that have this same semantic
word_haha_list = list(word_haha.word)
list(word_haha.word)

### 1.4 Study duplicates
We noticed that several tweets were duplicated in test, pos and neg tweets. We study ho many they are, and which tweets (if any) are shared between the test set and the pos/neg set.
We noticed after running that there are no tweets shared between all tweet sets, thus we can label duplicate of the test dataset by directly looking for corresponding duplicates in pos/neg 

In [17]:
#load raw tweets
pos = import_(path_tweets_pos_full)
neg = import_(path_tweets_neg_full)
data = import_without_comma(path_tweets_test_full)

In [21]:
common_pos_data = []
common_neg_data = []

pos = sorted(list(set(pos)))
neg = sorted(list(set(neg)))
data = sorted(list(set(data)))
len_data = len(data)
for i, data_tweet in enumerate(data) :
    for pos_tweet in pos :
        if data_tweet == pos_tweet : 
            common_pos_data.append(data_tweet)
            print("duplicate pos-data")
            print("\ttweet :" + str(pos_tweet))  
            break
    for neg_tweet in neg:
        if data_tweet == neg_tweet : 
            common_neg_data.append(data_tweet)
            print("duplicate neg-data")
            print("\ttweet :" + str(pos_tweet))
            break
    print("{:.1f}".format(i/len_data*100), "%", end='\r')
            
print ("done")

duplicate pos-data
	tweet :<user> <user> follow me please ?
duplicate neg-data
	tweet :~ ~ ~ recording miniatures <user> today connected to <user> nancarrow festival 4 <user> w / alexandra wood
duplicate neg-data
	tweet :~ ~ ~ recording miniatures <user> today connected to <user> nancarrow festival 4 <user> w / alexandra wood
duplicate neg-data
	tweet :~ ~ ~ recording miniatures <user> today connected to <user> nancarrow festival 4 <user> w / alexandra wood
duplicate pos-data
	tweet :<user> that's what i'm thinking
duplicate pos-data
	tweet :<user> whore
duplicate neg-data
	tweet :~ ~ ~ recording miniatures <user> today connected to <user> nancarrow festival 4 <user> w / alexandra wood
duplicate pos-data
	tweet :on the bus
duplicate neg-data
	tweet :~ ~ ~ recording miniatures <user> today connected to <user> nancarrow festival 4 <user> w / alexandra wood
done0 %


In [None]:
def occurences(tweet, tweets):
    count = 0
    for t in tweets:
        if t == tweet:
            count += 1
    return count
    

## 2 data pre-processing
In this part we start from raw tweets and preprocess them to achieve uniformization of the words


In [7]:
#load raw tweets
pos = import_(path_tweets_pos_full)
neg = import_(path_tweets_neg_full)
data = import_without_comma(path_tweets_test_full)

### 2.1  Data Standardization
We start by removing repetitions of more than 3 letters, and replace dots by spaces
ex : 

1) `no_dot` : "Funny.I" becomes "Funny I" 

2) `find_repetitions` : "I looooove" becomes "I love"

3) `no_s` : "dummies" becomes "dummy" and "carresses" becomes "carresse"
            

This is done by the function `standardize_tweets`


In [7]:
#standardize tweets
print("Process data pos")
processed_tweets_pos_full = standardize_tweets(pos)
print("Process data neg")
processed_tweets_neg_full = standardize_tweets(neg) 
print("Process data test")
processed_tweets_data_full = standardize_tweets(data) 


print("export data")
#export tweets
export(processed_tweets_neg_full,  "processed_tweet_neg_full")
export(processed_tweets_pos_full,  "processed_tweet_pos_full")
export(processed_tweets_data_full, "processed_tweet_data_full")

Process data pos
Process data neg
Process data test
export data


### 2.2 Build the vocabs
Using shell commands `vocab_cut`, we build vocabs for positive, negative and test tweets 

### 2.3 Build and Merge the dataframes 
We build dataframes containing several informations for each type of tweets, and we combine them all in order to facilitate the stemming we will do later

In [5]:
path_pos_full = "vocab_pos_full_to_stem.txt"
path_neg_full = "vocab_neg_full_to_stem.txt"
path_test_data = "vocab_data_full_to_stem.txt"

    
pos_full =  build_df(path_pos_full)
neg_full =  build_df(path_neg_full)
test_data = build_df(path_test_data)
    
    
#Merge Pos and neg vocabs    
merged_full = merging(neg_full, pos_full, True)
merged_full = merging(merged_full, test_data, True)
    

### 2.4 Build semantic equivalences
We define which word is equivalent to each word (this can take a while)

In [6]:
print("Building semantic")
#Build equivalence list
same_begin = list(merged_full[merged_full["len"]==8]["word"])
same_begin = [list(merged_full.loc[(merged_full.word.str.startswith(w))]["word"]) for w in same_begin]
    
print("Building haha semantic")
#build "haha" equivalences
word_haha_test = test_data.loc[test_data.word.str.startswith("haha") | test_data.word.str.startswith("ahah")]
word_haha_pos = pos_full.loc[pos_full.word.str.startswith("haha") | pos_full.word.str.startswith("ahah")]
word_haha_neg = neg_full.loc[neg_full.word.str.startswith("haha") | neg_full.word.str.startswith("ahah")]
    
word_haha_list = list(set(list(word_haha_test.word)+list(word_haha_pos.word)+list(word_haha_neg.word)))
word_haha_list.remove("haha")
word_haha_list.insert(0, "haha")

    

    
print("filter semantic")
#append the two semantic lists and filter them
same_begin.append(list(word_haha_list))
semantic = filter_single_rep(same_begin)

Building semantic


KeyboardInterrupt: 

### 2.5 Map the equivalent words, stem them all thanks to nltk and export the result

In [None]:
#process tweets
print("Process data pos")
processed_tweets_pos_full = stem_tweets( processed_tweets_pos_full, semantic)
print("Process data neg")
processed_tweets_neg_full = stem_tweets( processed_tweets_neg_full, semantic) #Sous form d'un tableau de tweet   
print("Process data test")
processed_tweets_data_full = stem_tweets( processed_tweet_data_full, semantic) #Sous form d'un tableau de tweet    
    
    
    
print("export data")
#export tweets
export(processed_tweets_neg_full,  "preprocessed_tweet_neg_full")
export(processed_tweets_pos_full,  "preprocessed_tweet_pos_full")
export(processed_tweets_data_full, "preprocessed_tweet_data_full")

### 2.6 Create bigrams and trigrams
We consider also combinations of two and three following words And create a new file for each type of tweets

In [None]:
create_bitri_tweets('preprocessed_tweet_neg_full.txt', 'bitri_tweet_data_full.txt', False)
create_bitri_tweets('preprocessed_tweet_pos_full.txt', 'bitri_tweet_data_full.txt', False)
create_bitri_tweets('preprocessed_tweet_data_full.txt', 'bitri_tweet_data_full.txt', False)

### 2.7 Create the vocabs (with occurences) to build the dataframes
This is done by using shell commands, as presented before

### 2.8 Find characteristic words
We set a list of words that appear only in neg/pos at least min_diff times such that if we see such a word in a tweet we assume it is a neg/pos tweet

In [None]:
# load the tweets to be labelled
import_without_comma('bitri_tweet_data_full.txt')

#Build the new dataframes
path_pos_full = "vocab_pos_full_bitri.txt"
path_neg_full = "vocab_neg_full_bitri.txt"
path_test_data = "vocab_data_bitri.txt"


pos_full = build_df(path_pos_full)
neg_full = build_df(path_neg_full)
test_data = build_df(path_test_data)
    
    
#Merge Pos and neg vocabs    
merged_full = merging(neg_full, pos_full, True)
merged_full = merging(merged_full, test_data, True)


#Search minimal difference so that two opposite characteristic words cannot be seen together in data
min_diff = set_min_diff(tweets, merged_full)

#export characteristic words
characteristic_words = characteristic_words(merged_full, min_diff)
characteristic_words.head()

In [None]:
#Takes tweets to be labelled, and a merged instance of pos and neg vocabs
def set_min_diff(data_tweets, merged_full):
    ratio_one = merged_full[(merged_full.ratio == 1)]
    differences = (list(ratio_one.difference)).sort(reverse=True)
    for d in differences:
        words_to_check = list((ratio_one[(ratio_one.ratio == 1) & (ratio_one.difference >= d)]).word)
        if not contains(data_tweets, words_to_check):
            return d
        print(str(d) + "not sucessful")
    return 0
            
        
                            

In [2]:
#return true iff at least one of the tweets contains at least one of the words
#tweets is an array of strings
#words is a list of words
def contains(tweets, words):
    for t in tweets:
        for w1 in t.split():
            for w2 in words:
                if w1 == w2:
                    return True
    return False

def drop_duplicates(tweets):
    tweets = list(set(tweets))
    return tweets
    

In [4]:
process_data_no_stem("train_pos.txt", "train_neg.txt", "test_data.txt", False)

Import Data
Process data pos
Process data neg
Process data test
export data
removing items present less than 1 times
Time elpased (hh:mm:ss.ms) 0:00:04.359159
removing items present less than 1 times
Time elpased (hh:mm:ss.ms) 0:00:03.804771
removing items present less than 1 times
Time elpased (hh:mm:ss.ms) 0:00:00.825586


### 2.9 Recapitulative Functions
These functions do the work presented above, we created them for clarity purposes

In [3]:
def process_data_no_stem(path_pos, path_neg, path_test, is_full):
    ########----------Build DF----------------------------------------
    #### -------------- Partial tweets -------------------------------
   
    print("Import Data")
    #import tweets
    tweets_test = import_without_comma(path_test)     
    tweets_pos =  import_(path_pos)
    tweets_neg =  import_(path_neg)
    
    #remove duplicates
    tweets_pos = drop_duplicates(tweets_pos)
    tweets_neg = drop_duplicates(tweets_neg)
    
    #process tweets
    print("Process data pos")
    preprocessed_pos = standardize_tweets( tweets_pos)
    print("Process data neg")
    preprocessed_neg = standardize_tweets( tweets_neg) #Sous form d'un tableau de tweet   
    print("Process data test")
    preprocessed_test = standardize_tweets( tweets_test) #Sous form d'un tableau de tweet    
    
    print("export data")
    #export tweets
    if(is_full):
        
        #export tweets
        export(preprocessed_neg,  "preprocessed_neg_full")
        export(preprocessed_pos,  "preprocessed_pos_full")
        export(preprocessed_test, "preprocessed_test_full")
        
        #export vocabs
        build_vocab(preprocessed_neg, "preprocessed_vocab_neg_full", 1)
        build_vocab(preprocessed_pos, "preprocessed_vocab_pos_full", 1)
        build_vocab(preprocessed_test, "preprocessed_vocab_test_full", 1)
        
        
    else :
        #export tweets
        export(preprocessed_neg,  "preprocessed_neg")
        export(preprocessed_pos,  "preprocessed_pos")
        export(preprocessed_test, "preprocessed_test")
        
        #export vocabs
        build_vocab(preprocessed_neg, "preprocessed_vocab_neg", 1)
        build_vocab(preprocessed_pos, "preprocessed_vocab_pos", 1)
        build_vocab(preprocessed_test, "preprocessed_vocab_test", 1)
    

def stemming(is_full):
    
    print("Import Data")
    
    if(is_full):
        #import tweets
        preprocessed_neg = import_("preprocessed_neg_full")
        preprocessed_pos = import_("preprocessed_pos_full")    
        preprocessed_test = import_("preprocessed_test_full")
        
        #import vocabs
        pos_df = build_df("preprocessed_vocab_neg_full")
        neg_df = build_df("preprocessed_vocab_pos_full")
        test_df = build_df("preprocessed_vocab_test_full")
   
    else :
        #import tweets
        preprocessed_neg = import_("preprocessed_neg")
        preprocessed_pos = import_("preprocessed_pos")    
        preprocessed_test = import_("preprocessed_data")
        
        #import vocabs
        pos_df = build_df("preprocessed_vocab_neg")
        neg_df = build_df("preprocessed_vocab_pos")
        test_df = build_df("preprocessed_vocab_test")
   
       
    
    #Merge Pos and neg dataframes    
    merged = merging(neg_df, pos_df, True)
    merged = merging(merged, test_df, True)
    
    
    print("Building semantic")
    
    #Build equivalence list
    same_begin = list(merged[merged["len"]==8]["word"])
    same_begin = [list(merged.loc[(merged.word.str.startswith(w))]["word"]) for w in same_begin]
    
    print("Building haha semantic")
    
    #build "haha" equivalences    
    word_haha_list = merged.loc[merged.word.str.startswith("haha") | merged.word.str.startswith("ahah")]
                                
    word_haha_list.remove("haha")
    word_haha_list.insert(0, "haha")
    same_begin.append(list(word_haha_list))
    

    
    print("filter semantic")
    
    #fiter roots alone
    semantic = filter_single_rep(same_begin)
    
    
   #process tweets
    print("Process data pos")
    stemmed_pos = stem_tweets( preprocessed_pos, semantic)
    print("Process data neg")
    stemmed_neg = stem_tweets( preprocessed_neg, semantic) #Sous form d'un tableau de tweet   
    print("Process data test")
    stemmed_test = stem_tweets( preprocessed_test, semantic) #Sous form d'un tableau de tweet    
    
    
    
    print("export data")
    
    #export data    
    if(is_full):
        #export tweets
        export(stemmed_neg,  "stemmed_neg_full")
        export(stemmed_pos,  "stemmed_pos_full")
        export(stemmed_test, "stemmed_test_full") 
        
        #export vocabs
        build_vocab(preprocessed_neg, "cleaned_vocab_neg_full")
        build_vocab(preprocessed_pos, "cleaned_vocab_pos_full")
        build_vocab(preprocessed_test, "cleaned_vocab_test_full")
    
    else :
        #export tweets
        export(stemmed_neg,  "stemmed_neg")
        export(stemmed_pos,  "stemmed_pos")
        export(stemmed_test, "stemmed_test")  
        
        #export vocabs
        build_vocab(preprocessed_neg, "cleaned_vocab_neg")
        build_vocab(preprocessed_pos, "cleaned_vocab_pos")
        build_vocab(preprocessed_test, "cleaned_vocab_test")
        
        
#clean raw tweets
def clean_tweets(path_pos, path_neg, path_test, is_full):
    
    process_data_no_stem(path_pos, path_neg, path_test, is_full)
    stemming(is_full)
        
        
    
    
    
def bitri_tweets():
    print("bitri for neg running...")
    create_bitri_tweets('preprocessed_tweet_neg_full.txt', 'bitri_tweet_data_full.txt', False)
    
    print("bitri for pos running...")    
    create_bitri_tweets('preprocessed_tweet_pos_full.txt', 'bitri_tweet_data_full.txt', False)
    
    print("bitri for test running...")
    create_bitri_tweets('preprocessed_tweet_test_full.txt', 'bitri_tweet_test_full.txt', False)
    
    # load the tweets to be labelled
    import_without_comma('bitri_tweet_data_full.txt')

    #Build the new dataframes
    path_pos_full = "vocab_pos_full_bitri.txt"
    path_neg_full = "vocab_neg_full_bitri.txt"
    path_test_data = "vocab_data_bitri.txt"


    pos_full = build_df(path_pos_full)
    neg_full = build_df(path_neg_full)
    test_data = build_df(path_test_data)


    #Merge Pos and neg vocabs    
    merged_full = merging(neg_full, pos_full, True)
    merged_full = merging(merged_full, test_data, True)


    #Search minimal difference so that two opposite characteristic words cannot be seen together in data
    min_diff = set_min_diff(tweets, merged_full)

    #export characteristic words
    characteristic_words = characteristic_words(merged_full, min_diff)
    characteristic_words.head()

In [None]:
process_data_no_stem()

In [3]:
def create_relevant_vocab(pertinence_thres, min_count, dataframe):
    relevant = dataframe[dataframe["ratio"] >= pertinence_thres]
    relevant = relevant[(relevant["occurence_pos"] + relevant["occurence_neg"]) >= min_count]
    relevant = relevant[["word","ratio"]]
    relevant = relevant.set_index("word")
    dict_relevant = relevant.to_dict()
    relevant.to_pickle("relevant_vocab_pert="+str(pertinence_thres)+"_count="+str(min_count))

In [None]:
def create_relevant_vocab(pertinence_thres, min_count, dataframe):
    relevant = dataframe[dataframe["ratio"] >= pertinence_thres]
    relevant = relevant[(relevant["occurence_pos"] + relevant["occurence_neg"]) >= min_count]
    relevant = relevant[["word","ratio"]]
    relevant = relevant.set_index("word")
    dict_relevant = relevant.to_dict()
    relevant.to_pickle("relevant_vocab_pert="+str(pertinence_thres)+"_count="+str(min_count))
    


def create_data(lower_bound):
    
    #paths of positive and negative vocabs
    path_pos = "./vocab_cut_pos_full.txt"
    path_neg = "./vocab_cut_neg_full.txt"
    
    # pos is mapping of words in happy tweets with their occurences in all happy tweets
    pos = pd.read_table(filepath_or_buffer = path_pos, header=None, names=["word"])
    pos["occurence"] = pos["word"].map(lambda x:  int(x.split()[0]))
    pos["word"] = pos["word"].map(lambda x:  x.encode('utf-8').split()[1])

    # neg is mapping of words in sad tweets with their occurences in all sad tweets
    neg = pd.read_table(filepath_or_buffer = path_neg, header=None, names=["word"])
    neg["occurence"] = neg["word"].map(lambda x:  int(x.split()[0]))
    neg["word"] = neg["word"].map(lambda x:  x.encode('utf-8').split()[1])
    
    # We merge the two dataframe in order to better handle them
    merged = pd.merge(left=neg, right=pos, left_on = "word", right_on = "word", suffixes=('_neg', '_pos'),  how="outer")
    merged = merged.fillna(0)

    #We only consider words whose occurences dfferences between sad and happy tweets is greater or equal than 5 
    merged["difference"] = abs((merged["occurence_neg"]-merged["occurence_pos"]))
    merged = merged[merged["difference"]>=5]

    #We compute the sum of occurences
    merged["somme"] = merged["occurence_neg"]+merged["occurence_pos"]

    #The ratio si how relevant it is to judge happyness/sadness of the tweet using the word : 0 if not relevant, 1 if truly relevant
    merged["ratio"] = 2* abs(0.5 - merged["occurence_pos"]/(merged["occurence_pos"]+merged["occurence_neg"]))
    
    
    def lower_ratio(x) :
        if(x["somme"]<MIN_SOMME):
            return 0
        else:
            return x["ratio"]
    
    
    #We only consider with more than 'lower_bound' occurences
    merged["ratio"] = merged.apply(lower_ratio, axis = 1) 
    
    #sort the array by ratio and then sum
    merged.sort_values(by = ["ratio","somme"], ascending=[False, False])
    
    #store the data
    filename = "relevant_vocab_full_lb="+str(lower_bound)+".txt"
    merged.to_csv(path_or_buf=filename, sep=' ')
        

In [None]:
#for lb in [50, 100, 500, 1000, 2000, 5000, 10000, 20000, 50000]:
create_data(5000)
    

## type of the word investigation.

Here we investigate if a type of word seems to be in a certain category of tweet

In [None]:
def categorizer(w):
    w = nltk.word_tokenize(w)
    c = nltk.pos_tag(w)
    return c[0][1]

In [None]:
categorizer("the")

In [None]:
merged["categorize"] = merged["word"].map(categorizer)

In [None]:
type_ratio = merged.groupby("categorize").mean().sort_values("ratio")

t_r = type_ratio["ratio"]

In [None]:
t_r.plot.bar(figsize = (15,10))