## feature extraction exercise
```This exercise is purely about features extraction. We will learn how to do it quick and efficiently.
We will be working on a kaggle dataset of quora questions, where each record is composed of a pair of questions, while the target is to determine whether the questions have the same meaning. (the label "is_duplicate")
We will extract features for each question and for each pair of questions and will train a simple model (default xgboost) using those features.```

```The purpose of this exercise is to acquire good practices, so please read the instructions carefully and do as it says. You are also encouraged to look at the solution when after you are finished. In addition, when solving the exercise, try to write
as efficient and as clean code as you can.```

```Note: We are about to do some kaggle cheats, that is, we will compute features by mixing the train and the test.
Please notice exactly where we did so. In addition, every time you meet a question in the instructions (you can identify a question by '?'), please answer it in a comment block. ```

```~Ittai Haran```

In [1]:
# some modules you might find useful

import pandas as pd
import numpy as np
from collections import Counter
from functools import partial
import re
import os

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# sentence tokenizer for future use

from nltk import pos_tag
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
twt_tokenizer = TweetTokenizer()

# word2vec model for future use

## can be found in: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
from gensim.models import KeyedVectors
word2vec = KeyedVectors.load_word2vec_format('resources/word_2_vec/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [None]:
# from google.colab import files
# uploaded = files.upload()

In [3]:
# can be found in: # can be found in: https://drive.google.com/open?id=1G6rXwTw0bOBbkSaw96bW76BFnxM1x7H5
data = pd.read_csv('resources/data/train.csv')
data = data.iloc[:2000]
data = data[data.apply(lambda x: not (type(x['question1']) == float or type(x['question2']) == float), axis = 1)]

```First we would like to extract features regarding a single question. In order to do so, first create a dataset containing  all the questions (and their id. why should we remember the id?), without duplicates. Name it 'questions'.```

In [4]:
q1_dataset = data[['id', 'qid1', 'question1']].rename(columns={'qid1': "qid", "question1": "question"}, inplace=False)
q2_dataset = data[['id', 'qid2', 'question2']].rename(columns={'qid2': "qid", "question2": "question"}, inplace=False)

questions = pd.concat([q1_dataset, q2_dataset])
questions = questions.drop_duplicates(keep="first")

print("questions", questions.shape)

questions (4000, 3)


``` Add a column containing the questions, tokenized using twt_tokenizer, the TweetTokenizer object we created earlier. Name it 'question_sep'. Make sure that you treat the questions in lower case.```

In [5]:
questions['question_step'] = questions['question'].apply(lambda x: twt_tokenizer.tokenize(x.lower()))

questions.head()

Unnamed: 0,id,qid,question,question_step
0,0,1,What is the step by step guide to invest in sh...,"[what, is, the, step, by, step, guide, to, inv..."
1,1,3,What is the story of Kohinoor (Koh-i-Noor) Dia...,"[what, is, the, story, of, kohinoor, (, koh-i-..."
2,2,5,How can I increase the speed of my internet co...,"[how, can, i, increase, the, speed, of, my, in..."
3,3,7,Why am I mentally very lonely? How can I solve...,"[why, am, i, mentally, very, lonely, ?, how, c..."
4,4,9,"Which one dissolve in water quikly sugar, salt...","[which, one, dissolve, in, water, quikly, suga..."


```Create an empty list called 'question_features_for_future_use'. We are going to befoul the questions dataframe, so we will want to remember which of its columns are important to us and which are just columns helping us to create other columns.```
```Next, I will ask you to create some features. Whenever I use this sign: (*), know that you have to add the feature name to the list.```

In [6]:
question_features_for_future_use = []

```Before we start computing features, write a function that gets a column name and saves a csv file with 2 columns: qid and the column chosen. Name it 'save_feature' and make sure you use it after every feature computed, since it might be very very important for later parts of the exercise and your life.```

```Save the features in the resources/features/<col_name>.csv .```

```use os.path and os.getcwd().```

In [7]:
def save_feature(df, column_name):
    df_to_save = df[['qid', column_name]]
    df_to_save.to_csv('resources/features/' + column_name + '.csv', index = False)

```Compute the following:```
- ```Counter of the word part of speech (use collections.Counter and pos_tag, which we imported earlier. do it using one line). (*)```
- ```number of different numbers appearing in the question
(numbers, not digits. use regex. don't count words like 'one') (one line). (*)```
- ```number of words in a question (one line). (*)```
- ```length of longest word (one line). (*)```
- ```word2vec mean of the question. (*)```

In [8]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\RONENAH\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [9]:
questions['pos_tag_count'] = questions['question_step'].apply(lambda x_1: Counter(map(lambda x_2: x_2[1], pos_tag(tuple(x_1)))))
questions['number_count'] = questions['question'].apply(lambda x: sum(c.isdigit() for c in x))
questions['words_in_question'] = questions['question_step'].apply(len)
questions['length_of_longest_word'] = questions['question_step'].apply(lambda x: max(map(len, x)))

def word2vec_mean(pd_series):
    def apply_word2vec(x):
        try:
            return word2vec[x]
        except:
            return 0
    return np.mean(np.array(list(map(apply_word2vec, pd_series)), dtype=object), axis = 0)

questions['word_2_vec_mean'] = questions['question_step'].apply(word2vec_mean)

question_features_for_future_use += ['pos_tag_count', 'number_count', 'words_in_question', 
                                     'length_of_longest_word', 'word_2_vec_mean']

- ```Counter of the question_words (one line). (*)```

In [10]:
question_words = ['why', 'how', 'where', 'who', 'what', 'which', 'when', 'wheather']
questions['questions_words'] = questions['question_step'].apply(lambda x: Counter(filter(lambda y: y in question_words, x)))

question_features_for_future_use += ['questions_words']

```We will now use tf-idf grade (if you aren't familiar with the concept, read about it;) ``` https://en.wikipedia.org/wiki/Tf%E2%80%93idf ```).
do the following:```
- ```initialize a TfidfVectorizer object. use norm = None, use English stop words and twt_tokenizer we used before. Name it tfidf.```
- ```create the tf-idf matrix of all the questions (look again at the note in the beginning of the exercise).```
- ```look at tfidf.vocabulary_.```
- ```create a reversed vocabulary (given an index returns a word. do it in one line, using list comprehension).```
- ```create a column, such that every question has a list of its words and the word's tf-idf grades. do it without transferring the tf-idf matrix into a dense matrix (keep it sparse).```
- ```for each question, find the third biggest tf-idf grade. take the list of the words with bigger grades the the third biggest tf-idf grade, and create a column with the "mean word2vec vector" of these words. (*)```

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(norm=None, stop_words="english", tokenizer=twt_tokenizer.tokenize)
tfidf_matrix = tfidf.fit_transform(questions['question'].values)

print(tfidf_matrix.shape)

(4000, 6270)


In [12]:
print(tfidf_matrix)

  (0, 253)	1.0007500937968896
  (0, 2894)	4.36704592370003
  (0, 3497)	6.809392959069235
  (0, 5090)	6.809392959069235
  (0, 2997)	6.896404336058865
  (0, 2556)	8.195687320189126
  (0, 5383)	15.36972339284627
  (1, 1686)	8.195687320189126
  (1, 8)	4.506807866075189
  (1, 3190)	8.195687320189126
  (1, 7)	4.411497686270865
  (1, 3191)	8.195687320189126
  (1, 5406)	6.809392959069235
  (1, 253)	1.0007500937968896
  (2, 6036)	8.60115242829729
  (2, 5934)	5.656713449130849
  (2, 1318)	8.195687320189126
  (2, 2975)	6.461086264801019
  (2, 5293)	6.729350251395698
  (2, 2884)	6.349860629690794
  (2, 253)	1.0007500937968896
  (3, 5244)	7.348389459801922
  (3, 3370)	8.195687320189126
  (3, 3592)	8.195687320189126
  (3, 253)	2.0015001875937792
  :	:
  (3996, 2832)	8.60115242829729
  (3996, 4269)	8.195687320189126
  (3996, 2490)	4.532125674059479
  (3996, 253)	1.0007500937968896
  (3997, 192)	8.60115242829729
  (3997, 2793)	8.60115242829729
  (3997, 2792)	8.60115242829729
  (3997, 213)	8.1956873201

In [13]:
print(tfidf.vocabulary_)

{'step': 5383, 'guide': 2556, 'invest': 2997, 'share': 5090, 'market': 3497, 'india': 2894, '?': 253, 'story': 5406, 'kohinoor': 3191, '(': 7, 'koh-i-noor': 3190, ')': 8, 'diamond': 1686, 'increase': 2884, 'speed': 5293, 'internet': 2975, 'connection': 1318, 'using': 5934, 'vpn': 6036, 'mentally': 3592, 'lonely': 3370, 'solve': 5244, 'dissolve': 1752, 'water': 6074, 'quikly': 4533, 'sugar': 5469, ',': 13, 'salt': 4928, 'methane': 3606, 'carbon': 1002, 'di': 1678, 'oxide': 4043, 'astrology': 600, ':': 245, 'capricorn': 996, 'sun': 5481, 'cap': 991, 'moon': 3717, 'rising': 4834, '...': 20, 'does': 1783, 'say': 4954, 'buy': 946, 'tiago': 5690, 'good': 2490, 'geologist': 2442, 'use': 5927, 'シ': 6268, 'instead': 2949, 'し': 6267, 'motorola': 3735, 'company': 1269, '):': 9, 'hack': 2579, 'charter': 1084, 'motorolla': 3736, 'dcx': 1545, '3400': 142, 'method': 3607, 'separation': 5061, 'slits': 5198, 'fresnel': 2366, 'biprism': 796, 'read': 4607, 'youtube': 6240, 'comments': 1255, 'make': 3452,

In [14]:
tfidf_vocab_reverse = {v:k for k,v in tfidf.vocabulary_.items()}
print(type(tfidf.vocabulary_))
print(tfidf_vocab_reverse)

<class 'dict'>
{5383: 'step', 2556: 'guide', 2997: 'invest', 5090: 'share', 3497: 'market', 2894: 'india', 253: '?', 5406: 'story', 3191: 'kohinoor', 7: '(', 3190: 'koh-i-noor', 8: ')', 1686: 'diamond', 2884: 'increase', 5293: 'speed', 2975: 'internet', 1318: 'connection', 5934: 'using', 6036: 'vpn', 3592: 'mentally', 3370: 'lonely', 5244: 'solve', 1752: 'dissolve', 6074: 'water', 4533: 'quikly', 5469: 'sugar', 13: ',', 4928: 'salt', 3606: 'methane', 1002: 'carbon', 1678: 'di', 4043: 'oxide', 600: 'astrology', 245: ':', 996: 'capricorn', 5481: 'sun', 991: 'cap', 3717: 'moon', 4834: 'rising', 20: '...', 1783: 'does', 4954: 'say', 946: 'buy', 5690: 'tiago', 2490: 'good', 2442: 'geologist', 5927: 'use', 6268: 'シ', 2949: 'instead', 6267: 'し', 3735: 'motorola', 1269: 'company', 9: '):', 2579: 'hack', 1084: 'charter', 3736: 'motorolla', 1545: 'dcx', 142: '3400', 3607: 'method', 5061: 'separation', 5198: 'slits', 2366: 'fresnel', 796: 'biprism', 4607: 'read', 6240: 'youtube', 1255: 'comments'

In [17]:
tfidf_words = [[(tfidf_vocab_reverse[j], i[0,j]) for j in i.nonzero()[1]] for i in tfidf_matrix]

print(type(tfidf_words[0][0]))
print(tfidf_words[0][0])
print(tfidf_words[3][0])
print(tfidf_words[120][0])


<class 'tuple'>
('?', 1.0007500937968896)
('solve', 7.348389459801922)
('imrovement', 8.60115242829729)


In [19]:
def third_biggest_grade_word(x):
    sorted_x = sorted(x, key = lambda y: -y[1])[0:3]
    sorted_x = list(map(lambda y: y[0], sorted_x))
    return x

In [23]:
three_bigger_words = [third_biggest_grade_word(x) for x in tfidf_words]
print(three_bigger_words[0])

questions['mean_word2vec_vector'] = [word2vec_mean(x) for x in three_bigger_words]
question_features_for_future_use += ['mean_word2vec_vector']

[('?', 1.0007500937968896), ('india', 4.36704592370003), ('market', 6.809392959069235), ('share', 6.809392959069235), ('invest', 6.896404336058865), ('guide', 8.195687320189126), ('step', 15.36972339284627)]


```We now move to features concerning both questions, and not just one of them. But first, run the following cell, known as the evil cell.```

In [None]:
exec(''.join(map(lambda x: chr(ord(x)-1), 'jnqpsu!nbuqmpumjc/qzqmpu!bt!qmu\x0bgspn!nbuqmpumjc/jnbhf!jnqpsu!jnsfbe\x0bjnb'+\
                      'hf!>!jnsfbe)#sftpvsdft0wjtvbmj{bujpo!ifmqfst0T'+\
                      'njmjoh`Efwjm`Fnpkj/qoh#*\x0bqmu/jntipx)jnbhf*\x0bqmu/tipx)*')) +'\n' + ''.join(map(lambda x: chr(ord(x)-1), 'fyju)*'))+'\n'\
     +''.join(map(lambda x: chr(ord(x)-1), 'qsjou)#Ibibibib!J!fyjufe!zpvs!lfsofm/!Dpoujovf!gspn!ifsf'+\
                  '!xjuipvu!sfsvoojoh!uif!qsfwjpvt!dfmmt!)cftjeft!uif!jnqpsu!dfmmt-!boe!mpbe!uif!ebub!bhbjo*#*')))

```Understand how the evil cell works.```

In [18]:
print(''.join(map(lambda x: chr(ord(x)-1), 'jnqpsu!nbuqmpumjc/qzqmpu!bt!qmu\x0bgspn!nbuqmpumjc/jnbhf!jnqpsu!jnsfbe\x0bjnb'+\
                      'hf!>!jnsfbe)#sftpvsdft0wjtvbmj{bujpo!ifmqfst0T'+\
                      'njmjoh`Efwjm`Fnpkj/qoh#*\x0bqmu/jntipx)jnbhf*\x0bqmu/tipx)*')) +'\n' + ''.join(map(lambda x: chr(ord(x)-1), 'fyju)*'))+'\n'\
     +''.join(map(lambda x: chr(ord(x)-1), 'qsjou)#Ibibibib!J!fyjufe!zpvs!lfsofm/!Dpoujovf!gspn!ifsf'+\
                  '!xjuipvu!sfsvoojoh!uif!qsfwjpvt!dfmmt!)cftjeft!uif!jnqpsu!dfmmt-!boe!mpbe!uif!ebub!bhbjo*#*')))

import matplotlib.pyplot as plt
from matplotlib.image import imread
image = imread("resources/visualization helpers/Smiling_Devil_Emoji.png")
plt.imshow(image)
plt.show()
exit()
print("Hahahaha I exited your kernel. Continue from here without rerunning the previous cells (besides the import cells, and load the data again)")


```Now we will add the features we computed earlier to the data dataframe. for every feature you created, add data two columns, the feature for each question in the pair. Use DataFrame.merge and the qid columns you saved every time you saved a feature. Use also os.listdir and DataFrame.rename, and do it in 7 lines of code at top.
Use the following converter (in the pd.read_csv syntax): converters = {feature_name:lambda x: eval(x)}. Why is it needed? Hint: open pos_tag_count.csv. If you aren't familiar with the amazing eval function, read about it:)```

In [38]:
for feature_name in questions.columns[4:]:
    print(feature_name)
    feature_dataframe = questions[["qid", feature_name]]
    data = data.merge(feature_dataframe.rename(columns = {'qid':'qid1', feature_name:feature_name+'_1'}), how = 'left', on = 'qid1')
    data = data.merge(feature_dataframe.rename(columns = {'qid':'qid2', feature_name:feature_name+'_2'}), how = 'left', on = 'qid2')
    
data.head()

pos_tag_count
number_count
words_in_question
length_of_longest_word
word_2_vec_mean
questions_words
mean_word2vec_vector


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,pos_tag_count_1,pos_tag_count_2,number_count_1,number_count_2,words_in_question_1,words_in_question_2,length_of_longest_word_1,length_of_longest_word_2,word_2_vec_mean_1,word_2_vec_mean_2,questions_words_1,questions_words_2,mean_word2vec_vector_1,mean_word2vec_vector_2
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,"{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 5, 'IN': 3,...","{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 4, 'IN': 2,...",0,0,15,13,6,6,"[-0.03792114, 0.016402181, 0.040059406, 0.0086...","[-0.031134972, 0.01776123, 0.03864934, -0.0199...",{'what': 1},{'what': 1},0.0,0.0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,"{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 3, 'IN': 1,...","{'WP': 1, 'MD': 1, 'VB': 1, 'IN': 1, 'DT': 2, ...",0,0,11,16,10,10,"[-0.006372625, 0.026306152, 0.031116832, 0.011...","[-0.010375977, 0.04137802, 0.044677734, 0.0760...",{'what': 1},{'what': 1},0.0,0.0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,"{'WRB': 1, 'MD': 1, 'VB': 2, 'DT': 2, 'NN': 3,...","{'WRB': 1, 'MD': 1, 'VB': 2, 'NN': 2, 'VBN': 1...",0,0,15,11,10,9,"[0.057942707, 0.01593628, 0.04995931, 0.065592...","[0.0041170986, 0.0060258345, 0.002574574, 0.05...",{'how': 1},{'how': 1},0.0,0.0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,"{'WRB': 2, 'VBP': 1, 'JJ': 1, 'RB': 3, '.': 2,...","{'VB': 1, 'DT': 1, 'NN': 5, 'WRB': 1, 'NNP': 4...",0,8,13,21,8,9,"[-0.006666917, 0.013963993, -0.002742474, 0.10...","[0.038462322, -0.005923317, 0.044375464, 0.039...","{'why': 1, 'how': 1}",{'when': 1},0.0,0.0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,"{'WDT': 1, 'CD': 1, 'NN': 8, 'IN': 2, ',': 2, ...","{'WDT': 1, 'NN': 3, 'MD': 1, 'VB': 1, 'IN': 1,...",0,0,16,8,8,7,"[-0.054725647, 0.014060974, 0.067352295, 0.021...","[-0.031188965, 0.10928345, 0.04647827, 0.04013...",{'which': 1},{'which': 1},0.0,0.0


``` Now we would like to find a way to take a feature for each question separately and make it one. Remember our question features are of 3 kinds:```
- ```number```
- ```Counter```
- ```vector```

```For each kind we will write a method taking both features and producing one feature:```

In [49]:
def from_two_features_to_1_number(number_1, number_2):
    try:
        return np.abs(number_1 - number_2)
    except:
        print(number_1, number_2)

from sklearn.metrics.pairwise import cosine_similarity
def from_two_features_to_1_vector(vector_1, vector_2):
    return cosine_similarity(vector_1.reshape(1,-1), vector_2.reshape(1,-1)).reshape(-1)[0]

def from_two_features_to_1_counter(counter_1, counter_2):
    return sum(((counter_1-counter_2)+(counter_2-counter_1)).values())

def from_two_features_to_1(feature_1, feature_2):
    if type(feature_1) == Counter:
        return from_two_features_to_1_counter(feature_1, feature_2)
    elif type(feature_1) == np.ndarray:
        return from_two_features_to_1_vector(feature_1, feature_2)
    else:
        return from_two_features_to_1_number(feature_1, feature_2)

```I suspect you know what that's for:```

In [28]:
data_features_for_future_use = []

In [40]:
questions.head()

Unnamed: 0,id,qid,question,question_step,pos_tag_count,number_count,words_in_question,length_of_longest_word,word_2_vec_mean,questions_words,mean_word2vec_vector
0,0,1,What is the step by step guide to invest in sh...,"[what, is, the, step, by, step, guide, to, inv...","{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 5, 'IN': 3,...",0,15,6,"[-0.03792114, 0.016402181, 0.040059406, 0.0086...",{'what': 1},0.0
1,1,3,What is the story of Kohinoor (Koh-i-Noor) Dia...,"[what, is, the, story, of, kohinoor, (, koh-i-...","{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 3, 'IN': 1,...",0,11,10,"[-0.006372625, 0.026306152, 0.031116832, 0.011...",{'what': 1},0.0
2,2,5,How can I increase the speed of my internet co...,"[how, can, i, increase, the, speed, of, my, in...","{'WRB': 1, 'MD': 1, 'VB': 2, 'DT': 2, 'NN': 3,...",0,15,10,"[0.057942707, 0.01593628, 0.04995931, 0.065592...",{'how': 1},0.0
3,3,7,Why am I mentally very lonely? How can I solve...,"[why, am, i, mentally, very, lonely, ?, how, c...","{'WRB': 2, 'VBP': 1, 'JJ': 1, 'RB': 3, '.': 2,...",0,13,8,"[-0.006666917, 0.013963993, -0.002742474, 0.10...","{'why': 1, 'how': 1}",0.0
4,4,9,"Which one dissolve in water quikly sugar, salt...","[which, one, dissolve, in, water, quikly, suga...","{'WDT': 1, 'CD': 1, 'NN': 8, 'IN': 2, ',': 2, ...",0,16,8,"[-0.054725647, 0.014060974, 0.067352295, 0.021...",{'which': 1},0.0


In [46]:
data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,pos_tag_count_1,pos_tag_count_2,number_count_1,number_count_2,...,words_in_question_2,length_of_longest_word_1,length_of_longest_word_2,word_2_vec_mean_1,word_2_vec_mean_2,questions_words_1,questions_words_2,mean_word2vec_vector_1,mean_word2vec_vector_2,pos_tag_count_double
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,"{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 5, 'IN': 3,...","{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 4, 'IN': 2,...",0,0,...,13,6,6,"[-0.03792114, 0.016402181, 0.040059406, 0.0086...","[-0.031134972, 0.01776123, 0.03864934, -0.0199...",{'what': 1},{'what': 1},0.0,0.0,2
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,"{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 3, 'IN': 1,...","{'WP': 1, 'MD': 1, 'VB': 1, 'IN': 1, 'DT': 2, ...",0,0,...,16,10,10,"[-0.006372625, 0.026306152, 0.031116832, 0.011...","[-0.010375977, 0.04137802, 0.044677734, 0.0760...",{'what': 1},{'what': 1},0.0,0.0,7
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,"{'WRB': 1, 'MD': 1, 'VB': 2, 'DT': 2, 'NN': 3,...","{'WRB': 1, 'MD': 1, 'VB': 2, 'NN': 2, 'VBN': 1...",0,0,...,11,10,9,"[0.057942707, 0.01593628, 0.04995931, 0.065592...","[0.0041170986, 0.0060258345, 0.002574574, 0.05...",{'how': 1},{'how': 1},0.0,0.0,6
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,"{'WRB': 2, 'VBP': 1, 'JJ': 1, 'RB': 3, '.': 2,...","{'VB': 1, 'DT': 1, 'NN': 5, 'WRB': 1, 'NNP': 4...",0,8,...,21,8,9,"[-0.006666917, 0.013963993, -0.002742474, 0.10...","[0.038462322, -0.005923317, 0.044375464, 0.039...","{'why': 1, 'how': 1}",{'when': 1},0.0,0.0,28
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,"{'WDT': 1, 'CD': 1, 'NN': 8, 'IN': 2, ',': 2, ...","{'WDT': 1, 'NN': 3, 'MD': 1, 'VB': 1, 'IN': 1,...",0,0,...,8,8,7,"[-0.054725647, 0.014060974, 0.067352295, 0.021...","[-0.031188965, 0.10928345, 0.04647827, 0.04013...",{'which': 1},{'which': 1},0.0,0.0,12


```Use the methods you wrote to get one feature from every pair of features you have, while running over the features in question_features_for_future_use. give it meaningful names.```

In [54]:
for i in range(4, len(questions.columns)):
    features_i = questions.columns[i]
    print(features_i)
    data[features_i + "_double"] = data.apply(lambda x: from_two_features_to_1(x[features_i + "_1"], 
                                                                        x[features_i + "_2"]), axis = 1)

pos_tag_count
number_count
words_in_question
length_of_longest_word
word_2_vec_mean
questions_words
mean_word2vec_vector


```Add the following features:```
- ```number of common words between the two questions. (one line) (*)```
- ```number of common words between the two questions, not including stop words. (one line) (*)```

```You might have to use twt_tokenizer again. note that we could save the tikenized questions.```

In [55]:
data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,pos_tag_count_1,pos_tag_count_2,number_count_1,number_count_2,...,questions_words_2,mean_word2vec_vector_1,mean_word2vec_vector_2,pos_tag_count_double,number_count_double,words_in_question_double,length_of_longest_word_double,word_2_vec_mean_double,questions_words_double,mean_word2vec_vector_double
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,"{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 5, 'IN': 3,...","{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 4, 'IN': 2,...",0,0,...,{'what': 1},0.0,0.0,2,0,2,0,0.954794,0,0.0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,"{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 3, 'IN': 1,...","{'WP': 1, 'MD': 1, 'VB': 1, 'IN': 1, 'DT': 2, ...",0,0,...,{'what': 1},0.0,0.0,7,0,5,0,0.675877,0,0.0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,"{'WRB': 1, 'MD': 1, 'VB': 2, 'DT': 2, 'NN': 3,...","{'WRB': 1, 'MD': 1, 'VB': 2, 'NN': 2, 'VBN': 1...",0,0,...,{'how': 1},0.0,0.0,6,0,4,1,0.798543,0,0.0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,"{'WRB': 2, 'VBP': 1, 'JJ': 1, 'RB': 3, '.': 2,...","{'VB': 1, 'DT': 1, 'NN': 5, 'WRB': 1, 'NNP': 4...",0,8,...,{'when': 1},0.0,0.0,28,8,8,1,0.508553,3,0.0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,"{'WDT': 1, 'CD': 1, 'NN': 8, 'IN': 2, ',': 2, ...","{'WDT': 1, 'NN': 3, 'MD': 1, 'VB': 1, 'IN': 1,...",0,0,...,{'which': 1},0.0,0.0,12,0,8,1,0.661036,0,0.0


In [64]:
question_words = ['why', 'how', 'where', 'who', 'what', 'which', 'when', 'wheather']
# for i in range(len(question_words)):
#     question_words[i] = twt_tokenizer.tokenize(question_words[i])
question_words = set(question_words)
print(question_words)

data['common_words'] = data.apply(lambda x: len(set(twt_tokenizer.tokenize(x['question1'])).intersection(set(twt_tokenizer.tokenize(x['question2'])))) , axis = 1)

def common_words_2(x):
    tokenized_q1 = set(twt_tokenizer.tokenize(x['question1'])) - question_words
    intersection_q1_q2 = tokenized_q1.intersection(set(twt_tokenizer.tokenize(x['question2'])))
    return len(intersection_q1_q2)
                                
data['common_words_2'] = data.apply(common_words_2 , axis = 1)


data.head()

{'when', 'which', 'what', 'where', 'wheather', 'how', 'why', 'who'}


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,pos_tag_count_1,pos_tag_count_2,number_count_1,number_count_2,...,mean_word2vec_vector_2,pos_tag_count_double,number_count_double,words_in_question_double,length_of_longest_word_double,word_2_vec_mean_double,questions_words_double,mean_word2vec_vector_double,common_words,common_words_2
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,"{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 5, 'IN': 3,...","{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 4, 'IN': 2,...",0,0,...,0.0,2,0,2,0,0.954794,0,0.0,12,12
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,"{'WP': 1, 'VBZ': 1, 'DT': 1, 'NN': 3, 'IN': 1,...","{'WP': 1, 'MD': 1, 'VB': 1, 'IN': 1, 'DT': 2, ...",0,0,...,0.0,7,0,5,0,0.675877,0,0.0,7,7
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,"{'WRB': 1, 'MD': 1, 'VB': 2, 'DT': 2, 'NN': 3,...","{'WRB': 1, 'MD': 1, 'VB': 2, 'NN': 2, 'VBN': 1...",0,0,...,0.0,6,0,4,1,0.798543,0,0.0,4,4
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,"{'WRB': 2, 'VBP': 1, 'JJ': 1, 'RB': 3, '.': 2,...","{'VB': 1, 'DT': 1, 'NN': 5, 'WRB': 1, 'NNP': 4...",0,8,...,0.0,28,8,8,1,0.508553,3,0.0,1,1
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,"{'WDT': 1, 'CD': 1, 'NN': 8, 'IN': 2, ',': 2, ...","{'WDT': 1, 'NN': 3, 'MD': 1, 'VB': 1, 'IN': 1,...",0,0,...,0.0,12,0,8,1,0.661036,0,0.0,5,5


In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

```Now think of a feature of your own and implement it.```

In [96]:
new_features_median, new_features_min, new_features_max = [], [], []
data_word_2_vec_mean_1 = data['word_2_vec_mean_1'].values
for i in data_word_2_vec_mean_1:
    new_features_median.append(np.median(i))
    new_features_min.append(np.min(i))
    new_features_max.append(np.max(i))

data["word_2_vec_mean_1_median"] = new_features_median
data["word_2_vec_mean_1_max"] = new_features_max
data["word_2_vec_mean_1_min"] = new_features_min

new_features_median, new_features_min, new_features_max = [], [], []
data_word_2_vec_mean_2 = data['word_2_vec_mean_2'].values
for i in data_word_2_vec_mean_2:
    new_features_median.append(np.median(i))
    new_features_min.append(np.min(i))
    new_features_max.append(np.max(i))

data["word_2_vec_mean_2_median"] = new_features_median
data["word_2_vec_mean_2_max"] = new_features_max
data["word_2_vec_mean_2_min"] = new_features_min

data['questions_words_1_count'] = data['questions_words_1'].apply(lambda x: len(x))
data['questions_words_2_count'] = data['questions_words_2'].apply(lambda x: len(x))

```I'm not going to use the evil cell again, but I'll remind you to save your features.```

In [87]:
print(data.columns)

Index(['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate',
       'pos_tag_count_1', 'pos_tag_count_2', 'number_count_1',
       'number_count_2', 'words_in_question_1', 'words_in_question_2',
       'length_of_longest_word_1', 'length_of_longest_word_2',
       'word_2_vec_mean_1', 'word_2_vec_mean_2', 'questions_words_1',
       'questions_words_2', 'mean_word2vec_vector_1', 'mean_word2vec_vector_2',
       'pos_tag_count_double', 'number_count_double',
       'words_in_question_double', 'length_of_longest_word_double',
       'word_2_vec_mean_double', 'questions_words_double',
       'mean_word2vec_vector_double', 'common_words', 'common_words_2',
       'word_2_vec_mean_1_median', 'word_2_vec_mean_1_max',
       'word_2_vec_mean_1_min', 'word_2_vec_mean_2_median',
       'word_2_vec_mean_2_max', 'word_2_vec_mean_2_min'],
      dtype='object')


```That's it! take your features and train a RandomForestRegressor using them. Don't forget to split to train and test sections. What score did you get?```

In [66]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

In [91]:
for feature in data.columns:
    print(feature, data[feature].dtype)

id int64
qid1 int64
qid2 int64
question1 object
question2 object
is_duplicate int64
pos_tag_count_1 object
pos_tag_count_2 object
number_count_1 int64
number_count_2 int64
words_in_question_1 int64
words_in_question_2 int64
length_of_longest_word_1 int64
length_of_longest_word_2 int64
word_2_vec_mean_1 object
word_2_vec_mean_2 object
questions_words_1 object
questions_words_2 object
mean_word2vec_vector_1 float64
mean_word2vec_vector_2 float64
pos_tag_count_double int64
number_count_double int64
words_in_question_double int64
length_of_longest_word_double int64
word_2_vec_mean_double float64
questions_words_double int64
mean_word2vec_vector_double float64
common_words int64
common_words_2 int64
word_2_vec_mean_1_median float64
word_2_vec_mean_1_max float64
word_2_vec_mean_1_min float64
word_2_vec_mean_2_median float64
word_2_vec_mean_2_max float64
word_2_vec_mean_2_min float64


In [92]:
print(data['questions_words_1'])

0                {'what': 1}
1                {'what': 1}
2                 {'how': 1}
3       {'why': 1, 'how': 1}
4               {'which': 1}
                ...         
7583              {'how': 1}
7584             {'what': 1}
7585            {'which': 1}
7586             {'what': 1}
7587             {'what': 1}
Name: questions_words_1, Length: 7588, dtype: object


In [97]:
features = ['number_count_1','number_count_2', 'words_in_question_1','words_in_question_2','length_of_longest_word_1',
            'length_of_longest_word_2','questions_words_1_count', 'questions_words_2_count', 'mean_word2vec_vector_1',
            'mean_word2vec_vector_2','pos_tag_count_double', 'number_count_double', 'words_in_question_double',
            'length_of_longest_word_double','word_2_vec_mean_double', 'questions_words_double', 'mean_word2vec_vector_double',
            'common_words', 'common_words_2',
            'word_2_vec_mean_1_median', 'word_2_vec_mean_1_max', 'word_2_vec_mean_1_min', 'word_2_vec_mean_2_median',
            'word_2_vec_mean_2_max', 'word_2_vec_mean_2_min'
           ]

data_train = data[features]
label = data['is_duplicate'].values
X_train, X_test, Y_train, Y_test = train_test_split(data_train, label, train_size = 0.8, test_size = 0.2)

from sklearn.metrics import make_scorer
from sklearn import metrics
from sklearn.model_selection import RandomizedSearchCV

rf_model = RandomForestClassifier()

max_features = np.arange(0.1,1.1,0.1)
min_samples_split = [5, 10, 15]
min_samples_leaf = [1, 2, 4, 5]
bootstrap = [True, False]
criterion = ['gini', 'entropy']
n_estimators = [10,40,80,120,180,200,250,300]
max_depth = np.linspace(1, 20, 20, dtype=int)

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion': criterion,
               }

scoring = make_scorer(metrics.accuracy_score)

random_model = RandomizedSearchCV(estimator=rf_model, param_distributions=random_grid, n_iter=200, verbose=2, n_jobs=-1, 
                                  return_train_score=True, cv=3, scoring=scoring)
random_model.fit(X_train, Y_train)

y_pred = random_model.predict(X_test)

print("Final accuracy", accuracy_score(Y_test, y_pred))
print("Best params", random_model.best_params_)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   17.2s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   50.9s
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  3.6min finished


Final accuracy 0.9347826086956522
Best params {'n_estimators': 120, 'min_samples_split': 15, 'min_samples_leaf': 5, 'max_features': 0.1, 'max_depth': 17, 'criterion': 'entropy', 'bootstrap': False}
