The code in this notebook combines all the extracted feature sets to allow for experimentation.

# Load Libraries

In [1]:
import pickle
import pandas as pd

# Load Data Sets And Preprocess

The following block of code loads up the two data sets and starts steps towards preprocessing them for our experiments.

In [2]:
SWC = pickle.load( open( "../Data/DataSets/SWC/SWC.p", "rb" ) )
SQS = pickle.load( open( "../Data/DataSets/SQS/SQS.p", "rb" ) )

SWC = SWC[['sID', 'query', 'type', 'class']]
SQS = SQS[['sID', 'query', 'class']]

In [3]:
SQS

Unnamed: 0,sID,query,class
0,39899,collagen vascular disease lifestyle,0
1,39900,france world cup 1998 reactions,0
2,39901,dooney bourke look alike purses,0
3,39902,VOIP phones,0
4,39903,Travel to the poconos,0
...,...,...,...
296,41399,Who plays the bad guy in Star Wars the Horde a...,1
297,41400,What is a fox's favorite kind of food?,1
298,41401,"Show me the movie called ""The Martian""",1
299,41402,What is the biggest rock found on Mars?,1


# Load Extracted Features 

In the following block of code we load all feature sets before merging all the text based features into one dataframe before joining all feature sets together.

In [40]:
searchFeatSWC = pickle.load( open( "Pickles/SearchFeatSWC.p", "rb" ) )
searchFeatSQS = pickle.load( open( "Pickles/SearchFeatSQS.p", "rb" ) )

vocabFeat = pickle.load( open( "Pickles/VocabFeat.p", "rb" ) )
lexFeat = pickle.load( open( "Pickles/LexFeat.p", "rb" ) )
synFeat = pickle.load( open( "Pickles/SynFeat.p", "rb" ) )
sPFeat = pickle.load( open( "Pickles/SPFeat.p", "rb" ) )

textBasedFeat = sPFeat.merge(vocabFeat)
textBasedFeat = textBasedFeat.merge(lexFeat)
textBasedFeat = textBasedFeat.merge(synFeat)

SWCAll = SWC.merge(textBasedFeat, how='inner', on='query')
SWCAll = SWCAll[SWCAll['type'] == 'Q'].groupby('sID').mean()
SWCAll = SWCAll.join(searchFeatSWC)

SQSAll = SQS.merge(textBasedFeat, how='inner', on='query')
SQSAll = SQSAll.set_index('sID')
SQSAll = SQSAll.merge(searchFeatSQS)
SQSAll = SQSAll.drop(columns = ['query'])

In [43]:
SQSQ = SQSAll['query'].tolist()

In [45]:
SQSAll = SQS.merge(textBasedFeat, how='inner', on='query')
SQSAll = SQSAll.set_index('sID')
SQSAll = SQSAll.merge(searchFeatSQS)
# SQSAll = SQSAll.drop(columns = ['query'])
SQSAll[SQSAll['query'].isin(SQSQ)]

Unnamed: 0,query,class,numSpellingErrors,offByOne,kidsError,punct,casing,coreVocab,nonCoreVocab,minAoA,...,queryDistance,timeQueries,uniqueQueries,allSameQueries,repeatQueries,uniqueClicks,allSameClicks,repeatClicks,timeClicks,clickDistance
0,collagen vascular disease lifestyle,0,0,0,0,0,0,0.000000,1.000000,7.55,...,-1,-1,1,0,-1,-1,-1,-1,-1,-1
1,france world cup 1998 reactions,0,0,0,0,0,0,0.400000,0.600000,0.00,...,-1,-1,1,0,-1,-1,-1,-1,-1,-1
2,dooney bourke look alike purses,0,2,2,0,0,0,0.400000,0.600000,0.00,...,-1,-1,1,0,-1,-1,-1,-1,-1,-1
3,VOIP phones,0,1,1,0,0,1,0.500000,0.500000,0.00,...,-1,-1,1,0,-1,-1,-1,-1,-1,-1
4,Travel to the poconos,0,0,0,0,0,1,0.500000,0.500000,0.00,...,-1,-1,1,0,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1522,Who plays the bad guy in Star Wars the Horde a...,1,1,1,0,1,1,0.545455,0.454545,0.00,...,-1,-1,1,0,-1,-1,-1,-1,-1,-1
1523,What is a fox's favorite kind of food?,1,1,1,0,1,1,0.750000,0.250000,0.00,...,-1,-1,1,0,-1,-1,-1,-1,-1,-1
1524,"Show me the movie called ""The Martian""",1,0,0,0,0,1,0.428571,0.571429,0.00,...,-1,-1,1,0,-1,-1,-1,-1,-1,-1
1525,What is the biggest rock found on Mars?,1,0,0,0,1,1,0.750000,0.250000,0.00,...,-1,-1,1,0,-1,-1,-1,-1,-1,-1


In [15]:
searchFeatSQS

Unnamed: 0_level_0,query,class,numQueries,numClicks,numClicksPerQuery,meanClickPosition,queryDistance,timeQueries,uniqueQueries,allSameQueries,repeatQueries,uniqueClicks,allSameClicks,repeatClicks,timeClicks,clickDistance
sID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
39899,collagen vascular disease lifestyle,0,1,-1,-1,-1,-1,-1,1,0,-1,-1,-1,-1,-1,-1
39900,france world cup 1998 reactions,0,1,-1,-1,-1,-1,-1,1,0,-1,-1,-1,-1,-1,-1
39901,dooney bourke look alike purses,0,1,-1,-1,-1,-1,-1,1,0,-1,-1,-1,-1,-1,-1
39902,VOIP phones,0,1,-1,-1,-1,-1,-1,1,0,-1,-1,-1,-1,-1,-1
39903,Travel to the poconos,0,1,-1,-1,-1,-1,-1,1,0,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41399,Who plays the bad guy in Star Wars the Horde a...,1,1,-1,-1,-1,-1,-1,1,0,-1,-1,-1,-1,-1,-1
41400,What is a fox's favorite kind of food?,1,1,-1,-1,-1,-1,-1,1,0,-1,-1,-1,-1,-1,-1
41401,"Show me the movie called ""The Martian""",1,1,-1,-1,-1,-1,-1,1,0,-1,-1,-1,-1,-1,-1
41402,What is the biggest rock found on Mars?,1,1,-1,-1,-1,-1,-1,1,0,-1,-1,-1,-1,-1,-1


# Return Aggregated Extracted Features

The following block of code returns the extracted features aggregated with their respective data sets.

In [4]:
pickle.dump(SWCAll, open( "DataSets/SWCFeatures/SWCFeat.p", "wb" ) )
pickle.dump(SQSAll, open( "DataSets/SQSFeatures/SQSFeat.p", "wb" ) )