In [1]:
import pandas
import numpy as np
from bs4 import BeautifulSoup

In [2]:
import os

In [3]:
os.listdir('.')

['test.csv',
 '.ipynb_checkpoints',
 'diy.csv',
 'cooking.csv',
 'biology.csv',
 'sample_submission.csv',
 'crypto.csv',
 'zips',
 'robotics.csv',
 'travel.csv',
 'Untitled.ipynb']

In [4]:
data_list = {key:pandas.read_csv(key) for key in ['diy.csv',
                                                'cooking.csv',
                                                'biology.csv',
                                                'crypto.csv',
                                                'robotics.csv',
                                                'travel.csv']
           }

In [5]:
#get column names
list(data_list['diy.csv'])

['id', 'title', 'content', 'tags']

In [8]:
for key in data_list:
    data_list[key]['source'] = key

In [9]:
data = pandas.concat(data_list.values())

In [10]:
data.shape

(87000, 5)

In [12]:
data.iloc[0]

id                                                         1
title      How do I install a new, non load bearing wall ...
content    <p>I'm looking to finish my basement and simpl...
tags                           remodeling basement carpentry
source                                               diy.csv
Name: 0, dtype: object

In [13]:
random_indices = np.random.randint(0,87000,20)
for index in random_indices:
    print 'index: %d' % index
    print data.iloc[index]['tags']

index: 1950
fence post gates
index: 85008
integrity steganography
index: 70423
kale
index: 71087
seasoning
index: 68746
pizza
index: 45761
budget europe schengen destinations itineraries
index: 26323
molecular-biology genetics
index: 45817
belgium leuven
index: 45303
air-travel tickets airlines bookings
index: 23154
kitchens
index: 62442
flavor deep-frying
index: 84572
chosen-plaintext-attack
index: 76913
encryption public-key one-way-function
index: 66508
frying oil deep-frying
index: 84255
aes cbc initialization-vector
index: 43086
food-and-drink cuba dietary-restrictions
index: 71419
honey
index: 21747
wood stairs handrail
index: 26587
human-biology eyes reflexes
index: 73751
flavor


It looks like there are hyphenated terms. I should investigate the context of these better, since I may need to use ngrams and/or network-based models to extract similar keywords.

In [15]:
hyphenated_indexes = [45303, 62442, 84572, 76913, 66508, 84255, 43086, 26587]
for index in hyphenated_indexes:
    print 'index: %d' % index
    print 'tags: %s' % data.iloc[index]['tags']
    print 'title: %s' % data.iloc[index]['title']
    print 'BODY:'
    print data.iloc[index]['content']

index: 45303
tags: air-travel tickets airlines bookings
title: How far in advance is it recommended to book flight tickets?
BODY:
<p>Ok so I know that rule of thumb for booking flight tickets is 'earlier the better' but I am not sure how far in advance should the action be taken.
I have seen many cases where the min. prices at say 4 months before travel dates are higher than 3 months before, this I believe is primarily due to a new flight route being added or some complex airline thingy beyond me but this almost always happens. Also let me know if this is common or I am mistaken here.</p>

<p>So I am asking what is the time line that I should draw untill I look for better deals and book the flight eventually. I have to travel in last week June and return around 1 week Aug, I am open to changes in travel dates by 1 week in both direction and both ways.</p>

index: 62442
tags: flavor deep-frying
title: Deep frying - taste difference in saturated vs. unsaturated oil
BODY:
<p>In <a href="h

notes from each selection above: 

1. The first sample does not use the words "air travel/air-travel" or "bookings", or "airlines". The title does use the word "book", though, in the context of "booking".

2. "Deep frying" (non-hyphenated) is in the title. "Taste" is present several times, but "flavor" is only used at the very end, in a parenthetical statement.

3. Only presence of a keyword ("chosen plantext attack") is an instance of its acronym. Its definition would have to be found elsewhere.

4. Only one key word is present in the content ("public key"), and "one-way-function" is sort-of present in the last paragraph, "one-way-ish functions". (Hyphenation is actually used here, but the -ish part of the English language is kind of funny)

5. "frying" and "oil" and "deep frying" are all present in this body. Interestingly "french fries" are not the main focus here.

6. AES and CBC acronyms are both present in the title, and AES is also present in body. IV is present in both title and body, but its definition (in parentheses) is present in the body as well. Its definition is the relevant tag. Perhaps the frequency of usage of acronyms vs. their full names should be analyzed to determine which one to use.

7. Apart from the mistake of visiting a poor country where everything is cooked in lard as a vegetarian who doesn't know the language..."food-and-drink" is a general category, which might be derived from context; "Cuba" can be inferred from the title, as it is not a very common subject and important to the topic at hand; "dietary restrictions" is probably inferable from the constant use of "vegetarian", and possibly from the consistent uses of first-person verbs.

8. This would be a hard example. There are references to people, which might allow "human-biology" to come into play via context; and "looking at (X)" might relate to "eyes", and "trigger" for "reflexes", and perhaps the indication of "reflexes" might relate "looking at (X)" to "eyes" more.

At this point, it seems like the following is the case:

1. The scores below 0.3 on Kaggle could be genuine, since that appears to be roughly the accuracy you could get from a single correct guess and 2 incorrect guesses from a lot of these, and the ones above 0.9 are likely hand-labeled (cheaters). 

2. While I could probably achieve a minimal level of performance by simply extracting words, using TF-IDF and a bit of conjugation, better performance requires mining the data set to determine how to form key words from the entire corpus, and figuring out where they are relevant

3. First, I think some summary statistics should be extracted to get a better idea of what we're working with, segmented by source

    a) How many times each key is used across all posts
    
    b) How many times the key appears as a substring in the post they are mentioned in , (i) with and (ii) without considering conjugated forms as equal
    
    c) How many times each key appears in all posts (regardless of their being a key(word)) (i) with and (ii) without considering conjugated forms as equal

4. Second-order statistics and visualization may also prove useful to see if there's any structure that can be taken advantage of:

    a) Build a co-occurence network of keywords. Clustering may demonstrate generative behavior. Visualization in Gephi is possible.
    
    b) Identify some key words/phrases from each entry using standard methodology, and then build a dictionary for each keyword. Data reduction might have to be used if this is too memory-intensive.


Finally, this problem can be determined to consist of two parts:

A. How do you extract keywords from a given corpus?

B. How do you relate the title and content of a question to the keywords?

Also, are these two problems identical across different (general) categories, or are some a lot different than others?

# Preliminary method ideas

(incomplete)

1. Learn word encodings via neural network and try to extract a "keywordiness" feature from it via MLP logistic learning. 

2. Consider a modified version of "Thought Skip" algorithm which encodes sentences. Normally the algorithm requires a large body of continguous text, as opposed to smaller ones.

3. Prepend the question to the body text for more uniform treatment of text (at least initially).

In [16]:
data.to_csv('combined_data.csv', index=False)