Transfer Learning on Stack Exchange Tags
----------------------------------------

 *Predict tags from models trained on unrelated topics* 

*Qualitative Description of Task*
What does physics have in common with biology, cooking, cryptography, diy, robotics, and travel? If you answered "all pursuits are governed by the immutable laws of physics" we'll begrudgingly give you partial credit. If you answered "all were chosen randomly by a scheming Kaggle employee for a twisted transfer learning competition", congratulations, we accept your answer and mark the question as solved.

In this competition, we provide the titles, text, and tags of Stack Exchange questions from six different sites. We then ask for tag predictions on unseen physics questions. Solving this problem via a standard machine approach might involve training an algorithm on a corpus of related text. Here, you are challenged to train on material from outside the field. Can an algorithm learn appropriate physics tags from "extreme-tourism Antarctica"? Let's find out.

Kaggle is hosting this competition for the data science community to use for fun and education. This dataset originates from the Stack Exchange data dump.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
"""
To Form a Quantitative Description of the Problem 
Let's See What Output is Expected
"""
print("sample Submission \n")
submission_format = {"sample_submission": pd.read_csv("../input/sample_submission.csv")}
print(submission_format["sample_submission"].iloc[2])

print("\n Test Format \n")
test_format = pd.read_csv("../input/test.csv")
test_format.head(6)

Our Working Description:
(source: http://machinelearningmastery.com/how-to-define-your-machine-learning-problem/)

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Task (T): Classify physics questions that has not been seen, with tags learned from E.
Experience (E): A corpus of tagged stack exchange questions in 6 discrete categories.
Performance (P): Classification accuracy, the number of tags predicted correctly out of all questions considered as a percentage.


In [None]:
""""
Load data and test the structure of our input data
"""

dataframes = {
    "biology": pd.read_csv("../input/biology.csv"),
    "cooking": pd.read_csv("../input/cooking.csv"),
    "crypto": pd.read_csv("../input/crypto.csv"),
    "diy": pd.read_csv("../input/diy.csv"),
    "robotics": pd.read_csv("../input/robotics.csv"),
    "travel": pd.read_csv("../input/travel.csv"), 
}

from random import randint, choice

print(dataframes[choice(["biology", "cooking", "crypto", "diy", "robotics", "travel"])].iloc[randint(0,20)])



Key Assumptions Check
=====================
*assume = to make an 'ass' out of 'u' and 'me'*

 Seeing as Transfer Learning is in the title Kaggle wants us to solve this by training a really accurate classifier over the given sites and then transferring that model to the physics domain.

1. I'll probably end up sticking a Softmax Regression as our transfer ![Example Softmax Regression][1]
1.1. Already this is a problem since I'm assuming our end classes are mutually exclusive of one another, this clearly isn't the case, but maybe there's an amalgam of tags which are (e.g. like how we use super-pixels for classification). Ideally, our softmax is serving as an "activation" or "link" function, shaping the output of our linear function into whatever form we want (etc. pd distr)
2. A multiple binary classifier will probably be a good baseline for our final function.
3. Our base model should probably be extremely feauture rich
  [1]: https://www.harrisgeospatial.com/docs/html/images/Classification/SoftmaxDiagram.gif

In [None]:
stats = pd.concat(dataframes)
stats.head()