# Transfer Learning on Stack Exchange Tags
## Kaggle competition
https://www.kaggle.com/c/transfer-learning-on-stack-exchange-tags

In [1]:
import pandas as pd
from statistics import mode
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import re
from sklearn.ensemble import RandomForestClassifier

In [2]:
dataframe_1 = pd.DataFrame()
data_list = ['biology', 'cooking', 'crypto', 'diy', 'robotics', 'travel']
pd.set_option('max_colwidth', 800)

for theme in data_list:
    path = "dados/" + theme + ".csv"
    x = pd.DataFrame(pd.read_csv(path))
    x['subject'] = theme
    dataframe_1 = dataframe_1.append(x)

dataframe_1.head(10)

Unnamed: 0,id,title,content,tags,subject
0,1,What is the criticality of the ribosome binding site relative to the start codon in prokaryotic translation?,"<p>In prokaryotic translation, how critical for efficient translation is the location of the ribosome binding site, relative to the start codon?</p>\n\n<p>Ideally, it is supposed to be -7b away from the start. How about if it is -9 bases away or even more? Will this have an observable effect on translation?</p>\n",ribosome binding-sites translation synthetic-biology,biology
1,2,How is RNAse contamination in RNA based experiments prevented?,<p>Does anyone have any suggestions to prevent RNAse contamination when working with RNA?</p>\n\n<p>I tend to have issues with degradation regardless of whether I use DEPC treated / RNAse free water and filtered pipette tips.</p>\n,rna biochemistry,biology
2,3,Are lymphocyte sizes clustered in two groups?,"<p>Tortora writes in <em>Principles of Anatomy and Physiology</em>:</p>\n\n<blockquote>\n <p>Lymphocytes may be as small as 6–9 μm in diameter or as large as 10–14 μm in diameter.</p>\n</blockquote>\n\n<p>Those ranges are quite close to each others. Should the above be taken to mean that lymphocytes sizes are clustered in two groups, or is it just a way of saying that lymphocytes are 6-14 μm?</p>\n",immunology cell-biology hematology,biology
3,4,How long does antibiotic-dosed LB maintain good selection?,"<p>Various people in our lab will prepare a liter or so of LB, add kanamycin to 25-37 mg/L for selection, and store it at 4 °C for minipreps or other small cultures (where dosing straight LB with a 1000X stock is troublesome). Some think using it after more than a week is dubious, but we routinely use kan plates that are 1-2 months old with no ill effect.</p>\n\n<p>How long can LB with antibiotic such as kanamycin, chloramphenicol, or ampicillin be stored at 4 °C and maintain selection?</p>\n",cell-culture,biology
4,5,Is exon order always preserved in splicing?,"<p>Are there any cases in which the splicing machinery constructs an mRNA in which the exons are not in the 5' -> 3' genomic order? I'm interested any such cases, whether they involve constitutive or alternative splicing.</p>\n",splicing mrna spliceosome introns exons,biology
5,6,How can I avoid digesting protein-bound DNA?,"<p>I'm interested in sequencing and analyzing the bound DNA, and minimizing the amount of unbound DNA that gets sequenced through digestion.</p>\n\n<p>When digesting protein-bound DNA, is <em>all</em> of the unbound DNA digested? Is there a way to maximize the amount of unbound DNA that is digested?</p>\n",dna biochemistry molecular-biology,biology
6,8,Under what conditions do dendritic spines form?,"<p>I'm looking for resources or any information about the formation of dendritic spines and synaptogenesis, especially in relation to how new connections are formed on a daily basis.</p>\n\n<p>Does the electrotonic signalling along the axons and through the spines cause new connections to be made based on some kind of spatial condition (maybe an electrical or chemical attraction), or is there some larger heuristic here?</p>\n",neuroscience synapses,biology
7,9,How should I ship plasmids?,"<p>I shipped 10 µL of my vector miniprep to a collaborator in a 1.5 mL eppendorf parafilmed shut and stuffed into a 50 mL conical with some paper-towel padding. However, something happened on the way and there was nothing (no liquid) in the tube when it arrived. They didn't make any comments about the microcentrifuge tube popping open or broken parafilm, so nothing crazy happened but something did.</p>\n\n<p>What's the most reliable way to ship plasmids?</p>\n",plasmids,biology
8,10,What is the reason behind choosing the reporter gene when experimenting on your gene of interest?,"<p>I noticed within example experiments in class that different reporter genes are chosen to be inserted near your gene of interest to prove whether or not the gene is being expressed. For example, you may insert the gene for fluorescence next to your gene of interest so you know if it is transcribed or not by whether the organism's cells are fluorescent and to what degree they are fluorescing at.</p>\n\n<p>I have noticed in some experiments that have multiple versions that in one case they use the fluorescent gene and in the next a different gene (for example lactose). Both portions of the experiment use almost the exact same steps so why would they not choose the same reporter gene?</p>\n",molecular-genetics gene-expression experimental-design,biology
9,11,How many times did endosymbiosis occur?,"<p>According to the endosymbiont theory, mitochondria and chloroplasts originated as bacteria which were engulfed by larger cells. How many times is it estimated that this occurred in the past? Are there any examples of this process being observed directly?</p>\n",evolution mitochondria chloroplasts,biology


In [3]:
def words_cleaning(text):
    #Remove HTML
    x = BeautifulSoup(text, 'html5lib').get_text()
    #Remove non letters
    x_2 = re.sub("[^a-zA-Z]", " ", x)
    #Lower case and split words
    x_3 = x_2.lower().split()
    #Remove stop words
    stop = set(stopwords.words('english'))
    x_4 = [w for w in x_3 if not w in stop]
    return(" ".join(x_4))

In [4]:
dataframe_1['title'] = dataframe_1['title'].apply(lambda x: words_cleaning(x))

In [5]:
dataframe_1['content'] = dataframe_1['content'].apply(lambda x: words_cleaning(x))

In [6]:
dataframe_1.head()

Unnamed: 0,id,title,content,tags,subject
0,1,criticality ribosome binding site relative start codon prokaryotic translation,prokaryotic translation critical efficient translation location ribosome binding site relative start codon ideally supposed b away start bases away even observable effect translation,ribosome binding-sites translation synthetic-biology,biology
1,2,rnase contamination rna based experiments prevented,anyone suggestions prevent rnase contamination working rna tend issues degradation regardless whether use depc treated rnase free water filtered pipette tips,rna biochemistry,biology
2,3,lymphocyte sizes clustered two groups,tortora writes principles anatomy physiology lymphocytes may small diameter large diameter ranges quite close others taken mean lymphocytes sizes clustered two groups way saying lymphocytes,immunology cell-biology hematology,biology
3,4,long antibiotic dosed lb maintain good selection,various people lab prepare liter lb add kanamycin mg l selection store c minipreps small cultures dosing straight lb x stock troublesome think using week dubious routinely use kan plates months old ill effect long lb antibiotic kanamycin chloramphenicol ampicillin stored c maintain selection,cell-culture,biology
4,5,exon order always preserved splicing,cases splicing machinery constructs mrna exons genomic order interested cases whether involve constitutive alternative splicing,splicing mrna spliceosome introns exons,biology


In [7]:
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

In [8]:
train_words = list(dataframe_1['title'] + dataframe_1['content'])
train_words[:5]

['criticality ribosome binding site relative start codon prokaryotic translationprokaryotic translation critical efficient translation location ribosome binding site relative start codon ideally supposed b away start bases away even observable effect translation',
 'rnase contamination rna based experiments preventedanyone suggestions prevent rnase contamination working rna tend issues degradation regardless whether use depc treated rnase free water filtered pipette tips',
 'lymphocyte sizes clustered two groupstortora writes principles anatomy physiology lymphocytes may small diameter large diameter ranges quite close others taken mean lymphocytes sizes clustered two groups way saying lymphocytes',
 'long antibiotic dosed lb maintain good selectionvarious people lab prepare liter lb add kanamycin mg l selection store c minipreps small cultures dosing straight lb x stock troublesome think using week dubious routinely use kan plates months old ill effect long lb antibiotic kanamycin chl

In [9]:
train_data_features = vectorizer.fit_transform(train_words)
train_data_features = train_data_features.toarray()

In [10]:
train_data_features.shape

(87000, 5000)

In [11]:
vocab = vectorizer.get_feature_names()
print(vocab[:100])

['aa', 'ab', 'abc', 'ability', 'able', 'abroad', 'absence', 'absolute', 'absolutely', 'absorb', 'absorbed', 'ac', 'acc', 'acceleration', 'accelerometer', 'accept', 'acceptable', 'accepted', 'access', 'accessible', 'accidentally', 'accommodate', 'accommodation', 'accomplish', 'according', 'account', 'accounts', 'accuracy', 'accurate', 'accurately', 'achieve', 'achieved', 'acid', 'acidic', 'acids', 'across', 'acrylic', 'act', 'acting', 'action', 'activate', 'activated', 'activation', 'active', 'activities', 'activity', 'acts', 'actual', 'actually', 'ad', 'adapter', 'adaptive', 'add', 'added', 'adding', 'addition', 'additional', 'additionally', 'additive', 'address', 'adds', 'adequate', 'adhesive', 'adjacent', 'adjust', 'adjustable', 'adjusted', 'adult', 'adults', 'advance', 'advanced', 'advantage', 'advantages', 'adversary', 'advice', 'advise', 'advised', 'ae', 'aes', 'afci', 'affect', 'affected', 'affects', 'afford', 'affordable', 'afraid', 'africa', 'african', 'afternoon', 'afterwards'

In [40]:
forest = RandomForestClassifier(n_estimators = 10, max_depth=5) 
forest = forest.fit( train_data_features, dataframe_1['tags'] )

In [13]:
dataframe_test = pd.read_csv('dados/test.csv')
dataframe_test.head()

Unnamed: 0,id,title,content
0,1,What is spin as it relates to subatomic particles?,"<p>I often hear about subatomic particles having a property called ""spin"" but also that it doesn't actually relate to spinning about an axis like you would think. Which particles have spin? What does spin mean if not an actual spinning motion?</p>\n"
1,2,What is your simplest explanation of the string theory?,<p>How would you explain string theory to non physicists such as myself? I'm specially interested on how plausible is it and what is needed to successfully prove it?</p>\n
2,3,"Lie theory, Representations and particle physics","<p>This is a question that has been posted at many different forums, I thought maybe someone here would have a better or more conceptual answer than I have seen before:</p>\n\n<p>Why do physicists care about representations of Lie groups? For myself, when I think about a representation that means there is some sort of group acting on a vector space, what is the vector space that this Lie group is acting on? </p>\n\n<p>Or is it that certain things have to be invariant under a group action?\nmaybe this is a dumb question, but i thought it might be a good start...</p>\n\n<p>To clarify, I am specifically thinking of the symmetry groups that people think about in relation to the standard model. I do not care why it might be a certain group, but more how we see the group acting, what is it a..."
3,7,Will Determinism be ever possible?,<p>What are the main problems that we need to solve to prove Laplace's determinism correct and overcome the Uncertainty principle?</p>\n
4,9,Hamilton's Principle,"<p>Hamilton's principle states that a dynamic system always follows a path such that its action integral is stationary (that is, maximum or minimum).</p>\n\n<p>Why should the action integral be stationary? On what basis did Hamilton state this principle?</p>\n"


In [14]:
dataframe_test['title'] = dataframe_test['title'].apply(lambda x: words_cleaning(x))
dataframe_test['content'] = dataframe_test['content'].apply(lambda x: words_cleaning(x))
dataframe_test.head()

Unnamed: 0,id,title,content
0,1,spin relates subatomic particles,often hear subatomic particles property called spin also actually relate spinning axis like would think particles spin spin mean actual spinning motion
1,2,simplest explanation string theory,would explain string theory non physicists specially interested plausible needed successfully prove
2,3,lie theory representations particle physics,question posted many different forums thought maybe someone would better conceptual answer seen physicists care representations lie groups think representation means sort group acting vector space vector space lie group acting certain things invariant group action maybe dumb question thought might good start clarify specifically thinking symmetry groups people think relation standard model care might certain group see group acting acting etc
3,7,determinism ever possible,main problems need solve prove laplace determinism correct overcome uncertainty principle
4,9,hamilton principle,hamilton principle states dynamic system always follows path action integral stationary maximum minimum action integral stationary basis hamilton state principle


In [15]:
test_words = list(dataframe_test['title'] + dataframe_test['content'])
test_words[:5]

['spin relates subatomic particlesoften hear subatomic particles property called spin also actually relate spinning axis like would think particles spin spin mean actual spinning motion',
 'simplest explanation string theorywould explain string theory non physicists specially interested plausible needed successfully prove',
 'lie theory representations particle physicsquestion posted many different forums thought maybe someone would better conceptual answer seen physicists care representations lie groups think representation means sort group acting vector space vector space lie group acting certain things invariant group action maybe dumb question thought might good start clarify specifically thinking symmetry groups people think relation standard model care might certain group see group acting acting etc',
 'determinism ever possiblemain problems need solve prove laplace determinism correct overcome uncertainty principle',
 'hamilton principlehamilton principle states dynamic system a

In [16]:
test_data_features = vectorizer.transform(test_words)
test_data_features = test_data_features.toarray()

In [24]:
test_data_features.shape

(81926, 5000)

In [38]:
output = pd.DataFrame( columns=['id', 'tags'] )
output.to_csv( "submit_3.csv", index=False, quoting=1)

In [34]:
test_data_features[1].reshape(1,-1)

array([[0, 0, 0, ..., 0, 0, 0]])

In [39]:
with open('submit_3.csv', 'a') as f:
    for i in range(test_data_features.shape[0]):
        result = forest.predict(test_data_features[i].reshape(1,-1))  
        output = pd.DataFrame( data={"id":dataframe_test["id"][i], "tags":result} )
        output.to_csv(f, index=False, quoting=1, header=False)

In [42]:
forest = RandomForestClassifier(n_estimators = 50, max_depth=10)
forest = forest.fit( train_data_features, dataframe_1['tags'] )

In [43]:
output = pd.DataFrame( columns=['id', 'tags'] )
output.to_csv( "submit_4.csv", index=False, quoting=1)
with open('submit_4.csv', 'a') as f:
    for i in range(test_data_features.shape[0]):
        result = forest.predict(test_data_features[i].reshape(1,-1))  
        output = pd.DataFrame( data={"id":dataframe_test["id"][i], "tags":result} )
        output.to_csv(f, index=False, quoting=1, header=False)

In [44]:
forest = RandomForestClassifier(n_estimators = 25, max_depth=15) 
forest = forest.fit( train_data_features, dataframe_1['tags'] )
output = pd.DataFrame( columns=['id', 'tags'] )
output.to_csv( "submit_5.csv", index=False, quoting=1)
with open('submit_5.csv', 'a') as f:
    for i in range(test_data_features.shape[0]):
        result = forest.predict(test_data_features[i].reshape(1,-1))  
        output = pd.DataFrame( data={"id":dataframe_test["id"][i], "tags":result} )
        output.to_csv(f, index=False, quoting=1, header=False)