# Transfer Learning on Stack Exchange Tags
## Kaggle competition
https://www.kaggle.com/c/transfer-learning-on-stack-exchange-tags

In [48]:
import pandas as pd
from statistics import mode
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import csv

In [49]:
dataframe_1 = pd.DataFrame()
data_list = ['biology', 'cooking', 'crypto', 'diy', 'robotics', 'travel']
pd.set_option('max_colwidth', 800)

for theme in data_list:
    path = "dados/" + theme + ".csv"
    x = pd.read_csv(path)
    dataframe_1 = dataframe_1.append(x)

dataframe_1.head(10)

Unnamed: 0,id,title,content,tags
0,1,What is the criticality of the ribosome binding site relative to the start codon in prokaryotic translation?,"<p>In prokaryotic translation, how critical for efficient translation is the location of the ribosome binding site, relative to the start codon?</p>\n\n<p>Ideally, it is supposed to be -7b away from the start. How about if it is -9 bases away or even more? Will this have an observable effect on translation?</p>\n",ribosome binding-sites translation synthetic-biology
1,2,How is RNAse contamination in RNA based experiments prevented?,<p>Does anyone have any suggestions to prevent RNAse contamination when working with RNA?</p>\n\n<p>I tend to have issues with degradation regardless of whether I use DEPC treated / RNAse free water and filtered pipette tips.</p>\n,rna biochemistry
2,3,Are lymphocyte sizes clustered in two groups?,"<p>Tortora writes in <em>Principles of Anatomy and Physiology</em>:</p>\n\n<blockquote>\n <p>Lymphocytes may be as small as 6–9 μm in diameter or as large as 10–14 μm in diameter.</p>\n</blockquote>\n\n<p>Those ranges are quite close to each others. Should the above be taken to mean that lymphocytes sizes are clustered in two groups, or is it just a way of saying that lymphocytes are 6-14 μm?</p>\n",immunology cell-biology hematology
3,4,How long does antibiotic-dosed LB maintain good selection?,"<p>Various people in our lab will prepare a liter or so of LB, add kanamycin to 25-37 mg/L for selection, and store it at 4 °C for minipreps or other small cultures (where dosing straight LB with a 1000X stock is troublesome). Some think using it after more than a week is dubious, but we routinely use kan plates that are 1-2 months old with no ill effect.</p>\n\n<p>How long can LB with antibiotic such as kanamycin, chloramphenicol, or ampicillin be stored at 4 °C and maintain selection?</p>\n",cell-culture
4,5,Is exon order always preserved in splicing?,"<p>Are there any cases in which the splicing machinery constructs an mRNA in which the exons are not in the 5' -> 3' genomic order? I'm interested any such cases, whether they involve constitutive or alternative splicing.</p>\n",splicing mrna spliceosome introns exons
5,6,How can I avoid digesting protein-bound DNA?,"<p>I'm interested in sequencing and analyzing the bound DNA, and minimizing the amount of unbound DNA that gets sequenced through digestion.</p>\n\n<p>When digesting protein-bound DNA, is <em>all</em> of the unbound DNA digested? Is there a way to maximize the amount of unbound DNA that is digested?</p>\n",dna biochemistry molecular-biology
6,8,Under what conditions do dendritic spines form?,"<p>I'm looking for resources or any information about the formation of dendritic spines and synaptogenesis, especially in relation to how new connections are formed on a daily basis.</p>\n\n<p>Does the electrotonic signalling along the axons and through the spines cause new connections to be made based on some kind of spatial condition (maybe an electrical or chemical attraction), or is there some larger heuristic here?</p>\n",neuroscience synapses
7,9,How should I ship plasmids?,"<p>I shipped 10 µL of my vector miniprep to a collaborator in a 1.5 mL eppendorf parafilmed shut and stuffed into a 50 mL conical with some paper-towel padding. However, something happened on the way and there was nothing (no liquid) in the tube when it arrived. They didn't make any comments about the microcentrifuge tube popping open or broken parafilm, so nothing crazy happened but something did.</p>\n\n<p>What's the most reliable way to ship plasmids?</p>\n",plasmids
8,10,What is the reason behind choosing the reporter gene when experimenting on your gene of interest?,"<p>I noticed within example experiments in class that different reporter genes are chosen to be inserted near your gene of interest to prove whether or not the gene is being expressed. For example, you may insert the gene for fluorescence next to your gene of interest so you know if it is transcribed or not by whether the organism's cells are fluorescent and to what degree they are fluorescing at.</p>\n\n<p>I have noticed in some experiments that have multiple versions that in one case they use the fluorescent gene and in the next a different gene (for example lactose). Both portions of the experiment use almost the exact same steps so why would they not choose the same reporter gene?</p>\n",molecular-genetics gene-expression experimental-design
9,11,How many times did endosymbiosis occur?,"<p>According to the endosymbiont theory, mitochondria and chloroplasts originated as bacteria which were engulfed by larger cells. How many times is it estimated that this occurred in the past? Are there any examples of this process being observed directly?</p>\n",evolution mitochondria chloroplasts


### Removing html tags and '\n' from 'content' column

In [47]:
comments = list(dataframe_1['content'])
comments_clean = []

for comment in comments:
    #clean html tags
    x = BeautifulSoup(comment, 'html5lib').get_text()
    #clean new line operator and ponctuation's marks
    for rep in ['\n','.','?','!',',',';',':',"'"]:
        x = x.replace(rep, ' ')
    comments_clean.append(x)
    
comments_clean[:5]

KeyboardInterrupt: 

### Removing stop words

In [None]:
stop = set(stopwords.words('english'))
split_comments = []
for comment in comments_clean:
    x = comment.split()
    split_comments.append(x)

split_comments_2 = []

for comment in split_comments:
    word_list = []
    for word in comment:
        if word.lower() not in stop:
            word_list.append(word)
    split_comments_2.append(word_list)


In [None]:
print(split_comments_2[:5])

In [None]:
print(len(split_comments_2))

In [None]:
dataframe_1.shape

### Working with variable 'title'

In [8]:
titles = list(dataframe_1['title'])
titles_clean = []

for title in titles:
    #clean new line operator and ponctuation's marks
    for rep in ['\n','.','?','!',',',';',':',"'"]:
        title = title.replace(rep, ' ')
    titles_clean.append(title)
    
titles_clean[:10]

['What is the criticality of the ribosome binding site relative to the start codon in prokaryotic translation ',
 'How is RNAse contamination in RNA based experiments prevented ',
 'Are lymphocyte sizes clustered in two groups ',
 'How long does antibiotic-dosed LB maintain good selection ',
 'Is exon order always preserved in splicing ',
 'How can I avoid digesting protein-bound DNA ',
 'Under what conditions do dendritic spines form ',
 'How should I ship plasmids ',
 'What is the reason behind choosing the reporter gene when experimenting on your gene of interest ',
 'How many times did endosymbiosis occur ']

In [9]:
split_titles = []
for title in titles_clean:
    x = title.split()
    split_titles.append(x)

split_titles_2 = []

for title in split_titles:
    word_list = []
    for word in title:
        if word.lower() not in stop:
            word_list.append(word)
    split_titles_2.append(word_list)
    
print(split_titles_2[:6])

[['criticality', 'ribosome', 'binding', 'site', 'relative', 'start', 'codon', 'prokaryotic', 'translation'], ['RNAse', 'contamination', 'RNA', 'based', 'experiments', 'prevented'], ['lymphocyte', 'sizes', 'clustered', 'two', 'groups'], ['long', 'antibiotic-dosed', 'LB', 'maintain', 'good', 'selection'], ['exon', 'order', 'always', 'preserved', 'splicing'], ['avoid', 'digesting', 'protein-bound', 'DNA']]


In [10]:
len(split_titles_2)

87000

### First tags
#### 1 - Words in title and content

In [11]:
tags_list = []
for i in range(len(split_titles_2)):
    
    title_tags = []
    for ii in range(len(split_titles_2[i])):
        if split_titles_2[i][ii] in split_comments_2[i]:
            title_tags.append(split_titles_2[i][ii])
    tags_list.append(list(set(title_tags)))

tags_list[:10]

[['ribosome',
  'prokaryotic',
  'site',
  'binding',
  'start',
  'relative',
  'translation',
  'codon'],
 ['RNAse', 'contamination', 'RNA'],
 ['sizes', 'two', 'groups', 'clustered'],
 ['LB', 'maintain', 'long', 'selection'],
 ['order', 'splicing'],
 ['DNA', 'protein-bound', 'digesting'],
 ['dendritic', 'spines'],
 ['ship', 'plasmids'],
 ['reporter', 'gene', 'interest'],
 ['times', 'many']]

# First submition

In [65]:
dataframe_test = pd.read_csv('dados/test.csv')
dataframe_test.head(10)

Unnamed: 0,id,title,content
0,1,What is spin as it relates to subatomic particles?,"<p>I often hear about subatomic particles having a property called ""spin"" but also that it doesn't actually relate to spinning about an axis like you would think. Which particles have spin? What does spin mean if not an actual spinning motion?</p>\n"
1,2,What is your simplest explanation of the string theory?,<p>How would you explain string theory to non physicists such as myself? I'm specially interested on how plausible is it and what is needed to successfully prove it?</p>\n
2,3,"Lie theory, Representations and particle physics","<p>This is a question that has been posted at many different forums, I thought maybe someone here would have a better or more conceptual answer than I have seen before:</p>\n\n<p>Why do physicists care about representations of Lie groups? For myself, when I think about a representation that means there is some sort of group acting on a vector space, what is the vector space that this Lie group is acting on? </p>\n\n<p>Or is it that certain things have to be invariant under a group action?\nmaybe this is a dumb question, but i thought it might be a good start...</p>\n\n<p>To clarify, I am specifically thinking of the symmetry groups that people think about in relation to the standard model. I do not care why it might be a certain group, but more how we see the group acting, what is it a..."
3,7,Will Determinism be ever possible?,<p>What are the main problems that we need to solve to prove Laplace's determinism correct and overcome the Uncertainty principle?</p>\n
4,9,Hamilton's Principle,"<p>Hamilton's principle states that a dynamic system always follows a path such that its action integral is stationary (that is, maximum or minimum).</p>\n\n<p>Why should the action integral be stationary? On what basis did Hamilton state this principle?</p>\n"
5,13,What is sound and how is it produced?,"<p>I've been using the term ""sound"" all my life, but I really have no clue as to what sound exactly is or how it is created. What is sound? How is it produced? Can it be measured?</p>\n"
6,15,What experiment would disprove string theory?,"<p>I know that there's big controversy between two groups of physicists:</p>\n\n<ol>\n<li>those who support string theory (most of them, I think)</li>\n<li>and those who oppose it.</li>\n</ol>\n\n<p>One of the arguments of the second group is that there's no way to disprove the correctness of the string theory.</p>\n\n<p>So my question is if there's any defined experiment that would disprove string theory? </p>\n"
7,17,"Why does the sky change color? Why the sky is blue during the day, red during sunrise/set and black during the night?","<p>Why does the sky change color? Why the sky is blue during the day, red during sunrise/set and black during the night?</p>\n"
8,19,How's the energy of particle collisions calculated?,"<p>Physicists often refer to the energy of collisions between different particles. My question is: how is that energy calculated? Is that kinetic energy?</p>\n\n<p>Also, related to this question, I know that the aim is to have higher and higher energy collisions (e.g to test for Higgs Boson). My understanding is that to have higher energy you can either accelerate them more, or use particles with higher mass. Is this correct?</p>\n"
9,21,Monte Carlo use,<p>Where is the Monte Carlo method used in physics?</p>\n


In [66]:
comments = list(dataframe_test['content'])
comments_clean = []

for comment in comments:
    #clean html tags
    x = BeautifulSoup(comment, 'html5lib').get_text()
    #clean new line operator and ponctuation's marks
    for rep in ['\n','.','?','!',',',';',':',"'",'"',"(",")","/"]:
        x = x.replace(rep, ' ')
    comments_clean.append(x)
    
comments_clean[:5]

['I often hear about subatomic particles having a property called  spin  but also that it doesn t actually relate to spinning about an axis like you would think  Which particles have spin  What does spin mean if not an actual spinning motion  ',
 'How would you explain string theory to non physicists such as myself  I m specially interested on how plausible is it and what is needed to successfully prove it  ',
 'This is a question that has been posted at many different forums  I thought maybe someone here would have a better or more conceptual answer than I have seen before   Why do physicists care about representations of Lie groups  For myself  when I think about a representation that means there is some sort of group acting on a vector space  what is the vector space that this Lie group is acting on    Or is it that certain things have to be invariant under a group action  maybe this is a dumb question  but i thought it might be a good start     To clarify  I am specifically thinkin

In [67]:
stop = set(stopwords.words('english'))
split_comments = []
for comment in comments_clean:
    x = comment.split()
    split_comments.append(x)

split_comments_2 = []

for comment in split_comments:
    word_list = []
    for word in comment:
        if word.lower() not in stop:
            word_list.append(word)
    split_comments_2.append(word_list)

In [68]:
print(split_comments_2[:5])

[['often', 'hear', 'subatomic', 'particles', 'property', 'called', 'spin', 'also', 'actually', 'relate', 'spinning', 'axis', 'like', 'would', 'think', 'particles', 'spin', 'spin', 'mean', 'actual', 'spinning', 'motion'], ['would', 'explain', 'string', 'theory', 'non', 'physicists', 'specially', 'interested', 'plausible', 'needed', 'successfully', 'prove'], ['question', 'posted', 'many', 'different', 'forums', 'thought', 'maybe', 'someone', 'would', 'better', 'conceptual', 'answer', 'seen', 'physicists', 'care', 'representations', 'Lie', 'groups', 'think', 'representation', 'means', 'sort', 'group', 'acting', 'vector', 'space', 'vector', 'space', 'Lie', 'group', 'acting', 'certain', 'things', 'invariant', 'group', 'action', 'maybe', 'dumb', 'question', 'thought', 'might', 'good', 'start', 'clarify', 'specifically', 'thinking', 'symmetry', 'groups', 'people', 'think', 'relation', 'standard', 'model', 'care', 'might', 'certain', 'group', 'see', 'group', 'acting', 'acting', 'etc'], ['main'

In [69]:
titles = list(dataframe_test['title'])
titles_clean = []

for title in titles:
    #clean new line operator and ponctuation's marks
    for rep in ['\n','.','?','!',',',';',':',"'",'"',"(",")","/"]:
        title = title.replace(rep, ' ')
    titles_clean.append(title)
    
titles_clean[:10]

['What is spin as it relates to subatomic particles ',
 'What is your simplest explanation of the string theory ',
 'Lie theory  Representations and particle physics',
 'Will Determinism be ever possible ',
 'Hamilton s Principle',
 'What is sound and how is it produced ',
 'What experiment would disprove string theory ',
 'Why does the sky change color  Why the sky is blue during the day  red during sunrise set and black during the night ',
 'How s the energy of particle collisions calculated ',
 'Monte Carlo use']

In [70]:
split_titles = []
for title in titles_clean:
    x = title.split()
    split_titles.append(x)

split_titles_2 = []

for title in split_titles:
    word_list = []
    for word in title:
        if word.lower() not in stop:
            word_list.append(word)
    split_titles_2.append(word_list)
    
print(split_titles_2[:6])

[['spin', 'relates', 'subatomic', 'particles'], ['simplest', 'explanation', 'string', 'theory'], ['Lie', 'theory', 'Representations', 'particle', 'physics'], ['Determinism', 'ever', 'possible'], ['Hamilton', 'Principle'], ['sound', 'produced']]


In [71]:
tags_list = []
for i in range(len(split_titles_2)):
    
    title_tags = []
    for ii in range(len(split_titles_2[i])):
        if split_titles_2[i][ii] in split_comments_2[i]:
            title_tags.append(split_titles_2[i][ii].lower())
    tags_list.append(list(set(title_tags)))

tags_list[:10]

[['subatomic', 'spin', 'particles'],
 ['string', 'theory'],
 ['lie'],
 [],
 ['hamilton'],
 ['produced', 'sound'],
 ['theory', 'string', 'would', 'experiment', 'disprove'],
 ['change',
  'sunrise',
  'red',
  'set',
  'night',
  'color',
  'black',
  'sky',
  'day',
  'blue'],
 ['collisions', 'calculated', 'energy'],
 ['monte', 'carlo']]

In [72]:
submit = []
for i in range(len(tags_list)):
    submit.append([' '.join(tags_list[i])])
print(submit[:10])
len(submit)

[['subatomic spin particles'], ['string theory'], ['lie'], [''], ['hamilton'], ['produced sound'], ['theory string would experiment disprove'], ['change sunrise red set night color black sky day blue'], ['collisions calculated energy'], ['monte carlo']]


81926

In [27]:
# Writing csv file to submission
df_tags = pd.DataFrame(submit, columns=['tags'])
submit_df = pd.DataFrame(columns=['id','tags'])
submit_df['id'] = dataframe_test['id']
submit_df['tags'] = df_tags['tags']
submit_df.to_csv('submit_1.csv', index = False, quoting=1)


Unnamed: 0,tags
0,subatomic spin particles
1,string theory
2,lie
3,
4,hamilton
5,produced sound
6,theory string would experiment disprove
7,change red night color black sky day blue sunr...
8,collisions calculated energy
9,monte carlo


### Adding most frequent word in title and content

In [38]:
ds_titles = pd.Series(split_titles_2)
ds_contents = pd.Series(split_comments_2)
df_treated = pd.DataFrame()
df_treated['title'] = ds_titles
df_treated['content'] = ds_contents
#df_treated.mode()

TypeError: wrapper2() missing 1 required positional argument: 'pat'

### Search for tags words in test.csv

In [60]:
test_tags_list = []
for row in list(dataframe_1['tags']):
    for word in row.split():
        test_tags_list.append(word)
test_tags_list = list(set(test_tags_list))
test_tags_list[:10]

['air-travel',
 'junction',
 'species-identification',
 'spinach',
 'swiss',
 'filipino-citizens',
 'grand-teton',
 'uae',
 'kitchen-safety',
 'surge-suppression']

In [91]:
tags_list_2 = []
word_list_2 = []
i=0
for title in split_titles_2:
    for word in title:
        if word.lower() in test_tags_list:
            word_list_2.append(word.lower())
        for word_2 in title:
            if str(word.lower() + '-' + word_2.lower()) in test_tags_list:
                word_list_2.append(str(word.lower() + '-' + word_2.lower()))
    tags_list_2.append(list(set(word_list_2)))
    word_list_2 = []
    
tags_list_2[:100]

[[],
 ['theory'],
 ['theory'],
 [],
 [],
 [],
 ['theory', 'experiment'],
 ['color'],
 ['energy'],
 [],
 [],
 [],
 ['measurement'],
 [],
 ['theory'],
 ['sink', 'bathtub'],
 ['energy'],
 [],
 [],
 [],
 [],
 [],
 ['materials', 'stress'],
 [],
 ['cancer', 'treatment'],
 [],
 ['materials'],
 [],
 [],
 ['light'],
 [],
 [],
 [],
 [],
 ['laser', 'beam'],
 ['current'],
 ['stability'],
 ['energy'],
 [],
 ['force'],
 [],
 [],
 [],
 [],
 ['tools'],
 [],
 [],
 [],
 ['data', 'reference-request'],
 ['water', 'boiling', 'salt'],
 ['water', 'tap-water', 'temperature'],
 [],
 [],
 ['force'],
 [],
 ['energy'],
 [],
 [],
 [],
 [],
 ['torque'],
 ['mechanism'],
 ['books'],
 ['information'],
 [],
 ['current'],
 ['transformation'],
 [],
 [],
 ['spaghetti'],
 [],
 ['software'],
 ['human', 'gyroscope'],
 ['study'],
 [],
 [],
 [],
 [],
 ['chain'],
 ['angle'],
 ['water', 'vegetation'],
 ['clothing'],
 [],
 [],
 [],
 [],
 ['energy'],
 ['work'],
 ['water'],
 [],
 ['radiation'],
 [],
 ['measurement'],
 ['force'],
 [

In [94]:
print(len(tags_list))
print(len(tags_list_2))

81926
81926


In [117]:
tags_list_3 = []
for i in range(len(tags_list)):
    tags_list_3.append(list(set(tags_list[i] + tags_list_2[i])))

print(len(tags_list_3))
print(tags_list_3[:20])

81926
[['subatomic', 'spin', 'particles'], ['string', 'theory'], ['lie', 'theory'], [], ['hamilton'], ['produced', 'sound'], ['disprove', 'theory', 'would', 'experiment', 'string'], ['sky', 'change', 'sunrise', 'red', 'set', 'night', 'color', 'black', 'day', 'blue'], ['collisions', 'calculated', 'energy'], ['monte', 'carlo'], ['bicycle', 'cause', 'leaning'], ['field', 'electromagnetic'], ['measurement', 'interaction', 'mechanics'], ['calculate', 'average', 'speed'], ['theory', 'relativity'], ['sink', 'bathtub', 'show'], ['magnets', 'repel', 'energy'], ['real', 'check', 'world'], ['field', 'mathematics', 'theories'], ['capacitive']]


In [118]:
submit = []
for i in range(len(tags_list_3)):
    submit.append([' '.join(tags_list_3[i])])
print(submit[:10])
len(submit)

[['subatomic spin particles'], ['string theory'], ['lie theory'], [''], ['hamilton'], ['produced sound'], ['disprove theory would experiment string'], ['sky change sunrise red set night color black day blue'], ['collisions calculated energy'], ['monte carlo']]


81926

In [119]:
# Writing csv file to submission
df_tags = pd.DataFrame(submit, columns=['tags'])
submit_df = pd.DataFrame(columns=['id','tags'])
submit_df['id'] = dataframe_test['id']
submit_df['tags'] = df_tags['tags']
submit_df.to_csv('submit_2.csv', index = False, quoting=1)