# Transfer Learning on Stack Exchange Tags
## Kaggle competition
https://www.kaggle.com/c/transfer-learning-on-stack-exchange-tags

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

In [49]:
dataframe_1 = pd.DataFrame()
data_list = ['biology', 'cooking', 'crypto', 'diy', 'robotics', 'travel']
pd.set_option('max_colwidth', 800)

for theme in data_list:
    path = "dados/" + theme + ".csv"
    x = pd.read_csv(path)
    dataframe_1 = dataframe_1.append(x)

dataframe_1.head(10)

Unnamed: 0,id,title,content,tags
0,1,What is the criticality of the ribosome binding site relative to the start codon in prokaryotic translation?,"<p>In prokaryotic translation, how critical for efficient translation is the location of the ribosome binding site, relative to the start codon?</p>\n\n<p>Ideally, it is supposed to be -7b away from the start. How about if it is -9 bases away or even more? Will this have an observable effect on translation?</p>\n",ribosome binding-sites translation synthetic-biology
1,2,How is RNAse contamination in RNA based experiments prevented?,<p>Does anyone have any suggestions to prevent RNAse contamination when working with RNA?</p>\n\n<p>I tend to have issues with degradation regardless of whether I use DEPC treated / RNAse free water and filtered pipette tips.</p>\n,rna biochemistry
2,3,Are lymphocyte sizes clustered in two groups?,"<p>Tortora writes in <em>Principles of Anatomy and Physiology</em>:</p>\n\n<blockquote>\n <p>Lymphocytes may be as small as 6–9 μm in diameter or as large as 10–14 μm in diameter.</p>\n</blockquote>\n\n<p>Those ranges are quite close to each others. Should the above be taken to mean that lymphocytes sizes are clustered in two groups, or is it just a way of saying that lymphocytes are 6-14 μm?</p>\n",immunology cell-biology hematology
3,4,How long does antibiotic-dosed LB maintain good selection?,"<p>Various people in our lab will prepare a liter or so of LB, add kanamycin to 25-37 mg/L for selection, and store it at 4 °C for minipreps or other small cultures (where dosing straight LB with a 1000X stock is troublesome). Some think using it after more than a week is dubious, but we routinely use kan plates that are 1-2 months old with no ill effect.</p>\n\n<p>How long can LB with antibiotic such as kanamycin, chloramphenicol, or ampicillin be stored at 4 °C and maintain selection?</p>\n",cell-culture
4,5,Is exon order always preserved in splicing?,"<p>Are there any cases in which the splicing machinery constructs an mRNA in which the exons are not in the 5' -> 3' genomic order? I'm interested any such cases, whether they involve constitutive or alternative splicing.</p>\n",splicing mrna spliceosome introns exons
5,6,How can I avoid digesting protein-bound DNA?,"<p>I'm interested in sequencing and analyzing the bound DNA, and minimizing the amount of unbound DNA that gets sequenced through digestion.</p>\n\n<p>When digesting protein-bound DNA, is <em>all</em> of the unbound DNA digested? Is there a way to maximize the amount of unbound DNA that is digested?</p>\n",dna biochemistry molecular-biology
6,8,Under what conditions do dendritic spines form?,"<p>I'm looking for resources or any information about the formation of dendritic spines and synaptogenesis, especially in relation to how new connections are formed on a daily basis.</p>\n\n<p>Does the electrotonic signalling along the axons and through the spines cause new connections to be made based on some kind of spatial condition (maybe an electrical or chemical attraction), or is there some larger heuristic here?</p>\n",neuroscience synapses
7,9,How should I ship plasmids?,"<p>I shipped 10 µL of my vector miniprep to a collaborator in a 1.5 mL eppendorf parafilmed shut and stuffed into a 50 mL conical with some paper-towel padding. However, something happened on the way and there was nothing (no liquid) in the tube when it arrived. They didn't make any comments about the microcentrifuge tube popping open or broken parafilm, so nothing crazy happened but something did.</p>\n\n<p>What's the most reliable way to ship plasmids?</p>\n",plasmids
8,10,What is the reason behind choosing the reporter gene when experimenting on your gene of interest?,"<p>I noticed within example experiments in class that different reporter genes are chosen to be inserted near your gene of interest to prove whether or not the gene is being expressed. For example, you may insert the gene for fluorescence next to your gene of interest so you know if it is transcribed or not by whether the organism's cells are fluorescent and to what degree they are fluorescing at.</p>\n\n<p>I have noticed in some experiments that have multiple versions that in one case they use the fluorescent gene and in the next a different gene (for example lactose). Both portions of the experiment use almost the exact same steps so why would they not choose the same reporter gene?</p>\n",molecular-genetics gene-expression experimental-design
9,11,How many times did endosymbiosis occur?,"<p>According to the endosymbiont theory, mitochondria and chloroplasts originated as bacteria which were engulfed by larger cells. How many times is it estimated that this occurred in the past? Are there any examples of this process being observed directly?</p>\n",evolution mitochondria chloroplasts


### Removing html tags and '\n' from 'content' column

In [9]:
comments = list(dataframe_1['content'])
comments_clean = []

for comment in comments:
    #clean html tags
    x = BeautifulSoup(comment, 'html5lib').get_text()
    #clean new line operator and ponctuation's marks
    for rep in ['\n','.','?','!',',',';',':',"'"]:
        x = x.replace(rep, ' ')
    comments_clean.append(x)
    
comments_clean[:5]

['In prokaryotic translation  how critical for efficient translation is the location of the ribosome binding site  relative to the start codon   Ideally  it is supposed to be -7b away from the start  How about if it is -9 bases away or even more  Will this have an observable effect on translation  ',
 'Does anyone have any suggestions to prevent RNAse contamination when working with RNA   I tend to have issues with degradation regardless of whether I use DEPC treated / RNAse free water and filtered pipette tips  ',
 'Tortora writes in Principles of Anatomy and Physiology      Lymphocytes may be as small as 6–9 μm in diameter or as large as 10–14 μm in diameter    Those ranges are quite close to each others  Should the above be taken to mean that lymphocytes sizes are clustered in two groups  or is it just a way of saying that lymphocytes are 6-14 μm  ',
 'Various people in our lab will prepare a liter or so of LB  add kanamycin to 25-37 mg/L for selection  and store it at 4 °C for mini

### Removing stop words

In [10]:
stop = set(stopwords.words('english'))
split_comments = []
for comment in comments_clean:
    x = comment.split()
    split_comments.append(x)

split_comments_2 = []

for comment in split_comments:
    word_list = []
    for word in comment:
        if word.lower() not in stop:
            word_list.append(word)
    split_comments_2.append(word_list)


In [11]:
print(split_comments_2[:5])

[['prokaryotic', 'translation', 'critical', 'efficient', 'translation', 'location', 'ribosome', 'binding', 'site', 'relative', 'start', 'codon', 'Ideally', 'supposed', '-7b', 'away', 'start', '-9', 'bases', 'away', 'even', 'observable', 'effect', 'translation'], ['anyone', 'suggestions', 'prevent', 'RNAse', 'contamination', 'working', 'RNA', 'tend', 'issues', 'degradation', 'regardless', 'whether', 'use', 'DEPC', 'treated', '/', 'RNAse', 'free', 'water', 'filtered', 'pipette', 'tips'], ['Tortora', 'writes', 'Principles', 'Anatomy', 'Physiology', 'Lymphocytes', 'may', 'small', '6–9', 'μm', 'diameter', 'large', '10–14', 'μm', 'diameter', 'ranges', 'quite', 'close', 'others', 'taken', 'mean', 'lymphocytes', 'sizes', 'clustered', 'two', 'groups', 'way', 'saying', 'lymphocytes', '6-14', 'μm'], ['Various', 'people', 'lab', 'prepare', 'liter', 'LB', 'add', 'kanamycin', '25-37', 'mg/L', 'selection', 'store', '4', '°C', 'minipreps', 'small', 'cultures', '(where', 'dosing', 'straight', 'LB', '10

In [12]:
print(len(split_comments_2))

87000


In [13]:
dataframe_1.shape

(87000, 4)

### Working with variable 'title'

In [30]:
titles = list(dataframe_1['title'])
titles_clean = []

for title in titles:
    #clean new line operator and ponctuation's marks
    for rep in ['\n','.','?','!',',',';',':',"'"]:
        title = title.replace(rep, ' ')
    titles_clean.append(title)
    
titles_clean[:10]

['What is the criticality of the ribosome binding site relative to the start codon in prokaryotic translation ',
 'How is RNAse contamination in RNA based experiments prevented ',
 'Are lymphocyte sizes clustered in two groups ',
 'How long does antibiotic-dosed LB maintain good selection ',
 'Is exon order always preserved in splicing ',
 'How can I avoid digesting protein-bound DNA ',
 'Under what conditions do dendritic spines form ',
 'How should I ship plasmids ',
 'What is the reason behind choosing the reporter gene when experimenting on your gene of interest ',
 'How many times did endosymbiosis occur ']

In [33]:
split_titles = []
for title in titles_clean:
    x = title.split()
    split_titles.append(x)

split_titles_2 = []

for title in split_titles:
    word_list = []
    for word in title:
        if word.lower() not in stop:
            word_list.append(word)
    split_titles_2.append(word_list)
    
print(split_titles_2[:6])

[['criticality', 'ribosome', 'binding', 'site', 'relative', 'start', 'codon', 'prokaryotic', 'translation'], ['RNAse', 'contamination', 'RNA', 'based', 'experiments', 'prevented'], ['lymphocyte', 'sizes', 'clustered', 'two', 'groups'], ['long', 'antibiotic-dosed', 'LB', 'maintain', 'good', 'selection'], ['exon', 'order', 'always', 'preserved', 'splicing'], ['avoid', 'digesting', 'protein-bound', 'DNA']]


In [32]:
len(split_titles_2)

87000

### First tags
#### 1 - Words in title and content

In [53]:
tags_list = []
for i in range(len(split_titles_2)):
    
    title_tags = []
    for ii in range(len(split_titles_2[i])):
        if split_titles_2[i][ii] in split_comments_2[i]:
            title_tags.append(split_titles_2[i][ii])
    tags_list.append(list(set(title_tags)))

tags_list[:10]

[['binding',
  'relative',
  'ribosome',
  'site',
  'codon',
  'prokaryotic',
  'translation',
  'start'],
 ['RNA', 'contamination', 'RNAse'],
 ['clustered', 'sizes', 'groups', 'two'],
 ['long', 'maintain', 'selection', 'LB'],
 ['order', 'splicing'],
 ['digesting', 'DNA', 'protein-bound'],
 ['dendritic', 'spines'],
 ['ship', 'plasmids'],
 ['gene', 'interest', 'reporter'],
 ['times', 'many']]