# Transfer Learning on Stack Exchange Tags
## Kaggle competition
https://www.kaggle.com/c/transfer-learning-on-stack-exchange-tags

In [150]:
import pandas as pd
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

In [140]:
dataframe_1 = pd.DataFrame()
data_list = ['biology', 'cooking', 'crypto', 'diy', 'robotics', 'travel']
pd.set_option('max_colwidth', 800)

for theme in data_list:
    path = "dados/" + theme + ".csv"
    x = pd.read_csv(path)
    dataframe_1 = dataframe_1.append(x)

dataframe_1.head()

Unnamed: 0,id,title,content,tags
0,1,What is the criticality of the ribosome binding site relative to the start codon in prokaryotic translation?,"<p>In prokaryotic translation, how critical for efficient translation is the location of the ribosome binding site, relative to the start codon?</p>\n\n<p>Ideally, it is supposed to be -7b away from the start. How about if it is -9 bases away or even more? Will this have an observable effect on translation?</p>\n",ribosome binding-sites translation synthetic-biology
1,2,How is RNAse contamination in RNA based experiments prevented?,<p>Does anyone have any suggestions to prevent RNAse contamination when working with RNA?</p>\n\n<p>I tend to have issues with degradation regardless of whether I use DEPC treated / RNAse free water and filtered pipette tips.</p>\n,rna biochemistry
2,3,Are lymphocyte sizes clustered in two groups?,"<p>Tortora writes in <em>Principles of Anatomy and Physiology</em>:</p>\n\n<blockquote>\n <p>Lymphocytes may be as small as 6–9 μm in diameter or as large as 10–14 μm in diameter.</p>\n</blockquote>\n\n<p>Those ranges are quite close to each others. Should the above be taken to mean that lymphocytes sizes are clustered in two groups, or is it just a way of saying that lymphocytes are 6-14 μm?</p>\n",immunology cell-biology hematology
3,4,How long does antibiotic-dosed LB maintain good selection?,"<p>Various people in our lab will prepare a liter or so of LB, add kanamycin to 25-37 mg/L for selection, and store it at 4 °C for minipreps or other small cultures (where dosing straight LB with a 1000X stock is troublesome). Some think using it after more than a week is dubious, but we routinely use kan plates that are 1-2 months old with no ill effect.</p>\n\n<p>How long can LB with antibiotic such as kanamycin, chloramphenicol, or ampicillin be stored at 4 °C and maintain selection?</p>\n",cell-culture
4,5,Is exon order always preserved in splicing?,"<p>Are there any cases in which the splicing machinery constructs an mRNA in which the exons are not in the 5' -> 3' genomic order? I'm interested any such cases, whether they involve constitutive or alternative splicing.</p>\n",splicing mrna spliceosome introns exons


### Removing html tags and '\n' from 'content' column

In [218]:
comments = list(dataframe_1['content'])
comments_clean = []

for comment in comments:
    #clean html tags
    x = BeautifulSoup(comment, 'html5lib').get_text()
    #clean new line operator
    for rep in ['\n','.','?','!',',']:
        x = x.replace(rep, ' ')
    comments_clean.append(x)
    
comments_clean[:5]

['In prokaryotic translation  how critical for efficient translation is the location of the ribosome binding site  relative to the start codon   Ideally  it is supposed to be -7b away from the start  How about if it is -9 bases away or even more  Will this have an observable effect on translation  ',
 'Does anyone have any suggestions to prevent RNAse contamination when working with RNA   I tend to have issues with degradation regardless of whether I use DEPC treated / RNAse free water and filtered pipette tips  ',
 'Tortora writes in Principles of Anatomy and Physiology:     Lymphocytes may be as small as 6–9 μm in diameter or as large as 10–14 μm in diameter    Those ranges are quite close to each others  Should the above be taken to mean that lymphocytes sizes are clustered in two groups  or is it just a way of saying that lymphocytes are 6-14 μm  ',
 'Various people in our lab will prepare a liter or so of LB  add kanamycin to 25-37 mg/L for selection  and store it at 4 °C for mini

### Removing stop words

In [221]:
stop = set(stopwords.words('english'))
split_comments = []
for comment in comments_clean:
    x = comment.split()
    split_comments.append(x)

split_comments_2 = []

for comment in split_comments:
    word_list = []
    for word in comment:
        if word.lower() not in stop:
            word_list.append(word)
    split_comments_2.append(word_list)


In [222]:
print(split_comments_2[:5])

[['prokaryotic', 'translation', 'critical', 'efficient', 'translation', 'location', 'ribosome', 'binding', 'site', 'relative', 'start', 'codon', 'Ideally', 'supposed', '-7b', 'away', 'start', '-9', 'bases', 'away', 'even', 'observable', 'effect', 'translation'], ['anyone', 'suggestions', 'prevent', 'RNAse', 'contamination', 'working', 'RNA', 'tend', 'issues', 'degradation', 'regardless', 'whether', 'use', 'DEPC', 'treated', '/', 'RNAse', 'free', 'water', 'filtered', 'pipette', 'tips'], ['Tortora', 'writes', 'Principles', 'Anatomy', 'Physiology:', 'Lymphocytes', 'may', 'small', '6–9', 'μm', 'diameter', 'large', '10–14', 'μm', 'diameter', 'ranges', 'quite', 'close', 'others', 'taken', 'mean', 'lymphocytes', 'sizes', 'clustered', 'two', 'groups', 'way', 'saying', 'lymphocytes', '6-14', 'μm'], ['Various', 'people', 'lab', 'prepare', 'liter', 'LB', 'add', 'kanamycin', '25-37', 'mg/L', 'selection', 'store', '4', '°C', 'minipreps', 'small', 'cultures', '(where', 'dosing', 'straight', 'LB', '1

In [223]:
print(len(split_comments_2))

87000


In [224]:
dataframe_1.shape

(87000, 4)