We are going to download the data from [archive.org site](https://archive.org/details/stackexchange).

In [1]:
!wget https://archive.org/download/stackexchange/academia.stackexchange.com.7z

--2018-07-28 21:02:45--  https://archive.org/download/stackexchange/academia.stackexchange.com.7z
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ia600107.us.archive.org/27/items/stackexchange/academia.stackexchange.com.7z [following]
--2018-07-28 21:02:46--  https://ia600107.us.archive.org/27/items/stackexchange/academia.stackexchange.com.7z
Resolving ia600107.us.archive.org (ia600107.us.archive.org)... 207.241.227.247
Connecting to ia600107.us.archive.org (ia600107.us.archive.org)|207.241.227.247|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91709455 (87M) [application/x-7z-compressed]
Saving to: 'academia.stackexchange.com.7z'


2018-07-28 21:05:36 (533 KB/s) - 'academia.stackexchange.com.7z' saved [91709455/91709455]



This is a 7z file, hence you will need to have 7z installed to uncompress the file. This will uncompress to a bunch of files. We are interested in the Posts.xml

In [2]:
!7z e academia.stackexchange.com.7z


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,8 CPUs x64)

Scanning the drive for archives:
  0M Sca        1 file, 91709455 bytes (88 MiB)

Extracting archive: academia.stackexchange.com.7z
--
Path = academia.stackexchange.com.7z
Type = 7z
Physical Size = 91709455
Headers Size = 338
Method = BZip2
Solid = +
Blocks = 5

      2% - Badges.xm                  4% 1 - Comments.xm                      5% 1 - Comments.xm                      6% 1 - Comments.xm                      7% 1 - Comments.xm                      8% 1 - Comments.xm                      9% 1 - Comments.xm                     10% 1 - Comments.xm                     12% 1 - Comments.xm                     13% 1 - Comments.xm                     14% 1 - Comments.xm                     16% 1 - Comments.xm                     17% 2 - PostHistory.x                       18% 2 - PostHistory.x                       20% 2 - PostHistory.x      

We are now going to parse the xml tree and get the relevant text.

In [3]:
import xml.etree.ElementTree as ET
import re

In [4]:
!head Posts.xml

﻿<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="180" CreationDate="2012-02-14T20:23:40.127" Score="16" ViewCount="415" Body="&lt;p&gt;As from title. What kind of visa class do I have to apply for, in order to work as an academic in Japan ? &lt;/p&gt;&#xA;" OwnerUserId="5" LastEditorUserId="2700" LastEditDate="2013-10-30T09:14:11.633" LastActivityDate="2013-10-30T09:14:11.633" Title="What kind of Visa is required to work in Academia in Japan?" Tags="&lt;job-search&gt;&lt;visa&gt;&lt;japan&gt;" AnswerCount="1" CommentCount="1" FavoriteCount="1" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="246" CreationDate="2012-02-14T20:26:22.683" Score="11" ViewCount="725" Body="&lt;p&gt;Which online resources are available for job search at the Ph.D. level in the computational chemistry field?&lt;/p&gt;&#xA;" OwnerUserId="5" LastEditorUserId="15723" LastEditDate="2014-09-18T13:02:01.180" LastActivityDate="2014-09-18T13:02:01.180" Title="As a co

In [5]:
tree = ET.parse('Posts.xml')
root = tree.getroot()

In [6]:
root.tag

'posts'

In [7]:
x = "<publications><journals><open-access>"
re.findall(r'<(.+?)>', x)

['publications', 'journals', 'open-access']

In [8]:
def get_label_text(root):
    for child in root:
        try:
            labels = child.attrib['Tags']
            labels = re.findall(r'<(.+?)>', labels)
            labels = ["__label__" + l for l in labels]  # needed from a fasttext perspective
            yield " ".join(labels + [child.attrib['Title']])
        except KeyError:
            #print(child.tag, child.attrib)
            pass

In [9]:
dataset_list = [x for x in get_label_text(root)]

In [10]:
dataset_list[:3]

['__label__job-search __label__visa __label__japan What kind of Visa is required to work in Academia in Japan?',
 '__label__phd __label__job-search __label__online-resource __label__chemistry As a computational chemist, which online resources are available for Ph.D. level jobs?',
 '__label__journals __label__bibliometrics __label__impact-factor Where can I find the Impact Factor for a given journal?']

We want to do a trainign and validation. Before that we will shuffle through the list.

In [11]:
# shuffling a list
from random import shuffle
shuffle(dataset_list) # shuffling is in place. this is more like a method.

In [12]:
dataset_list[:3]

['__label__writing Numbering Introduction and Conclusion?',
 '__label__thesis __label__mathematics Has there ever been a pure mathematics thesis longer than 909 pages?',
 '__label__research __label__education I feel like I forgot my simple mathematical knowledge']

In [13]:
dataset_len = len(dataset_list)
train_len = int(0.8 * dataset_len)
training_data = dataset_list[:train_len]
validation_data = dataset_list[train_len:]

In [14]:
len(training_data)

19099

Save the lists to a training file and a validation file so that fasttext can process them.

In [15]:
with open("posts.train", 'w+') as fw:
    [fw.write(t + '\n') for t in training_data]
with open("posts.val", 'w+') as fw:
    [fw.write(t + '\n') for t in validation_data]
print("done")

done


Now we are ready to do the actual training.

In [16]:
from fastText import train_supervised

In [17]:
train_data = 'posts.train'
valid_data = 'posts.val'

In [18]:
model = train_supervised(input=train_data, epoch=25, lr=1.0, wordNgrams=2, verbose=2, minCount=1)
print('done')

done


In [19]:
def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(3, p))
    print("R@{}\t{:.3f}".format(3, r))

In [20]:
print_results(*model.test(valid_data))

N	4775
P@3	0.571
R@3	0.216


In [21]:
model.predict("I love datascience.", k=3)

(('__label__phd', '__label__academic-life', '__label__research'),
 array([0.34132218, 0.1827462 , 0.1314908 ]))