## Making Dataset - RNN Topics

The text data source is from PDF documents which are mostly of conference and journal papers.

I use several tools to extract the text from PDF. One package is called **textract**

Some manual intervention (editing, and deletion) are involved as the extraction results are not ideal.

In [1]:
import os
import keras
from keras.preprocessing import text
import textract

Using TensorFlow backend.


Below is one snapshot of the folder location where a list of PDF papers which are in one category are placed into.

For this notebook, the category is for 'recurrent neural network'

In [2]:
#path = 'C:/Users/k/Documents/qiqqa/324534E5-7D3E-4DAC-B4E9-057A96B7AC62'
path = 'D:/Documents/qiqqa/02CE4074-0A28-4C5C-84CE-5505482C9E7A'
pdf = []
for root, dr, fs in os.walk(path):
    for f in fs:
        if '.pdf' not in f:
            continue
        pdf.append(root + '/' + f)

Following, we use combination of **nltk** stopwords and RegexpTokenizer to help remove unwanted characters. Keras text preprocessing is also used to split the texts into word sequences.

In [3]:
# the snippet code from ref[1] -- see next notebook
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer, sent_tokenize

# Load stop-words
stop_words = set(stopwords.words('english'))

# Initialize tokenizer
# It's also possible to try with a stemmer or to mix a stemmer and a lemmatizer
tokenizer = RegexpTokenizer('[\'a-zA-Z]+')

In [5]:
min_num_words = 5
label = 'recurrent neural network'
docs = []

In [6]:
# check if there is duplicate
dup = set()

In [7]:
for doc in pdf:
    try:
        p1 = textract.process(doc)
    except:
        print('exception..')
        pass
    p2 = str(p1)
    p3 = p2.split('\\r\\n')
    test = ''
    for i in range(0,len(p3)):
        test = test + p3[i]
    if test in dup:
        continue
    dup.add(test)
    print(p3[:5])
    print()
    i = 0
    start = False
    txt = []
    for p in p3:
        # w = [t for t in tokenizer.tokenize(p) if t.lower() not in stop_words]
        w = keras.preprocessing.text.text_to_word_sequence(p)
        if 'abstract' not in w and not start:
            continue
        if 'abstract' in w:
            start = True
            continue
        if 'references' in w:
            break
        w1 = [v for v in w if not v.lower().startswith('x')]
        if len(w1)<min_num_words:
            continue
        for word in w1:
            txt.append(word)
#         print(w1)
        i = i +1
        if i>1000:
            break
    if txt != []:
        docs.append((label,txt))

["b'Simulation Modelling Practice and Theory 15 (2007) 1016\\xe2\\x80\\x931028", 'www.elsevier.com/locate/simpat', '', 'A greenhouse control with feed-forward and recurrent', 'neural networks']

["b'LETTER", '', 'Communicated by Michael Cohen', '', 'Dynamical Behaviors of Delayed Neural Network Systems']

["b'Joint Language and Translation Modeling with Recurrent Neural Networks", 'Michael Auli, Michel Galley, Chris Quirk, Geoffrey Zweig', 'Microsoft Research', 'Redmond, WA, USA', '{michael.auli,mgalley,chrisq,gzweig}@microsoft.com']

["b'University of Massachusetts - Amherst", 'From the SelectedWorks of R. Manmatha', '', '2011', '']

["b'See\\tdiscussions,\\tstats,\\tand\\tauthor\\tprofiles\\tfor\\tthis\\tpublication\\tat:\\thttps://www.researchgate.net/publication/221139120", '', 'Contextual\\tBehaviors\\tand\\tInternal', 'Representations\\tAcquired\\tby\\tReinforcement', 'Learning\\twith\\ta\\tRecurrent\\tNeural\\tNetwork\\tin']

["b'Investigation of Recurrent-Neural-Network Archite

["b'Psychological Review", '2006, Vol. 113, No. 2, 201\\xe2\\x80\\x93233', '', 'Copyright 2006 by the American Psychological Association', '0033-295X/06/$12.00 DOI: 10.1037/0033-295X.113.2.201']

["b'See\\tdiscussions,\\tstats,\\tand\\tauthor\\tprofiles\\tfor\\tthis\\tpublication\\tat:\\thttps://www.researchgate.net/publication/236035223", '', 'Recurrent\\tneural\\tnetwork-based\\tcontrol', 'strategy\\tfor\\tbattery\\tenergy\\tstorage\\tin', 'generation\\tsystems\\twith\\tintermittent']

["b'Proc. of the 15th Int. Conference on Digital Audio Effects (DAFx-12), York, UK , September 17-21, 2012", '', 'ONLINE REAL-TIME ONSET DETECTION WITH RECURRENT NEURAL NETWORKS', 'Sebastian B\\xc3\\xb6ck, Andreas Arzt, Florian Krebs, Markus Schedl', 'Department of Computational Perception']

["b'INTERSPEECH 2014", '', 'Long Short-Term Memory Recurrent Neural Network Architectures', 'for Large Scale Acoustic Modeling', 'Has\\xcc\\xa7im Sak, Andrew Senior, Franc\\xcc\\xa7oise Beaufays']

["b'Not All Con

["b'WCCI 2010 IEEE World Congress on Computational Intelligence", 'July, 18-23, 2010 - CCIB, Barcelona, Spain', '', 'IJCNN', '']

["b'A Multiplicative Model for Learning Distributed", 'Text-Based Attribute Representations', '', 'Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov', 'University of Toronto']

["b'Sequence to Sequence Learning", 'with Neural Networks', 'Ilya Sutskever', 'Google', 'ilyasu@google.com']

["b'INTERSPEECH 2011", '', 'Recurrent Neural Network based Language Modeling in Meeting Recognition', 'Stefan Kombrink, Toma\\xcc\\x81s\\xcc\\x8c Mikolov, Martin Karafia\\xcc\\x81t, Luka\\xcc\\x81s\\xcc\\x8c Burget', 'Speech@FIT, Brno University of Technology, Brno, Czech Republic']

["b'Offline Handwriting Recognition with", 'Multidimensional Recurrent Neural Networks', 'Alex Graves', 'TU Munich, Germany', 'graves@in.tum.de']

["b'1092", '', 'IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 17, NO. 5, OCTOBER 2009', '', 'A Recurrent Self-Evolving Interval Type-2 Fuzzy']

["b'IEEE TR

In [8]:
print(len(docs))

104


Save the text and category as **pkl** file

In [9]:
import _pickle as pickle

In [10]:
pickle.dump(docs, open('rnn.pkl', 'wb'))