# USING FRUITBOWL Part 1:  Data collection, corpora and training
### First, lets collect some data:
The cherry framework is designed to collect publication metadata online. The file `scrape_me2.json` contains urls for webpages containing dois. Run cherrymaster.sh to get some data to form a corpus on: 

In [1]:
%%sh
cd fruitbowl/cherry/
sh cherrymaster.sh -p ../../scrape_me2.json
cd ../..

initialised
Scraping Run Analysed.


INFO: Enabled extensions: CoreStats, TelnetConsole, LogStats, CloseSpider, SpiderState
INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
INFO: Enabled item pipelines: CherryPipeline
INFO: Spider opened
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG: Telnet console listening on 127.0.0.1:6023
DEBUG: Crawled (200) <GET http://www.lifesci.dundee.ac.uk/people/robert-ryan> (referer: None)
DEBUG: Crawled (200) <GET http://www.lifesci.dundee.ac.uk/people/carol-mackintosh> (referer: None)
DEBUG: Crawled (200) <GET http://www.lifesci.dundee.ac.uk/> (referer: None)
DEBUG: Crawled (200) <GET http://www.lifesci.

# Cherry has found us raw documents
### Now lets make a corpus from those documents

* Find the `complete.json` file from the scraping run
* move it to the cwd
* Continue:

In [4]:
from fruitbowl.orange import corpus,docIterators 

corp=corpus.Corpus('Example_Corpus',doc_iterator)

Building Dictionary




### The corpus has been built

* The corpus is built on gensim
* Get a random sample of documents via `get_sample` method 

### The corpus has been tokenized - each word has been tokenized into an integer:

* Access the dictionary (token to word) via `corp.dictionary[token]`
* Access the inverse dictionary (word to token) via `corp.inv_dict[word]`
 

We can to save the corpus to disk so that documents can be accessed later.

In [5]:
sample = corp.get_sample(1)
sample_document = sample[0]
print(sample_document['doi'])
print('')
print(sample_document['doc'])
print('\n'*2)
print(corp.dictionary[200])
print(corp.inv_dict['Topographic'])


corp.export2jsonfile('_SAVED.json')

Sampling 1 random documents


10.1371/journal.pgen.1001285

[[u'Evolutionary', u'Conserved', u'Regulation', u'of', u'HIF-1\u03b2', u'by', u'NF-\u03baB'], [u'Hypoxia', u'Inducible', u'Factor-1', u'(HIF-1)', u'is', u'essential', u'for', u'mammalian', u'development', u'and', u'is', u'the', u'principal', u'transcription', u'factor', u'activated', u'by', u'low', u'oxygen', u'tensions'], [u'HIF-\u03b1', u'subunit', u'quantities', u'and', u'their', u'associated', u'activity', u'are', u'regulated', u'in', u'a', u'post-translational', u'manner,', u'through', u'the', u'concerted', u'action', u'of', u'a', u'class', u'of', u'enzymes', u'called', u'Prolyl', u'Hydroxylases', u'(PHDs)', u'and', u'Factor', u'Inhibiting', u'HIF', u'(FIH)', u'respectively'], [u'However,', u'alternative', u'modes', u'of', u'HIF-\u03b1', u'regulation', u'such', u'as', u'translation', u'or', u'transcription', u'are', u'under-investigated,', u'and', u'their', u'importance', u'has', u'not', u'been', u'firmly', u'established'

# With a corpus built, lets make some models
### First, lets represent a document as its Bag Of Words representation:

In [6]:
for sentence in sample_document['doc']:
    print(' '.join(sentence))
    print(corp.get_cbow_doc(sentence))
    print('')

Evolutionary Conserved Regulation of HIF-1β by NF-κB
[(39, 1), (40, 1), (1297, 1), (1936, 1), (1939, 1), (1960, 1), (1982, 1)]

Hypoxia Inducible Factor-1 (HIF-1) is essential for mammalian development and is the principal transcription factor activated by low oxygen tensions
[(24, 1), (39, 1), (59, 2), (75, 1), (101, 1), (323, 1), (542, 1), (544, 1), (548, 1), (553, 1), (732, 1), (809, 1), (1593, 1), (1937, 1), (1948, 1), (1951, 1), (1953, 1), (1969, 1), (1981, 1)]

HIF-α subunit quantities and their associated activity are regulated in a post-translational manner, through the concerted action of a class of enzymes called Prolyl Hydroxylases (PHDs) and Factor Inhibiting HIF (FIH) respectively
[(17, 1), (22, 1), (40, 2), (58, 1), (75, 2), (81, 1), (95, 2), (101, 1), (525, 1), (547, 1), (560, 1), (579, 1), (1427, 1), (1528, 1), (1758, 1), (1941, 1), (1944, 1), (1947, 1), (1956, 1), (1957, 1), (1958, 1), (1961, 1), (1966, 1), (1968, 1), (1973, 1), (1976, 1), (1977, 1), (1984, 1), (1985, 

# Next, lets get the tfidf representation:
Note that word order is *not* preserved 

In [7]:
#create the tfidf model
corp.get_tfidf_model()
#get the tfidf representation:
for sentence in sample_document['doc']:
    print(' '.join(sentence))
    for k,v in corp.get_tfidf_doc(sentence):
        print(corp.dictionary[k],v)
    print('')

Creating TF-IDF Model
Evolutionary Conserved Regulation of HIF-1β by NF-κB
(u'by', 0.05997220351073987)
(u'of', 0.009991079309451001)
(u'Evolutionary', 0.4438891566014532)
(u'NF-\u03baB', 0.4438891566014532)
(u'Regulation', 0.36933851475796803)
(u'HIF-1\u03b2', 0.5184397984449385)
(u'Conserved', 0.4438891566014532)

Hypoxia Inducible Factor-1 (HIF-1) is essential for mammalian development and is the principal transcription factor activated by low oxygen tensions
(u'for', 0.02347092805874262)
(u'by', 0.040837993317608665)
(u'is', 0.0736478244483421)
(u'and', 0.010117943893314192)
(u'the', 0.010117943893314192)
(u'oxygen', 0.23515784536060694)
(u'transcription', 0.165177479792776)
(u'Hypoxia', 0.2725700229735426)
(u'factor', 0.17103969941681435)
(u'Inducible', 0.25150057717856444)
(u'essential', 0.17103969941681435)
(u'development', 0.13362752180387866)
(u'low', 0.1774122949539822)
(u'activated', 0.3022657389569286)
(u'principal', 0.3530309007352927)
(u'(HIF-1)', 0.3530309007352927)
(u't

# Moving on to sanitising:
### Before training more powerful algorithms, it is useful to clean the data

The document iterator class has been simply returning the document as scraped from the page.

Sanitiser Objects are designed to improve the data quality of the documents:

In [8]:
from fruitbowl.strawberry import sanitisers

#get sources:
punctuation = 'fruitbowl/strawberry/ancillaries/punctuation.json'
stopwords = 'fruitbowl/strawberry/ancillaries/full_stopwords.json'

#get sentence:
sample_sentence=' '.join(sample_document['doc'][0])
print('sample sentence:')
print(sample_sentence+'\n')
#define sanitisers:
min_san = sanitisers.MinimalSanitiser(punctuation)
stop_san = sanitisers.StopWordSanitiser(punct_file=punctuation,stopwords_file=stopwords)
stem_san = sanitisers.StemmingSanitiser(punct_file=punctuation,stopwords_file=stopwords,stem_type='SNOWBALL')
#show sanitising:
print('minimal_sanitising:')
print(min_san.sanitise(sample_sentence)+'\n')
print('stopword_sanitising:')
print(stop_san.sanitise(sample_sentence)+'\n')
print('stopword and stemming sanitising:')
print(stem_san.sanitise(sample_sentence)+'\n')

sample sentence:
Evolutionary Conserved Regulation of HIF-1β by NF-κB

minimal_sanitising:
evolutionary conserved regulation of hif 1β by nf κb

stopword_sanitising:
evolutionary conserved regulation hif 1β nf κb

stopword and stemming sanitising:
evolutionari conserv regul hif 1β nf κb



# Lets use the stemming sanitiser to stream documents into the corpus

* Give the corpus's document iterator the stemming sanitiser to use in streaming
* Rebuild the corpus's dictionaries
* Rebuild the corpus's TFIDF model
* Save the new corpus to file, including TFIDF weights


In [9]:
corp.doc_iter.sanitiser=stem_san
corp.rebuild_dicts()
corp.get_tfidf_model()

corp.export2jsonfile('_SAVED2.json',tfidf=True)



Creating TF-IDF Model


EXPORT COMPLETE


# With properly sanitised input, we can now train word2vec and doc2vec models

In [10]:
from fruitbowl.strawberry import model_training
print('Training Word2Vec model:')
w2vmodel = model_training.train_word2vec(corp.doc_iter)
print('Training Doc2Vec model:')
d2vmodel = model_training.train_doc2vec(corp.doc_iter)

Training Word2Vec model:




Model Trained
Training Doc2Vec model:


training epoch: 0


training epoch: 1


training epoch: 2


training epoch: 3


training epoch: 4


training epoch: 5


training epoch: 6


training epoch: 7


training epoch: 8


training epoch: 9


training epoch: 10


training epoch: 11


training epoch: 12


training epoch: 13


training epoch: 14


training epoch: 15


training epoch: 16


training epoch: 17


training epoch: 18


training epoch: 19


training epoch: 20


training epoch: 21


training epoch: 22


training epoch: 23


Model Trained


# Finally we can generate vectors for documents:
### 3 Classes are implemented to help facilitate this:
* `SentBySentGenerator` generates document vectors by averaging word vectors into sentences, then averaging sentence vectors into document vectors
* `WordByWordGenerator` generates document vectors by averaging word vectors directly into documents
* `Doc2VecGenerator` generates document vectors using trained doc2vec models

In [18]:
from  fruitbowl.strawberry import vect_generators
sbs = vect_generators.SentBySentGenerator(w2vmodel)
wbw = vect_generators.WordByWordGenerator(w2vmodel)
d2v = vect_generators.Doc2VecGenerator(d2vmodel)

In [19]:
sbs_vector = sbs.get_vector(sample_document['doc'])
wbw_vector = wbw.get_vector(sample_document['doc'])
d2v_vector = d2v.get_vector(sample_document['doi'])

# We have gone from raw scraping  document representation vectors in 15 lines of code

Minimal working example below:


In [None]:
%%sh
cd fruitbowl/cherry/
sh cherrymaster.sh -p ../../scrape_me2.json
mv complete.json ../../  

In [None]:
from fruitbowl.orange import corpus,docIterators 
from fruitbowl.strawberry import sanitisers
from fruitbowl.strawberry import model_training

stem_san = sanitisers.StemmingSanitiser(punct_file=punctuation,stopwords_file=stopwords,stem_type='SNOWBALL')
doc_iterator = docIterators.JsonDiskIter('complete.json',sanit=stem_san)
corp=corpus.Corpus('Example_Corpus',doc_iterator)
corp.get_tfidf_model()
corp.export2jsonfile('_SAVED2.json',tfidf=True)
d2v = vect_generators.Doc2VecGenerator(d2vmodel)
sample_document=corp.get_sample(1)
d2v_vector = d2v.get_vector(sample_document['doi'])