Let's grab abstracts and metadata from PubMed via the Entrez API.
The biopython library should do the trick:
http://biopython.org/DIST/docs/api/Bio.Entrez-module.html

In [16]:
from Bio import Medline, Entrez

pmids = ['18680603', '18665331', '18661158', '18627489', '18627452', '18612381']
Entrez.email = 'your_email@your_address.com'
handle = Entrez.efetch(db="pubmed", id=pmids, rettype="medline", retmode="text")
records = Medline.parse(handle)
# returns a generator containing dicts 
# E.g. to get journal titles back
for record in records:
    print record['JT']

BMC pregnancy and childbirth
Journal of natural medicines
Mycorrhiza
The New phytologist
Molecular ecology
PloS one


Great, that works. So how about getting the text of the abstracts?

In [7]:
handle = Entrez.efetch(db="pubmed", id=pmids, rettype="medline", retmode="text")
records = Medline.parse(handle)
for record in records:
    print record['AB']   # journal titles would be record['JT']
 

BACKGROUND: Evidence-based practice (EBP) can provide appropriate care for women and their babies; however implementation of EBP requires health professionals to have access to knowledge, the ability to interpret health care information and then strategies to apply care. The aim of this survey was to assess current knowledge of evidence-based practice, information seeking practices, perceptions and potential enablers and barriers to clinical practice change among maternal and infant health practitioners in South East Asia. METHODS: Questionnaires about IT access for health information and evidence-based practice were administered during August to December 2005 to health care professionals working at the nine hospitals participating in the South East Asia Optimising Reproductive and Child Health in Developing countries (SEA-ORCHID) project in Indonesia, Malaysia, Thailand and The Philippines. RESULTS: The survey was completed by 660 staff from six health professional groups. Overall, ea

So to get more abstracts, we need (many) more pmids:
http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/#Obtaining_DOIs

In [24]:
import pandas

pmid_data = pandas.read_csv("/Users/adam/Code/SCIgen2/pubmed/PMC-ids.csv")
print len(pmid_data)
pmid_data.head()

3679075


Unnamed: 0,Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date
0,Breast Cancer Res,1465-5411,1465-542X,2000,3,1,55,,PMC13900,11250746,,live
1,Breast Cancer Res,1465-5411,1465-542X,2000,3,1,61,,PMC13901,11250747,,live
2,Breast Cancer Res,1465-5411,1465-542X,2000,3,1,66,,PMC13902,11250748,,live
3,Breast Cancer Res,1465-5411,1465-542X,1999,2,1,59,,PMC13911,11056684,,live
4,Breast Cancer Res,1465-5411,1465-542X,1999,2,1,64,,PMC13912,11400682,,live


3.68 million papers! OK now let's limit ourselves to papers published since the year 2010...

In [25]:
pmid_data = pmid_data[pmid_data.Year >= 2010]
print len(pmid_data)

1495598


My goodness. Most of the papers are recent. Well, 1.5 million papers it is then.

In [26]:
pmid_data.PMID.dtype

dtype('float64')

That's what I thought... pandas converted the PMID col to floats because of some NaN entries.
So let's drop the NaN records from PMID, cast as int to get rid of the decimals, cast as str to use as proper ID codes, and finally convert this pandas Series to a python list...

In [27]:
pmids = pmid_data.PMID.dropna().apply(int).apply(str).tolist()
print len(pmids)
print pmids[:10]

1427980
['21464888', '19663750', '16584150', '17233537', '16584152', '16584155', '16796349', '16796357', '16796363', '17285742']


Now we've got our big list of pmids to feed to Entrez.
Let's test out the first 10 of them...

In [28]:
sample = pmids[:10]
Entrez.email = 'your_email@your_address.com'
handle = Entrez.efetch(db="pubmed", id=sample, rettype="medline", retmode="text")
records = Medline.parse(handle)
for record in records:
    print record['AB']

This paper reports findings from a clinical trial of a probation case management (PCM) intervention for drug-involved women offenders. Participants were randomly assigned to either PCM (n=92) or standard probation (n=91), and followed for 12 months using measures of substance abuse, psychiatric symptoms, social support and service utilization. Arrest data were collected from administrative datasets. The sample (N=183) included mostly African American (57%) and White (20%) women, with a mean age of 34.7 (SD = 9.2) and mean education of 11.6 years (SD = 2.1). Cocaine and heroin were the most frequently reported drugs of abuse, 86% reported prior history of incarceration, and 74% had children. Women assigned to both PCM and standard probation showed change over time in the direction of clinical improvement on 7 of 10 outcomes measured. However, changes observed for the PCM group were no different than those observed for the standard probation group. Higher levels of case management, drug 

Sweet. Now it's just a matter of scaling up and saving the results.
Let's make sure to skip any papers that don't have abstracts (for whatever reason).
And we'll have to do it in chunks of 10,000 queries because of the API limit:
http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch

In [42]:
import sys, os

skipped = 0
num_chunks = len(pmids)/10000
print(num_chunks), "total chunks needed"
# check for chunks from previous runs, since this will take a while!
done = [i for i in os.listdir("/Users/adam/Code/SCIgen2/char-rnn/data/pubmed_abstracts/chunks") if i[-3:] == "txt"]
for i in range(num_chunks + 1):
    if "%s.txt" % str(i) in done:
        continue
    with open("/Users/adam/Code/SCIgen2/char-rnn/data/pubmed_abstracts/chunks/%s.txt" % i, "wb") as f:
        if i%20 == 0:
            print("current chunk: %d" % i)
            sys.stdout.flush()
        chunk = pmids[i * 10000: (i + 1) * 10000]
        Entrez.email = 'your_email@your_address.com'
        handle = Entrez.efetch(db="pubmed", id=chunk, rettype="medline", retmode="text")
        records = Medline.parse(handle)
        for record in records:
            try:
                f.write(record['AB'] + '\n')
            except KeyError:
                skipped += 1

print skipped, "abstracts skipped"

142
Chunk 140
13639


Nice. So only 13,639 out of 1.4 million papers didn't have abstracts. 
Now to combine those chunks into a single file called input.txt for char-rnn...

In [46]:
done = [i for i in os.listdir("/Users/adam/Code/SCIgen2/char-rnn/data/pubmed_abstracts/chunks") if i[-3:] == "txt"]
with open("/Users/adam/Code/SCIgen2/char-rnn/data/pubmed_abstracts/input.txt", "wb") as f:
    for i in done:
        with open("/Users/adam/Code/SCIgen2/char-rnn/data/pubmed_abstracts/chunks/%s" %i, "rb") as j:
            chunk = j.read()
            f.write(chunk)

And now we have a nice big corpus of scientific abstracts!