Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory error while working with The Signal Media One-Million News Articles Dataset(2.7GB approx) #3

Open
iabhi7 opened this issue Mar 21, 2017 · 2 comments

Comments

@iabhi7
Copy link

iabhi7 commented Mar 21, 2017

I tried creating the vocabulary embeddings with the 'The Signal Media One-Million News Articles Dataset' (which is approximately 2.7GB in size) but it gave me an error on a g2.8xlarge instance. Not sure what I am doing wrong here.
The vocabulary-embedding.py runs as expected but while training the model it is giving a memory error.
I also tried distributing the model on the 4 GPUs that are available.

Any hack for this, or any code snippet or alternate dataset that could help me solve this problem.

@jmsfcb
Copy link

jmsfcb commented Jun 14, 2017

If it helps. I had issues too so I replaced block 6 in vocabulary-embedding.ipynb with the following

import json
fndata = 'data/signalmedia-1m.jsonl'
heads = []
desc = []
keywords = []
counter = 0
with open(fndata) as f:
        for line in f:
            if counter < 20000:
                jdata = json.loads(line)
                heads.append(jdata["title"].lower())
                desc.append(jdata["content"].lower())
                keywords.append(None)
                #counter +=1

Creating a seperate pickle file was just causing me grief so I read it directly from the source. You can uncomment the counter in the last line if you want to only grab the first 20,000 articles.

@KevinDanikowski
Copy link

@jmsfcb , you should add else: break so you don't have to wait for it to go through all 1 million articles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants