Skip to content
This repository has been archived by the owner on Oct 5, 2023. It is now read-only.

Cloud Deployment Specs #104

Open
michaelsharpe opened this issue Dec 10, 2019 · 6 comments
Open

Cloud Deployment Specs #104

michaelsharpe opened this issue Dec 10, 2019 · 6 comments

Comments

@michaelsharpe
Copy link

Hey,

Awesome project here! I have been wanting to get better with my ML, and this is right up my alley.

I am curious what the specs for deploying this on the cloud would be, such as AWS. Obviously instances without GPU are considerably cheaper.

Is there work being on on hosted cloud infrastructure for this, and if so, what is being used?

Thanks again!

@dyc3
Copy link
Contributor

dyc3 commented Dec 10, 2019

I'm assuming that you mean running the model as a service in the cloud.

Due to the shear magnitude of the GPT-2 model and the lack of multi-gpu support, I don't think it's feasible to do a cloud deployment like that. Hell, we had to start serving the model over torrents instead of a CDN download because it was too expensive. (See #41 )

The GPT-2 Large model requires 12 GB of vram in order to run on a gpu. So a cloud deployment would probably involve multiple instances of the application, all with their own gpu with at least 12 GB vram.

@michaelsharpe
Copy link
Author

@dyc3 Thanks. Yeah, running it as a service. Having an API endpoint, or an AWS lambda, that can send a command to the model and receive a response.

I have been following some of the discussion about the size of the model and the issues with downloading it. Was curious what running requires. 12GB of vram is definitely cost prohibitive for running this on a server as an API. Running even one instance of that size 24/7 would range in the thousands of dollars a month.

Does it run decently on multicore CPU? I saw someone mentioning it not yet being optimized for CPU yet. Multicore CPU instances are definitely more affordable than instances with dedicated vram.

Any idea what is currently being considered for deploying and scaling this?

Thanks!

@dyc3
Copy link
Contributor

dyc3 commented Dec 11, 2019

I've run it locally on a Ryzen 7 2700X. It's significantly slower that running it on a GPU.

@ethanspitz
Copy link

I'm running think on a SkySilk VPS and wrote a discord bot wrapper (so people in the server can interactively play AI Dungeon 2 together), and it works fine on a 2VCPU server with 8GB of RAM. It's definitely not fast though. Each response generally takes somewhere between 1-2 minutes, but it gets the job done for what I'm using it for.

@fumbleforce
Copy link

fumbleforce commented Dec 11, 2019

I have a colleague running it locally with an 24 thread AMD cpu, and he gets a result in a few seconds, definitely playable at that point. Running it on my 8700k, I get a reply in about 40-50 secs. It uses about 8-10 gig of RAM depending on the length of the story.

@michaelsharpe
Copy link
Author

@ethanspitz That is good to know. I have been told that they are working on a multicore CPU solution, which should open up scalable deployment solutions.

@JorgeER That is good to hear. For now, I would be happy with a response rate like that, even just for my development process before it would hit production. I wonder about the scalability though. Does each individual story get stored in RAM, ie does each user require 8-10gigs of ram to run? That would be a little crazy.

Thanks again for the responses! Very helpful for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants