-
Notifications
You must be signed in to change notification settings - Fork 553
Cloud Deployment Specs #104
Comments
I'm assuming that you mean running the model as a service in the cloud. Due to the shear magnitude of the GPT-2 model and the lack of multi-gpu support, I don't think it's feasible to do a cloud deployment like that. Hell, we had to start serving the model over torrents instead of a CDN download because it was too expensive. (See #41 ) The GPT-2 Large model requires 12 GB of vram in order to run on a gpu. So a cloud deployment would probably involve multiple instances of the application, all with their own gpu with at least 12 GB vram. |
@dyc3 Thanks. Yeah, running it as a service. Having an API endpoint, or an AWS lambda, that can send a command to the model and receive a response. I have been following some of the discussion about the size of the model and the issues with downloading it. Was curious what running requires. 12GB of vram is definitely cost prohibitive for running this on a server as an API. Running even one instance of that size 24/7 would range in the thousands of dollars a month. Does it run decently on multicore CPU? I saw someone mentioning it not yet being optimized for CPU yet. Multicore CPU instances are definitely more affordable than instances with dedicated vram. Any idea what is currently being considered for deploying and scaling this? Thanks! |
I've run it locally on a Ryzen 7 2700X. It's significantly slower that running it on a GPU. |
I'm running think on a SkySilk VPS and wrote a discord bot wrapper (so people in the server can interactively play AI Dungeon 2 together), and it works fine on a 2VCPU server with 8GB of RAM. It's definitely not fast though. Each response generally takes somewhere between 1-2 minutes, but it gets the job done for what I'm using it for. |
I have a colleague running it locally with an 24 thread AMD cpu, and he gets a result in a few seconds, definitely playable at that point. Running it on my 8700k, I get a reply in about 40-50 secs. It uses about 8-10 gig of RAM depending on the length of the story. |
@ethanspitz That is good to know. I have been told that they are working on a multicore CPU solution, which should open up scalable deployment solutions. @JorgeER That is good to hear. For now, I would be happy with a response rate like that, even just for my development process before it would hit production. I wonder about the scalability though. Does each individual story get stored in RAM, ie does each user require 8-10gigs of ram to run? That would be a little crazy. Thanks again for the responses! Very helpful for me. |
Hey,
Awesome project here! I have been wanting to get better with my ML, and this is right up my alley.
I am curious what the specs for deploying this on the cloud would be, such as AWS. Obviously instances without GPU are considerably cheaper.
Is there work being on on hosted cloud infrastructure for this, and if so, what is being used?
Thanks again!
The text was updated successfully, but these errors were encountered: