Skip to content

perry-C/Distributed-CNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed-CNN

Architecture diagram

architecture_diagram drawio

Instructions for deployment

if using aws academy:

  • Update the ~/.aws/credentials file to up-to-date version for the current session

  • the main implementation is a pytorch-based distributed Convolutional-Neural-Net which runs on the amount of ec2 instances defined by the environmental variable WORLD_SIZE(which is the only user-parameters provided for tunning)

    # Setup the environment
    
    # Extremely important, defines the number of ec2 instances to be made / the number of containers that we train the model on  
    
    export WORLD_SIZE=(int) # 1-5 is ideal as setting it too high would reach cpu request limit 
    
    chmod -R u+x scripts
    
    # Inside "fabfile.env", change these variables to:
    ssh_key_path = "PATHTOKEY/KEYNAME.pem"
    key_name = "KEYNAME"
    
    
    # Set up the swarm clusters
    fab start # May take a while to pull the docker file
    fab setupmaster 
    
    # Note: This step can be skipped if world size = 1
    fab setupworkers
    
    # Finally
    fab distributejobs
  • to see performance/output of the training process (after all the commands above has been run):

    # rank0 being the name for the master node where the swarm manager resides
    fab ec2ssh rank0
    docker attach rank0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published