# Runhouse

The [Runhouse](https://github.com/run-house/runhouse) allows remote compute and data across environments and users. See the [Runhouse docs](https://www.run.house/docs).

This example goes over how to use LangChain and [Runhouse](https://github.com/run-house/runhouse) to interact with models hosted on your own GPU, or on-demand GPUs on AWS, GCP, AWS, or Lambda.

**Note**: Code uses `SelfHosted` name instead of the `Runhouse`.

In [4]:
%pip install --upgrade --quiet runhouse
%cd /Users/sashabelousovrh/PycharmProjects/LangchainIntegration/langchain/libs/langchain
%pip install -e . 
%cd /Users/sashabelousovrh/PycharmProjects/LangchainIntegration/langchain/docs/docs/integrations/llms

Note: you may need to restart the kernel to use updated packages.
/Users/sashabelousovrh/PycharmProjects/LangchainIntegration/langchain/libs/langchain


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


Obtaining file:///Users/sashabelousovrh/PycharmProjects/LangchainIntegration/langchain/libs/langchain
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Collecting langchain-community<0.1,>=0.0.21 (from langchain==0.1.9)
  Obtaining dependency information for langchain-community<0.1,>=0.0.21 from https://files.pythonhosted.org/packages/8d/cc/387b93205020d23151c039e73805062c749a452a417fc578c7ea69efd469/langchain_community-0.0.27-py3-none-any.whl.metadata
  Downloading langchain_community-0.0.27-py3-none-any.whl.metadata (8.2 kB)
Collecting langchain-core<0.2,>=0.1.26 (from langchain==0.1.9)
  Obtaining dependency information for langchain-core<0.2,>=0.1.26 from https://files.pythonhosted.org/packages/53/b3/ae022560a8b104525b4ac1a97a557e3aa05dd0d233bb5284f7c63509742f/langchain_core-0.1.30-py3

In [1]:
import runhouse as rh
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms import SelfHostedHuggingFaceLLM, SelfHostedPipeline
from langchain_community.llms.self_hosted_hugging_face import _generate_text, _load_transformer

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/sashabelousovrh/Library/Application Support/sagemaker/config.yaml


In [2]:
# For an on-demand A10G with the cheapest provider (default)
gpu = rh.cluster(name="rh-a10x", instance_type="g5.2xlarge", use_spot=False)

# For an on-demand A10G with AWS
# gpu = rh.ondemand_cluster(name='rh-a10x', instance_type='g5.2xlarge', provider='aws')

# For an existing cluster
# gpu = rh.cluster(ips=['<ip of the cluster>'],
#                  ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'},
#                  name='rh-a10x')

Output()

INFO | 2024-03-07 13:59:21.835063 | Saving config for rh-a10x-ssh-secret to Den
INFO | 2024-03-07 13:59:21.995021 | Saving secrets for rh-a10x-ssh-secret to Vault


In [3]:
model_env = rh.env(reqs=["transformers", "torch"])

In [4]:
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

In [5]:
load_transformer_remote = rh.function(fn=_load_transformer).to(gpu, env=model_env)

INFO | 2024-03-07 13:59:41.887909 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2024-03-07 13:59:42.735377 | Authentication (publickey) successful!
INFO | 2024-03-07 13:59:43.452130 | Server rh-a10x is up.
INFO | 2024-03-07 13:59:43.464212 | Copying package from file:///Users/sashabelousovrh/PycharmProjects/LangchainIntegration/langchain to: rh-a10x
INFO | 2024-03-07 13:59:47.810518 | Calling base_env.install
INFO | 2024-03-07 13:59:49.114660 | Time to call base_env.install: 1.3 seconds


Output()

INFO | 2024-03-07 13:59:52.457303 | Sending module _load_transformer to rh-a10x


Output()

In [6]:
generate_text_remote = rh.function(_generate_text).to(gpu, env=model_env)

INFO | 2024-03-07 14:00:01.271664 | Copying package from file:///Users/sashabelousovrh/PycharmProjects/LangchainIntegration/langchain to: rh-a10x
INFO | 2024-03-07 14:00:03.362922 | Calling base_env.install
INFO | 2024-03-07 14:00:04.655512 | Time to call base_env.install: 1.29 seconds


Output()

INFO | 2024-03-07 14:00:08.103214 | Sending module _generate_text to rh-a10x


Output()

In [8]:
llm = SelfHostedHuggingFaceLLM(name="gemma-2b-it", model_id="gemma-2b-it", model_load_fn=load_transformer_remote, inference_fn=generate_text_remote).to(gpu, env=model_env)

ValueError: SelfHostedPipeline relies on the pickle module. You will need to set allow_dangerous_deserialization=True if you want to opt-in to allow deserialization of data using pickle.Data can be compromised by a malicious actor if not handled properly to include a malicious payload that when deserialized with pickle can execute arbitrary code. 

In [None]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [None]:
question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"

llm_chain.run(question)

You can also load more custom models through the SelfHostedHuggingFaceLLM interface:

In [None]:
llm = SelfHostedHuggingFaceLLM(
    model_id="google/flan-t5-small",
    task="text2text-generation",
    hardware=gpu,
)

In [None]:
llm("What is the capital of Germany?")

Using a custom load function, we can load a custom pipeline directly on the remote hardware:

In [None]:
def load_pipeline():
    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
        pipeline,
    )

    model_id = "gpt2"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id)
    pipe = pipeline(
        "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10
    )
    return pipe


def inference_fn(pipeline, prompt, stop=None):
    return pipeline(prompt)[0]["generated_text"][len(prompt) :]

In [None]:
llm = SelfHostedHuggingFaceLLM(
    model_load_fn=load_pipeline, hardware=gpu, inference_fn=inference_fn
)

In [None]:
llm("Who is the current US president?")

You can send your pipeline directly over the wire to your model, but this will only work for small models (<2 Gb), and will be pretty slow:

In [None]:
pipeline = load_pipeline()
llm = SelfHostedPipeline.from_pipeline(
    pipeline=pipeline, hardware=gpu, model_reqs=["pip:./", "transformers", "torch"]
)

Instead, we can also send it to the hardware's filesystem, which will be much faster.

In [None]:
import pickle

rh.blob(pickle.dumps(pipeline), path="models/pipeline.pkl").save().to(
    gpu, path="models"
)

llm = SelfHostedPipeline.from_pipeline(pipeline="models/pipeline.pkl", hardware=gpu)