# Runhouse

The [Runhouse](https://www.run.house/) allows remote compute and data across environments and users. See the [Runhouse docs](https://www.run.house/docs).

This example goes over how to use LangChain and [Runhouse](https://github.com/run-house/runhouse) to interact with models hosted on your own GPU, or on-demand GPUs on AWS, GCP, AWS, or Lambda.

**Note**: Code uses `SelfHosted` name instead of the `Runhouse`.

In [None]:
%pip install --upgrade --quiet "runhouse[sky]"

In [2]:
import runhouse as rh
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms import SelfHostedHuggingFaceLLM, SelfHostedPipeline

In [3]:
# For an on-demand A100 with GCP, Azure, or Lambda
gpu = rh.cluster(name="langchain-rh-a10x", instance_type="g5.4xlarge")
gpu.up_if_not()
# For an on-demand A10G with AWS (no single A100s on AWS)
# gpu = rh.cluster(name='rh-a10x', instance_type='g5.2xlarge', provider='aws')

# For an existing cluster
# gpu = rh.cluster(ips=['<ip of the cluster>'],
#                  ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'},
#                  name='rh-a10x')

Output()

Output()

<runhouse.resources.hardware.on_demand_cluster.OnDemandCluster at 0x107919c00>

In [4]:
model_env = rh.env(
    name="model_env",
    reqs=["transformers", "torch", "accelerate", "huggingface-hub"],
    secrets=["huggingface"],  # need for downloading google/gemma-2b-it
).to(system=gpu)

INFO | 2024-03-24 13:57:50.749173 | Copying package from file:///Users/sashabelousovrh/PycharmProjects/LangchainIntegration/langchain to: langchain-rh-a10x
INFO | 2024-03-24 13:57:53.583524 | Port 32300 is already in use. Trying next port.
INFO | 2024-03-24 13:57:53.593203 | Port 32301 is already in use. Trying next port.
INFO | 2024-03-24 13:57:53.597796 | Forwarding port 32302 to port 32300 on localhost.
INFO | 2024-03-24 13:57:56.970754 | Server langchain-rh-a10x is up.
INFO | 2024-03-24 13:57:57.158541 | Calling huggingface._write_to_file


-----------------
[36mlangchain-rh-a10x[0m
-----------------
[36mSecrets already exist in ~/.cache/huggingface/token.
[0m

INFO | 2024-03-24 13:57:58.339523 | Time to call huggingface._write_to_file: 1.18 seconds
INFO | 2024-03-24 13:57:58.703685 | Calling model_env.install
INFO | 2024-03-24 13:57:59.877462 | Time to call model_env.install: 1.17 seconds


In [5]:
gpu.run(commands=["pip install langchain"])



[(0,
  '')]

In [6]:
llm = SelfHostedHuggingFaceLLM(
    model_id="google/gemma-2b-it",
    hardware=gpu,
    env=model_env,
)

INFO | 2024-03-24 13:58:03.435729 | Calling file_20240324_155758.exists_in_system
INFO | 2024-03-24 13:58:04.655172 | Time to call file_20240324_155758.exists_in_system: 1.22 seconds
INFO | 2024-03-24 13:58:04.658518 | Calling file_20240324_155758.resolved_state
INFO | 2024-03-24 13:58:05.831685 | Time to call file_20240324_155758.resolved_state: 1.17 seconds
INFO | 2024-03-24 13:58:05.842903 | Calling huggingface._write_to_file


[36mSecrets already exist in .cache/huggingface/token.
[0m

INFO | 2024-03-24 13:58:07.015366 | Time to call huggingface._write_to_file: 1.17 seconds
INFO | 2024-03-24 13:58:07.380929 | Calling model_env.install
INFO | 2024-03-24 13:58:08.553067 | Time to call model_env.install: 1.17 seconds
INFO | 2024-03-24 13:58:08.566441 | Sending module LangchainLLMModelPipeline to langchain-rh-a10x
INFO | 2024-03-24 13:58:09.138016 | Calling LangchainLLMModelPipeline._remote_init
INFO | 2024-03-24 13:58:10.361149 | Time to call LangchainLLMModelPipeline._remote_init: 1.22 seconds
INFO | 2024-03-24 13:58:10.368231 | Calling file_20240324_155807.exists_in_system
INFO | 2024-03-24 13:58:11.542731 | Time to call file_20240324_155807.exists_in_system: 1.17 seconds
INFO | 2024-03-24 13:58:11.547294 | Calling file_20240324_155807.resolved_state
INFO | 2024-03-24 13:58:12.745853 | Time to call file_20240324_155807.resolved_state: 1.2 seconds
INFO | 2024-03-24 13:58:12.752305 | Calling LangchainLLMModelPipeline.load_model


[36m
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s][0m[36m
Loading checkpoint shards:  50%|█████     | 1/2 [00:03<00:03,  3.14s/it][0m[36m
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.39s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.66s/it]
[0m

INFO | 2024-03-24 13:58:17.981402 | Time to call LangchainLLMModelPipeline.load_model: 5.23 seconds


In [7]:
template = """Question: {question}

Answer: Let's think step by step."""

In [8]:
prompt = PromptTemplate.from_template(template)

In [9]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [10]:
question = "What is the capital of Germany?"

llm_chain.invoke(question)

INFO | 2024-03-24 13:58:18.040352 | Calling LangchainLLMModelPipeline.interface_fn
INFO | 2024-03-24 14:00:22.353148 | Time to call LangchainLLMModelPipeline.interface_fn: 124.31 seconds


{'question': 'What is the capital of Germany?',
 'text': '\n\nThe word "Germany" refers to a country in Western Europe that is located between the River Rhine and the River Danube.\n\nThe capital city of Germany is Berlin.\n\nTherefore, the capital of Germany is Berlin.'}

You can also execute the prediction function of the model directly:


In [11]:
llm.invoke("Write me a short poem about Super Bowl")

INFO | 2024-03-24 14:00:22.377121 | Calling LangchainLLMModelPipeline.interface_fn
INFO | 2024-03-24 14:04:15.849123 | Time to call LangchainLLMModelPipeline.interface_fn: 233.47 seconds


' Sunday.\n\nBright lights paint the stadium floor,\nA symphony of cheers and roars.\nFamilies gather, hand in hand,\nTo watch their heroes on the grandest stage.\n\nThe crowd roars loud, a thunderous beat,\nAs the game unfolds, a thrilling feat.\nThe halftime show ignites the sky,\nA spectacle that leaves a joyous cry.\n\nSuper Bowl Sunday, a spectacle to behold,\nA moment to cherish, a story to be told.\nThe anticipation hangs in air,\nAs the excitement reaches its peak.'