# Runhouse

The [Runhouse](https://github.com/run-house/runhouse) allows remote compute and data across environments and users. See the [Runhouse docs](https://runhouse-docs.readthedocs-hosted.com/en/latest/).

This example goes over how to use LangChain and [Runhouse](https://github.com/run-house/runhouse) to interact with models hosted on your own GPU, or on-demand GPUs on AWS, GCP, AWS, or Lambda.

**Note**: Code uses `SelfHosted` name instead of the `Runhouse`.

In [1]:
%pip install --upgrade --quiet runhouse
%pip install --upgrade --quiet "skypilot[aws]"

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
anthropic 0.3.11 requires anyio<4,>=3.5.0, but you have anyio 4.3.0 which is incompatible.
langchain 0.1.12 requires langsmith<0.2.0,>=0.1.17, but you have langsmith 0.1.5 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
types-requests 2.31.0.20240125 requires urllib3>=2, but you have urllib3 1.26.18 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [1]:
import runhouse as rh
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms import SelfHostedHuggingFaceLLM, SelfHostedPipeline

In [5]:
# For an on-demand A100 with GCP, Azure, or Lambda
gpu = rh.cluster(name='sasha-rh-a10x', instance_type='g5.4xlarge', provider='aws', region='eu-central-1')
gpu.run(commands=["pip install langchain"])

# For an on-demand A10G with AWS (no single A100s on AWS)
# gpu = rh.cluster(name='rh-a10x', instance_type='g5.2xlarge', provider='aws')

# For an existing cluster
# gpu = rh.cluster(ips=['<ip of the cluster>'],
#                  ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'},
#                  name='rh-a10x')

Output()

INFO | 2024-03-20 09:37:14.735394 | Saving config for sasha-rh-a10x-ssh-secret to Den
INFO | 2024-03-20 09:37:14.898213 | Saving secrets for sasha-rh-a10x-ssh-secret to Vault


Collecting langchain
  Downloading langchain-0.1.12-py3-none-any.whl (809 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 809.1/809.1 kB 23.2 MB/s eta 0:00:00
Collecting langsmith<0.2.0,>=0.1.17
  Downloading langsmith-0.1.31-py3-none-any.whl (71 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.6/71.6 kB 13.1 MB/s eta 0:00:00
Collecting dataclasses-json<0.7,>=0.5.7
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting tenacity<9.0.0,>=8.1.0
  Downloading tenacity-8.2.3-py3-none-any.whl (24 kB)
Collecting jsonpatch<2.0,>=1.33
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-core<0.2.0,>=0.1.31
  Downloading langchain_core-0.1.32-py3-none-any.whl (260 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.9/260.9 kB 30.8 MB/s eta 0:00:00
Collecting langchain-text-splitters<0.1,>=0.0.1
  Downloading langchain_text_splitters-0.0.1-py3-none-any.whl (21 kB)
Collecting langchain-community<0.1,>=0.0.28
  Downloading langchain_community-0.0.28-p

[(0,
  '')]

In [3]:
model_env = rh.env(
    name="model_env15",
    reqs=["transformers", "torch", "accelerate", "huggingface-hub"],
    secrets=["huggingface"]  # need for downloading google/gemma-2b-it
)

In [6]:
llm = SelfHostedHuggingFaceLLM(model_id="google/gemma-2b-it", hardware=gpu, env=model_env)

INFO | 2024-03-20 09:37:41.416454 | SSH tunnel on to server's port 32300 via server's ssh port 22 already created with the cluster.
INFO | 2024-03-20 09:37:41.675983 | Server sasha-rh-a10x is up.
INFO | 2024-03-20 09:37:41.681327 | Copying package from file:///Users/sashabelousovrh/PycharmProjects/LangchainIntegration/langchain to: sasha-rh-a10x
INFO | 2024-03-20 09:37:43.965711 | Calling huggingface._write_to_file


[36mSecrets already exist in ~/.cache/huggingface/token.
[0m

INFO | 2024-03-20 09:37:45.115689 | Time to call huggingface._write_to_file: 1.15 seconds


Output()

INFO | 2024-03-20 09:37:50.178477 | Calling model_env15.install
INFO | 2024-03-20 09:37:51.321045 | Time to call model_env15.install: 1.14 seconds


Output()

INFO | 2024-03-20 09:37:56.039099 | Sending module LangchainLLMModelPipeline to sasha-rh-a10x


Output()

Output()

INFO | 2024-03-20 09:38:05.927212 | Calling LangchainLLMModelPipeline._remote_init
INFO | 2024-03-20 09:38:07.083693 | Time to call LangchainLLMModelPipeline._remote_init: 1.16 seconds
INFO | 2024-03-20 09:38:07.089691 | Calling LangchainLLMModelPipeline.load_model


[36mgemma-2b-it is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
[0m[36mIf this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
[0m[36mTraceback (most recent call last):
[0m[36m  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
[0m[36m    response.raise_for_status()
[0m[36m  File "/opt/conda/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
[0m[36m    raise HTTPError(http_error_msg, response=self)
[0m[36mrequests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/gemma-2b-it/resolve/main/tokenizer_config.json
[0m[36m
[0m[36mThe above exception was the direct cause of the following exception:
[0m[36m
[0m[36mTraceback (most recent call last):
[0m[36m  File "/opt/conda/lib

ERROR | 2024-03-20 09:38:18.259120 | [36mError calling load_model on LangchainLLMModelPipeline on server: gemma-2b-it is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`[0m
ERROR | 2024-03-20 09:38:18.262832 | [36mTraceback: Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/opt/conda/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/gemma-2b-it/resolve/main/tokenizer_config.json

The above exception was the direct cause of the following exception:

Traceb

OSError: gemma-2b-it is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

In [3]:
template = """Question: {question}

Answer: Let's think step by step."""

In [5]:
prompt = PromptTemplate.from_template(template)

In [None]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [None]:
question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"

llm_chain.run(question)

You can also load more custom models through the SelfHostedHuggingFaceLLM interface:

In [None]:
llm = SelfHostedHuggingFaceLLM(
    model_id="google/flan-t5-small",
    task="text2text-generation",
    hardware=gpu,
)

In [None]:
llm("What is the capital of Germany?")

Using a custom load function, we can load a custom pipeline directly on the remote hardware:

In [None]:
def load_pipeline():
    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
        pipeline,
    )

    model_id = "gpt2"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id)
    pipe = pipeline(
        "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10
    )
    return pipe


def inference_fn(pipeline, prompt, stop=None):
    return pipeline(prompt)[0]["generated_text"][len(prompt) :]

In [None]:
llm = SelfHostedHuggingFaceLLM(
    model_load_fn=load_pipeline, hardware=gpu, inference_fn=inference_fn
)

In [None]:
llm("Who is the current US president?")

You can send your pipeline directly over the wire to your model, but this will only work for small models (<2 Gb), and will be pretty slow:

In [None]:
pipeline = load_pipeline()
llm = SelfHostedPipeline.from_pipeline(
    pipeline=pipeline, hardware=gpu, model_reqs=["pip:./", "transformers", "torch"]
)

Instead, we can also send it to the hardware's filesystem, which will be much faster.

In [None]:
import pickle

rh.blob(pickle.dumps(pipeline), path="models/pipeline.pkl").save().to(
    gpu, path="models"
)

llm = SelfHostedPipeline.from_pipeline(pipeline="models/pipeline.pkl", hardware=gpu)