Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Features:
- Remote Inferencing: Perform inferencing tasks remotely with Llama models hosted on a remote connection (or serverless localhost).
- Simple Integration: With easy-to-use APIs, a developer can quickly integrate Llama Stack in their Android app. The difference with local vs remote inferencing is also minimal.

Latest Release Notes: [v0.1.0](https://github.com/meta-llama/llama-stack-client-kotlin/releases/tag/v0.1.0)
Latest Release Notes: [v0.1.2](https://github.com/meta-llama/llama-stack-client-kotlin/releases/tag/v0.1.2)

*Tagged releases are stable versions of the project. While we strive to maintain a stable main branch, it's not guaranteed to be free of bugs or issues.*

Expand All @@ -24,7 +24,7 @@ The key files in the app are `ExampleLlamaStackLocalInference.kt`, `ExampleLlama
Add the following dependency in your `build.gradle.kts` file:
```
dependencies {
implementation("com.llama.llamastack:llama-stack-client-kotlin:0.1.0")
implementation("com.llama.llamastack:llama-stack-client-kotlin:0.1.2")
}
```
This will download jar files in your gradle cache in a directory like `~/.gradle/caches/modules-2/files-2.1/com.llama.llamastack/`
Expand Down Expand Up @@ -60,7 +60,7 @@ Start a Llama Stack server on localhost. Here is an example of how you can do th
```
conda create -n stack-fireworks python=3.10
conda activate stack-fireworks
pip install llama-stack=0.1.0
pip install llama-stack=0.1.2
llama stack build --template fireworks --image-type conda
export FIREWORKS_API_KEY=<SOME_KEY>
llama stack run /Users/<your_username>/.llama/distributions/llamastack-fireworks/fireworks-run.yaml --port=5050
Expand Down Expand Up @@ -99,7 +99,7 @@ client = LlamaStackClientLocalClient
client = LlamaStackClientOkHttpClient
.builder()
.baseUrl(remoteURL)
.headers(mapOf("x-llamastack-client-version" to listOf("0.1.0")))
.headers(mapOf("x-llamastack-client-version" to listOf("0.1.2")))
.build()
```
</td>
Expand Down Expand Up @@ -258,7 +258,7 @@ val result = client!!.inference().chatCompletion(
)

// response contains string with response from model
var response = result.asChatCompletionResponse().completionMessage().content().string();
var response = result.completionMessage().content().string();
```

[Remote only] For inference with a streaming response:
Expand Down Expand Up @@ -286,7 +286,7 @@ The purpose of this section is to share more details with users that would like
### Prerequisite

You must complete the following steps:
1. Clone the repo (`git clone https://github.com/meta-llama/llama-stack-client-kotlin.git -b release/0.1.0`)
1. Clone the repo (`git clone https://github.com/meta-llama/llama-stack-client-kotlin.git -b release/0.1.2`)
2. Port the appropriate ExecuTorch libraries over into your Llama Stack Kotlin library environment.
```
cd llama-stack-client-kotlin-client-local
Expand All @@ -309,7 +309,7 @@ Copy the .jar files over to the lib directory in your Android app. At the same t
### Additional Options for Local Inferencing
Currently we provide additional properties support with local inferencing. In order to get the tokens/sec metric for each inference call, add the following code in your Android app after you run your chatCompletion inference function. The Reference app has this implementation as well:
```
var tps = (result.asChatCompletionResponse()._additionalProperties()["tps"] as JsonNumber).value as Float
var tps = (result._additionalProperties()["tps"] as JsonNumber).value as Float
```
We will be adding more properties in the future.

Expand Down
2 changes: 1 addition & 1 deletion build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ plugins {

allprojects {
group = "com.llama.llamastack"
version = "0.1.0"
version = "0.1.2"
}
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,12 @@ import com.llama.llamastack.client.local.util.buildInferenceChatCompletionRespon
import com.llama.llamastack.client.local.util.buildLastInferenceChatCompletionResponsesFromStream
import com.llama.llamastack.core.RequestOptions
import com.llama.llamastack.core.http.StreamResponse
import com.llama.llamastack.models.ChatCompletionResponse
import com.llama.llamastack.models.ChatCompletionResponseStreamChunk
import com.llama.llamastack.models.CompletionResponse
import com.llama.llamastack.models.EmbeddingsResponse
import com.llama.llamastack.models.InferenceChatCompletionParams
import com.llama.llamastack.models.InferenceChatCompletionResponse
import com.llama.llamastack.models.InferenceCompletionParams
import com.llama.llamastack.models.InferenceCompletionResponse
import com.llama.llamastack.models.InferenceEmbeddingsParams
import com.llama.llamastack.services.blocking.InferenceService
import org.pytorch.executorch.LlamaCallback
Expand All @@ -31,7 +32,7 @@ constructor(
private var sequenceLengthKey: String = "seq_len"
private var stopToken: String = ""

private val streamingResponseList = mutableListOf<InferenceChatCompletionResponse>()
private val streamingResponseList = mutableListOf<ChatCompletionResponseStreamChunk>()
private var isStreaming: Boolean = false

private val waitTime: Long = 100
Expand Down Expand Up @@ -69,7 +70,7 @@ constructor(
override fun chatCompletion(
params: InferenceChatCompletionParams,
requestOptions: RequestOptions
): InferenceChatCompletionResponse {
): ChatCompletionResponse {
isStreaming = false
clearElements()
val mModule = clientOptions.llamaModule
Expand Down Expand Up @@ -99,8 +100,8 @@ constructor(
}

private val streamResponse =
object : StreamResponse<InferenceChatCompletionResponse> {
override fun asSequence(): Sequence<InferenceChatCompletionResponse> {
object : StreamResponse<ChatCompletionResponseStreamChunk> {
override fun asSequence(): Sequence<ChatCompletionResponseStreamChunk> {
return sequence {
while (!onResultComplete || streamingResponseList.isNotEmpty()) {
if (streamingResponseList.isNotEmpty()) {
Expand Down Expand Up @@ -132,7 +133,7 @@ constructor(
override fun chatCompletionStreaming(
params: InferenceChatCompletionParams,
requestOptions: RequestOptions
): StreamResponse<InferenceChatCompletionResponse> {
): StreamResponse<ChatCompletionResponseStreamChunk> {
isStreaming = true
streamingResponseList.clear()
resultMessage = ""
Expand All @@ -156,14 +157,14 @@ constructor(
override fun completion(
params: InferenceCompletionParams,
requestOptions: RequestOptions
): InferenceCompletionResponse {
): CompletionResponse {
TODO("Not yet implemented")
}

override fun completionStreaming(
params: InferenceCompletionParams,
requestOptions: RequestOptions
): StreamResponse<InferenceCompletionResponse> {
): StreamResponse<CompletionResponse> {
TODO("Not yet implemented")
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@ constructor(
TODO("Not yet implemented")
}

override fun close() {
TODO("Not yet implemented")
}

override fun agents(): AgentService {
TODO("Not yet implemented")
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
package com.llama.llamastack.client.local.util

import com.llama.llamastack.core.JsonValue
import com.llama.llamastack.models.ChatCompletionResponse
import com.llama.llamastack.models.ChatCompletionResponseStreamChunk
import com.llama.llamastack.models.CompletionMessage
import com.llama.llamastack.models.ContentDelta
import com.llama.llamastack.models.InferenceChatCompletionResponse
import com.llama.llamastack.models.InterleavedContent
import com.llama.llamastack.models.ToolCall
import java.util.UUID

fun buildInferenceChatCompletionResponse(
response: String,
stats: Float,
stopToken: String
): InferenceChatCompletionResponse {
): ChatCompletionResponse {
// check for prefix [ and suffix ] if so then tool call.
// parse for "toolName", "additionalProperties"
var completionMessage =
Expand All @@ -30,41 +32,33 @@ fun buildInferenceChatCompletionResponse(
.build()
}

var inferenceChatCompletionResponse =
InferenceChatCompletionResponse.ofChatCompletionResponse(
InferenceChatCompletionResponse.ChatCompletionResponse.builder()
.completionMessage(completionMessage)
.putAdditionalProperty("tps", JsonValue.from(stats))
.build()
)
val inferenceChatCompletionResponse =
ChatCompletionResponse.builder()
.completionMessage(completionMessage)
.putAdditionalProperty("tps", JsonValue.from(stats))
.build()
return inferenceChatCompletionResponse
}

fun buildInferenceChatCompletionResponseFromStream(
response: String,
): InferenceChatCompletionResponse {
return InferenceChatCompletionResponse.ofChatCompletionResponseStreamChunk(
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.builder()
.event(
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.Event.builder()
.delta(ContentDelta.TextDelta.builder().text(response).build())
.eventType(
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.Event
.EventType
.PROGRESS
)
.build()
)
.build()
)
): ChatCompletionResponseStreamChunk {
return ChatCompletionResponseStreamChunk.builder()
.event(
ChatCompletionResponseStreamChunk.Event.builder()
.delta(ContentDelta.TextDelta.builder().text(response).build())
.eventType(ChatCompletionResponseStreamChunk.Event.EventType.PROGRESS)
.build()
)
.build()
}

fun buildLastInferenceChatCompletionResponsesFromStream(
resultMessage: String,
stats: Float,
stopToken: String,
): List<InferenceChatCompletionResponse> {
val listOfResponses: MutableList<InferenceChatCompletionResponse> = mutableListOf()
): List<ChatCompletionResponseStreamChunk> {
val listOfResponses: MutableList<ChatCompletionResponseStreamChunk> = mutableListOf()
if (isResponseAToolCall(resultMessage)) {
val toolCalls = createCustomToolCalls(resultMessage)
for (toolCall in toolCalls) {
Expand All @@ -83,73 +77,51 @@ fun buildLastInferenceChatCompletionResponsesFromStream(
}

fun buildInferenceChatCompletionResponseForCustomToolCallStream(
toolCall: CompletionMessage.ToolCall,
toolCall: ToolCall,
stopToken: String,
stats: Float
): InferenceChatCompletionResponse {
): ChatCompletionResponseStreamChunk {
val delta =
ContentDelta.ToolCallDelta.builder()
.parseStatus(ContentDelta.ToolCallDelta.ParseStatus.SUCCEEDED)
.toolCall(
ContentDelta.ToolCallDelta.ToolCall.InnerToolCall.builder()
.toolName(toolCall.toolName().toString())
.arguments(
ContentDelta.ToolCallDelta.ToolCall.InnerToolCall.Arguments.builder()
.additionalProperties(toolCall.arguments()._additionalProperties())
.build()
)
.callId(toolCall.callId())
.build()
)
.toolCall(toolCall)
.build()
return InferenceChatCompletionResponse.ofChatCompletionResponseStreamChunk(
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.builder()
.event(
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.Event.builder()
.delta(delta)
.stopReason(mapStopTokenToReasonForStream(stopToken))
.eventType(
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.Event
.EventType
.PROGRESS
)
.build()
)
.putAdditionalProperty("tps", JsonValue.from(stats))
.build()
)
return ChatCompletionResponseStreamChunk.builder()
.event(
ChatCompletionResponseStreamChunk.Event.builder()
.delta(delta)
.stopReason(mapStopTokenToReasonForStream(stopToken))
.eventType(ChatCompletionResponseStreamChunk.Event.EventType.PROGRESS)
.build()
)
.putAdditionalProperty("tps", JsonValue.from(stats))
.build()
}

fun buildInferenceChatCompletionResponseForStringStream(
str: String,
stopToken: String,
stats: Float
): InferenceChatCompletionResponse {
): ChatCompletionResponseStreamChunk {

return InferenceChatCompletionResponse.ofChatCompletionResponseStreamChunk(
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.builder()
.event(
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.Event.builder()
.delta(ContentDelta.TextDelta.builder().text(str).build())
.stopReason(mapStopTokenToReasonForStream(stopToken))
.eventType(
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.Event
.EventType
.PROGRESS
)
.putAdditionalProperty("tps", JsonValue.from(stats))
.build()
)
.build()
)
return ChatCompletionResponseStreamChunk.builder()
.event(
ChatCompletionResponseStreamChunk.Event.builder()
.delta(ContentDelta.TextDelta.builder().text(str).build())
.stopReason(mapStopTokenToReasonForStream(stopToken))
.eventType(ChatCompletionResponseStreamChunk.Event.EventType.PROGRESS)
.putAdditionalProperty("tps", JsonValue.from(stats))
.build()
)
.build()
}

fun isResponseAToolCall(response: String): Boolean {
return response.startsWith("[") && response.endsWith("]")
}

fun createCustomToolCalls(response: String): List<CompletionMessage.ToolCall> {
val toolCalls: MutableList<CompletionMessage.ToolCall> = mutableListOf()
fun createCustomToolCalls(response: String): List<ToolCall> {
val toolCalls: MutableList<ToolCall> = mutableListOf()

val splitsResponse = response.split("),")
for (split in splitsResponse) {
Expand All @@ -170,13 +142,9 @@ fun createCustomToolCalls(response: String): List<CompletionMessage.ToolCall> {
}
}
toolCalls.add(
CompletionMessage.ToolCall.builder()
.toolName(CompletionMessage.ToolCall.ToolName.of(toolName))
.arguments(
CompletionMessage.ToolCall.Arguments.builder()
.additionalProperties(paramsJson)
.build()
)
ToolCall.builder()
.toolName(toolName)
.arguments(ToolCall.Arguments.builder().additionalProperties(paramsJson).build())
.callId(UUID.randomUUID().toString())
.build()
)
Expand All @@ -194,15 +162,9 @@ fun mapStopTokenToReason(stopToken: String): CompletionMessage.StopReason =

fun mapStopTokenToReasonForStream(
stopToken: String
): InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.Event.StopReason =
): ChatCompletionResponseStreamChunk.Event.StopReason =
when (stopToken) {
"<|eot_id|>" ->
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.Event.StopReason
.END_OF_TURN
"<|eom_id|>" ->
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.Event.StopReason
.END_OF_MESSAGE
else ->
InferenceChatCompletionResponse.ChatCompletionResponseStreamChunk.Event.StopReason
.OUT_OF_TOKENS
"<|eot_id|>" -> ChatCompletionResponseStreamChunk.Event.StopReason.END_OF_TURN
"<|eom_id|>" -> ChatCompletionResponseStreamChunk.Event.StopReason.END_OF_MESSAGE
else -> ChatCompletionResponseStreamChunk.Event.StopReason.OUT_OF_TOKENS
}
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ class LlamaStackClientOkHttpClient private constructor() {
fun fromEnv(): LlamaStackClientClient = builder().fromEnv().build()
}

class Builder {
/** A builder for [LlamaStackClientOkHttpClient]. */
class Builder internal constructor() {

private var clientOptions: ClientOptions.Builder = ClientOptions.builder()
private var baseUrl: String = ClientOptions.PRODUCTION_URL
Expand Down Expand Up @@ -128,6 +129,8 @@ class LlamaStackClientOkHttpClient private constructor() {
clientOptions.responseValidation(responseValidation)
}

fun apiKey(apiKey: String?) = apply { clientOptions.apiKey(apiKey) }

fun fromEnv() = apply { clientOptions.fromEnv() }

fun build(): LlamaStackClientClient =
Expand Down
Loading