Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ [Tasks] JSON Schema spec for Inference types + TS type generation #449

Merged
merged 51 commits into from
Jan 26, 2024
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
7c50482
add JSON schema spec for audio-classification
SBrandeis Jan 19, 2024
fd98112
add JSON schema spec for text-generation
SBrandeis Jan 19, 2024
352e7c5
✨ Add script to generate inference types
SBrandeis Jan 19, 2024
5551f5b
Add generated code
SBrandeis Jan 19, 2024
fad594b
💄format with pnpm
SBrandeis Jan 19, 2024
9a8f327
misc fix
SBrandeis Jan 19, 2024
02ba10c
✨ Add specs for existing tasks
SBrandeis Jan 19, 2024
93c37f5
🩹 Ignore placeholder when generating code
SBrandeis Jan 19, 2024
bbf72ec
🩹 Fix: ensure spec files exist
SBrandeis Jan 19, 2024
16a9beb
✨ Generate inference types for existing tasks
SBrandeis Jan 19, 2024
b27846c
✨ Support cross-file references
SBrandeis Jan 19, 2024
7d9a9f6
regen following header change
SBrandeis Jan 19, 2024
dbd0254
✨ Add text2text-generation task & reference it from summarization/tra…
SBrandeis Jan 19, 2024
6d90348
♻️ Use $id, $defs & title
SBrandeis Jan 19, 2024
d027115
✨ Add sentence similarity task spec
SBrandeis Jan 19, 2024
224c039
fix typo in text2text-generation spec
SBrandeis Jan 22, 2024
b8dae86
regenerate code
SBrandeis Jan 22, 2024
b84825e
Have text-to-speech refer to text-to-audio
SBrandeis Jan 22, 2024
4484e39
regenerate code
SBrandeis Jan 22, 2024
a9c9ae1
Add quicktype-core from fork
SBrandeis Jan 23, 2024
f9fd4f9
regenerate code
SBrandeis Jan 23, 2024
d4ec535
💄format with pnpm
SBrandeis Jan 23, 2024
00501a6
Add canonicalId to TaskData
SBrandeis Jan 23, 2024
29fecc0
Fix naming for bounding boxes types
SBrandeis Jan 23, 2024
d220a9b
♻️ Better names for intermediate types
SBrandeis Jan 23, 2024
49a1d50
✨ Update placeholder
SBrandeis Jan 23, 2024
f4784bf
Changes from code review
SBrandeis Jan 23, 2024
a33987f
mark image & question as required in doc QA
SBrandeis Jan 23, 2024
6558af4
Document QA: rename input element to inputsingle
SBrandeis Jan 24, 2024
0724e26
No batching
SBrandeis Jan 25, 2024
29f5975
rename input -> data
SBrandeis Jan 25, 2024
3a98f58
enable explicit-unions when generating
SBrandeis Jan 25, 2024
e0a4939
tweaks
SBrandeis Jan 25, 2024
2d46399
🩹 Don't use require in rootDirFinder
SBrandeis Jan 26, 2024
c1151c0
Explicit titles
SBrandeis Jan 26, 2024
077a88f
Post-process hack to generate array types
SBrandeis Jan 26, 2024
6b10c4d
regenerate code
SBrandeis Jan 26, 2024
c35fe85
e📝 Some comments
SBrandeis Jan 26, 2024
6f1a8b3
💄 Lint
SBrandeis Jan 26, 2024
9d25d28
Add text-to-image pipeline
SBrandeis Jan 26, 2024
499ed5f
Update image-to-image output
SBrandeis Jan 26, 2024
bf48f5e
Update image-to-image inputs
SBrandeis Jan 26, 2024
49a8151
Factorize generate parameters
SBrandeis Jan 26, 2024
e4f3d13
Correclty type ASR output
SBrandeis Jan 26, 2024
826181a
wip: spec generate parameters
SBrandeis Jan 26, 2024
0000f02
e♻️ Factorize common classification types
SBrandeis Jan 26, 2024
8dc4d17
fix: await writefile in post process
SBrandeis Jan 26, 2024
9ccb3a4
add scheduler param
SBrandeis Jan 26, 2024
accdeff
rename schema-utls to common-definitions
SBrandeis Jan 26, 2024
3a3d4ba
proper type for table QA
SBrandeis Jan 26, 2024
4742c9e
oops I forgot to commit the new file after rename
SBrandeis Jan 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions packages/tasks/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,10 @@
"format": "prettier --write .",
"format:check": "prettier --check .",
"prepublishOnly": "pnpm run build",
"build": "tsup src/index.ts --format cjs,esm --clean --dts",
"build": "tsup src/index.ts src/scripts/**.ts --format cjs,esm --clean --dts",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tsup is used for files to be published.

If you just want to run the script, you can take inspiration from doc-internal:

node --experimental-specifier-resolution=node --loader ts-node/esm scripts/inference-codegen.ts

No need to add ts-node, it's already included in the root package

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ that command fails with a syntax error (cannot use import outside of a module or smth)

Is it OK tu use tsc to compile the script in pnpm run inference-codegen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that command fails with a syntax error

I think we can add type: "module" to the package.json

"prepare": "pnpm run build",
"check": "tsc"
"check": "tsc",
"inference-codegen": "pnpm run build && node dist/scripts/inference-codegen.js"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should include this directly in the build command

},
"files": [
"dist",
Expand All @@ -40,5 +41,8 @@
],
"author": "Hugging Face",
"license": "MIT",
"devDependencies": {}
"devDependencies": {
"@types/node": "^20.11.5",
"quicktype-core": "https://github.com/huggingface/quicktype/raw/pack-18.0.15/packages/quicktype-core/quicktype-core-18.0.15.tgz"
}
}
209 changes: 209 additions & 0 deletions packages/tasks/pnpm-lock.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

112 changes: 112 additions & 0 deletions packages/tasks/src/scripts/inference-codegen.ts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the scripts folder could be moved to the top-level (under package/tasks)

Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
import type { SerializedRenderResult } from "quicktype-core";
import { quicktype, InputData, JSONSchemaInput, FetchingJSONSchemaStore } from "quicktype-core";
import * as fs from "fs/promises";
import { existsSync as pathExists } from "fs";
import * as path from "path";

const TYPESCRIPT_HEADER_FILE = `
/**
* Inference code generated from the JSON schema spec in ./spec
*
* Using src/scripts/inference-codegen
*/

`;

const rootDirFinder = function (): string {
const parts = __dirname.split("/");
let level = parts.length - 1;
while (level > 0) {
const currentPath = parts.slice(0, level).join("/");
console.debug(currentPath);
try {
require(`${currentPath}/package.json`);
return path.normalize(currentPath);
} catch (err) {
/// noop
}
level--;
}
return "";
};
SBrandeis marked this conversation as resolved.
Show resolved Hide resolved

/**
*
* @param taskId The ID of the task for which we are generating code
* @param taskSpecDir The path to the directory where the input.json & output.json files are
* @param allSpecFiles An array of paths to all the tasks specs. Allows resolving cross-file references ($ref).
*/
async function buildInputData(taskId: string, taskSpecDir: string, allSpecFiles: string[]): Promise<InputData> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taskId can have a better type probably (something like PipelineType 😢)

const schema = new JSONSchemaInput(new FetchingJSONSchemaStore(), [], allSpecFiles);
await schema.addSource({
name: `${taskId}-input`,
schema: await fs.readFile(`${taskSpecDir}/input.json`, { encoding: "utf-8" }),
});
await schema.addSource({
name: `${taskId}-output`,
schema: await fs.readFile(`${taskSpecDir}/output.json`, { encoding: "utf-8" }),
});
const inputData = new InputData();
inputData.addInput(schema);
return inputData;
}

async function generateTypescript(inputData: InputData): Promise<SerializedRenderResult> {
return await quicktype({
inputData,
lang: "typescript",
alphabetizeProperties: true,
rendererOptions: {
"just-types": true,
"nice-property-names": true,
"prefer-unions": true,
"prefer-const-values": true,
"prefer-unknown": true,
// "explicit-unions": true,
},
});
}

async function main() {
const rootDir = rootDirFinder();
const tasksDir = path.join(rootDir, "src", "tasks");
const allTasks = await Promise.all(
(await fs.readdir(tasksDir, { withFileTypes: true }))
.filter((entry) => entry.isDirectory())
.filter((entry) => entry.name !== "placeholder")
.map(async (entry) => ({ task: entry.name, dirPath: path.join(entry.path, entry.name) }))
);
const allSpecFiles = allTasks
.flatMap(({ dirPath }) => [path.join(dirPath, "spec", "input.json"), path.join(dirPath, "spec", "output.json")])
.filter((filepath) => pathExists(filepath));

for (const { task, dirPath } of allTasks) {
const taskSpecDir = path.join(dirPath, "spec");
if (!(pathExists(path.join(taskSpecDir, "input.json")) && pathExists(path.join(taskSpecDir, "output.json")))) {
console.debug(`No spec found for task ${task} - skipping`);
continue;
}
console.debug(`✨ Generating types for task`, task);

console.debug(" 📦 Building input data");
const inputData = await buildInputData(task, taskSpecDir, allSpecFiles);

console.debug(" 🏭 Generating typescript code");
{
const { lines } = await generateTypescript(inputData);
await fs.writeFile(`${dirPath}/inference.ts`, [TYPESCRIPT_HEADER_FILE, ...lines].join(`\n`), {
flag: "w+",
encoding: "utf-8",
});
}
}
console.debug("✅ All done!");
}

let exit = 0;
main()
.catch((err) => {
console.error("Failure", err);
exit = 1;
})
.finally(() => process.exit(exit));
48 changes: 48 additions & 0 deletions packages/tasks/src/tasks/audio-classification/inference.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
/**
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
* Inference code generated from the JSON schema spec in ./spec
*
* Using src/scripts/inference-codegen
*/

/**
* Inputs for Audio Classification inference
*/
export interface AudioClassificationInput {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-flagging this comment in case it was lost

Copy link
Contributor

@Wauplin Wauplin Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for images. The jsonschema cannot specify this since sending as raw data and sending as json are 2 different things. So for now it's kind of a blind spot. If we provide an openapi schema for our APIs in the future, then it will be possible to document it. Openapi easily integrates with jsonschema so having them is already a first good step.

(difference between a jsonschema as in this PR and an openapi description is that this PR describes objects with their attributes while the openapi description with include stuff like server routes, accepted headers, etc.)

(^ only my understanding of the specs, anyone feel free to correct me 😄)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - sorry for the delay in answering

Leaving the image/audio data as unknown was intentional, to give more flexibility to the libraries.
Image & audio data can be passed in several different forms (raw binary data, path to a local or remote file, base64 encoded data...) and I did not want to constrain downstream users of those types into one single representation.

(difference between a jsonschema as in this PR and an openapi description is that this PR describes objects with their attributes while the openapi description with include stuff like server routes, accepted headers, etc.)

Yes that is correct, there will be some additional work necessary to generate an OpenAPI spec for an inference API (including actually specifying how we expect the binary data to be represented)

/**
* One or several audio files to classify
*/
inputs: unknown;
/**
* Additional inference parameters
*/
parameters?: AudioClassificationParameters;
[property: string]: unknown;
}

/**
* Additional inference parameters
*
* Additional inference parameters for Audio Classification
*/
export interface AudioClassificationParameters {
/**
* When specified, limits the output to the top K most probable classes.
*/
topK?: number;
[property: string]: unknown;
}

/**
* Outputs for Audio Classification inference
*/
export interface AudioClassificationOutput {
/**
* The predicted class label (model specific).
*/
label: string;
/**
* The corresponding probability.
*/
score: number;
[property: string]: unknown;
}
Loading