Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docSite/assets/imgs/datasetSetting1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 3 additions & 2 deletions docSite/content/docs/installation/upgrading/46.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,5 +50,6 @@ curl --location --request POST 'https://{{host}}/api/admin/initv46-2' \
1. 新增 - 团队空间
2. 新增 - 多路向量(多个向量映射一组数据)
3. 新增 - tts语音
4. 线上环境新增 - ReRank向量召回,提高召回精度
5. 优化 - 知识库导出,可直接触发流下载,无需等待转圈圈
4. 新增 - 支持知识库配置文本预处理模型
5. 线上环境新增 - ReRank向量召回,提高召回精度
6. 优化 - 知识库导出,可直接触发流下载,无需等待转圈圈
8 changes: 4 additions & 4 deletions docSite/content/docs/pricing.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
---
title: '定价'
description: 'FastGPT 的定价'
title: '线上版定价'
description: 'FastGPT 线上版定价'
icon: 'currency_yen'
draft: false
toc: true
weight: 10
weight: 11
---

## Tokens 说明
Expand All @@ -15,7 +15,7 @@ weight: 10

## FastGPT 线上计费

目前,FastGPT 线上计费也仅按 Tokens 使用数量为准。以下是详细的计费表(最新定价以线上表格为准,可在点击充值后实时获取):
使用: [https://fastgpt.run](https://fastgpt.run) 或 [https://ai.fastgpt.in](https://ai.fastgpt.in) 只需仅按 Tokens 使用数量扣费即可。可在 账号-使用记录 中查看具体使用情况,以下是详细的计费表(最新定价以线上表格为准,可在点击充值后实时获取):

{{< table "table-hover table-striped-columns" >}}
| 计费项 | 价格: 元/ 1K tokens(包含上下文) |
Expand Down
20 changes: 14 additions & 6 deletions docSite/content/docs/use-cases/datasetEngine.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "知识库结构讲解"
description: "本节会介绍 FastGPT 知识库结构设计,理解其 QA 的存储格式和检索格式,以便更好的构建知识库。这篇介绍主要以使用为主,详细原理不多介绍。"
description: "本节会详细介绍 FastGPT 知识库结构设计,理解其 QA 的存储格式和多向量映射,以便更好的构建知识库。这篇介绍主要以使用为主,详细原理不多介绍。"
icon: "dataset"
draft: false
toc: true
Expand All @@ -25,13 +25,21 @@ FastGPT 采用了 RAG 中的 Embedding 方案构建知识库,要使用好 Fast

FastGPT 采用了 `PostgresSQL` 的 `PG Vector` 插件作为向量检索器,索引为`HNSW`。且`PostgresSQL`仅用于向量检索,`MongoDB`用于其他数据的存取。

在`PostgresSQL`的表中,设置一个 `index` 字段用于存储向量、一个 `q` 字段用于存储向量对应的内容,以及一个 `a` 字段用于检索映射。之所以取字段为 `qa` 是由于一些历史缘故,无需完全解为 “问答对” 的格式。在实际使用过程中,可以利用`q`和`a`的组合,对检索后的内容做进一步的声明,提高大模型的理解力(注意,这里不直接提高搜索精度)
在`PostgresSQL`的表中,设置一个 `index` 字段用于存储向量,以及一个`data_id`用于在`MongoDB`中寻找对应的映射值。多个`index`可以对应一组`data_id`,也就是说,一组向量可以对应多组数据。在进行检索时,相同数据会进行合并

目前,提高向量搜索的精度,主要可以通过几种途径:
![](/imgs/datasetSetting1.png)

1. 精简`q`的内容,减少向量内容的长度:当`q`的内容更少,更准确时,检索精度自然会提高。但与此同时,会牺牲一定的检索范围,适合答案较为严格的场景。
2. 更好分词分段:当一段话的结构和语义是完整的,并且是单一的,精度也会提高。因此,许多系统都会优化分词器,尽可能的保障每组数据的完整性。
3. 多样性文本:为一段内容增加关键词、摘要、相似问题等描述性信息,可以使得该内容的向量具有更大的检索覆盖范围。
## 多向量的目的和使用方式

在一组数据中,如果我们希望它尽可能长,但语义又要在向量中尽可能提现,则没有办法通过一组向量来表示。因此,我们采用了多向量映射的方式,将一组数据映射到多组向量中,从而保障数据的完整性和语义的提现。

你可以为一组较长的文本,添加多组向量,从而在检索时,只要其中一组向量被检索到,该数据也将被召回。

## 提高向量搜索精度的方法

1. 更好分词分段:当一段话的结构和语义是完整的,并且是单一的,精度也会提高。因此,许多系统都会优化分词器,尽可能的保障每组数据的完整性。
2. 精简`index`的内容,减少向量内容的长度:当`index`的内容更少,更准确时,检索精度自然会提高。但与此同时,会牺牲一定的检索范围,适合答案较为严格的场景。
3. 丰富`index`的数量,可以为同一个`chunk`内容增加多组`index`。
4. 优化检索词:在实际使用过程中,用户的问题通常是模糊的或是缺失的,并不一定是完整清晰的问题。因此优化用户的问题(检索词)很大程度上也可以提高精度。
5. 微调向量模型:由于市面上直接使用的向量模型都是通用型模型,在特定领域的检索精度并不高,因此微调向量模型可以很大程度上提高专业领域的检索效果。

Expand Down
4 changes: 2 additions & 2 deletions packages/global/common/string/textSplitter.ts
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,8 @@ export const splitText2Chunks = (props: { text: string; maxLen: number; overlapL
let chunks: string[] = [];
for (let i = 0; i < splitTexts.length; i++) {
let text = splitTexts[i];
let chunkToken = countPromptTokens(lastChunk, '');
const textToken = countPromptTokens(text, '');
let chunkToken = lastChunk.length;
const textToken = text.length;

// next chunk is too large / new chunk is too large(The current chunk must be smaller than maxLen)
if (textToken >= maxLen || chunkToken + textToken > maxLen * 1.4) {
Expand Down
6 changes: 4 additions & 2 deletions packages/global/core/dataset/type.d.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import type { VectorModelItemType } from '../../core/ai/model.d';
import type { LLMModelItemType, VectorModelItemType } from '../../core/ai/model.d';
import { PermissionTypeEnum } from '../../support/permission/constant';
import { PushDatasetDataChunkProps } from './api';
import {
Expand All @@ -19,6 +19,7 @@ export type DatasetSchemaType = {
avatar: string;
name: string;
vectorModel: string;
agentModel: string;
tags: string[];
type: `${DatasetTypeEnum}`;
permission: `${PermissionTypeEnum}`;
Expand Down Expand Up @@ -84,8 +85,9 @@ export type CollectionWithDatasetType = Omit<DatasetCollectionSchemaType, 'datas
};

/* ================= dataset ===================== */
export type DatasetItemType = Omit<DatasetSchemaType, 'vectorModel'> & {
export type DatasetItemType = Omit<DatasetSchemaType, 'vectorModel' | 'agentModel'> & {
vectorModel: VectorModelItemType;
agentModel: LLMModelItemType;
isOwner: boolean;
canWrite: boolean;
};
Expand Down
2 changes: 2 additions & 0 deletions packages/global/support/wallet/bill/api.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ import { BillListItemType } from './type';

export type CreateTrainingBillProps = {
name: string;
vectorModel?: string;
agentModel?: string;
};

export type ConcatBillProps = {
Expand Down
1 change: 0 additions & 1 deletion packages/service/core/app/schema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,6 @@ const AppSchema = new Schema({

try {
AppSchema.index({ updateTime: -1 });
AppSchema.index({ 'share.collection': -1 });
} catch (error) {
console.log(error);
}
Expand Down
1 change: 0 additions & 1 deletion packages/service/core/dataset/collection/schema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,6 @@ const DatasetCollectionSchema = new Schema({

try {
DatasetCollectionSchema.index({ datasetId: 1 });
DatasetCollectionSchema.index({ userId: 1 });
DatasetCollectionSchema.index({ updateTime: -1 });
} catch (error) {
console.log(error);
Expand Down
5 changes: 5 additions & 0 deletions packages/service/core/dataset/schema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ const DatasetSchema = new Schema({
required: true,
default: 'text-embedding-ada-002'
},
agentModel: {
type: String,
required: true,
default: 'gpt-3.5-turbo-16k'
},
type: {
type: String,
enum: Object.keys(DatasetTypeMap),
Expand Down
2 changes: 1 addition & 1 deletion packages/service/core/dataset/training/schema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ const TrainingDataSchema = new Schema({

try {
TrainingDataSchema.index({ lockTime: 1 });
TrainingDataSchema.index({ userId: 1 });
TrainingDataSchema.index({ datasetId: 1 });
TrainingDataSchema.index({ collectionId: 1 });
TrainingDataSchema.index({ expireAt: 1 }, { expireAfterSeconds: 7 * 24 * 60 });
} catch (error) {
Expand Down
2 changes: 2 additions & 0 deletions projects/app/public/locales/en/common.json
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@
}
},
"dataset": {
"Agent Model": "Learning Model",
"Chunk Length": "Chunk Length",
"Confirm move the folder": "Confirm Move",
"Confirm to delete the data": "Confirm to delete the data?",
Expand All @@ -259,6 +260,7 @@
"Delete Dataset Error": "Delete dataset failed",
"Edit Folder": "Edit Folder",
"Export": "Export",
"Export Dataset Limit Error": "Export Data Error",
"File Input": "Import File",
"File Size": "File Size",
"Filename": "Filename",
Expand Down
2 changes: 2 additions & 0 deletions projects/app/public/locales/zh/common.json
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@
}
},
"dataset": {
"Agent Model": "文件处理模型",
"Chunk Length": "数据总量",
"Confirm move the folder": "确认移动到该目录",
"Confirm to delete the data": "确认删除该数据?",
Expand All @@ -259,6 +260,7 @@
"Delete Dataset Error": "删除知识库异常",
"Edit Folder": "编辑文件夹",
"Export": "导出",
"Export Dataset Limit Error": "导出数据失败",
"File Input": "文件导入",
"File Size": "文件大小",
"Filename": "文件名",
Expand Down
13 changes: 5 additions & 8 deletions projects/app/src/constants/dataset.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import { defaultQAModels, defaultVectorModels } from '@fastgpt/global/core/ai/model';
import type {
DatasetCollectionItemType,
DatasetItemType
Expand All @@ -17,13 +18,8 @@ export const defaultDatasetDetail: DatasetItemType = {
permission: 'private',
isOwner: false,
canWrite: false,
vectorModel: {
model: 'text-embedding-ada-002',
name: 'Embedding-2',
price: 0.2,
defaultToken: 500,
maxToken: 3000
}
vectorModel: defaultVectorModels[0],
agentModel: defaultQAModels[0]
};

export const defaultCollectionDetail: DatasetCollectionItemType = {
Expand All @@ -43,7 +39,8 @@ export const defaultCollectionDetail: DatasetCollectionItemType = {
name: '',
tags: [],
permission: 'private',
vectorModel: 'text-embedding-ada-002'
vectorModel: defaultVectorModels[0].model,
agentModel: defaultQAModels[0].model
},
parentId: '',
name: '',
Expand Down
2 changes: 2 additions & 0 deletions projects/app/src/global/core/api/datasetReq.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import type { SearchTestItemType } from '@/types/core/dataset';
import { UploadChunkItemType } from '@fastgpt/global/core/dataset/type';
import { DatasetCollectionSchemaType } from '@fastgpt/global/core/dataset/type';
import { PermissionTypeEnum } from '@fastgpt/global/support/permission/constant';
import type { LLMModelItemType } from '@fastgpt/global/core/ai/model.d';

/* ===== dataset ===== */
export type DatasetUpdateParams = {
Expand All @@ -14,6 +15,7 @@ export type DatasetUpdateParams = {
name?: string;
avatar?: string;
permission?: `${PermissionTypeEnum}`;
agentModel?: LLMModelItemType;
};

export type SearchTestProps = {
Expand Down
1 change: 1 addition & 0 deletions projects/app/src/global/core/dataset/api.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ export type CreateDatasetParams = {
tags: string;
avatar: string;
vectorModel?: string;
agentModel?: string;
type: `${DatasetTypeEnum}`;
};

Expand Down
6 changes: 3 additions & 3 deletions projects/app/src/global/core/prompt/agent.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
export const Prompt_AgentQA = {
prompt: `我会给你一段文本,{{theme}},学习它们,并整理学习成果,要求为:
1. 提出最多 25 个问题
2. 给出每个问题的答案
3. 答案要详细完整,答案可以包含普通文字、链接、代码、表格、公示、媒体链接等 markdown 元素
1. 提出问题并给出每个问题的答案
2. 每个答案都要详细完整,给出相关原文描述,答案可以包含普通文字、链接、代码、表格、公示、媒体链接等 markdown 元素
3. 最多提出 30 个问题
4. 按格式返回多个问题和答案:

Q1: 问题。
Expand Down
16 changes: 15 additions & 1 deletion projects/app/src/pages/api/admin/initv46-2.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ import {
import { authCert } from '@fastgpt/service/support/permission/auth/common';
import { MongoDatasetData } from '@fastgpt/service/core/dataset/data/schema';
import { getUserDefaultTeam } from '@fastgpt/service/support/user/team/controller';
import { MongoDataset } from '@fastgpt/service/core/dataset/schema';
import { defaultQAModels } from '@fastgpt/global/core/ai/model';

let success = 0;
/* pg 中的数据搬到 mongo dataset.datas 中,并做映射 */
Expand Down Expand Up @@ -41,6 +43,13 @@ export default async function handler(req: NextApiRequest, res: NextApiResponse)

await initPgData();

await MongoDataset.updateMany(
{},
{
agentModel: defaultQAModels[0].model
}
);

jsonRes(res, {
data: await init(limit),
message:
Expand Down Expand Up @@ -76,14 +85,19 @@ async function initPgData() {
for (let i = 0; i < limit; i++) {
init(i);
}

async function init(index: number): Promise<any> {
const userId = rows[index]?.user_id;
if (!userId) return;
try {
const tmb = await getUserDefaultTeam({ userId });
console.log(tmb);

// update pg
await PgClient.query(
`Update ${PgDatasetTableName} set team_id = '${tmb.teamId}', tmb_id = '${tmb.tmbId}' where user_id = '${userId}' AND team_id='null';`
`Update ${PgDatasetTableName} set team_id = '${String(tmb.teamId)}', tmb_id = '${String(
tmb.tmbId
)}' where user_id = '${userId}' AND team_id='null';`
);
console.log(++success);
init(index + limit);
Expand Down
101 changes: 101 additions & 0 deletions projects/app/src/pages/api/admin/initv46-3.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import type { NextApiRequest, NextApiResponse } from 'next';
import { jsonRes } from '@fastgpt/service/common/response';
import { connectToDatabase } from '@/service/mongo';
import { delay } from '@/utils/tools';
import { PgClient } from '@fastgpt/service/common/pg';
import {
DatasetDataIndexTypeEnum,
PgDatasetTableName
} from '@fastgpt/global/core/dataset/constant';

import { authCert } from '@fastgpt/service/support/permission/auth/common';
import { MongoDatasetData } from '@fastgpt/service/core/dataset/data/schema';

let success = 0;
/* pg 中的数据搬到 mongo dataset.datas 中,并做映射 */
export default async function handler(req: NextApiRequest, res: NextApiResponse) {
try {
const { limit = 50 } = req.body as { limit: number };
await authCert({ req, authRoot: true });
await connectToDatabase();
success = 0;

jsonRes(res, {
data: await init(limit)
});
} catch (error) {
console.log(error);

jsonRes(res, {
code: 500,
error
});
}
}

type PgItemType = {
id: string;
q: string;
a: string;
dataset_id: string;
collection_id: string;
data_id: string;
};

async function init(limit: number): Promise<any> {
const { rows: idList } = await PgClient.query<{ id: string }>(
`SELECT id FROM ${PgDatasetTableName} WHERE inited=1`
);

console.log('totalCount', idList.length);

await delay(2000);

if (idList.length === 0) return;

for (let i = 0; i < limit; i++) {
initData(i);
}

async function initData(index: number): Promise<any> {
const dataId = idList[index]?.id;
if (!dataId) {
console.log('done');
return;
}
// get limit data where data_id is null
const { rows } = await PgClient.query<PgItemType>(
`SELECT id,q,a,dataset_id,collection_id,data_id FROM ${PgDatasetTableName} WHERE id=${dataId};`
);
const data = rows[0];
if (!data) {
console.log('done');
return;
}

try {
// update mongo data and update inited
await MongoDatasetData.findByIdAndUpdate(data.data_id, {
q: data.q,
a: data.a,
indexes: [
{
defaultIndex: !data.a,
type: data.a ? DatasetDataIndexTypeEnum.qa : DatasetDataIndexTypeEnum.chunk,
dataId: data.id,
text: data.q
}
]
});
// update pg data_id
await PgClient.query(`UPDATE ${PgDatasetTableName} SET inited=0 WHERE id=${dataId};`);

return initData(index + limit);
} catch (error) {
console.log(error);
console.log(data);
await delay(500);
return initData(index);
}
}
}
Loading