Skip to content

[Job Launcher] GCV moderation: Batches processing#3081

Merged
portuu3 merged 15 commits intofeature/launcher/gcv-moderationfrom
feature/launcher/gcv-moderation-multiple-tasks
Mar 6, 2025
Merged

[Job Launcher] GCV moderation: Batches processing#3081
portuu3 merged 15 commits intofeature/launcher/gcv-moderationfrom
feature/launcher/gcv-moderation-multiple-tasks

Conversation

@0xVoronov
Copy link
Copy Markdown
Contributor

Issue tracking

This PR addresses issues related to job moderation processing and result collection.
Tracking: [Job Launcher] GCV integration
#2905.

Context behind the change

This PR introduces batch processing for large datasets, ensuring that datasets with more than 2000 images are split into multiple tasks for better processing efficiency.

Changes made:

Implemented batch processing:

  • If a dataset contains more than 2000 images, it is split into smaller tasks to improve parallel processing.
    Added a new migration:
  • Introduced the job-moderation-tasks table to store individual task details.

How has this been tested?

  • Unit Tests:
    • Tested valid and invalid moderation results.
    • Ensured missing files throw expected errors.
  • Manual Testing:
    • Triggered moderation flow with various scenarios (successful, partial passed, fully passed).

Release plan

  • Remove all data from cron-jobs table.
  • Run the new migration before deploying to create the job-moderation-tasks table.

@0xVoronov 0xVoronov self-assigned this Feb 7, 2025
@0xVoronov 0xVoronov requested a review from portuu3 February 7, 2025 11:52
@vercel
Copy link
Copy Markdown

vercel Bot commented Feb 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
human-app ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 26, 2025 0:04am
human-dashboard-frontend ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 26, 2025 0:04am
staking-dashboard ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 26, 2025 0:04am
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
faucet-frontend ⬜️ Ignored (Inspect) Visit Preview Feb 26, 2025 0:04am
faucet-server ⬜️ Ignored (Inspect) Visit Preview Feb 26, 2025 0:04am

Copy link
Copy Markdown
Collaborator

@portuu3 portuu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think job moderation should be in a different module.

Comment thread packages/apps/job-launcher/server/src/common/constants/index.ts Outdated
Comment thread packages/apps/job-launcher/server/src/common/constants/index.ts Outdated
Comment on lines +88 to +89
export const JOB_MODERATION_ASYNC_BATCH_SIZE = 100;
export const JOB_MODERATION_BATCH_SIZE_PER_TASK = 2000;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add GCV to the names of these constants, just in case we ever add other external service for moderation

`);
await queryRunner.query(`
ALTER TABLE "hmt"."job-moderation-tasks"
ALTER COLUMN "job_id" DROP NOT NULL
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

}

this.logger.log('Parse job moderation results START');
this.logger.log('Process job moderation tasks START');
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.logger.log('Process job moderation tasks START');
this.logger.log('Parse job moderation results START');

this.logger.error(
`Error process job moderation task. Error ID: ${errorId}, Job ID: ${jobModerationTaskEntity.id}, Reason: ${failedReason}, Message: ${err.message}`,
);
jobModerationTaskEntity.status = JobModerationTaskStatus.FAILED;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is missing a notification


@Cron('*/3 * * * *')
@Cron('*/2 * * * *')
public async processJobModerationTasksCronJob() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cron job is unnecessary. As far as I can see we have 4 cron jobs:

  1. Create moderation task entity.
  2. Send the data for moderation.
  3. Parse results.
  4. Send notification.

At least 1 and 2 could be combined.

Copy link
Copy Markdown
Collaborator

@portuu3 portuu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think job moderation should have its own module

…update environment variables, and remove deprecated job moderation tasks.
Comment thread packages/apps/job-launcher/server/src/common/enums/content-moderation.ts Outdated
…ssary console logs and ensure consistent output format
… a new gcv.ts file and updating imports accordingly
it('should do nothing if requests already exist', async () => {
const jobEntity = {
id: faker.number.int(),
status: JobStatus.PAID,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job entity can be defined globally inside ensureRequests or maybe even for the whole test file.
Check this for other mocked entities

Comment thread packages/apps/job-launcher/server/src/modules/cron-job/cron-job.service.ts Outdated
}

jobEntity.contentModerationRequests = [
...(jobEntity.contentModerationRequests || []),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will always be null, we are checking if the job has requests in line 92 and returning the function

Comment thread packages/apps/job-launcher/server/src/common/enums/content-moderation.ts Outdated
(fileName) => `${gcDataUrl}/${fileName.split('/').pop()}`,
);

const fileName = getFileName(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to a more descriptive name

const outputUri = constructGcsPath(
this.visionConfigService.moderationResultsBucket,
this.visionConfigService.moderationResultsFilesPath,
hashString(fileName),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes no sense to hash the file name

*/
private async collectModerationResults(fileName: string) {
try {
const hash = hashString(fileName);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary

@portuu3 portuu3 merged commit 6670dee into feature/launcher/gcv-moderation Mar 6, 2025
@portuu3 portuu3 deleted the feature/launcher/gcv-moderation-multiple-tasks branch March 6, 2025 12:45
portuu3 added a commit that referenced this pull request Mar 12, 2025
* Implemented job moderation service

* Implemented storage

* Added updates

* Updated names, updated migration

* Added migration file

* Improved job moderation flow, added new unit tests

* Implemented async batch annotation logic

* Resolved comments, fixed dependencies, updated tests

* Implemented new GCS and Job Moderation utils

* [Job Launcher] GCV moderation: Batches processing (#3081)

* Implemented batch processing logic

* Added entity, repository and migration

* Updated job moderation service

* Implemented unit tests

* Refactor content moderation module: rename and restructure entities, update environment variables, and remove deprecated job moderation tasks.

* Add @faker-js/faker dependency, improve test cases, and clean up job service tests

* Improved content moderation tests

* Faker ussage for Content Moderation tests

* Refactor GCVContentModerationService tests to use mockResolvedValueOnce for better control over mock behavior

* Fix GCV content moderation tests

* Refactor GCS URL conversion and validation functions to remove unnecessary console logs and ensure consistent output format

* Refactor content moderation enums by moving ContentModerationLevel to a new gcv.ts file and updating imports accordingly

* Refactor categorize methods and clean some useless code

---------

Co-authored-by: Francisco López <francislopez977@gmail.com>
Co-authored-by: portuu3 <adrian.portugues.mas@gmail.com>

* feat: add node-cache for caching GCS object listings and remove wrong migration

* Update gcv-content-moderation.service.ts

---------

Co-authored-by: Francisco López <francislopez977@gmail.com>
Co-authored-by: portuu3 <adrian.portugues.mas@gmail.com>
Co-authored-by: Francisco López <50665615+flopez7@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants