[Job Launcher] GCV moderation: Batches processing#3081
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
2 Skipped Deployments
|
portuu3
left a comment
There was a problem hiding this comment.
I think job moderation should be in a different module.
| export const JOB_MODERATION_ASYNC_BATCH_SIZE = 100; | ||
| export const JOB_MODERATION_BATCH_SIZE_PER_TASK = 2000; |
There was a problem hiding this comment.
I would add GCV to the names of these constants, just in case we ever add other external service for moderation
| `); | ||
| await queryRunner.query(` | ||
| ALTER TABLE "hmt"."job-moderation-tasks" | ||
| ALTER COLUMN "job_id" DROP NOT NULL |
| } | ||
|
|
||
| this.logger.log('Parse job moderation results START'); | ||
| this.logger.log('Process job moderation tasks START'); |
There was a problem hiding this comment.
| this.logger.log('Process job moderation tasks START'); | |
| this.logger.log('Parse job moderation results START'); |
| this.logger.error( | ||
| `Error process job moderation task. Error ID: ${errorId}, Job ID: ${jobModerationTaskEntity.id}, Reason: ${failedReason}, Message: ${err.message}`, | ||
| ); | ||
| jobModerationTaskEntity.status = JobModerationTaskStatus.FAILED; |
There was a problem hiding this comment.
it is missing a notification
|
|
||
| @Cron('*/3 * * * *') | ||
| @Cron('*/2 * * * *') | ||
| public async processJobModerationTasksCronJob() { |
There was a problem hiding this comment.
I think this cron job is unnecessary. As far as I can see we have 4 cron jobs:
- Create moderation task entity.
- Send the data for moderation.
- Parse results.
- Send notification.
At least 1 and 2 could be combined.
portuu3
left a comment
There was a problem hiding this comment.
I think job moderation should have its own module
…update environment variables, and remove deprecated job moderation tasks.
…gcv-moderation-multiple-tasks
…ce for better control over mock behavior
…ssary console logs and ensure consistent output format
… a new gcv.ts file and updating imports accordingly
| it('should do nothing if requests already exist', async () => { | ||
| const jobEntity = { | ||
| id: faker.number.int(), | ||
| status: JobStatus.PAID, |
There was a problem hiding this comment.
Job entity can be defined globally inside ensureRequests or maybe even for the whole test file.
Check this for other mocked entities
| } | ||
|
|
||
| jobEntity.contentModerationRequests = [ | ||
| ...(jobEntity.contentModerationRequests || []), |
There was a problem hiding this comment.
this will always be null, we are checking if the job has requests in line 92 and returning the function
| (fileName) => `${gcDataUrl}/${fileName.split('/').pop()}`, | ||
| ); | ||
|
|
||
| const fileName = getFileName( |
There was a problem hiding this comment.
rename to a more descriptive name
| const outputUri = constructGcsPath( | ||
| this.visionConfigService.moderationResultsBucket, | ||
| this.visionConfigService.moderationResultsFilesPath, | ||
| hashString(fileName), |
There was a problem hiding this comment.
it makes no sense to hash the file name
| */ | ||
| private async collectModerationResults(fileName: string) { | ||
| try { | ||
| const hash = hashString(fileName); |
* Implemented job moderation service * Implemented storage * Added updates * Updated names, updated migration * Added migration file * Improved job moderation flow, added new unit tests * Implemented async batch annotation logic * Resolved comments, fixed dependencies, updated tests * Implemented new GCS and Job Moderation utils * [Job Launcher] GCV moderation: Batches processing (#3081) * Implemented batch processing logic * Added entity, repository and migration * Updated job moderation service * Implemented unit tests * Refactor content moderation module: rename and restructure entities, update environment variables, and remove deprecated job moderation tasks. * Add @faker-js/faker dependency, improve test cases, and clean up job service tests * Improved content moderation tests * Faker ussage for Content Moderation tests * Refactor GCVContentModerationService tests to use mockResolvedValueOnce for better control over mock behavior * Fix GCV content moderation tests * Refactor GCS URL conversion and validation functions to remove unnecessary console logs and ensure consistent output format * Refactor content moderation enums by moving ContentModerationLevel to a new gcv.ts file and updating imports accordingly * Refactor categorize methods and clean some useless code --------- Co-authored-by: Francisco López <francislopez977@gmail.com> Co-authored-by: portuu3 <adrian.portugues.mas@gmail.com> * feat: add node-cache for caching GCS object listings and remove wrong migration * Update gcv-content-moderation.service.ts --------- Co-authored-by: Francisco López <francislopez977@gmail.com> Co-authored-by: portuu3 <adrian.portugues.mas@gmail.com> Co-authored-by: Francisco López <50665615+flopez7@users.noreply.github.com>
Issue tracking
This PR addresses issues related to job moderation processing and result collection.
Tracking: [Job Launcher] GCV integration
#2905.
Context behind the change
This PR introduces batch processing for large datasets, ensuring that datasets with more than 2000 images are split into multiple tasks for better processing efficiency.
Changes made:
Implemented batch processing:
Added a new migration:
How has this been tested?
Release plan
cron-jobstable.