Skip to content

Conversation

@bkb2135
Copy link
Collaborator

@bkb2135 bkb2135 commented May 6, 2024

Currently Date_QA context is gathered from a specific Wikipedia section that stores a list of dates. Miners have internalized this database, and query this database to solve date tasks. This means that the questions that we can ask about dates are specifically limited to what is contained within this specific wikipedia database. This PR will change that in a few ways.

  • Free form parseing of Wikipedia articles for dates: Rather than taking our date context from the date seciton of wikipedia, instead we take any wikipedia article, and parse the dates out of it.
  • Token based inferencing for challenges: A special token is used to replace the parsed date in the context, and then our llm is prompted to create a question where would be the answer.
  • Cacheing: In order to decrease the number of IO operations to apis, we can instead cache the articles that we have previously grabbed for summarization and qa.
  • Rouge Scoring for Dates: In order to prepare for the incomming release of the front-end, miners are encouraged to return longer, more chat-bot-esque answers, by creating a reference which includes text, and scoring it with rouge.
  • Increased context saved for Date_QA: Because Date_QA now relies on context objects saved from previous summarization or QA tasks, we could use these contexts to bridge the Date_QA with other wikipedia tasks for multi-turn conversation that follows one topic through different tasks.

@bkb2135
Copy link
Collaborator Author

bkb2135 commented May 14, 2024

Currently, Testnet runs appear to be successful, and references, while plain, appear to be high quality, and contain the date that they are suppossed to. Before merging to main, it would be advantageous to have a tracked run on main. The distribution of rewards in Date_QA is guaranteed to change, because we are shifting from a limited set of contexts which are solved and in a database, to a "Wikipedia Size" dataset. Currently however, error Failed to create summarization task. (<class 'requests.exceptions.SSLError'>, SSLError(MaxRetryError("HTTPSConnectionPool(host='en.wikipedia.org', port=443): Max retries exceeded with url: /w/api.php?prop=extracts%7Crevisions&explaintext=&rvprop=ids&titles=Tsvetoslav+Stankulov&format=json&action=query (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')))")), <traceback object at 0x7f020f4c6340>). Skipping to next task arises, which is unexpected as a result of a cacheing system is that there should be LESS calls to Wikipedia.

@bkb2135
Copy link
Collaborator Author

bkb2135 commented May 15, 2024

Short of a mainnet run to anticipate the effects of releasing this feature, the only observable change that needs to be made is adjusting the cleaner pipelines for the reference.
Screenshot 2024-05-15 at 8 17 46 AM

Passing clean = False on reference generation does not disable the cleaning pipelines for the reference as anticipated.

@bkb2135 bkb2135 requested a review from p-ferreira May 15, 2024 12:45
@p-ferreira
Copy link
Contributor

p-ferreira commented May 16, 2024

@dbobrenko Unfortunately I cannot add you as a reviewer in this space but please consider yourself as a reviewer of all PRs of this repo (specially the ones with tag 2.3.0)

@p-ferreira p-ferreira mentioned this pull request May 17, 2024
@bkb2135 bkb2135 marked this pull request as ready for review May 17, 2024 18:35
@p-ferreira p-ferreira changed the base branch from pre-staging to staging May 17, 2024 18:46
@p-ferreira p-ferreira merged commit 12cd9d7 into staging May 17, 2024
@Hollyqui Hollyqui deleted the features/free-form-date-parsing branch August 2, 2024 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants