Skip to content

mfekadu/nimbus-transformer

Repository files navigation

nimbus-transformer

it's like Nimbus but uses a transformer language model

Written in a Functional Programming style.

Getting Started

Works with macOS, Linux, Windows.

2. Setup virtual environment

pipenv install

This will create a virtual environment with the required:

3. Open virtual environment

pipenv shell

4. Verify your python version

$ python --version
Python 3.6.8

Usage

from ntfp.ntfp import get_context, transformer
question = "what is Dr. Foaad Khosmood email?"
_, _, context = get_context(question)
answer, _ = transformer(question, context)
print("answer: ", answer)
>>> answer:  foaad@ calpoly.edu.

Demo

$ python main.py
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html


question: what is Dr. Foaad Khosmood email?
len(context):  911
Converting examples to features: 100%|██| 1/1 [00:00<00:00, 95.61it/s]



answer:  foaad@ calpoly.edu.
appended new row to data.csv

demo.png

How it works

Assumptions

  • "Context" is limited to Cal Poly, so expect non-Cal-Poly "Questions" to fail
  • "Answer" is expected to exist publically on the web, such that Google can access it.

Pipeline

  1. User asks Question to a web application.
  2. Scrape Google for Context limit 10 url results.
  3. Store Context into database.
  4. Transform ( Question, Context ) >> Answer
  5. Reply with Answer
  6. Mark, good/bad answer to learn from later.

TODO

  • a simple web UI with an input box and a section for answers
    • if bad-answer then offer user a toggle: isItAnyOf(ans1,ans2..)
    • if user does not choose a toggle then mark as possibly-answerable
    • set up a nice UI for verification team to complete task.
  • database code for
  • test performance
    • avoid test generation by code because the test itself should not depend on subject-under-test.
    • measure precision & recall of this system
  • make improvements to assumptions
  • consider git rev-parse HEAD to get latest commit hash to associate with data.
  • consider learning new facts from TrustedUser
    • e.g. Dr. Khosmood is a TrustedUser and can offer the system either:
      • URL
        • e.g. a published google doc containing a professor's syllabus.
        • e.g. a professor's personal website
      • UserContext
        • e.g. the plain-text of a professor's syllabus.
        • either provided through real-time chat client
        • or provided through a simple input box
        • also consider ChatContext
      • (Question, Answer) mappings
      • so, when any User asks a previously mapped question, then the correct answer can be returned
      • or, when the most relevant UserContext is found for the given question, a reasonable answer can still be returned.
  • question/answer data augmentation
    • remember augmentations need grammar check by human
    • try Question-Paraphrasing
    • also try style-transformations
      • "PHRASE REPLACEMENT TRANSFORM" (Khosmood, pg. 118)
        • I wanted to be with you alone
          • => I desired to be with you only.
        • class phraseXform
          • update it to latest technologies: SpaCy! BabelNet?
        • similar to /r/IncreasinglyVerbose
        • I teach at Cal Poly
          • => I teach at a university in California
            • (replace Stanford University with definition)
          • => I impart skills or knowledge to students at a university in California
            • (replace teach with definition and append students)
          • => I impart skills or knowledge to students at an establishment where a seat of higher learning is housed in California
            • (replace university with definition)
          • => I impart skills or knowledge to students at an establishment where a seat of higher learning is housed in San Luis Obispo, California
            • (apply knowledge of city location of Cal Poly)
        • "Translation-Tours" (Khosmood, pg. 141)
          • "Translation tour with Spanish, French, German" (Khosmood, pg. 141)
            • I teach at Cal Poly
              • => Enseño en Cal Poly (Enlish => Spanish)
              • => J'enseigne à Cal Poly (Spanish => French)
              • => Ich unterrichte an der Cal Poly (French => German)
              • => I teach at Cal Poly (German => English)
            • I teach at Cal Poly.
              • => Doy clases en Cal Poly. (Enlish => Spanish)
              • => Ich unterrichte an der Cal Poly.
              • => I teach at Cal Poly.
          • Alternative Translation Tours
            • I teach at Cal Poly
              • => እኔ በካሊ ፖሊ አስተምራለሁ ፡፡ (English => Amharic)
              • => I teach by Kali Poly. (Amharic => English)
  • chart useful metrics
    • e.g. averge confidence score of transformer over time (or over code changes) need log commit hash
    • e.g. lexical similarity (fuzz ratio) of question to context over time (or over code changes) need log commit hash

What is data.csv?

data.csv is a temporary "database" for appending question samples with the generated meta-data and final answer of this system.

Keeping track of this data will help with measuring the model's performance and making improvements based on performance metrics.

data.png

Resources