Skip to content

I use this project to investigate different ways to solve a questing-answering functionality using AI/ML

Notifications You must be signed in to change notification settings

madcato/question-answering

Repository files navigation

Question anwsering

Install

  1. Install torch==1.9.0 and torchtext==0.10.0 for
  2. git clone git@gitlab:ai/question-answering.git
  3. cd question-answering
  4. git submodule init
  5. git submodule update
  6. source .venv/bin/activate
  7. cd transformers
  8. pip3 install -e .
  9. pip3 install pandas
  10. pip3 install -U scikit-learn

ToDo

Doc

Actually, for solving a question-answering problem like the email answering, we must use text-generation solutions, the type of task we must use text2text-generation. Like:

Guide 1 to retrain a GPT-2 model with PyTorch

special_tokens_dict = {'bos_token': '<BOS>', 'eos_token': '<EOS>', 'pad_token': '<PAD>'}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

Grocery and Gourmet Food

Download and prepare data

  1. Download: $ wget http://jmcauley.ucsd.edu/data/amazon/qa/qa_Grocery_and_Gourmet_Food.json.gz
  2. Prepare train and verify data $ python3 prepare_GG_data.py
  3. Prepare train and verify data for seq2seq $ python3 prepare_seq_data.py

qa_Grocery_and_Gourmet_Food Data format

Each line has a json object with the following properties:

  • questionType
  • asin
  • answerTime
  • unixTime
  • question
  • answer
Sample code to load files
import pandas as pd
import gzip

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('qa_Video_Games.json.gz')

Models

Text2Text-generations

Investigation ways

I must investigate four ways to solve this problem:

  1. Train a lenguage model from scratch, to generate text.
  2. Use a pre-trained language model retrained, to generate text. Like GPT-2.
  3. Fine-tune a pre-trained language model, to generate text, by doing a "conversational" tasks.
  4. Fine-tune a seq2seq model, to generate text.
  5. Train from scratch a tiny model to get a 100% accuracy
  6. Train a pytorch model based on translation
  7. Use doc2vec to generate the question embedding, store it and find it using cosine similarity.
  8. Try OpenAI to make a "text search" solution.
  9. Try Huggingface BERT to make a "text search" solution.
  10. Try Sentence-Transformers to make a "text search" solution.
  11. Try NER.

1. Train a lenguage model from scratch, to generate text.

First install requirements:

$ pip3 install -r ./transformers/examples/pytorch/language-modeling/requirements.txt

Also, do the step Download and prepare data.

Then run training:

$ ./train_gpt2_from_scratch/train_gpt2_model.sh

After training is done, do inference:

$ python3 ./train_gpt2_from_scratch/inference_gpt2_model.py

Results

Almost returns the same anwsers, no matter the questions: almos always returns: "I don't know"

2. Use a pre-trained language model retrained, to generate text. Like GPT-2.

First install requirements:

$ pip3 install -r ./transformers/examples/pytorch/language-modeling/requirements.txt

Also, do the step Download and prepare data.

Then run training:

$ ./train_gpt2_from_pretrained/train_gpt2_model.sh

After training is done, do inference:

$ python3 ./train_gpt2_from_pretrained/inference_gpt2_model.py

Results

Training was fast, but i could not solve inference, because the script launch an exception that I could no solve.

3. Fine-tune a pre-trained language model, to generate text, by doing a "conversational" tasks

NoTHINMG

4. Fine-tune a seq2seq model, to generate text

NOTHOING

5. Train from scratch a tiny model to get a 100% accuracy

First install requirements:

$ pip3 install -r ./transformers/examples/pytorch/language-modeling/requirements.txt

Then run training:

$ ./train_tiny_gpt2_from_scratch/train_gpt2_model.sh

After training is done, do inference:

$ python3 ./train_tiny_gpt2_from_scratch/inference_gpt2_model.py

6. Train a pytorch model based on translation

Use a translation model by adapting it to solve this questions-answering.

https://pytorch.org/tutorials/beginner/translation_transformer.html

ToDo:

  • Buscar códigos donde ya haya implementado Datasets.
  • Buscar códigos donde ya haya usado los transformers de pytorch.
  • Implementar el código tal y como está en el doc translation_transformer.
  • Adaptar el código para hacer una question-answering.

tokenizer = get_tokenizer('basic_english')

Doc: tokens

  • UNK_IDX -> default index. This index is returned when the token is not found.
  • PAD_IDX -> value used to fill short sequences.
  • BOS_IDX -> begining of string
  • EOS_IDX -> end of string
  • SEP_IDX -> separator between questions and answers

Run sample file:

$ `source .venv/bin/activate`
# Create source and target language tokenizer. Make sure to install the dependencies.
# pip3 install -U spacy
# python3 -m spacy download en_core_web_sm
# python3 -m spacy download de_core_news_sm
$ python3 ./train_pytorch_for_translation/language_translation.py

7. Use doc2vec to generate the question embedding, store it and find it using cosine similarity.

cosine-similarity(V,W) = 1 - (v1 * w1 + v2 * w2 + v3 * w3) / (sqrt(v1 * v1 + v2 * v2 + v3 * v3) * sqrt(w1 * w1 + w2 * w2 + w3 * w3) )

Use StackOverflow dataset as corpus.

8. Try OpenAI to make a "text search" solution.

Installation

First install last version of sqlite3

... deactivating SQLITE_MAX_EXPR_DEPTH

  1. CFLAGS="-DSQLITE_MAX_EXPR_DEPTH=0" ./configure
  2. make
  3. sudo make install
Set $PATH to point to /usr/local/bin

$ export PATH="$PATH:/usr/local/bin"

Last install gems

$ bundle

IMPORTANT Install gems last to make sqlite3 gem use the compiled version of sqlite3 executable.

Usage

First generate embeddings and store it in sqlite.db:

ruby ./openai_embeddings/generate_embeddings.rb

Then search for text:

ruby ./openai_embedding/search_text.rb

Use train_tiny_list.csv

Train

$ python3 ./train_pytorch_for_translation/question-answering.py

9. Try Huggingface Bert to make a "text search" solution

with sentence embeddings

cd bert_embeddings
source .venv/bin/activate
python3 try_embeddings.py

10. Try Sentence-Transformers to make a "text search" solution

Use cases

ToDo

Install

First install PyTorch with CUDA. Then:

$ pip3 install -U sentence-transformers

Usage

cd sentence_transformers
source .venv/bin/activate
python3 try_sentence.py

Conclusions

Para idioma español funciona bastante bien el modelo:

Estos otros modelos también son multi-idioma:

  • distiluse-base-multilingual-cased-v1
  • distiluse-base-multilingual-cased-v2
  • paraphrase-multilingual-MiniLM-L12-v2
  • paraphrase-multilingual-mpnet-base-v2

11. Try NER

.venv/bin/activate
cd ner
python3 try_spanish_ner.py

ToDo

  • Encontrar cómo guardar y restaurar modelos reentrenados.
  • Por lo que veo en el documento de custom dataset, lo que tengo que hacer es crear mi propio código que cargue los datos.
  • Investigar Question Answering with SQuAD 2.0
  • Necesito encontrar la manera de reentrenar sistemas de text2text-generation, -> USAR LM https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling
  • Crear un sistema sencillo de inference de pipelines para text2text-generation.
  • Una opción que debo tener en cuenta es que quizás no necesito ralizar un fine-tuning, podría simplemente entrenar todo el modelo con mis correos. SEPARANDO LA PREGUNTA DE LA RESPUESTA CON UNA PALABRA TOKEN CLAVE COMO '>>>QA>>>'
  • Igual puedo hacer un reentreno de un gpt2 en español. ASÍ ES.

Remember

Patches

token patches

line run_clm.py:346

tokenizer.add_special_tokens({ "eos_token": "", "bos_token": "", "unk_token": "" })

Marketing

About

I use this project to investigate different ways to solve a questing-answering functionality using AI/ML

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published