Question anwsering

Install

Install torch==1.9.0 and torchtext==0.10.0 for
git clone git@gitlab:ai/question-answering.git
cd question-answering
git submodule init
git submodule update
source .venv/bin/activate
cd transformers
pip3 install -e .
pip3 install pandas
pip3 install -U scikit-learn

ToDo

Mirar para aprender https://huggingface.co/transformers/v3.2.0/quicktour.html
Estudiar https://huggingface.co/transformers/v3.2.0/preprocessing.html)
Estudiar https://huggingface.co/transformers/v3.2.0/training.html

Doc

Actually, for solving a question-answering problem like the email answering, we must use text-generation solutions, the type of task we must use text2text-generation. Like:

Text Generation
Task Summary: Text Generation
Hugging Face: install from source
transformers/examples/pytorch/text-generation/
Hugging Face - Text2Text Generation models
Huggingface: fine-tuning with custom datasets
transformers/examples/pytorch/language-modeling/
train transformer with pytorch
Finetune in native pytorch]
Preprocessing data
GPT2Config param doc
transformers.ConversationalPipeline
DialoGPT
StackOverflow dataset
Text similarity search with vector fields
Huggingface: Preprocessing data
Huggingdace: Training and fine-tuning (Aquí explican como crear un fine-tuning custom con PyTorch y Huggingface)
dslim/bert-base-NER

Guide 1 to retrain a GPT-2 model with PyTorch

special_tokens_dict = {'bos_token': '<BOS>', 'eos_token': '<EOS>', 'pad_token': '<PAD>'}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

Grocery and Gourmet Food

Source of data

Download and prepare data

Download: $ wget http://jmcauley.ucsd.edu/data/amazon/qa/qa_Grocery_and_Gourmet_Food.json.gz
Prepare train and verify data $ python3 prepare_GG_data.py
Prepare train and verify data for seq2seq $ python3 prepare_seq_data.py

qa_Grocery_and_Gourmet_Food Data format

Each line has a json object with the following properties:

questionType
asin
answerTime
unixTime
question
answer

Sample code to load files

import pandas as pd
import gzip

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('qa_Video_Games.json.gz')

Models

Text2Text-generations

Investigation ways

I must investigate four ways to solve this problem:

Train a lenguage model from scratch, to generate text.
Use a pre-trained language model retrained, to generate text. Like GPT-2.
Fine-tune a pre-trained language model, to generate text, by doing a "conversational" tasks.
Fine-tune a seq2seq model, to generate text.
Train from scratch a tiny model to get a 100% accuracy
Train a pytorch model based on translation
Use doc2vec to generate the question embedding, store it and find it using cosine similarity.
Try OpenAI to make a "text search" solution.
Try Huggingface BERT to make a "text search" solution.
Try Sentence-Transformers to make a "text search" solution.
Try NER.

1. Train a lenguage model from scratch, to generate text.

First install requirements:

$ pip3 install -r ./transformers/examples/pytorch/language-modeling/requirements.txt

Also, do the step Download and prepare data.

Then run training:

$ ./train_gpt2_from_scratch/train_gpt2_model.sh

After training is done, do inference:

$ python3 ./train_gpt2_from_scratch/inference_gpt2_model.py

Results

Almost returns the same anwsers, no matter the questions: almos always returns: "I don't know"

2. Use a pre-trained language model retrained, to generate text. Like GPT-2.

First install requirements:

$ pip3 install -r ./transformers/examples/pytorch/language-modeling/requirements.txt

Also, do the step Download and prepare data.

Then run training:

$ ./train_gpt2_from_pretrained/train_gpt2_model.sh

After training is done, do inference:

$ python3 ./train_gpt2_from_pretrained/inference_gpt2_model.py

Results

Training was fast, but i could not solve inference, because the script launch an exception that I could no solve.

3. Fine-tune a pre-trained language model, to generate text, by doing a "conversational" tasks

NoTHINMG

4. Fine-tune a seq2seq model, to generate text

NOTHOING

5. Train from scratch a tiny model to get a 100% accuracy

First install requirements:

$ pip3 install -r ./transformers/examples/pytorch/language-modeling/requirements.txt

Then run training:

$ ./train_tiny_gpt2_from_scratch/train_gpt2_model.sh

After training is done, do inference:

$ python3 ./train_tiny_gpt2_from_scratch/inference_gpt2_model.py

6. Train a pytorch model based on translation

Use a translation model by adapting it to solve this questions-answering.

https://pytorch.org/tutorials/beginner/translation_transformer.html

ToDo:

Buscar códigos donde ya haya implementado Datasets.
Buscar códigos donde ya haya usado los transformers de pytorch.
Implementar el código tal y como está en el doc translation_transformer.
Adaptar el código para hacer una question-answering.

tokenizer = get_tokenizer('basic_english')

Doc: tokens

UNK_IDX -> default index. This index is returned when the token is not found.
PAD_IDX -> value used to fill short sequences.
BOS_IDX -> begining of string
EOS_IDX -> end of string
SEP_IDX -> separator between questions and answers

Run sample file:

$ `source .venv/bin/activate`
# Create source and target language tokenizer. Make sure to install the dependencies.
# pip3 install -U spacy
# python3 -m spacy download en_core_web_sm
# python3 -m spacy download de_core_news_sm
$ python3 ./train_pytorch_for_translation/language_translation.py

7. Use doc2vec to generate the question embedding, store it and find it using cosine similarity.

cosine-similarity(V,W) = 1 - (v1 * w1 + v2 * w2 + v3 * w3) / (sqrt(v1 * v1 + v2 * v2 + v3 * v3) * sqrt(w1 * w1 + w2 * w2 + w3 * w3) )

Use StackOverflow dataset as corpus.

8. Try OpenAI to make a "text search" solution.

OpenAI: Embeddings
Sqlite3 ruby gem doc
Wikipedia: Cosine similarity
Daru Gem (Pandas ruby alternative)
Daru doc form Ankane
Use Ada model, because is the ligther (1024 dimensions)
Use text-similarity-ada-001 model for clustering, regression, anomaly detection, visualization.

Installation

First install last version of sqlite3

... deactivating SQLITE_MAX_EXPR_DEPTH

CFLAGS="-DSQLITE_MAX_EXPR_DEPTH=0" ./configure
make
sudo make install

Set $PATH to point to /usr/local/bin

$ export PATH="$PATH:/usr/local/bin"

Last install gems

$ bundle

IMPORTANT Install gems last to make sqlite3 gem use the compiled version of sqlite3 executable.

Usage

First generate embeddings and store it in sqlite.db:

ruby ./openai_embeddings/generate_embeddings.rb

Then search for text:

ruby ./openai_embedding/search_text.rb

Use train_tiny_list.csv

Train

$ python3 ./train_pytorch_for_translation/question-answering.py

9. Try Huggingface Bert to make a "text search" solution

with sentence embeddings

cd bert_embeddings
source .venv/bin/activate
python3 try_embeddings.py

10. Try Sentence-Transformers to make a "text search" solution

Use cases

ToDo

See other use cases and usages of Sentence-Transformers

Install

First install PyTorch with CUDA. Then:

$ pip3 install -U sentence-transformers

Usage

cd sentence_transformers
source .venv/bin/activate
python3 try_sentence.py

Conclusions

Para idioma español funciona bastante bien el modelo:

Estos otros modelos también son multi-idioma:

distiluse-base-multilingual-cased-v1
distiluse-base-multilingual-cased-v2
paraphrase-multilingual-MiniLM-L12-v2
paraphrase-multilingual-mpnet-base-v2

11. Try NER

Huggingface: token classification
Davlan/bert-base-multilingual-cased-ner-hrl

.venv/bin/activate
cd ner
python3 try_spanish_ner.py

ToDo

Encontrar cómo guardar y restaurar modelos reentrenados.
Por lo que veo en el documento de custom dataset, lo que tengo que hacer es crear mi propio código que cargue los datos.
Investigar Question Answering with SQuAD 2.0
Necesito encontrar la manera de reentrenar sistemas de text2text-generation, -> USAR LM https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling
Crear un sistema sencillo de inference de pipelines para text2text-generation.
Una opción que debo tener en cuenta es que quizás no necesito ralizar un fine-tuning, podría simplemente entrenar todo el modelo con mis correos. SEPARANDO LA PREGUNTA DE LA RESPUESTA CON UNA PALABRA TOKEN CLAVE COMO '>>>QA>>>'
Igual puedo hacer un reentreno de un gpt2 en español. ASÍ ES.

Remember

Text generation is currently possible with GPT-2, OpenAi-GPT, CTRL, XLNet, Transfo-XL and Reformer in PyTorch
Una posible solución sería el Text Generation
También se puede usar los transformers para el Names Entity Recognition (NER)
Esto es lo que quiero hacer: fine-tuning un GPT-2 ---->>>> https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling

Patches

token patches

line run_clm.py:346

tokenizer.add_special_tokens({ "eos_token": "", "bos_token": "~~", "unk_token": "" })~~

Marketing

https://mailytica.com/en/pricing/

https://emailtree.ai

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
GPT4All		GPT4All
bert_embeddings		bert_embeddings
bloom		bloom
chatbot		chatbot
gmail		gmail
gpt4		gpt4
huggingface_qa		huggingface_qa
lexicon		lexicon
lexicon2		lexicon2
ner		ner
openai_embeddings		openai_embeddings
openai_qa		openai_qa
sentence_transformers		sentence_transformers
t0		t0
train_gpt2_from_pretrained		train_gpt2_from_pretrained
train_gpt2_from_pretrained_with_wiki		train_gpt2_from_pretrained_with_wiki
train_gpt2_from_scratch		train_gpt2_from_scratch
train_pytorch_for_translation		train_pytorch_for_translation
train_seq2seq_from scratch		train_seq2seq_from scratch
train_tiny_gpt2_from_scratch		train_tiny_gpt2_from_scratch
transformers @ 52d2e6f		transformers @ 52d2e6f
util		util
weaivate		weaivate
.gitignore		.gitignore
.gitmodules		.gitmodules
.lock		.lock
grocery_gourmet_dataset.py		grocery_gourmet_dataset.py
inference.py		inference.py
login.py		login.py
prepare_GG_data.py		prepare_GG_data.py
prepare_seq_data.py		prepare_seq_data.py
readme.md		readme.md
run.py		run.py
text2text_sample.py		text2text_sample.py
train_list_tiny.txt		train_list_tiny.txt
train_tiny_list.csv		train_tiny_list.csv
val_tiny_list.csv		val_tiny_list.csv

madcato/question-answering

Folders and files

Latest commit

History

Repository files navigation

Question anwsering

Install

ToDo

Doc

Guide 1 to retrain a GPT-2 model with PyTorch

Grocery and Gourmet Food

Download and prepare data

qa_Grocery_and_Gourmet_Food Data format

Sample code to load files

Models

Text2Text-generations

Investigation ways

1. Train a lenguage model from scratch, to generate text.

Results

2. Use a pre-trained language model retrained, to generate text. Like GPT-2.

Results

3. Fine-tune a pre-trained language model, to generate text, by doing a "conversational" tasks

4. Fine-tune a seq2seq model, to generate text

5. Train from scratch a tiny model to get a 100% accuracy

6. Train a pytorch model based on translation

7. Use doc2vec to generate the question embedding, store it and find it using cosine similarity.

8. Try OpenAI to make a "text search" solution.

Installation

First install last version of sqlite3

Set $PATH to point to /usr/local/bin

Last install gems

Usage

Train

9. Try Huggingface Bert to make a "text search" solution

10. Try Sentence-Transformers to make a "text search" solution

Use cases

ToDo

Install

Usage

Conclusions

11. Try NER

ToDo

Remember

Patches

token patches

Marketing

About

Resources

Stars

Watchers

Forks

Languages