- Install torch==1.9.0 and torchtext==0.10.0 for
git clone git@gitlab:ai/question-answering.git
cd question-answering
git submodule init
git submodule update
source .venv/bin/activate
cd transformers
pip3 install -e .
pip3 install pandas
pip3 install -U scikit-learn
- Mirar para aprender https://huggingface.co/transformers/v3.2.0/quicktour.html
- Estudiar https://huggingface.co/transformers/v3.2.0/preprocessing.html)
- Estudiar https://huggingface.co/transformers/v3.2.0/training.html
Actually, for solving a question-answering problem like the email answering, we must use text-generation solutions, the type of task we must use text2text-generation. Like:
- Text Generation
- Task Summary: Text Generation
- Hugging Face: install from source
- transformers/examples/pytorch/text-generation/
- Hugging Face - Text2Text Generation models
- Huggingface: fine-tuning with custom datasets
- transformers/examples/pytorch/language-modeling/
- train transformer with pytorch
- Finetune in native pytorch]
- Preprocessing data
- GPT2Config param doc
- transformers.ConversationalPipeline
- DialoGPT
- StackOverflow dataset
- Text similarity search with vector fields
- Huggingface: Preprocessing data
- Huggingdace: Training and fine-tuning (Aquí explican como crear un fine-tuning custom con PyTorch y Huggingface)
- dslim/bert-base-NER
- Fine-tuning GPT2 for Text Generation Using Pytorch
- This previous guide uses the old huggingface transformer script
special_tokens_dict = {'bos_token': '<BOS>', 'eos_token': '<EOS>', 'pad_token': '<PAD>'}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
- Download:
$ wget http://jmcauley.ucsd.edu/data/amazon/qa/qa_Grocery_and_Gourmet_Food.json.gz
- Prepare train and verify data
$ python3 prepare_GG_data.py
- Prepare train and verify data for seq2seq
$ python3 prepare_seq_data.py
Each line has a json object with the following properties:
- questionType
- asin
- answerTime
- unixTime
- question
- answer
import pandas as pd
import gzip
def parse(path):
g = gzip.open(path, 'rb')
for l in g:
yield eval(l)
def getDF(path):
i = 0
df = {}
for d in parse(path):
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient='index')
df = getDF('qa_Video_Games.json.gz')
I must investigate four ways to solve this problem:
- Train a lenguage model from scratch, to generate text.
- Use a pre-trained language model retrained, to generate text. Like GPT-2.
- Fine-tune a pre-trained language model, to generate text, by doing a "conversational" tasks.
- Fine-tune a seq2seq model, to generate text.
- Train from scratch a tiny model to get a 100% accuracy
- Train a pytorch model based on translation
- Use doc2vec to generate the question embedding, store it and find it using cosine similarity.
- Try OpenAI to make a "text search" solution.
- Try Huggingface BERT to make a "text search" solution.
- Try Sentence-Transformers to make a "text search" solution.
- Try NER.
First install requirements:
$ pip3 install -r ./transformers/examples/pytorch/language-modeling/requirements.txt
Also, do the step Download and prepare data.
Then run training:
$ ./train_gpt2_from_scratch/train_gpt2_model.sh
After training is done, do inference:
$ python3 ./train_gpt2_from_scratch/inference_gpt2_model.py
Almost returns the same anwsers, no matter the questions: almos always returns: "I don't know"
First install requirements:
$ pip3 install -r ./transformers/examples/pytorch/language-modeling/requirements.txt
Also, do the step Download and prepare data.
Then run training:
$ ./train_gpt2_from_pretrained/train_gpt2_model.sh
After training is done, do inference:
$ python3 ./train_gpt2_from_pretrained/inference_gpt2_model.py
Training was fast, but i could not solve inference, because the script launch an exception that I could no solve.
NoTHINMG
NOTHOING
First install requirements:
$ pip3 install -r ./transformers/examples/pytorch/language-modeling/requirements.txt
Then run training:
$ ./train_tiny_gpt2_from_scratch/train_gpt2_model.sh
After training is done, do inference:
$ python3 ./train_tiny_gpt2_from_scratch/inference_gpt2_model.py
Use a translation model by adapting it to solve this questions-answering.
https://pytorch.org/tutorials/beginner/translation_transformer.html
ToDo:
- Buscar códigos donde ya haya implementado Datasets.
- Buscar códigos donde ya haya usado los transformers de pytorch.
- Implementar el código tal y como está en el doc translation_transformer.
- Adaptar el código para hacer una question-answering.
tokenizer = get_tokenizer('basic_english')
- https://gitlab/ai/pytorch-word2vec/-/blob/main/word2vec/dataset.py
- https://gitlab/ai/libtorch-lm/-/blob/master/language_translation.py
- https://gitlab/ai/libtorch-lm/-/wikis/home
- https://andrewpeng.dev/transformer-pytorch/
Doc: tokens
- UNK_IDX -> default index. This index is returned when the token is not found.
- PAD_IDX -> value used to fill short sequences.
- BOS_IDX -> begining of string
- EOS_IDX -> end of string
- SEP_IDX -> separator between questions and answers
Run sample file:
$ `source .venv/bin/activate`
# Create source and target language tokenizer. Make sure to install the dependencies.
# pip3 install -U spacy
# python3 -m spacy download en_core_web_sm
# python3 -m spacy download de_core_news_sm
$ python3 ./train_pytorch_for_translation/language_translation.py
cosine-similarity(V,W) = 1 - (v1 * w1 + v2 * w2 + v3 * w3) / (sqrt(v1 * v1 + v2 * v2 + v3 * v3) * sqrt(w1 * w1 + w2 * w2 + w3 * w3) )
Use StackOverflow dataset as corpus.
-
Use Ada model, because is the ligther (1024 dimensions)
-
Use text-similarity-ada-001 model for clustering, regression, anomaly detection, visualization.
... deactivating SQLITE_MAX_EXPR_DEPTH
CFLAGS="-DSQLITE_MAX_EXPR_DEPTH=0" ./configure
make
sudo make install
$ export PATH="$PATH:/usr/local/bin"
$ bundle
IMPORTANT Install gems last to make sqlite3 gem use the compiled version of sqlite3 executable.
First generate embeddings and store it in sqlite.db
:
ruby ./openai_embeddings/generate_embeddings.rb
Then search for text:
ruby ./openai_embedding/search_text.rb
Use train_tiny_list.csv
$ python3 ./train_pytorch_for_translation/question-answering.py
with sentence embeddings
cd bert_embeddings
source .venv/bin/activate
python3 try_embeddings.py
- See other use cases and usages of Sentence-Transformers
First install PyTorch with CUDA. Then:
$ pip3 install -U sentence-transformers
cd sentence_transformers
source .venv/bin/activate
python3 try_sentence.py
Para idioma español funciona bastante bien el modelo:
Estos otros modelos también son multi-idioma:
- distiluse-base-multilingual-cased-v1
- distiluse-base-multilingual-cased-v2
- paraphrase-multilingual-MiniLM-L12-v2
- paraphrase-multilingual-mpnet-base-v2
- Huggingface: token classification
- Davlan/bert-base-multilingual-cased-ner-hrl
.venv/bin/activate
cd ner
python3 try_spanish_ner.py
- Encontrar cómo guardar y restaurar modelos reentrenados.
- Por lo que veo en el documento de custom dataset, lo que tengo que hacer es crear mi propio código que cargue los datos.
- Investigar Question Answering with SQuAD 2.0
- Necesito encontrar la manera de reentrenar sistemas de text2text-generation, -> USAR LM https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling
- Crear un sistema sencillo de
inference
de pipelines para text2text-generation. - Una opción que debo tener en cuenta es que quizás no necesito ralizar un fine-tuning, podría simplemente entrenar todo el modelo con mis correos. SEPARANDO LA PREGUNTA DE LA RESPUESTA CON UNA PALABRA TOKEN CLAVE COMO '>>>QA>>>'
- Igual puedo hacer un reentreno de un gpt2 en español. ASÍ ES.
- Text generation is currently possible with GPT-2, OpenAi-GPT, CTRL, XLNet, Transfo-XL and Reformer in PyTorch
- Una posible solución sería el Text Generation
- También se puede usar los transformers para el Names Entity Recognition (NER)
- Esto es lo que quiero hacer: fine-tuning un GPT-2 ---->>>> https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling
line run_clm.py:346
tokenizer.add_special_tokens({
"eos_token": "",
"bos_token": "",
"unk_token": ""
})