In [1]:
import json

doc_dir = "combinedSemanticDataSet.json"

json_files = json.loads(open(doc_dir).read())


In [2]:
print(json_files[0]['content'])

Hello, I am looking for best practices as I know that there could be multiple ways to solve the problem. I just like to understand if there is a lightning approach that should be preferred.

In my `LightningModule` I initialize a `CrossEntropyLoss` with specific `weight` to handle imbalanced classes: `torch.nn.CrossEntropyLoss(weight=my_weights)`. The weight for each class is defined as the `1 / number_of_samples_in_the_class`.

In order to do this, I need to supply my `LightningModule` instance with the number of samples per class. However, usually you would load the data (and therefore count the number of samples per class in the `setup` function of the `LightningDataModule` instance. So here's the problem: usually, when you initialize the `LightningModule` you haven't loaded yet the data.

Example:
```
class MyDataModule(LightningDataModule):
    def __init__(self):
        self.number_of_samples_per_class = None

     def setup(self, stage):
         self.number_of_samples_per_clas

In [3]:
def clean_json_text(texts):
    """
    Clean wikipedia text by removing multiple new lines, removing extremely short lines,
    adding paragraph breaks and removing empty paragraphs
    """
    newArray = []

    for text in texts: 
        content = text['content'] 

        while "\n" in content:
            content = content.replace("\n", "")
        
        while "\r" in content:
            content = content.replace("\r", "")

        while "```" in content:
            content = content.replace("```", " ")

        while "  " in content:
            content = content.replace("  ", " ")
        
        newArray.append({'content': content, 'meta': text['meta']})
        

    return newArray

In [4]:
json_files = clean_json_text(json_files)

In [5]:
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import PreProcessor

document_store = FAISSDocumentStore(
    faiss_index_factory_str="Flat", 
    return_embedding=True, 
    duplicate_documents='overwrite'
)

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=50,
    split_respect_sentence_boundary=True,
)

In [6]:
from haystack.nodes import DensePassageRetriever
from haystack.nodes import RAGenerator

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    use_gpu=True,
    embed_title=True,
    )

generator = RAGenerator(
    model_name_or_path="facebook/rag-sequence-nq",
    retriever=retriever,
    top_k=1,
    min_length=2
)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the 

In [7]:
from haystack import Pipeline

p = Pipeline()
p.add_node(component=retriever, name="Retriever", inputs=['Query'])
p.add_node(component=generator, name="RAGenerator", inputs=['Retriever'])

In [8]:
document_store.write_documents(json_files)
document_store.update_embeddings(retriever)

Writing Documents:   0%|          | 0/1827 [00:00<?, ?it/s]

Updating Embedding:   0%|          | 0/1826 [00:00<?, ? docs/s]

Create embeddings:   0%|          | 0/1840 [00:00<?, ? Docs/s]

In [9]:
# Saves the document store
document_store.save("my_faiss_index.faiss")

In [None]:
document_store = FAISSDocumentStore.load("my_faiss_index.faiss")
assert document_store.faiss_index_factory_str == "Flat"

In [12]:
query = "what comes after a Preprocessor in a pipeline"

In [13]:
p.run(query, params={'Retriever': {'top_k': 5}})

{'query': 'what comes after a Preprocessor in a pipeline',
 'answers': [<Answer {'answer': ' data accessor', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_id': None, 'meta': {'doc_ids': ['d31382234c0249763ca77babbf19477d', '681f259a0a88b1dbee6cb4f3963d4468', '4b171c865a0882a9790e559a4c50836d', '4b5bb9d8b4fe8618ebe397bcdca374a8', '11f3a35bdc738eae5c76e54aa2741675'], 'doc_scores': [0.6619975155599239, 0.6615594470192911, 0.6609649402339938, 0.6608263232300359, 0.6602357254224744], 'content': ['Whenever I index a document I see this in my console:Batches: 100%|██████████| 1/1 [00:00<00:00, 42.84it/s]I set `progress_bar=False` on every pipeline component yet I\'m still seeing this: text_converter = TextConverter(valid_languages=[\'en\'], progress_bar=False)preprocessor = PreProcessor(progress_bar=False) p = Pipeline()p.add_node(component=text_converter, name="TextConverter", inputs=["File"])p.add_node(component=prep