# Using WikipediaReader for Accessing Wikipedia Pages
In this notebook, we will demonstrate how to use the `WikipediaReader` class to access and process content from Wikipedia. The `WikipediaReader` class interacts with the Wikipedia API to retrieve page content and metadata, making it valuable for various applications, including research and question-answering systems.

To begin, ensure you have set up your environment variables and API keys in Python using the dotenv library. This is crucial for securely managing sensitive information, such as API keys, especially when using services like HuggingFace. Ensure your `HUGGINGFACE_API_KEY` is defined in the `.env` file to avoid hardcoding sensitive data into your codebase, thus enhancing security and maintainability.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/Demo/wikipedia_mongodb_semantic.ipynb)

In [None]:
!pip install indox
!pip install chromadb
!pip install wikipedia

## Setting Up the Python Environment

If you are running this project in your local IDE, please create a Python environment to ensure all dependencies are correctly managed. You can follow the steps below to set up a virtual environment named `indox`:

### Windows

1. **Create the virtual environment:**
```bash
python -m venv indox
```
2. **Activate the virtual environment:**
```bash
indox_judge\Scripts\activate
```

### macOS/Linux

1. **Create the virtual environment:**
   ```bash
   python3 -m venv indox
    ```

2. **Activate the virtual environment:**
    ```bash
   source indox/bin/activate
    ```
### Install Dependencies

Once the virtual environment is activated, install the required dependencies by running:

```bash
pip install -r requirements.txt
```


## Import Essential Libraries

Next, we import the essential libraries for our Indox question-answering system:

- `IndoxRetrievalAugmentation`: Enhances the retrieval process by improving the relevance and quality of the documents retrieved, leading to better QA performance.
- `MistralQA`: A powerful QA model provided by Indox, built on top of the Hugging Face model architecture. It leverages state-of-the-art language understanding to deliver precise answers.
- `HuggingFaceEmbedding`: This library uses Hugging Face embeddings to enrich semantic understanding, making it easier to capture the contextual meaning of the text.
- `SemanticTextSplitter`: utilizes a Hugging Face tokenizer to intelligently split text into chunks based on a specified maximum number of tokens, ensuring that each chunk maintains semantic coherence.

In [1]:
import os
from dotenv import load_dotenv

load_dotenv('api.env')

HUGGINGFACE_API_KEY = os.environ['HUGGINGFACE_API_KEY']

In [2]:
from indox import IndoxRetrievalAugmentation
indox = IndoxRetrievalAugmentation()

[32mINFO[0m: [1mIndoxRetrievalAugmentation initialized[0m

            ██  ███    ██  ██████   ██████  ██       ██
            ██  ████   ██  ██   ██ ██    ██   ██  ██
            ██  ██ ██  ██  ██   ██ ██    ██     ██
            ██  ██  ██ ██  ██   ██ ██    ██   ██   ██
            ██  ██  █████  ██████   ██████  ██       ██
            


## Building the WikipediaReader System and Initializing Models
Next, we will build our `WikipediaReader` system and initialize the necessary models for processing content from Wikipedia. This setup will enable us to effectively retrieve and handle Wikipedia pages, leveraging these models to support various research and question-answering tasks.

In [3]:
from indox.llms import HuggingFaceModel
from indox.embeddings import HuggingFaceEmbedding
mistral_qa = HuggingFaceModel(api_key=HUGGINGFACE_API_KEY,model="mistralai/Mistral-7B-Instruct-v0.2")
embed = HuggingFaceEmbedding(api_key=HUGGINGFACE_API_KEY,model="multi-qa-mpnet-base-cos-v1")

[32mINFO[0m: [1mInitializing HuggingFaceModel with model: mistralai/Mistral-7B-Instruct-v0.2[0m
[32mINFO[0m: [1mHuggingFaceModel initialized successfully[0m
[32mINFO[0m: [1mInitialized HuggingFaceEmbedding with model: multi-qa-mpnet-base-cos-v1[0m


## Setting Up the WikipediaReader for Retrieving Page Content
To demonstrate the capabilities of our `WikipediaReader` system and its integration with Indox, we will use a sample Wikipedia page. These page will serve as our reference data, which we will use for testing and evaluation of the system.

In [4]:
from indox.data_connectors import WikipediaReader

reader = WikipediaReader()

documents = reader.load_content(pages=["Python (programming language)"])

In [5]:
content = documents

In [6]:
content



## Splitting Content into Manageable Chunks
We use the `SemanticTextSplitter` function from the `indox.splitter` module to divide the retrieved content into smaller, meaningful chunks.

In [7]:
from indox.splitter import SemanticTextSplitter
splitter = SemanticTextSplitter(400)
content_chunks = splitter.split_text(content)


## Storing and Indexing Content with MongoDB
We use `MongoDB` to store and index the content chunks. By creating a collection named `sample` and applying an embedding function (`embed`), we convert each chunk into a vector for efficient retrieval. The `add` method then adds these vectors to the database, enabling scalable and effective search for question-answering tasks.

In [9]:
from indox.vector_stores import MongoDB
db = MongoDB(collection_name="sample",embedding_function=embed)
db.add(docs=content_chunks)

[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m


## Querying Wikipedia Data with Indox
With our `WikipediaReader` system and `Indox` fully set up, we are ready to test it using a sample query. This test will demonstrate how effectively our system can retrieve and process information from Wikipedia pages.

We’ll use the following sample query to evaluate our system:

- Query: "what is python?"

This question will be processed by the `WikipediaReader` and `Indox` system to retrieve relevant content from Wikipedia and generate an accurate response based on the information.

Let’s put our setup to the test with this query.

In [10]:
query = "what is python?"
retriever = indox.QuestionAnswer(vector_database=db, llm=mistral_qa, top_k=5)

Now that our `WikipediaReader` system with `Indox` is fully set up, we can test it with a sample query. We’ll use the invoke method to get a response from the system.

The `invoke` method processes the query using the connected QA model and retrieves relevant information from the Wikipedia pages. 

We’ll pass the query to the `invoke` method and print the response to evaluate how effectively the system retrieves and generates answers based on the Wikipedia content.

In [11]:
answer = retriever.invoke(query)
context = retriever.context

[32mINFO[0m: [1mRetrieving context and scores from the vector database[0m
[32mINFO[0m: [1mEmbedding documents[0m
[32mINFO[0m: [1mStarting to fetch embeddings for texts using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)[0m
[32mINFO[0m: [1mGenerating answer without document relevancy filter[0m
[32mINFO[0m: [1mAnswering question[0m
[32mINFO[0m: [1mSending request to Hugging Face API[0m
[32mINFO[0m: [1mReceived successful response from Hugging Face API[0m
[32mINFO[0m: [1mQuery answered successfully[0m


In [12]:
answer

'Python is a high-level, general-purpose programming language known for its code readability due to significant indentation. It is dynamically typed and garbage-collected, supports multiple programming paradigms, and has a comprehensive standard library. Guido van Rossum began working on it in the late 1980s as a successor to the ABC programming language. Python consistently ranks as one of the most popular programming languages, and is widely used in scientific computing and machine learning'