# File Search

https://platform.openai.com/docs/assistants/tools/file-search?context=without-streaming

In [1]:
import os
import json

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

In [2]:
from openai import AzureOpenAI, OpenAI
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
azure_openai_api_endpoint = os.getenv("AZURE_OPENAI_API_ENDPOINT")
deployment_name = os.getenv("AZURE_DEPLOYMENT_NAME")
openai_api_key = os.getenv('OPENAI_SIMPLON_API_KEY')
client = OpenAI(
    api_key=openai_api_key    
)

## Quickstart

In this example, we’ll create an assistant that can help answer questions about Certif IA 2023

### Step 1: Create a new Assistant with File Search Enabled

Create a new assistant with `file_search` enabled in the `tools` parameter of the Assistant.

In [3]:

assistant = client.beta.assistants.create(
  name="Assistant Certification IA 2023",
  instructions="Tu réponds uniquement aux questions concernant la certification IA 2023 délivrée par Simplon.",
  model="gpt-3.5-turbo",
  tools=[{"type": "file_search"}],
)

Once the `file_search` tool is enabled, the model decides when to retrieve content based on user messages.

### Step 2: Upload files and add them to a Vector Store

To access your files, the `file_search` tool uses the Vector Store object. Upload your files and create a Vector Store to contain them. Once the Vector Store is created, you should poll its status until all files are out of the `in_progress` state to ensure that all content has finished processing. The SDK provides helpers to uploading and polling in one shot.

In [4]:
print(assistant)

Assistant(id='asst_spGGvyg1alU3tvQBU3PWw0Gg', created_at=1721901675, description=None, instructions='Tu réponds uniquement aux questions concernant la certification IA 2023 délivrée par Simplon.', metadata={}, model='gpt-3.5-turbo', name='Assistant Certification IA 2023', object='assistant', tools=[FileSearchTool(type='file_search')], response_format='auto', temperature=1.0, tool_resources=ToolResources(code_interpreter=None, file_search=ToolResourcesFileSearch(vector_store_ids=[])), top_p=1.0)


In [20]:
# bout de code pour éviter de créer plein de fois le même vector srore
try : 
  vector_store = client.beta.vector_stores.retrieve(
    vector_store_id="vs_vUsPdfeq1ymnt0mC3BuU39Bq"
  )
  print(vector_store)
  print("vector store already exists")

except Exception as e:
  print("vector store not found")

  # Create a vector store caled "Full Certif IA 2023"
  vector_store = client.beta.vector_stores.create(name="Full Certif IA 2023")
  
  # Ready the files for upload to OpenAI
  file_paths = ["reglement_specifique_full_dev_ia_2023.pdf"]
  file_streams = [open(path, "rb") for path in file_paths]
  
  # Use the upload and poll SDK helper to upload the files, add them to the vector store,
  # and poll the status of the file batch for completion.
  file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
    vector_store_id=vector_store.id, files=file_streams
  )
  
  # You can print the status and the file counts of the batch to see the result of this operation.
  print(file_batch.model_dump_json(indent=2))
  print(file_batch.status)
  print(file_batch.file_counts)

{
  "id": "vsfb_e587a6bc697f4ef0b6ee72d52fc80f11",
  "created_at": 1721908713,
  "file_counts": {
    "cancelled": 0,
    "completed": 1,
    "failed": 0,
    "in_progress": 0,
    "total": 1
  },
  "object": "vector_store.file_batch",
  "status": "completed",
  "vector_store_id": "vs_vUsPdfeq1ymnt0mC3BuU39Bq"
}
completed
FileCounts(cancelled=0, completed=1, failed=0, in_progress=0, total=1)


### Step 3: Update the assistant to to use the new Vector Store

To make the files accessible to your assistant, update the assistant’s `tool_resources` with the new `vector_store` id.



In [6]:
assistant = client.beta.assistants.update(
  assistant_id=assistant.id,
  tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)

In [28]:
try : 
  vector_store = client.beta.vector_stores.retrieve(
    vector_store_id="vs_mNwTx0v5cZoZWP2Ze21hQWfY"
  )
  print(vector_store)
  print("vector store already exists")

except Exception as e:
  print("vector store not found")

VectorStore(id='vs_mNwTx0v5cZoZWP2Ze21hQWfY', created_at=1721910340, file_counts=FileCounts(cancelled=0, completed=1, failed=0, in_progress=0, total=1), last_active_at=1721910340, metadata={}, name=None, object='vector_store', status='completed', usage_bytes=3040, expires_after=ExpiresAfter(anchor='last_active_at', days=7), expires_at=1722515140)
vector store already exists


### Step 4: Create a thread

You can also attach files as Message attachments on your thread. Doing so will create another `vector_store` associated with the thread, or, if there is already a vector store attached to this thread, attach the new files to the existing thread vector store. When you create a Run on this thread, the file search tool will query both the `vector_store` from your assistant and the `vector_store` on the thread.

In this example, the user attached a copy of the résumé certif IA 2023


In [26]:
# encore une fois bloc pour pas upload 100 fois le même file
try : 
  vector_store = client.beta.vector_stores.retrieve(
    vector_store_id="vs_mNwTx0v5cZoZWP2Ze21hQWfY"
  )
  print(vector_store)
  print("vector store already exists")

except Exception as e:
  print("vector store not found")

  # Upload the user provided file to OpenAI
  message_file = client.files.create(
    file=open("resume_certif_ia_2023.pdf", "rb"), purpose="assistants"
  )

  message_file_id = "file-QEZAdgUjplVNJdIW24ZAv28b" # je le remets ici pour éviter de le reupload plein de fois

  # Create a thread and attach the file to the message
  thread = client.beta.threads.create(
    messages=[
      {
        "role": "user",
        "content": "Sur quoi porte le bloc de compétences 1 ?",
        # Attach the new file to the message.
        "attachments": [
          { "file_id": message_file_id, "tools": [{"type": "file_search"}]}
        ],
      }
    ]
  )
  
  # The thread now has a vector store with that file in its tool resources.
  print(thread.tool_resources.file_search)

ToolResourcesFileSearch(vector_store_ids=['vs_mNwTx0v5cZoZWP2Ze21hQWfY'])


Vector stores created using message attachements have a default expiration policy of 7 days after they were last active (defined as the last time the vector store was part of a run). This default exists to help you manage your vector storage costs. You can override these expiration policies at any time. Learn more [here](https://platform.openai.com/docs/assistants/tools/file-search/managing-costs-with-expiration-policies).

### Step 5: Create a run and check the output

Now, create a Run and observe that the model uses the File Search tool to provide a response to the user’s question.



In [18]:
# Use the create and poll SDK helper to create a run and poll the status of
# the run until it's in a terminal state.

run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, assistant_id=assistant.id,
    instructions="""
      Tu réponds uniquement aux questions concernant la certification IA 2023 délivrée par Simplon.
      Si la réponse à la question n'est pas compris dans les fichiers, dis-le.""" 
)
# je change les instructions de l'assistant ici mais c'est bien sûr optionnel

messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))

message_content = messages[0].content[0].text
annotations = message_content.annotations
citations = []
for index, annotation in enumerate(annotations):
    message_content.value = message_content.value.replace(annotation.text, f"[{index}]")
    if file_citation := getattr(annotation, "file_citation", None):
        cited_file = client.files.retrieve(file_citation.file_id)
        citations.append(f"[{index}] {cited_file.filename}")

print(message_content.value)
print("\n".join(citations))

Tu peux accéder à la certification IA 2023 délivrée par Simplon en étant en contrat d'apprentissage, comme le contrat de professionnalisation est l'une des voies d'accès possibles pour obtenir cette certification[0].
[0] reglement_specifique_dev_ia_2023.pdf


Notez que les annotations sont gérées pour être affichées correctement ici. Les annotations peuvent être des file_path ou des file_citation. Pour plus d'information sur comment les afficher correctement : https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/assistant#message-annotations 

In [12]:
print(annotations)

[FileCitationAnnotation(end_index=366, file_citation=FileCitation(file_id='file-7WDWcUBr05dDTP6MLtWI9Dys', quote=None), start_index=354, text='【4:1†source】', type='file_citation')]


## A vous de jouer

- Créez un vector store
- Ajoutez y le fichier [reglement spécifique de 5 pages](https://github.com/louiskuhn/IA-P3-Euskadi/blob/main/Ressources/GenAI/OpenAI_Assistants/reglement_specifique_5_pages_dev_ia_2023.pdf)
- Créez un assistant et associez lui le vector store créé précédemment
- Testez l'assistant sur des questions les blocs de compétences, puis sur la composition du jury (qu'il ne connait pas normalement)
- Supprimez votre vector store et modifiez votre assistant pour qu'il utilise maintenant le vector store que j'ai créé, avec le réglement complet de la certification, le vector_store_id est : `vs_vUsPdfeq1ymnt0mC3BuU39Bq`
- Retestez les questions d'au-dessus
- Supprimez votre assistant