# 🐍 Python 🔍 Index Text in Pandas for Fast Search 📊

## ❓ Have you ever faced difficulties when searching for specific text in a huge dataset? 🤔

👉 **Solution**: Let’s use Faiss to create fast text indices in Pandas!

## 🔧 How does it work?

Faiss, a similarity search library developed by Facebook AI Research, helps us index and retrieve text data in an efficient way. We transform text into vector representations (embeddings) and use Faiss to index them for fast searches in Pandas. This is ideal for handling large unstructured text data, such as customer feedback, blog posts, and more.

## 🔎 Why does it matter?

Traditional search methods on large datasets are often slow and inefficient. By using Faiss for indexing, we enable quicker and more scalable search, enhancing your ability to retrieve relevant information in no time.

## ✨ Real-life Example

Let’s say you’re working with an e-commerce platform and have thousands of product reviews. By indexing these reviews using Faiss, you can instantly retrieve similar reviews when a user asks a question, improving customer support and satisfaction.

## ⚙️ Business Impact

🔹 Faster data retrieval can enhance customer service efficiency.
🔹 Accurate product recommendations lead to higher customer satisfaction.
🔹 Saves time, allowing teams to focus on strategic tasks.

## 📊 Summary of the code's actions

- Converts text to vectors using a model like SentenceTransformer.
- Indexes vectors with Faiss for efficient search.
- Enables searching for similar texts based on user queries.

🔗 [Github](https://github.com/jcombari/AI-For-Unstructured-Data/tree/main)

## 💭 Reflection

**How could this be applied in your workflow to solve data search challenges?**  
What problems might you solve by indexing your text data this way?


🔑 **#AI #DataScience #Python #MachineLearning #DeepLearning #Pandas #Faiss #TextMining #GenerativeAI #TechForGood #DataScienceTips #UnstructuredData**

---

# 🐍 Python 🔍 Indexar texto en Pandas para búsqueda rápida 📊

## ❓ ¿Alguna vez has tenido dificultades para encontrar rápidamente el texto relevante en un conjunto de datos masivo? 🤔

👉 **Solución**: ¡Vamos a usar Faiss para crear índices rápidos de texto dentro de Pandas!

## 🔧 ¿Cómo funciona?

Usando Faiss, una biblioteca de búsqueda de similitud eficiente desarrollada por Facebook AI Research, podemos indexar tus datos de texto, lo que permite una búsqueda y recuperación ultrarrápida dentro de Pandas. Esto es ideal para el análisis de texto no estructurado, ya sea que estés manejando comentarios de clientes, descripciones de productos o grandes documentos.

## 🔎 ¿Por qué importa?

Cuando trabajamos con grandes conjuntos de datos textuales, los métodos tradicionales de búsqueda pueden ser lentos e ineficientes. Al convertir los textos en representaciones vectoriales (embeddings) y luego indexarlos con Faiss, hacemos que la búsqueda sea más rápida y escalable.

## ✨ Ejemplo del mundo real

Imagina que estás gestionando un conjunto de datos de reseñas de productos para una plataforma de comercio electrónico. Al indexar las reseñas, puedes encontrar rápidamente reseñas similares a una consulta dada. ¡Esto mejorará el servicio al cliente al asociar rápidamente reseñas relevantes con las preguntas de los clientes!

## ⚙️ Impacto en el negocio

🔹 La recuperación más rápida de datos mejora los procesos de soporte al cliente y análisis.
🔹 Las recomendaciones de productos más eficientes pueden aumentar la satisfacción del usuario.
🔹 Ahorra tiempo y recursos, lo que permite que los equipos se concentren en otras prioridades comerciales.

## 📊 Resumen de lo que hace el código

- Convierte el texto en vectores usando un modelo preentrenado como SentenceTransformer.
- Usa Faiss para indexar los vectores y hacerlos buscables.
- Busca entradas de texto similares según la entrada del usuario.

🔗 [Github](https://github.com/jcombari/AI-For-Unstructured-Data/tree/main)

## 💭 Reflexión

**¿Cómo implementarías esto en tu trabajo diario con grandes conjuntos de datos?**  
¿Qué desafíos de datos podría resolver esto?

🔑 **#AI #DataScience #Python #MachineLearning #DeepLearning #Pandas #Faiss #TextMining #GenerativeAI #TechForGood #DataScienceTips #UnstructuredData**

---
![image.png](attachment:178be2a9-01ce-46da-82f1-118c682faf0b.png)


In [1]:
# First, we import the necessary libraries
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

# Example: Let's create a small dataset with 12 sample text
data = {
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'text': [
        'I love programming in Python',
        'Python is amazing for data science',
        'Data science is the future of tech',
        'Machine learning makes data useful',
        'Deep learning models are transforming AI applications',
        'I am passionate about artificial intelligence and automation',
        'Big data and cloud computing are the backbone of modern businesses',
        'Data visualization helps in making informed decisions',
        'Neural networks are used for complex pattern recognition tasks',
        'Exploring the future of AI and machine learning technologies',
        'Data cleaning is crucial for accurate model training',
        'Generative AI has opened new possibilities in content creation'
    ]
}

# Create a pandas DataFrame
df = pd.DataFrame(data)

# Load the SentenceTransformer model to convert text into embeddings
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Convert the text column into embeddings (vectors)
embeddings = model.encode(df['text'].tolist())

# Initialize Faiss index for efficient search
dimension = len(embeddings[0])  # The dimension of the vectors
index = faiss.IndexFlatL2(dimension)  # L2 distance for finding closest vectors

# Add embeddings to the Faiss index
index.add(np.array(embeddings).astype(np.float32))

# Let's create a search query and convert it to an embedding
query = 'How does artificial intelligence change the way we live?'
query_embedding = model.encode([query])

# Perform the search: finding the top 3 most similar texts
k = 3
distances, indices = index.search(np.array(query_embedding).astype(np.float32), k)

# Print the search results: the most similar texts and their distances
for i in range(k):
    print(f"Text: {df.iloc[indices[0][i]]['text']} (Distance: {distances[0][i]:.2f})")



Text: I am passionate about artificial intelligence and automation (Distance: 29.35)
Text: Exploring the future of AI and machine learning technologies (Distance: 32.41)
Text: Deep learning models are transforming AI applications (Distance: 36.66)


## **Understanding the Output of the Search**

When running the code that performs the search, the output might look like this:

```plaintext
Text: Data science is the future of tech (Distance: 0.95)
Text: Exploring the future of AI and machine learning technologies (Distance: 0.97)
Text: I am passionate about artificial intelligence and automation (Distance: 0.99)
```

## **What Does This Output Mean?**

In this case, the program is showing the top 3 most similar texts to the query `How does artificial intelligence change the way we live?`. Each line contains two pieces of information:

- **Text**: The most similar text from the dataset.
- **Distance**: A number representing how similar this text is to the search query.

### **Explanation of the Components**

#### **Text**:

- This is the actual sentence or text from your dataset that is most similar to the search query.
- The search query, in this case, was "How does artificial intelligence change the way we live?". The program then compares this query to the texts stored in the dataset and finds the ones that are most closely related.

#### **Distance**:

- The **distance** is a measure of similarity between the query and the text. It is calculated using **L2 (Euclidean) distance**.
- **Lower distances** indicate that the text is more similar to the query, and **higher distances** indicate less similarity.
- For example, a distance of **0.95** means that the sentence "Data science is the future of tech" is relatively close to the query in terms of meaning. The smaller the number, the more relevant and similar it is to what you are searching for.

### **How is the Distance Calculated?**

- The **distance** is a mathematical representation of how close the two vectors (the query and the text) are in the multi-dimensional space. The **SentenceTransformer** model converts each text into a vector, a list of numbers that captures the meaning of the text.
- The **Faiss search** compares these vectors and computes the distance between them. Smaller values indicate that the vectors (representing the meaning of the query and the text) are very close to each other.

### **Interpreting the Results**

- **"Text: Data science is the future of tech (Distance: 0.95)"**:
  - This is the most similar text to the query. The distance value of **0.95** means it is quite relevant, though not an exact match.
- **"Text: Exploring the future of AI and machine learning technologies (Distance: 0.97)"**:
  - This is the second most similar text, with a distance value of **0.97**. It's still relevant, but slightly less related than the first text.
- **"Text: I am passionate about artificial intelligence and automation (Distance: 0.99)"**:
  - This is the third most similar text, with a distance of **0.99**, meaning it is very similar to the query, but not as relevant as the first two.

### **Key Takeaways:**

- The **distance value** helps us measure how similar the search results are to the query. A **lower distance value** indicates a **more relevant** text.
- **Faiss** is used to quickly find the closest match to the query based on the vector embeddings, allowing us to retrieve meaningful information efficiently.
