<a href="https://colab.research.google.com/github/ihstepura/publicgenai/blob/main/task5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build a Gradio semantic search tool for the "financial_news.csv" dataset. This tool should allow users to input a search query and retrieve the top 5 most semantically similar news articles from the dataset. The tool should also extract and display any URLs present in the news text, while removing them from the text used for embedding.

## Load Data

### Subtask:
Load the 'financial_news.csv' file into a pandas DataFrame.


**Reasoning**:
To load the data, I will first import the pandas library, then read the 'financial_news.csv' file into a DataFrame, display its first 5 rows, and print the column names to verify the loading process.



In [2]:
import pandas as pd

df = pd.read_csv('/content/financial_news.csv')
print("First 5 rows of the DataFrame:")
print(df.head())
print("\nColumn names of the DataFrame:")
print(df.columns)

First 5 rows of the DataFrame:
                                                text  label
0  Here are Thursday's biggest analyst calls: App...      0
1  Buy Las Vegas Sands as travel to Singapore bui...      0
2  Piper Sandler downgrades DocuSign to sell, cit...      0
3  Analysts react to Tesla's latest earnings, bre...      0
4  Netflix and its peers are set for a ‘return to...      0

Column names of the DataFrame:
Index(['text', 'label'], dtype='object')


## Extract and Clean URLs

### Subtask:
Iterate through the 'text' column to identify and extract URLs. Store these URLs in a new 'URL' column and remove them from the original 'text' column.


**Reasoning**:
To extract and remove URLs from the 'text' column, I need to use regular expressions. I will define functions for extraction and removal, and then apply them to the DataFrame.



In [4]:
import re

# Function to extract URLs
def extract_urls(text):
    # Regex to find URLs (http/https followed by non-whitespace characters)
    url_pattern = re.compile(r'https?://\S+')
    urls = url_pattern.findall(text)
    return urls if urls else []

# Function to remove URLs
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+')
    return url_pattern.sub('', text).strip()

# Apply the extract_urls function to create the 'urls' column
df['urls'] = df['text'].apply(extract_urls)

# Apply the remove_urls function to clean the 'text' column
df['text'] = df['text'].apply(remove_urls)

print("First 5 rows of the DataFrame after URL extraction and removal:")
print(df.head())
print("\nFirst 5 entries of the 'urls' column:")
print(df['urls'].head())

First 5 rows of the DataFrame after URL extraction and removal:
                                                text  label urls
0  Here are Thursday's biggest analyst calls: App...      0   []
1  Buy Las Vegas Sands as travel to Singapore bui...      0   []
2  Piper Sandler downgrades DocuSign to sell, cit...      0   []
3  Analysts react to Tesla's latest earnings, bre...      0   []
4  Netflix and its peers are set for a ‘return to...      0   []

First 5 entries of the 'urls' column:
0    []
1    []
2    []
3    []
4    []
Name: urls, dtype: object


## Generate Sentence Embeddings

### Subtask:
Generate sentence embeddings for the cleaned 'text' column.


**Reasoning**:
To generate sentence embeddings, the `sentence-transformers` library is required. This step ensures it is installed.



In [6]:
pip install sentence-transformers



**Reasoning**:
To generate sentence embeddings, I will import the necessary library, load a pre-trained model, and then encode the 'text' column, storing the results in a new 'embeddings' column.



In [8]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate sentence embeddings for the 'text' column
df['embeddings'] = df['text'].apply(lambda x: model.encode(x))

print("First 5 entries of the 'embeddings' column (showing shape):")
print(df['embeddings'].head().apply(lambda x: x.shape))

First 5 entries of the 'embeddings' column (showing shape):
0    (384,)
1    (384,)
2    (384,)
3    (384,)
4    (384,)
Name: embeddings, dtype: object


## Build Gradio Semantic Search App

### Subtask:
Create a Gradio interface that allows users to input search queries. Implement a function to compute cosine similarity between the query embedding and the pre-computed text embeddings, then return the top 5 closest records from the DataFrame. Display these results in the Gradio app.


**Reasoning**:
To build the Gradio application, the `gradio` library needs to be installed first. This step ensures that the necessary library is available in the environment.



In [10]:
pip install gradio



**Reasoning**:
To implement the semantic search functionality and display it using Gradio, I need to define the core search logic, including embedding the query, calculating cosine similarity, and retrieving the top results. Then, I will set up the Gradio interface to interact with this function.



In [12]:
import gradio as gr
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Ensure the model is loaded (it was loaded in a previous step, but good to ensure context)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2')

# Convert embeddings to a 2D numpy array for efficient computation
doc_embeddings = np.array(df['embeddings'].tolist())

def semantic_search(query):
    # Encode the query
    query_embedding = model.encode(query)

    # Calculate cosine similarity between query and document embeddings
    # Reshape query_embedding to (1, -1) for cosine_similarity function if it's 1D
    similarities = cosine_similarity(query_embedding.reshape(1, -1), doc_embeddings)[0]

    # Get the indices of the top 5 most similar articles
    top_indices = np.argsort(similarities)[::-1][:5]

    # Retrieve the corresponding articles and their URLs
    results = []
    for i in top_indices:
        text_entry = df.iloc[i]['text']
        urls_entry = df.iloc[i]['urls']
        label_entry = df.iloc[i]['label']
        results.append({"Text": text_entry, "Label": label_entry, "URLs": urls_entry})

    # Convert list of dicts to a format suitable for gr.Dataframe
    if results:
        output_df = pd.DataFrame(results)
    else:
        output_df = pd.DataFrame(columns=["Text", "Label", "URLs"])

    return output_df

# Create the Gradio interface
iface = gr.Interface(
    fn=semantic_search,
    inputs=gr.Textbox(lines=2, placeholder="Enter your search query here..."),
    outputs=gr.Dataframe(headers=["Text", "Label", "URLs"], row_count=5),
    title="Financial News Semantic Search",
    description="Enter a query to find the top 5 most semantically similar financial news articles. URLs present in the original text are extracted and displayed separately."
)

# Launch the app
iface.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://234fac2fc2e46c8370.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Final Task

### Subtask:
Summarize the functionality of the semantic search tool and provide instructions on how to use it.


## Summary:

### Q&A
*   **What is the functionality of the semantic search tool?**
    The semantic search tool allows users to input a text query and retrieves the top 5 most semantically similar financial news articles from the `financial_news.csv` dataset. It utilizes sentence embeddings and cosine similarity for matching. Additionally, it extracts any URLs present in the original news text, removes them from the text used for embedding, and displays these extracted URLs alongside the search results.

*   **How do I use the semantic search tool?**
    1.  Access the launched Gradio application via the provided public URL.
    2.  Enter your search query into the designated text box.
    3.  The application will then display a table containing the "Text," "Label," and "URLs" of the top 5 most semantically similar financial news articles.

### Data Analysis Key Findings
*   The `financial_news.csv` dataset was successfully loaded, containing 'text' and 'label' columns.
*   URLs were successfully identified and extracted from the 'text' column using a regular expression (`https?://\S+`). These URLs were stored in a new 'urls' column, and the 'text' column was cleaned by removing them.
*   Sentence embeddings for the cleaned 'text' column were generated using the 'all-MiniLM-L6-v2' Sentence Transformer model, resulting in 384-dimensional vectors for each news article.
*   A Gradio application was developed to facilitate semantic search, employing cosine similarity to rank articles based on their semantic relevance to the user's query.
*   The Gradio interface effectively displays the top 5 search results in a structured `gr.Dataframe`, showing the cleaned text, its associated label, and any extracted URLs.

### Insights or Next Steps
*   The developed semantic search tool offers a more intuitive and context-aware way to find financial news compared to traditional keyword-based searches, enabling users to discover relevant articles even with different phrasing of their queries.
*   To enhance the tool's capabilities, consider implementing a mechanism for users to provide feedback on search result relevance, which could be used to fine-tune the embedding model or improve the ranking algorithm over time.
