### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

# Basic Search

This list below will act as our database for the search. 

In [None]:
# Simulated database of Wikipedia-like entries
articles = [
    {'title': 'Python (programming language)', 'link': 'https://en.wikipedia.org/wiki/Python_(programming_language)'},
    {'title': 'History of Python', 'link': 'https://en.wikipedia.org/wiki/History_of_Python'},
    {'title': 'Monty Python', 'link': 'https://en.wikipedia.org/wiki/Monty_Python'},
    {'title': 'Anaconda (Python distribution)', 'link': 'https://en.wikipedia.org/wiki/Anaconda_(Python_distribution)'},
    {'title': 'Python molurus', 'link': 'https://en.wikipedia.org/wiki/Python_molurus'},
    {'title': 'Association football', 'link': 'https://en.wikipedia.org/wiki/Association_football'},
    {'title': 'FIFA World Cup', 'link': 'https://en.wikipedia.org/wiki/FIFA_World_Cup'},
    {'title': 'History of artificial intelligence', 'link': 'https://en.wikipedia.org/wiki/History_of_artificial_intelligence'},
    {'title': 'Football in England', 'link': 'https://en.wikipedia.org/wiki/Football_in_England'},
    {'title': 'Applications of artificial intelligence', 'link': 'https://en.wikipedia.org/wiki/Applications_of_artificial_intelligence'}
]

This function is designed to perform a keyword search on the provided list of articles. It takes two parameters: articles, which is the list of article dictionaries, and keyword, which is the user's search term.

In [None]:
# Function to perform keyword search on the simulated database
def keyword_search(articles, keyword):
    # Convert keyword to lowercase for case-insensitive matching
    keyword = keyword.lower()
    # Search for the keyword in the titles of the articles
    results = [article for article in articles if keyword in article['title'].lower()]
    return results

The code prompts the user to enter a keyword through the input function. This keyword is then used to search the database. The search results are then displayed to the user in a simple text format that lists the title and the link of each matching article. The loop iterates over the search_results and prints them out.

In [None]:
# Example usage
keyword = input("Enter a keyword to search: ")
search_results = keyword_search(articles, keyword)

# Display the search results
for result in search_results:
    print(result['title'], result['link'])

In [None]:
# Example usage
keyword = input("Enter a keyword to search: ")
search_results = keyword_search(articles, keyword)

# Display the search results
for result in search_results:
    print(result['title'], result['link'])

What we just saw are a very high level implementation.

# Search using BM25 Algo

Implementing a keyword search using the BM25 algorithm in Python can be done using the rank_bm25 package, which is a lightweight BM25 implementation. "20 Newsgroups" is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. This is a basic example of how keyword search can be implemented on a text dataset using the BM25 algorithm. It demonstrates preprocessing, scoring, and ranking documents based on their relevance to a given query.

In [None]:
# !pip install rank-bm25

In [1]:
from rank_bm25 import BM25Okapi
from sklearn.datasets import fetch_20newsgroups
import string

This function call retrieves the entire "20 Newsgroups" dataset, which is a collection of approximately 20,000 newsgroup documents.

In [2]:
# Fetch the dataset
newsgroups = fetch_20newsgroups(subset='all')
documents = newsgroups.data  # A list of documents (newsgroup posts)

The preprocess function converts text to lowercase, removes punctuation, and splits it into words (tokens). This standardization is essential for effective keyword matching.

In [3]:
# Preprocess the documents
def preprocess(text):
    return text.lower().translate(str.maketrans('', '', string.punctuation)).split()

# Tokenize the documents
tokenized_docs = [preprocess(doc) for doc in documents]

This initializes the BM25 model with the preprocessed (tokenized) documents. The model will use this data to compute the relevance of documents to a query.

In [4]:
# Create a BM25 object
bm25 = BM25Okapi(tokenized_docs)

In [5]:
# Example search query
query = "What are some of the good gun manufacturing brands?"
tokenized_query = preprocess(query)

The BM25 model calculates a score for each document based on its relevance to the query. These scores indicate how well each document matches the query.

In [6]:
# Perform search
doc_scores = bm25.get_scores(tokenized_query)

In [7]:
# Get top N documents
top_n = 3
top_doc_indices = sorted(range(len(doc_scores)), key=lambda i: doc_scores[i], reverse=True)[:top_n]

The script prints the file path (document ID), the BM25 score, and the first 200 characters of each of the top 3 documents. This gives you a glimpse of the content of the documents that are most relevant to the query "good gun manufacturing brands".

In [8]:
# Display top N results
for idx in top_doc_indices:
    print(f"Document ID: {newsgroups.filenames[idx]}, Score: {doc_scores[idx]}\nDocument: {documents[idx][:600]}...\n")

Document ID: /home/datascience/scikit_learn_data/20news_home/20news-bydate-train/talk.politics.guns/54187, Score: 24.239441992539824
Document: From: fcrary@ucsu.Colorado.EDU (Frank Crary)
Subject: Re: My Gun is like my American Express Card
Nntp-Posting-Host: ucsu.colorado.edu
Organization: University of Colorado, Boulder
Lines: 85

In article <CMM.0.90.2.735132009.thomasp@surt.ifi.uio.no> Thomas Parsli <thomasp@ifi.uio.no> writes:
>Drivers licence:
>Forgot that USA is THE land of cars.....
>Getting one in Scandinavia (and northern europe) is not easy.
>Average time is about 20 hours of training, and the cost is rather......

Is the license required for driving a car exclusively on private
property, such as a farm? Here in the United...

Document ID: /home/datascience/scikit_learn_data/20news_home/20news-bydate-train/talk.politics.guns/54273, Score: 24.078773414761613
Document: From: cdt@sw.stratus.com (C. D. Tavares)
Subject: Re: My Gun is like my American Express Card
Organization: S