# LOTUS demo

## APIs

In [2]:
import bigframes.pandas as bpd
from bigframes.ml.llm import GeminiTextGenerator, _GEMINI_1P5_FLASH_001_ENDPOINT, _GEMINI_1P5_PRO_001_ENDPOINT

bpd.options.display.progress_bar = None

In [3]:
# First let's initialize the dataframe we will use to perform semantic operations on
data = {
    "Course Name": [
        "Probability and Random Processes",
        "Optimization Methods in Engineering",
        "Digital Design and Integrated Circuits",
        "Computer Security",
        "Operating Systems and Systems Programming",
        "Compilers",
        "Computer Networks",
        "Deep Learning",
        "Graphics",
        "Databases",
        "Art History",
    ]
}
df = bpd.DataFrame(data)
 
model = GeminiTextGenerator(model_name=_GEMINI_1P5_FLASH_001_ENDPOINT)

  return func(get_global_session(), *args, **kwargs)


### 1. `sem_filter`

In [3]:
predict_df = df.sem_filter("{Course Name} requires a lot of math", model, logprobs=True)
predict_df



Unnamed: 0,Course Name,confidence_scores
0,Probability and Random Processes,0.95
1,Optimization Methods in Engineering,0.95
2,Digital Design and Integrated Circuits,0.95
5,Compilers,0.8
7,Deep Learning,0.95
8,Graphics,0.8


### 2. `sem_join`

In [4]:
skills_df = bpd.DataFrame({"Skill": ["Art", "Cryptography", "Baking"]})

join_df = df.sem_join(skills_df, "Taking {Course Name} will make me better at {Skill}", model=model, logprobs=True)
join_df



Unnamed: 0,Course Name,Skill,confidence_scores
10,Computer Security,Cryptography,0.85
19,Computer Networks,Cryptography,0.8
24,Graphics,Art,0.85


### 3. `sem_map`

In [5]:
map_df = df.sem_map("Generate a short study plan to succeed in {Course Name}", model=model)
map_df



Unnamed: 0,Course Name,_map
0,Probability and Random Processes,## Study Plan for Probability and Random Proce...
1,Optimization Methods in Engineering,## Study Plan for Optimization Methods in Engi...
2,Digital Design and Integrated Circuits,## Study Plan for Digital Design and Integrate...
3,Computer Security,## Computer Security Study Plan **Goal:** Ac...
4,Operating Systems and Systems Programming,## Study Plan for Operating Systems and Syste...
5,Compilers,## Short Study Plan for Compilers: **1. Funda...
6,Computer Networks,## Short Study Plan for Computer Networks **G...
7,Deep Learning,## Short Study Plan for Deep Learning: **Focu...
8,Graphics,## Graphics Course Study Plan: **Goal:** Achi...
9,Databases,## Short Study Plan for Databases: **1. Acti...


In [6]:
map_df.iloc[0, 1]

"## Study Plan for Probability and Random Processes\n\n**Goal:**  Master the core concepts and build strong problem-solving skills in Probability and Random Processes.\n\n**Strategy:** \n\n1. **Understand the Fundamentals:**\n    * **Week 1-2:** Focus on probability basics: events, axioms, probability distributions, conditional probability, Bayes' Theorem.  \n    * **Week 3-4:** Dive deeper into random variables, expected value, variance, common distributions (Bernoulli, Binomial, Poisson, Normal). \n    * **Week 5-6:**  Explore fundamental concepts of random processes: stochastic processes, Markov Chains, Poisson process. \n\n2. **Practice Regularly:**\n    * **Daily:** Solve at least 5-10 problems from the textbook or previous exams. \n    * **Weekly:**  Review class notes, work on challenging problems, and try to explain concepts to yourself or a study partner.\n\n3. **Seek Help and Resources:**\n    * **Office Hours:**  Utilize your professor's office hours to clarify concepts and 

### 4. `sem_agg`

#### No optimizations

In [6]:
agg_df = df.sem_agg("Generate a study plan for all {Course Name}s", model=model, target_level = None)
agg_df

Loop 0: aggregate 11 rows




Loop 1: aggregate 3 rows




0    Answer: 
Here's a comprehensive study plan enc...
Name: _lotus_doc, dtype: string

In [8]:
agg_df.iloc[0]

"Answer: \nHere's a comprehensive study plan encompassing all the listed courses:\n\n**Fundamentals:**\n\n* **Probability and Random Processes:**  Understand probability concepts, random variables, and stochastic processes.\n* **Optimization Methods in Engineering:**  Explore optimization techniques and algorithms used in engineering applications.\n* **Digital Design and Integrated Circuits:**  Learn about digital circuit design, logic gates, and integrated circuit fabrication.\n* **Computer Security:**  Study security principles, vulnerabilities, and methods for protecting computer systems.\n* **Operating Systems and Systems Programming:** Gain knowledge of operating system concepts, system programming, and resource management.\n\n**Computer Science Core:**\n\n* **Computer Networks:** Understand network protocols, topology, routing, and security.\n* **Compilers:** Learn about programming languages, syntax analysis, semantic analysis, code generation, and optimization.\n* **Databases:*

#### Optimization with given cluster ID

In [1]:
import numpy as np
# TODO: generate cluster ID through embedding model.
df['_lotus_partition_id'] = np.random.randint(0, 4, size=len(df))
agg_df = df.sem_agg("Generate a study plan for all {Course Name}s", model=model, num_batch = 5)
agg_df

  return func(get_global_session(), *args, **kwargs)


Loop 0: aggregate 11 rows




Loop 1: aggregate 5 rows




Loop 2: aggregate 4 rows
Starting aggregation cross groups.




0    Here's a study plan incorporating all the cour...
Name: _lotus_doc, dtype: string

In [2]:
agg_df.iloc[0]

"Here's a study plan incorporating all the courses provided:\n\n**Optimization Methods in Engineering:**\n\n1. **Master foundational math:**  Review calculus, linear algebra, and optimization principles.\n2. **Explore optimization techniques:**  Grasp gradient descent, Lagrange multipliers, and other methods.\n3. **Apply to engineering problems:**  Work through examples related to design, control, and resource allocation.\n4. **Use software tools:**  Familiarize yourself with optimization libraries and software packages.\n\n**Digital Design and Integrated Circuits:**\n\n1. **Understand Boolean Algebra:**  Learn the basics of logic gates and Boolean operations.\n2. **Master digital circuit design:** Design circuits with AND, OR, NOT gates, flip-flops, and memory elements.\n3. **Explore circuit implementation:**  Learn about different types of memory, registers, and digital hardware.\n4. **Practice with hardware description languages:**  Use VHDL or Verilog to design and simulate digital

## Optimizations

### Cascade Models

In [7]:
# Cascade models, where the smaller model running first to save cost.
large_model = GeminiTextGenerator(model_name=_GEMINI_1P5_PRO_001_ENDPOINT)
small_model = GeminiTextGenerator(model_name=_GEMINI_1P5_FLASH_001_ENDPOINT)


In [8]:
predict_df = df.sem_filter(
    "{Course Name} requires a lot of math", 
    model=large_model, 
    small_model=small_model,
    confidence_threshold=0.9, 
    logprobs=True
)
predict_df



Debug:
5 rows resolved by helper model.
6 rows resolved by large model


Unnamed: 0,Course Name,helper_lm_results,helper_lm_confidence_scores,large_lm_results,large_lm_confidence_scores
0,Probability and Random Processes,True,0.95,,
1,Optimization Methods in Engineering,True,0.99,,
2,Digital Design and Integrated Circuits,True,0.95,,
7,Deep Learning,True,0.95,,


## Apply to the `bigquery-public-data.hacker_news.full` dataset

### 1. Import required packages

In [12]:
import bigframes.pandas as bpd
from bigframes.ml.llm import GeminiTextGenerator, _GEMINI_1P5_FLASH_001_ENDPOINT, _GEMINI_1P5_PRO_001_ENDPOINT

large_model = GeminiTextGenerator(model_name=_GEMINI_1P5_PRO_001_ENDPOINT)
small_model = GeminiTextGenerator(model_name=_GEMINI_1P5_FLASH_001_ENDPOINT)
# Run 6s

### 2. Read table and select columns

- This dataset contains information on Hacker News stories and comments.
- The dataset includes 41M rows in total

In [13]:
hacker_news = bpd.read_gbq("bigquery-public-data.hacker_news.full")
hacker_news = hacker_news[["title", "text", "by", "score", "time"]].head(10000)
hacker_news
# Run 4s

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,title,text,by,score,time
0,,"Well, most people aren&#x27;t alcoholics, so I...",slipframe,,1624675076
1,,"No, you don&#x27;t really <i>need</i> a smartp...",vetinari,,1681919794
2,,It&#x27;s for the late Paul Allen RIP. Should&...,lsr_ssri,,1539652075
3,,Yup they are dangerous. Be careful Donald Trump.,Sven7,,1439222754
4,,"Sure, it&#x27;s totally reasonable. Just point...",nicoburns,,1601896851
...,...,...,...,...,...
9995,LinkedIn rival Viadeo acquires French startup ...,,jmfork,2,1358159590
9996,,"I think it&#x27;s true, as the blog author not...",borepop,,1608045566
9997,Nvidia to Android: We're Just Not That Into You,,kungfudoi,1,1245950208
9998,,How do you propose I&#x27;d be caught? This is...,TangoTrotFox,,1541520331


### 3. Filter titles related to "Art"

In [14]:
hacker_news_w_title = hacker_news[hacker_news["title"].isnull() == False]
hacker_news_w_title
# Run 6s

Unnamed: 0,title,text,by,score,time
6,The Impending NY Tech Apocalypse: Here's What ...,,gaoprea,3,1317163407
8,Eureca beta is live. A place for your business...,,ricardos,1,1350306572
15,Discord vs. IRC Rough Notes,,todsacerdoti,48,1720809592
21,Oh dear: new Yahoo anti-spoofing measures brea...,,joshreads,1,1396963790
22,How Much Warmer Was Your City in 2016?,,smb06,1,1487287594
...,...,...,...,...,...
9977,Ask HN: Sharing a dedicated server,Had an idea that I would like some feedback on...,idiet,2,1401901111
9986,How to Launch on Product Hunt,,debdutmukherjee,3,1554441805
9993,"Show HN: Free, open source JavaScript and mong...",,jdawg77,3,1422419156
9995,LinkedIn rival Viadeo acquires French startup ...,,jmfork,2,1358159590


In [15]:
art_hacker_news = hacker_news_w_title.sem_filter("{title} is related to Art", model=large_model)
art_hacker_news
# Run TODO

### 4. Cascade model performance

In [None]:
art_hacker_news = hacker_news_w_title.sem_filter("Is {title} related to Art", model=large_model, helper_model=small_model)
art_hacker_news

### 5. Automatically refill the missed title

In [None]:
hacker_news_wo_title = hacker_news[hacker_news["title"].isnull()].head(100)
hacker_news_gai_title = hacker_news_wo_title.sem_map("Generate a short title for the given context: {text}", model=large_model)
hacker_news_gai_title = hacker_news_gai_title[hacker_news_gai_title["title"].isnull()]
hacker_news_gai_title