In [1]:
!pip install google-generativeai
import google.generativeai as genai



In [2]:
import glob

for file in glob.glob("/content/unsupervised_reformat.txt"):
  with open(file) as f:
    lines = f.readlines()

  paragraphs = []
  min_chars = 200
  par = lines[0]
  for ln in lines[1:]:
    if ln[0] == "•" or ln[0].isdigit() and ln[1:3] == ". ":
      # Part of a list, combine with previous items
      par += ln
    else:
      paragraphs.append(par.strip())  # Remove trailing whitespace and store
      par = ln               # Start new paragraph

  paragraphs.append(par)

for line in paragraphs:
  print(line)
  print('---')
print(len(paragraphs))

Oftentimes a dataset can be partitioned into different categories. A doctor may notice that their patients come in cohorts and different cohorts respond to different treatments. A biologist may gain insight by identifying that bats and whales, despite outward appearances, have some underlying similarity, and both should be considered members of the same category, i.e., “mammal”. The problem of automatically identifying meaningful groupings in datasets is called clustering. Once these groupings are found, they can be leveraged toward interpreting the data and making optimal decisions for each group.
---
Mathematically, clustering looks a bit like classification: we wish to find a mapping from datapoints, x, to categories, y. However, rather than the categories being predefined labels, the categories in clustering are automatically discovered partitions of an unlabeled dataset.
---
Because clustering does not learn from labeled examples, it is an example of an unsupervised learning algor

In [3]:
geminiApiKey = "AIzaSyBu8_aWcKR6ZrFCH_VVorH6mo1RHDQ_SsA"
genai.configure(api_key=geminiApiKey)
cfg = genai.types.GenerationConfig(max_output_tokens=4000)
sys_msg_train = (
'''
The given txt form paragraph is on of the chapter in MIT textbook about unsupervised learning, so based on this txt file, I want you to generate at least 50 high-quality Q&A pairs for training model, and the question must be very specific, don't be too broad or general, and pleast follow this format:
<s>[INST]@AI_Supervised_Unsupervised.<question>.[/INST]<answer></s>
And the answers should be concise, accurate, and relevant to the paragraph's content.
'''
)
print(sys_msg_train)
print()
model_train = genai.GenerativeModel('gemini-1.5-flash', system_instruction=sys_msg_train)
qa_pairs_train = []
for par in paragraphs[:5]:
  if par.strip():
        response = model_train.generate_content(par, generation_config=cfg)
        qa_pairs_train.append(response.text)
        print(qa_pairs_train[-1])
  else:
    print("Skipping empty paragraph.")


The given txt form paragraph is on of the chapter in MIT textbook about unsupervised learning, so based on this txt file, I want you to generate at least 50 high-quality Q&A pairs for training model, and the question must be very specific, don't be too broad or general, and pleast follow this format:
<s>[INST]@AI_Supervised_Unsupervised.<question>.[/INST]<answer></s>
And the answers should be concise, accurate, and relevant to the paragraph's content.


Here are 50 Q&A pairs based on the provided text, following the requested format.  Note that since the text is brief, some questions will necessarily overlap in theme.

<s>[INST]@AI_Supervised_Unsupervised.What is the core problem addressed by clustering in data analysis?.[/INST]Automatically identifying meaningful groupings within datasets.</s>

<s>[INST]@AI_Supervised_Unsupervised.Give an example of how clustering could be used in medicine..[/INST]Identifying patient cohorts that respond differently to various treatments.</s>

<s>[IN

In [None]:
sys_msg_val = (

'''
The given txt form paragraph is on of the chapter in MIT textbook about unsupervised learning, so based on this txt file, I want you to generate at least 20 high-quality multiple choice Q&A pairs for validation, the question can be very broad because we want to validate the model works perfect or not, and pleast follow this format:
<s>[INST]@AI_Supervised_Unsupervised.<question>.[/INST]<answer></s>
And the answers should be multiple choice, at least 4 answer choices, and be concise, accurate, and relevant to the paragraph's content.
'''

)
print(sys_msg_val)
print()
model_val = genai.GenerativeModel('gemini-1.5-flash', system_instruction=sys_msg_val)
qa_pairs_val = []
for par in paragraphs[:5]:  # Use cleaned paragraphs
  if par.strip():  # Check if the paragraph is non-empty
    try:
      response = model_val.generate_content(par, generation_config=cfg)
      qa_pairs_val.append(response.text)
      print(f"Generated Validation Q&A Pair:\n{response.text}")
    except Exception as e:
      print(f"Error processing paragraph: {e}")
  else:
      print("Skipping empty paragraph.")