Install all the packages. Be sure to use accelerator TPU.

In [1]:
!pip install tensorflow-cpu tensorflow-hub tensorflow-text
!pip install  Pyarrow
!pip install langchain
!pip install --quiet langchain_experimental langchain_openai
!pip install pypdf
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

!pip install sentence_transformers
!pip install chromadb

Collecting tensorflow-cpu
  Downloading tensorflow_cpu-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (214.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m214.0/214.0 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tensorboard<2.17,>=2.16
  Downloading tensorboard-2.16.2-py3-none-any.whl (5.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Collecting ml-dtypes~=0.3.1
  Downloading ml_dtypes-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m62.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m
Collecting namex
  Downloading namex-0.0.7-py3-none-any.whl (5.8 kB)
Collecting tensorflow<2.16,>=2.15.0
  Downloading tensorflow-2.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

## Select a backend
Keras is a high-level, multi-framework deep learning API designed for simplicity and ease of use. Using Keras 3, you can run workflows on one of three backends: TensorFlow, JAX, or PyTorch.

In [2]:
import os

os.environ["KERAS_BACKEND"] = "jax"  # Or "torch" or "tensorflow".
# Avoid memory fragmentation on JAX backend.
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"

# Import Keras and KerasNLP.
import keras
import keras_nlp

  from .autonotebook import tqdm as notebook_tqdm


## Loading the Model

KerasNLP provides implementations of many popular model architectures.We will create a model using GemmaCausalLM, an end-to-end Gemma model for causal language modeling. A causal language model predicts the next token based on previous tokens.

In [3]:
# Loading Instruct Gemma_2b
import gc
gc.collect()
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_instruct_2b_en")
gemma_lm.summary()

Attaching 'config.json' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'assets/tokenizer/vocabulary.spm' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


## Prepare fine-tune data

Alternatively, skipping this part just directly reading /kaggle/input/data-science/data_science.txt


In [4]:
RETRAIN = False

In [5]:
if RETRAIN:
    !pip install cryptography>=3.1
    import os
    from langchain.document_loaders import PyPDFLoader

    from langchain.embeddings import HuggingFaceBgeEmbeddings
    from tqdm import tqdm

There are numbers text splitter to try in [LangChain Texty Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)
Here we tried with RecursiveCharacterTextSplitter for effiency. It also has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

In [6]:

from langchain_text_splitters import RecursiveCharacterTextSplitter
gc.collect()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)


In [7]:
if RETRAIN:
    # Prepare fine tune data
    def get_all_pdfs(directory):
        """Get the list of pdf files in the directory."""
        pdf_files = []
        for root, dirs, files in os.walk(directory):
            for file in files:
                if file.endswith(".pdf"):
                    pdf_files.append(os.path.join(root, file))
        return pdf_files

    # load documents
    pdf_files = get_all_pdfs('/kaggle/input/data-science-cheat-sheets')
    loaders = [PyPDFLoader(pdf_file) for pdf_file in pdf_files]
    all_documents = []
    for loader in tqdm(loaders):
        raw_documents = loader.load_and_split()
        # split the documents into smaller chunks
        documents = text_splitter.split_documents(raw_documents)
        all_documents.extend(documents)

    # write them down.
    with open("/kaggle/input/data-science/data_science.txt", "w") as f:
        for d in all_domuments:
            f.write(d.page_content)
    

## Inference
To help with accuracy, we adopt 3 strageties here.
1. RAG with wikipedia data.
2. Fine tune with data science cheat sheet
3. Chain-of-thoughts prompting.


For #1, we are going to use WikipediaRetriever to retrieve wiki pages from wikipedia.org into the Document format. This additional infomation will be used to providing extra context to gemma model to generate response. 

For #2, we are going to use LoRA to fine tune gemma model with domain specific knowledge.

For #3, the idea is first process the query, extracting terms related to data science, let the LLM first explain those terms before answering the question directly.

**The quality of tuning data is crutial. Low quality training data does more harm than good!**


## Load dataset

**Set 1:** Load data from data science cheat sheets.
This dataset needs cleaning before being used for fine tuning.

In [8]:

with open("/kaggle/input/data-science/data_science.txt") as f:
    data_science = f.read()
texts = text_splitter.create_documents([data_science])
data = [t.page_content for t in texts]

**Set 2:** Load data from kaggle-docs

In [9]:
import pandas as pd
df = pd.read_csv(f"/kaggle/input/kaggle-docs/questions_answers/data.csv")

In [10]:
template = "Question:\n{Question}\n\nAnswer:\n{Answer}"
df["prompt"] = df.apply(lambda row: template.format(Question=row.Question,
                                                             Answer=row.Answer), axis=1)
kaggle_data = df.prompt.tolist()
kaggle_data[:3]

['Question:\nWhat are the different types of competitions available on Kaggle?\n\nAnswer:\n# Types of Competitions\n\nKaggle Competitions are designed to provide challenges for competitors at all different stages of their machine learning careers. As a result, they are very diverse, with a range of broad types.\n\n## Featured\n\nFeatured competitions are the types of competitions that Kaggle is probably best known for. These are full-scale machine learning challenges which pose difficult, generally commercially-purposed prediction problems. For example, past featured competitions have included:\n\n- [Allstate Claim Prediction Challenge](https://www.kaggle.com/c/allstate-purchase-prediction-challenge) - Use customers’ shopping history to predict which insurance policy they purchase\n- [Jigsaw Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) - Predict the existence and type of toxic comments on Wikipedia\n- [Zillow Prize](htt

**Set 3:** DataScience_QA

In [11]:
qa = pd.read_csv("test12/main_data.csv")
qa["prompt"] = qa.apply(lambda row: template.format(Question=row.question,Answer=row.answer), axis=1)                                                   
qa_data = qa.prompt.tolist()
qa_data[:3]

['Question:\nWhat is Data Science?\n\nAnswer:\nData science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.',
 'Question:\nWhat are the Key Components of Data Science?\n\nAnswer:\nThe key components of data science include data collection, data cleaning and preprocessing, data analysis and modeling, interpretation of results, and communication of findings.',
 'Question:\nWhat is Data Collection?\n\nAnswer:\nData collection involves gathering relevant data from various sources, such as databases, files, sensors, and web scraping, to address specific questions or objectives.']

LLMs are extremely large in size (parameters in the order of billions). Full fine-tuning (which updates all the parameters in the model) is not required for most applications because typical fine-tuning datasets are relatively much smaller than the pre-training datasets.

Low Rank Adaptation (LoRA) is a fine-tuning technique which greatly reduces the number of trainable parameters for downstream tasks by freezing the weights of the model and inserting a smaller number of new weights into the model. This makes training with LoRA much faster and more memory-efficient, and produces smaller model weights (a few hundred MBs), all while maintaining the quality of the model outputs.

In [12]:
# finetune with data_science text.
# Enable LoRA for the model and set the LoRA rank to 4.
gemma_lm.backbone.enable_lora(rank=4)
gemma_lm.summary()

In [13]:
FINE_TUNE = True

In [14]:
if FINE_TUNE:
    # Limit the input sequence length to 512 (to control memory usage).
    gemma_lm.preprocessor.sequence_length = 512
    # Use AdamW (a common optimizer for transformer models).
    optimizer = keras.optimizers.AdamW(
        learning_rate=5e-5,
        weight_decay=0.01,
    )
    # Exclude layernorm and bias terms from decay.
    optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

    gemma_lm.compile(
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=optimizer,
        weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
    )
    #gemma_lm.fit(data, epochs=1, batch_size=1)
    
    # train on data from kaggle doc
    gemma_lm.fit(kaggle_data, epochs=3, batch_size=1)
    
    # It is better to have a seperated llm just for assistent work, like query cleasing. But due to memory issue, only one llm instance can be created. 
    # assistent_lm = keras.saving.load_model(MODEL_PATH)

Epoch 1/3
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 153ms/step - loss: 2.3653 - sparse_categorical_accuracy: 0.4726
Epoch 2/3
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 150ms/step - loss: 2.1124 - sparse_categorical_accuracy: 0.4805
Epoch 3/3
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 150ms/step - loss: 1.9953 - sparse_categorical_accuracy: 0.4886


In [15]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25ldone
[?25h  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=916223d89ae41045e198531965e6b61a5f02662630f7d5a8b74f87bcd86f6169
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [16]:
from langchain_community.retrievers import WikipediaRetriever
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
retriever = WikipediaRetriever(top_k_results=2)
template = """You are a data scientist that answer questions in data science domain in an easy to understand and eloquent way. Use the given Context as reference if it is relevant to the question.

Context: {context}

Question: {question}

Answer: 
"""

In [17]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [18]:
TO_CLEAN = set([w for w in ENGLISH_STOP_WORDS])
for c in string.punctuation:
    TO_CLEAN.add(c)
TO_CLEAN.add('does')

In [19]:
def clean_query(query):
    q = query.strip()
    new_text = " ".join([w for w in q.split() if w.lower() not in TO_CLEAN])
    return new_text

# few-shot prompting to extract keywords and explain them.
def extract_key_words(query):
    prompt = """ In the following text, identify terms that are relevant to Data Science and explain the terms. Follow the examples for formatting. 
    
    Text: Matplot is a python library that help us to plot data. The easiest and most basic plots are line, scatter and histogram plots.
    Answer: 
     Matplot: Matplot in Matplotlib is a Python library that lets you create all sorts of plots, charts, and graphs. Think of it as your digital art kit for visual storytelling with data.
     histogram plots: A histogram plot is a visualization tool used to understand the distribution of numerical data. 
     
    Text: Pandas Python library
    Answer: 
     Pandas: Pandas is built on top of the Python programming language and designed specifically to make working with structured, tabular data both easy and intuitive.
   
    Text: {text}
    Answer:
    """
    p = prompt.format(
        text=clean_query(query),
    )

    result = gemma_lm.generate(p)
    idx = result.rfind("Answer:")
    t_idx = result.rfind("Text:")
    if t_idx >= idx:
        return ""
    if idx != -1:
        return result[idx+7:] + '\n'
    return ""   

In [20]:
def summarize_context(context):
    prompt = """Summarize the given text. Follow the examples for formatting.
    
    Text: Heteroskedasticity is a condition in which the variance of a data series is not constant. This can cause problems when performing statistical analysis, as it can lead to biased and inefficient results.
    Answer: 
     Heteroskedasticity is a condition in which the variance of a data series is not constant.
    
    Text: {text}
    Answer:
    """
    p = prompt.format(
        text=context,
    )
    
    result = gemma_lm.generate(p)
    idx = result.rfind("Answer:")
    t_idx = result.rfind("Text:")
    if t_idx >= idx:
        return ""
    if idx != -1:
        return result[idx+7:] + '\n'
    return ""

In [21]:
def get_answer(query):
    # Input scoping
    scope = "In Data Science domain, "
    docs = retriever.get_relevant_documents(query=scope+query)
    context = ''
    for d in docs:
        summary = summarize_context(d.metadata['summary'])        
        context += summary
        
    keywords = extract_key_words(query)
    context += keywords
    prompt = template.format(
        context=context,
        question= scope + query,
    )
    # output sanitize?
    return gemma_lm.generate(prompt)
    

Result on fine-tuning with kaggle data only. 


In [22]:
qlist= ["What is Data Science?",
"Differentiate between Data Analytics and Data Science",
"What are the differences between supervised and unsupervised learning?",
"Explain the steps in making a decision tree.",
"Differentiate between univariate, bivariate, and multivariate analysis.",
"How should you maintain a deployed model?",
"What is a Confusion Matrix?",
"How is logistic regression done?",
"What is the significance of p-value?",
"Mention some techniques used for sampling."]

for q in qlist:
    print(get_answer(q))
    print("\n--------------\n")

You are a data scientist that answer questions in data science domain in an easy to understand and eloquent way. Use the given Context as reference if it is relevant to the question.

Context: 
    A data scientist is a professional who creates programming code and combines it with statistical knowledge to create insights from data.

    Data science is a broad field of study that encompasses many different topics, including statistics, mathematics, computer science, and social science. Data science is a rapidly growing field, and new technologies and tools are being developed all the time.

**Additional Notes:**

* The text also mentions Matplotlib, which is a Python library for creating plots.
* The text also mentions NumPy, which is a Python library for numerical computing.
* The text also mentions Seaborn, which is a Python library for creating data visualizations.


Question: In Data Science domain, What is Data Science?

Answer: 
Data science is a broad field of study that encomp

Result on fine-tuning with kaggle data + qa data. 


In [23]:
 gemma_lm.fit(qa_data, epochs=3, batch_size=1)

Epoch 1/3
[1m81/81[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 150ms/step - loss: 0.1946 - sparse_categorical_accuracy: 0.6585
Epoch 2/3
[1m81/81[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 150ms/step - loss: 0.1672 - sparse_categorical_accuracy: 0.6872
Epoch 3/3
[1m81/81[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 150ms/step - loss: 0.1451 - sparse_categorical_accuracy: 0.6946


<keras.src.callbacks.history.History at 0x7cbfd20ffcd0>

In [24]:
for q in qlist:
    print(get_answer(q))
    print("\n--------------\n")

You are a data scientist that answer questions in data science domain in an easy to understand and eloquent way. Use the given Context as reference if it is relevant to the question.

Context: 
    Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.

    Machine learning is a subfield of data science that focuses on algorithms that can learn from data. Machine learning algorithms are used to solve problems that are too complex for traditional statistical methods.


Question: In Data Science domain, What is Data Science?

Answer: 
Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.

-