# Using Gemini for Personalized Higher-Level Education Learning

For this Kaggle Challenge to highlight the capabilities of LLMs with very long context windows such as the Gemini models, I have created a learning assistant for Stanford CS229 Course. Education courses contain alot of information that students have to spend time going back through to identify the relevant content based on the topics they want to review. With Gemini's ability to handle vast amounts of information, students can ask questions spanning across multiple mediums — lecture videos, regular slides, annotated slides, midterm reviews, and the course textbook—leveraging Gemini's multimodal capabilities.

**Education courses often overwhelm students with a large volume of content, requiring significant effort to locate and review relevant material. This solution eliminates that bottleneck, making it easier for students to interact with their course material in a meaningful and efficient way.**

-----------------

## **Use of Long Context Window**
Gemini's ability to handle 1 million tokens allows it to synthesize information across the entire semester, eliminating the need for fragmented queries and enabling cohesive responses

-----------------

## **Key Features & Value Proposition**:
1) **Personalized Assistance**: Students can ask detailed questions about specific concepts, relationships between topics, or exam preparation strategies and receive tailored explanations.

2) **Integrated Course Resources**: Gemini bridges insights across lectures, notes, slides, and textbooks, providing unified answers that save time and improve comprehension.

3) **Flexible and Targeted Learning Support**: Students can catch up on missed lectures, compare annotated slides to regular ones, or focus on critical areas for exams.
Topic-specific caches can be created to provide focused study guides for struggling students or for areas requiring additional review.

---------------------

## **The Impact of Gemini**
By providing access to an entire semester’s worth of material in a single cache, **Gemini can draw connections across different concepts, creating more meaningful and comprehensive answers**. This not only enhances the learning experience but also **demonstrates the cost-effectiveness of caching—students can reuse shared caches, reducing token costs while maintaining high-quality assistance**.

---------------------

## **Broad Applicability**
While this notebook focuses on Stanford's CS229 course, this approach could be extended across K-12 education and other higher education disciplines — English, history, math, etc. As a supplemental learning assistant, Gemini offers an innovative solution for personalized, context-rich education support.

------------------

In [91]:
import os
import time
import google.generativeai as genai
from dotenv import load_dotenv
from google.generativeai import caching
import datetime
import pandas as pd
import os
from io import BytesIO

## Dataset Description

For this competition, I found the 2022 Stanford CS229 - Machine Learning course syllabus which had the link to all of the lecture slides, textbook and course lecture on Youtube. 


* A) Lecture Transcript: Using the Youtube API - I extracted the transcript of each lecture video and saved each as a text file: /kaggle/input/cs229-transcripts. 
* B) Lecture Notes: I downloaded all of the lectures notes (annotated and unannotated) & course textbook, which are publically avaiable, and saved the pdfs: /kaggle/input/cs299-notes
* C) Midterm Review - link from syllabus
* D) Course Textbook - link from syllabus


References for sources:
1. Youtube playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNyWOpJg_Yh4NSqI4Z4vOYy
2. Course syallbus: https://docs.google.com/spreadsheets/d/18pHRegyB0XawIdbZbvkr8-jMfi_2ltHVYPjBEOim-6w/edit?pli=1&gid=0#gid=0

In [92]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
google_api = user_secrets.get_secret("GOOGLE_API_KEY")

In [93]:
genai.configure(api_key=google_api)

In [94]:
file_1 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_1_transcript.txt')
file_2 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_2_transcript.txt')
file_3 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_3_transcript.txt')
file_4 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_4_transcript.txt')
file_5 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_5_transcript.txt')
file_6 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_6_transcript.txt')
file_7 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_7_transcript.txt')
file_8 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_8_transcript.txt')
file_9 = genai.upload_file(path='/kaggle/input/cs299-notes/eval_slides.pdf')
file_10 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_9_transcript.txt')
file_11 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_10_transcript.txt')
file_12 = genai.upload_file(path='/kaggle/input/cs299-notes/bias_annotated.pdf')
file_13 = genai.upload_file(path='/kaggle/input/cs299-notes/ridge_annotated.pdf')
file_14 = genai.upload_file(path='/kaggle/input/cs299-notes/lasso_annotated.pdf')
file_15 = genai.upload_file(path='/kaggle/input/cs299-notes/midterm_review.pdf')
file_16 = genai.upload_file(path='/kaggle/input/cs299-notes/boosting.pdf')
file_17 = genai.upload_file(path='/kaggle/input/cs299-notes/decisiontrees_annotated.pdf')
file_18 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_12_transcript.txt')
file_19 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_13_transcript.txt')
file_20 = genai.upload_file(path='/kaggle/input/cs299-notes/em_annotated.pdf')
file_21 = genai.upload_file(path='/kaggle/input/cs299-notes/pca_annotated.pdf')
file_22 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_14_transcript.txt')
file_23 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_15_transcript.txt')
file_24 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_16_transcript.txt')
file_25 = genai.upload_file(path='/kaggle/input/cs299-notes/learning.pdf')
file_26 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_17_transcript.txt')
file_27 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_18_transcript.txt')
file_28 = genai.upload_file(path='/kaggle/input/cs229-transcripts/lecture_19_transcript.txt')
file_29 = genai.upload_file(path='/kaggle/input/cs299-notes/textbook.pdf')
file_30 = genai.upload_file(path='/kaggle/input/cs299-notes/fairness_annotated.pdf')
file_31 = genai.upload_file(path='/kaggle/input/cs299-notes/explainability_annotated.pdf')

In [95]:
system_prompt = """
You are an expert tutor specializing in machine learning, with comprehensive knowledge of the Stanford CS229 "Introduction to Machine Learning" course. You have access to all relevant materials, including:
- Lecture cture notes for each session.
- Transcripts of all recorded lectures.
- The complete course textbook.
Your role is to guide the user through the CS229 course material by:
1. **Providing clear, detailed explanations** of key machine learning concepts and algorithms, from foundational topics like linear regression and classification to advanced areas such as support vector machines and unsupervised learning.
2. **Connecting course concepts**, explaining how different topics (e.g., gradient descent, regularization) relate and build upon each other across lectures.
3. **Summarizing lectures and sections**, highlighting major takeaways, essential equations, and conceptual insights.
4. **Supporting exam preparation**, identifying high-impact topics, common pitfalls, and suggesting areas for further review."""

In [96]:
def generate_gemini_response(cache, question):
    model = genai.GenerativeModel.from_cached_content(cached_content=cache)
    response = model.generate_content(question)
    return response.text

**NOTE**: The content across all course material exceeds the limit of 1 million so I split the material into first half and second half of the course

In [97]:
first_half_cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='first half of CS229 content',
    system_instruction=(
    system_prompt),
    contents=[file_1, file_2, file_3, file_4, file_5, file_6, file_7, file_8, file_9, file_10, file_11, file_12, file_13, file_14, file_15, file_29],
    ttl=datetime.timedelta(minutes=15)
)

In [79]:
second_half_cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='second half of CS229 content',
    system_instruction=(
    system_prompt),
    contents=[file_16, file_17, file_18, file_19, file_20, file_21, file_22, file_23, file_24, file_25, file_26, file_27, file_28, file_30, file_31, file_29],
    ttl=datetime.timedelta(minutes=15)
)

# 1. Lecture-Specific Queries

Students usually have lecture-specific questions to reinforce understanding of key topics or catch up on missed lectures.

Purpose: Retrieve explanations, examples, and elaborations on concepts covered in a particular lecture.

In [42]:
response_1 = generate_gemini_response(second_half_cache, 'What are some key concepts covered in the KMeans lecture that are not covered in the notes? Be very specific in the points you generate.')
print(response_1)

The KMeans lecture contains several concepts and explanations not explicitly detailed in the provided notes. Here are some key points:

1. **The Squishiness of Unsupervised Learning:** The lecture emphasizes the inherent ambiguity and lack of a clearly defined "right answer" in unsupervised learning compared to supervised learning.  The notes primarily focus on the algorithm itself, but the lecture highlights the need for stronger assumptions and weaker guarantees in unsupervised settings.  The professor explicitly states that this used to be more disturbing but is now a common characteristic of much of modern AI.

2. **Initialization Strategies (K-means++):** The lecture discusses the importance of initialization in KMeans and introduces the K-means++ algorithm. The notes mention K-means++ as a method to improve the approximation ratio but don't detail its mechanism. The lecture explains that K-means++ uses a density estimation to strategically place initial centroids, reducing the li

In [44]:
response_2 = generate_gemini_response(first_half_cache, "I didn't fully understand the concept of Laplace Smoothing discussed in Lecture 6. Provide more context on this topic, pull information from the textbook to supplement any explainations provided")
print(response_2)

Let's delve deeper into Laplace smoothing.  The concept arises in the context of naive Bayes, specifically when dealing with discrete features, as encountered in the spam classification example in Lecture 6. The core issue is that if a particular word (e.g., "neurips") never appears in your training set for a specific class (e.g., spam emails), then the maximum likelihood estimate for the probability of that word appearing in that class will be zero.  This leads to a problem because if you then encounter this word in a test email, the probability of the email belonging to that class becomes zero, regardless of other features.  Laplace smoothing offers a solution to this problem.


**The Problem with Maximum Likelihood Estimates**

The naive Bayes classifier estimates probabilities based on observed frequencies in the training data.  This approach is known as *maximum likelihood estimation*.  The probability of feature *x<sub>j</sub>* given class *y* is estimated as:

P(x<sub>j</sub> | 

## 2) Comparing Lecture Notes to Textbook
Students can identify additional insights shared in the lecture that are not explicitly documented in the textbook.

In [45]:
response_3 = gemini_response(first_half_cache, 'What is an example of something mentioned in the Neural Networks lecture that was not included in the textbook? Provide 2-3 specific examples')
print(response_3)

The CS229 lecture notes and textbook cover similar ground regarding neural networks, but the lectures often include more up-to-date information, practical advice, and discussion of recent trends not present in the textbook.  Here are a few examples of topics from the Neural Networks lecture (Lecture 8) that are not explicitly detailed in the Andrew Ng CS229 textbook:

1. **DALL-E 2 and similar large language models:** The lecture heavily emphasizes the impressive capabilities of large language models like DALL-E 2 for image generation and GPT-3 for text generation. While the textbook touches upon the general idea of neural networks and their applications, it doesn't specifically mention or detail these recent advancements in generative AI.  The lecture uses these examples to ground the discussion in the current state of the art, emphasizing the real-world impact of these models.

2. **Discussion of different variants of SGD (Stochastic Gradient Descent):**  The lecture delves into the 

In [80]:
response_4 = generate_gemini_response(second_half_cache, 'I am looking to further reinforce my understanding of decision trees. What are additional details covered about the different types of decision trees mentioned in the textbook that was not covered in the lecture?')
print(response_4)

The lecture focused on the core concepts of decision trees and their implementation using a greedy approach. It demonstrated the process of selecting features to split on by calculating the classification error for each split. However, the textbook delves deeper into different aspects and variants of decision trees, including:

* **ID3, C4.5, and CART:** The textbook discusses these popular algorithms in more detail, highlighting their differences in split criteria and handling of continuous attributes.  For instance, C4.5 uses the gain ratio to address the bias of ID3 towards features with many values, while CART employs the Gini index as a split criterion.

* **Handling of Missing Values:** The textbook explores strategies for dealing with missing values in decision tree construction. This involves techniques like surrogate splits, where alternative features are used for splitting when the primary feature has missing values, and weighting based on the availability of data.

* **Pruni

## 3) PowerPoint Slide Differences
Students can compare annotated slides to original slides to understand key takeaways and expanded explanations.

In [81]:
response_5 = generate_gemini_response(second_half_cache, 'What are the differences between the PCA original slides and the annotated content? Be specific in the type of differences you generate, do not include generic points like the annotated slides have more information')
print(response_5)

The annotated content in the slides has more information and commentary on several key aspects of PCA, compared to the original slides. Specifically:

**1.  Greater Emphasis on the Geometric Intuition**

* **Projection:** The annotated slides emphasize the geometric interpretation of PCA, explicitly showing how a point in a high-dimensional space is projected onto the subspace defined by the principal components. This is done through diagrams and explanations of how the dot product is used to find the closest point to the subspace. 
* **Variance Maximization:** The annotated slides provide a more visual understanding of how PCA maximizes the variance of the projected data. This is illustrated through diagrams comparing different directions (vectors) and how they capture the variance of the data points.

**2.  More Explicit Connections to Linear Algebra**

* **Eigenvectors:** The annotated content clarifies the relationship between PCA, covariance matrices, and eigenvectors. It explains

## 4) Study Guide and Practice Question Generation
Students can generate custom study questions and answers for exam preparation based on the cached content.

In [98]:
response_6a = generate_gemini_response(first_half_cache, 'Generate 3-5 questions using the concepts covered across lectures. Ask questions that require students to think critically about different concepts and that would be beneficial for them to review prior to the final exam. Provide the question followed by the answer.')
print(response_6a)

Here are a few questions that require students to think critically about different concepts and that would be beneficial for them to review prior to the final exam.

**Question 1:** When would you prefer to use a generative learning algorithm over a discriminative learning algorithm, and what are the potential drawbacks of each approach?

**Answer:** 

You would prefer a generative learning algorithm over a discriminative learning algorithm when you have strong prior knowledge about the distribution of your data, especially if this knowledge can be captured in a relatively simple probabilistic model. Generative algorithms, such as Gaussian Discriminant Analysis (GDA) and Naive Bayes, model the joint distribution of both the inputs (X) and the labels (Y), which allows them to leverage this prior knowledge to make more accurate predictions. 

However, the drawback is that if your assumptions about the data are incorrect, then a generative model can perform worse than a discriminative mod

In [88]:
response_6b = generate_gemini_response(second_half_cache, 'Generate 3-5 questions using the concepts covered across lectures. Ask questions that require students to think critically about different concepts and that would be beneficial for them to review prior to the final exam. Provide the question followed by the answer.')
print(response_6b)

Here are some questions that would be beneficial for students to review prior to the final exam. 

**Question 1:** Describe the difference between a supervised learning problem and an unsupervised learning problem. How do the goals differ? Give an example of each.

**Answer 1:**  Supervised learning involves having labeled data, where we know the target variable (y) for each input (x). The goal is to learn a model that can accurately predict y from x. 
Unsupervised learning, on the other hand, involves data without labels. The goal is to uncover hidden structure or patterns within the data, such as finding clusters, identifying key dimensions of variation, or learning representations of the data. 

**Question 2:** How does the concept of latent variables play a role in the Expectation Maximization (EM) algorithm? Give an example of a latent variable and explain how it is used within the EM algorithm.

**Answer 2:** Latent variables represent unobserved or hidden factors that influence 

In [69]:
response_7 = generate_gemini_response(first_half_cache, "Using the midterm review as a guide, generate 4-5 practice questions that cover concepts across lectures, with a particular focus on SVMs and Decision Trees. The questions should require critical thinking and help prepare for the final exam. Include questions that explore the relationships between these topics and other course concepts. Provide detailed answers for each question. Use the midterm review as a guide to determine the types of questions that are relevant.")
print(response_7)

Here are 4-5 practice questions that cover concepts across lectures, with a particular focus on SVMs and Decision Trees, that you can use to prepare for the final exam:

**Question 1:**

**Explain the concept of a “margin” in Support Vector Machines (SVMs). How does the margin relate to the idea of a “separating hyperplane” and the concept of “overfitting?”**

**Answer:**

In SVMs, the margin refers to the distance between the decision boundary (separating hyperplane) and the closest data points. A larger margin indicates that the classifier is more confident in its predictions, as data points are further away from the decision boundary. This concept is related to the idea of overfitting because a model with a large margin is less likely to overfit to the training data.

Here's why:

* **Overfitting:** Overfitting occurs when a model learns the training data too well, including noise and spurious patterns. This results in poor generalization to unseen data.
* **Large Margin:** A large 

## FUTURE IMPROVEMENTS
------------------------------

**1. Singular Cache**: Collating both first and second half of the semester's material into one cache makes it easier for students to ask questions and not relying on knowing if content was convered in the first or second half. Due to 503 errors, I was unable to combine both into one cache & the cache token limit

**2. Include Additional Course Materials**: If the cache window was longer, I would include historical course materials (additional midterm review and released midterm and final exams) which students could leverage to generate additional practice questions with to continue to reinforce concepts ahead of exams


