### Overview

For the Kaggle Challenge sponsored by Google, to highlight the capabilities of LLMs with very long context windows such as the Gemini models, I have created a semester study guide tool that allows users to ask questions about the course material across various mediums such as lecture videos, regular slides, annotated slides, midterm review and the textbook. Different caches can be created for diffrent combinations of material (content for Midterm 1, content for all the lectures so if students miss a class they can understand valuable points that were covered in the lecture that may not be in the notes, content curated on certain topics, etc.). There are endless possibilities of the different permutations and combinations of material that can be set up together. If certain students struggle on a set of topics, a cache can be created for that specific set of topics and the LLM can be topic focused as compared to a broad study guide. 

Having access to the entire semester's worth of material allows the LLM to draw connections between different concepts and provide more comprehensive answers which can serve as a very valuable resource in the education space and effectively showcases the benefits of caching and long context windows and the cost benefits of not having to pay for many tokens if students are using the same caches of content. 

In [1]:
import os
import google.generativeai as genai
from dotenv import load_dotenv
from google.generativeai import caching
import datetime
import pandas as pd
import PyPDF2
import os
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from io import BytesIO
from helper_functions import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
load_dotenv('/Users/netraranga/Desktop/Projects/.env')
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])

The following datasets are the syllabus and transcripts of the lectures from the 2022 Fall Playlist. Due to copyright restrictions, the raw lecture videos are not available and I used the Youtube API to pull the transcripts of the lectures. If a professor or university provided permission to use the raw lecture videos, those could have been used instead. 

The functions below are used to write the transcripts to individual text files and to merge the regular lecture slides and annotated slides into one file. 

In [None]:
write_transcripts_to_files(youtube_df)

combined_annotated_slides = []
for file_path in os.listdir('/Users/netraranga/Desktop/Projects/google_gemini/docs'):
      if 'annotated' in file_path:
            combined_annotated_slides.append(file_path.split('_')[0])

output_files = merge_annotated_slides(combined_annotated_slides) #Get list of files that need to be consolidated

In [7]:
youtube_df = pd.read_csv('youtube_playlist_contents.csv') #pull in transcript content 
youtube_df['Lecture'] = youtube_df.index + 1

Below are two functions that are used to write the transcripts to individual text files and to merge the regular lecture slides and annotated slides into one file

In [9]:
def create_context_cache(list_lectures):
    index_vals = len(list_lectures) + 1
    list_files = []
    for i in range(1, index_vals + 1):
        lecture_file = f'/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_{i}_transcript.txt'
        file_name = f'lecture_{i}'
        file_name = genai.upload_file(path=lecture_file)
        list_files.append(file_name)
    return list_files

In [10]:
cache_files = create_context_cache([1,2,3,4,5])

In [4]:
lecture_1 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_1_transcript.txt'
file_1 = genai.upload_file(path=lecture_1)

In [3]:
###Order of files - use only lectures and annotated slides
lecture_1 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_1_transcript.txt'
file_1 = genai.upload_file(path=lecture_1)

lecture_2 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_2_transcript.txt'
file_2 = genai.upload_file(path=lecture_2)

# lin_alg_notes = '/Users/netraranga/Desktop/Projects/google_gemini/docs/linalg_notes.pdf'
# file_3 = genai.upload_file(path=lin_alg_notes)

# lin_alg_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs/linalg_slides.pdf'
# file_3_1 = genai.upload_file(path=lin_alg_slides)

lecture_3 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_3_transcript.txt'
file_4 = genai.upload_file(path=lecture_3)

lecture_4 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_4_transcript.txt'
file_5 = genai.upload_file(path=lecture_4)

# probs_notes = '/Users/netraranga/Desktop/Projects/google_gemini/docs/prob_notes.pdf'
# file_6 = genai.upload_file(path=probs_notes)

# probs_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs/prob_slides.pdf'
# file_6_1 = genai.upload_file(path=probs_slides)

lecture_5 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_5_transcript.txt'
file_7 = genai.upload_file(path=lecture_5)

lecture_6 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_6_transcript.txt'
file_8 = genai.upload_file(path=lecture_6)

# numpy_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs/numpy_slides.pdf'
# file_9 = genai.upload_file(path=numpy_slides)

lecture_7 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_7_transcript.txt'
file_10 = genai.upload_file(path=lecture_7)

lecture_8 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_8_transcript.txt'
file_11 = genai.upload_file(path=lecture_8)

eval_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs/original_pdfs/eval_slides.pdf'
file_12 = genai.upload_file(path=eval_slides)

lecture_9 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_9_transcript.txt'
file_13 = genai.upload_file(path=lecture_9)

lecture_10 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_10_transcript.txt'
file_14 = genai.upload_file(path=lecture_10)

bias_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/bias_annotated.pdf'
file_15 = genai.upload_file(path=bias_slides)

ridge_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs/original_pdfs/ridge_annotated.pdf'
file_16 = genai.upload_file(path=ridge_slides)

lasso_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs/original_pdfs/lasso_annotated.pdf'
file_17 = genai.upload_file(path=lasso_slides)

midterm_review = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/midterm_review.pdf'
file_18 = genai.upload_file(path=midterm_review)

lecture_11 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_11_transcript.txt'
file_19 = genai.upload_file(path=lecture_11)

boosting_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/boosting.pdf'
file_20 = genai.upload_file(path=boosting_slides)

decision_trees_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/decisiontrees_annotated.pdf'
file_21 = genai.upload_file(path=decision_trees_slides)

# decision_trees_overfitting = '/Users/netraranga/Desktop/Projects/google_gemini/docs/decisiontrees_overfitting.pdf'
# file_22 = genai.upload_file(path=decision_trees_overfitting)

lecture_12 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_12_transcript.txt'
file_23 = genai.upload_file(path=lecture_12)

lecture_13 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_13_transcript.txt'
file_24 = genai.upload_file(path=lecture_13)

kmeans_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/kmeans_annotated.pdf'
file_25 = genai.upload_file(path=kmeans_slides)

em_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/em_annotated.pdf'
file_26 = genai.upload_file(path=em_slides)

pca_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/pca_annotated.pdf'
file_27 = genai.upload_file(path=pca_slides)

lecture_14 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_14_transcript.txt'
file_28 = genai.upload_file(path=lecture_14)

lecture_15 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_15_transcript.txt'
file_29 = genai.upload_file(path=lecture_15)

# ml_advice = '/Users/netraranga/Desktop/Projects/google_gemini/docs/ml_advice.pdf'
# file_30 = genai.upload_file(path=ml_advice)

lecture_16 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_16_transcript.txt'
file_31 = genai.upload_file(path=lecture_16)

learning_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/learning.pdf'
file_32 = genai.upload_file(path=learning_slides)

lecture_17 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_17_transcript.txt'
file_33 = genai.upload_file(path=lecture_17)

lecture_18 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_18_transcript.txt'
file_34 = genai.upload_file(path=lecture_18)

lecture_19 = '/Users/netraranga/Desktop/Projects/google_gemini/docs/transcripts/lecture_19_transcript.txt'
file_35 = genai.upload_file(path=lecture_19)

fairness_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/fairness_annotated.pdf'
file_36 = genai.upload_file(path=fairness_slides)

privacy_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/privacy_annotated.pdf'
file_37 = genai.upload_file(path=privacy_slides)

explanation_slides = '/Users/netraranga/Desktop/Projects/google_gemini/docs//original_pdfs/explainability_annotated.pdf'
file_38 = genai.upload_file(path=explanation_slides)

textbook = '/Users/netraranga/Desktop/Projects/google_gemini/docs/textbook.pdf'
file_39 = genai.upload_file(path=textbook)

In [5]:
### System Prompt
system_prompt = """You are an expert tutor specializing in machine learning, with comprehensive knowledge of the Stanford CS229 "Introduction to Machine Learning" course. You have access to all relevant materials, including:
- Annotated and regular lecture notes for each session.
- Transcripts of all recorded lectures.
- The complete course textbook.
Your role is to guide the user through the CS229 course material by:
1. **Providing clear, detailed explanations** of key machine learning concepts and algorithms, from foundational topics like linear regression and classification to advanced areas such as support vector machines and unsupervised learning.
2. **Connecting course concepts**, explaining how different topics (e.g., gradient descent, regularization) relate and build upon each other across lectures.
3. **Summarizing lectures and sections**, highlighting major takeaways, essential equations, and conceptual insights.
4. **Supporting exam preparation**, identifying high-impact topics, common pitfalls, and suggesting areas for further review."""

In [11]:
textbook_cache = create_cache(name='first_5_lectures', contents=cache_files)
response_1 = gemini_response(textbook_cache, 'Give me an interesting fact from each lecture. Provide the output in the following format: Lecture 1: Fact 1. Lecture 2: Fact 2. Lecture 3: Fact 3. Lecture 4: Fact 4. Lecture 5: Fact 5.')
print(response_1.text)

Lecture 1: The phrase "machine learning" was first introduced in 1959 by Arthur Samuel, who defined it as giving computers the ability to learn without being explicitly programmed.

Lecture 2:  Zillow attempted to use machine learning to predict house prices and flip houses, ultimately losing a significant amount of money, while Blackstone successfully profited from a similar venture.  This highlights the challenges and potential rewards in applying machine learning to real-world problems.

Lecture 3:  Newton's method, while incredibly fast for converging to a solution, is computationally expensive for high-dimensional problems due to the need to compute the Hessian matrix, making it less practical for many modern machine learning applications.

Lecture 4: Many common probability distributions (Bernoulli, Gaussian, Poisson, Gamma, etc.) belong to the exponential family, a fact that simplifies inference and learning through a unified framework.

Lecture 5: Even with simplifying assumpti

### Regular Queries from certain lectures

In [8]:
textbook_cache = create_cache(name='course_overview', contents=file_1)
response_1 = gemini_response(textbook_cache, 'Give me an overview of what this course is about, the main topics that will be covered over the semester, and the homework assignments and exams that will be given.')
print(response_1.text)

This Stanford CS229 course, "Introduction to Machine Learning," provides a foundational understanding of the field.  The instructors emphasize that while introductory, the course is mathematically intensive and requires a solid background in probability, linear algebra, and programming (especially Python and NumPy).  They strongly recommend students have experience with at least two of these three areas.

**Main Topics:**

The course covers a broad range of machine learning topics, organized roughly into these areas:

1. **Supervised Learning:** This forms a large portion of the course.  It focuses on:
    * **Regression:** Predicting continuous values (e.g., house prices).  Linear and polynomial regression are key examples.
    * **Classification:** Predicting categorical values (e.g., classifying images, identifying spam).  The course will cover various classification algorithms.  The concept of features (inputs) and labels (outputs) is central here.  High-dimensional features and fe

In [27]:
textbook = '/Users/netraranga/Desktop/Projects/google_gemini/docs/textbook.pdf'
file_39 = genai.upload_file(path=textbook)
textbook_cache = create_cache(name='textbook', contents=file_39)
response_1 = gemini_response(textbook_cache, 'Give me an overview of L1 and L2 regularization and how they are used in machine learning.')

TypeError: create_cache() got an unexpected keyword argument 'contents'

In [38]:
lecture_cache = create_cache(name='lecture_notes', contents=[file_1, file_2, file_4, file_5, file_7, file_8, file_10, file_11, file_12, file_13, file_14, file_15, file_16, file_17, file_18, file_19, file_20, file_21, file_23, file_24, file_25, file_26, file_27, file_28, file_29, file_31, file_32, file_33, file_34, file_35, file_39])

In [39]:
response_2 = gemini_response(lecture_cache, 'What is an examle of something mentioned in the Neural Networks lecture that wasn not included in the textbook? Provide 2-3 specific examples')

In [41]:
print(response_2.text)

The Neural Networks lectures in CS229 contain several concepts and details not explicitly covered in the textbook. Here are a few examples:

1. **Different Variants of Gradient Descent in Deep Learning:** While the textbook covers gradient descent, the lectures delve into the specifics of stochastic gradient descent (SGD) and mini-batch gradient descent, including practical considerations like the choice of batch size (often determined empirically by the maximum batch size your GPU memory can handle) and the fact that smaller batch sizes often lead to better performance but at the cost of increased noise.  The textbook doesn't give the same level of practical algorithmic detail.

2. **ReLU and Other Activation Functions:** Although the textbook might mention activation functions in general, the lectures focus specifically on the ReLU (Rectified Linear Unit) activation function, its properties (non-linearity), its use in neural networks, and its widespread popularity in deep learning.  

In [42]:
response_3 = gemini_response(lecture_cache, 'What are some key concepts covered in the KMeans lecture that are not covered in the notes? Be very specific in the points you generate.')
print(response_3.text)

The CS229 lecture on KMeans includes several important points and nuanced discussions not explicitly detailed in the accompanying notes.  Here are some key concepts that are discussed in the lecture but receive less attention or are absent from the notes:


1. **The Squishiness of Unsupervised Learning and the Role of Assumptions:** The lecture emphasizes the inherent ambiguity and difficulty in unsupervised learning compared to supervised learning.  It highlights that unsupervised methods necessitate stronger assumptions about the underlying data structure (e.g., the existence of K clusters) and accept weaker guarantees (e.g., convergence to a local rather than a global optimum). This philosophical point about the trade-off between stronger assumptions and weaker guarantees isn't explicitly stated in the notes, which focus more on the algorithmic details.

2. **Initialization Strategies and KMeans++:** The lecture introduces the importance of initialization in KMeans. While the notes 

In [43]:

pca_comparison = genai.upload_file(path='/Users/netraranga/Desktop/Projects/google_gemini/docs/consolidated/combined_pca_slides.pdf')
pca_cache = create_cache(name='pca', contents=pca_comparison)
response_pca= gemini_response(pca_cache, 'What are the differences between the PCA original slides and the annotated slides? Be specific in the type of differences you generate, do not include generic points like the annotated slides have more information')

In [44]:
print(response_pca.text)

The difference between the original and annotated PCA slides lies primarily in the addition of handwritten notes and diagrams on the annotated version.  These additions clarify concepts and calculations presented in the original slides. Let's break down the specific types of annotations:

1. **Elaboration on Mathematical Concepts:** The annotated slides contain additional mathematical steps, derivations, and explanations of equations related to reconstruction error, covariance matrices, and the relationship between PCA and eigenvectors.  For example, the derivation of the reconstruction error is expanded, showing intermediate steps and clarifying how minimizing this error leads to the selection of eigenvectors.

2. **Visual Aids and Interpretations:**  Handwritten diagrams and annotations are added to existing figures to illustrate the projection process, the reconstruction error visually, and the choice of optimal projection vectors.  These additions offer a more intuitive understandi

# Archive

In [37]:
for c in caching.CachedContent.list():
  print(c)

In [36]:
# for c in caching.CachedContent.list():
#   print(c) #Slide 1 to 15 are 166273 tokens
  #Slide 1 to 38 are 500798

for c in caching.CachedContent.list():
    c.delete()

In [None]:
### TO ODO
#-combines all of the slide contents into one file so it passes the cache min size limit
#Determine with chatgpt what are good questions - study guides on certain lectures and concepts
#Identify the differece between annotated notes and regular notes
#Create a study guide that is grounded in the lecture nad pulls additional key concepts from the notes
#Generate some python questions for certain lectures for the application piece 
#Watch a certain video and see if the LLM can retrieve the specific fact or instnce referenced in the video