<a href="https://colab.research.google.com/github/hayTambourineMan/CSCI-6170/blob/main/HW6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1

For this task, I selected the XSum (Extreme Summarization) dataset, which is a widely-used benchmark for abstractive text summarization. It consists of BBC news articles paired with professionally written single-sentence summaries. I chose this dataset because it focuses on generating highly concise summaries that still capture the core meaning of the article, making it a great challenge for evaluating transformer-based models like BART. The dataset contains over 200,000 document-summary pairs across a wide range of topics including politics, science, and sports. I performed a 90/10 split on the training set to create my own train-test sets for fine-tuning and evaluation.

In [None]:
!pip install transformers datasets evaluate rouge-score nltk



In [None]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from datasets import load_dataset

# Load and preview the dataset
dataset = load_dataset("xsum")
dataset

from datasets import DatasetDict

split_dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
from transformers import AutoTokenizer

# Load BART tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")

# Tokenize function
def preprocess(example):
    model_inputs = tokenizer(
        example["document"], max_length=1024, truncation=True, padding="max_length"
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            example["summary"], max_length=128, truncation=True, padding="max_length"
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply preprocessing
train_tokenized = train_dataset.map(preprocess, batched=True)
test_tokenized = test_dataset.map(preprocess, batched=True)

This is just a hw, so I reduce the training size and model size to make it fast.

In [None]:
from transformers import BartForConditionalGeneration, Trainer, TrainingArguments
import os
os.environ["WANDB_DISABLED"] = "true"

# 🔁 Reduce but still realistic dataset
train_small = train_tokenized.select(range(200))  # increased to 200 for more training signal
test_small = test_tokenized.select(range(40))

# Load the pre-trained BART model
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,  # ⬅️ More epochs
    weight_decay=0.01,
    save_total_limit=1,
    logging_steps=5,
    report_to="none",  # disables wandb
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_small,
    eval_dataset=test_small,
)

trainer.train()



Epoch,Training Loss,Validation Loss
1,4.5441,3.659192
2,1.1556,0.844631
3,0.368,0.47527
4,0.24,0.474259
5,0.1657,0.515716




TrainOutput(global_step=250, training_loss=2.201045848608017, metrics={'train_runtime': 620.126, 'train_samples_per_second': 1.613, 'train_steps_per_second': 0.403, 'total_flos': 2167104602112000.0, 'train_loss': 2.201045848608017, 'epoch': 5.0})

In [None]:
import gc
import torch
import evaluate
from nltk.tokenize import sent_tokenize

# Clear memory
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Load evaluation metrics
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

# Use fewer test samples to reduce memory
num_samples = 20  # Feel free to increase slightly if RAM allows
test_docs = test_dataset["document"][:num_samples]
test_refs = test_dataset["summary"][:num_samples]

# Tokenize and move to device in smaller batches
inputs = tokenizer(test_docs, return_tensors="pt", padding=True, truncation=True).input_ids.to(model.device)

# Generate summaries
with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=128)

# Decode predictions
preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)
refs = [s.split(". ") for s in test_refs]

# Evaluate
rouge_results = rouge.compute(predictions=preds, references=test_refs)
bleu_results = bleu.compute(
    predictions=preds,
    references=[[ref] for ref in test_refs]  # BLEU wants a list of list of strings
)

print("ROUGE:", rouge_results)
print("BLEU:", bleu_results)

ROUGE: {'rouge1': np.float64(0.3615770571146344), 'rouge2': np.float64(0.12904606450038364), 'rougeL': np.float64(0.2927826170846352), 'rougeLsum': np.float64(0.29063065158274426)}
BLEU: {'bleu': 0.08186612003823322, 'precisions': [0.3859223300970874, 0.12244897959183673, 0.053763440860215055, 0.02556818181818182], 'brevity_penalty': 0.9118926449486352, 'length_ratio': 0.9155555555555556, 'translation_length': 412, 'reference_length': 450}


After fine-tuning the Facebook BART model on a small subset of the XSum dataset for five epochs, I evaluated its performance using ROUGE and BLEU metrics. The model achieved a ROUGE-1 score of approximately 0.36 and a BLEU score of around 0.08. These results are modest but expected given the limited training data and compute constraints. I noticed that increasing the number of epochs improved the model's ability to produce more relevant summaries, especially in terms of ROUGE scores, which suggests it was learning better content overlap with the reference summaries. However, the low BLEU score indicates that the model still struggled with precise phrasing and exact word matches. The choice of BART as the underlying large language model had a significant impact — because it is pretrained specifically for sequence-to-sequence tasks like summarization, it performed reasonably well even with minimal fine-tuning. Using a model not designed for summarization, like GPT-2, would likely have resulted in worse performance without extensive customization. Overall, BART proved to be a good fit for this task, and hyperparameters such as learning rate and epoch count played an important role in balancing training time and summary quality.

# Task 2

One real-world application that can be effectively formulated as a Markov Decision Process (MDP) is autonomous driving. In this context, a self-driving car must continually make decisions in a dynamic environment while aiming to navigate safely and efficiently to its destination. MDPs are well-suited to model this problem because each decision the vehicle makes depends only on the current state, not on the full history of past decisions — which aligns with the Markov property.

The state space in autonomous driving includes a rich set of observations that describe the current environment around the vehicle. This could include the car's current position, velocity, lane, the positions and velocities of nearby vehicles, traffic signals, road conditions, and even the predicted intentions of other drivers. Essentially, the state encapsulates all relevant information the car needs to make a decision at any given moment.

The action space consists of all possible actions the self-driving car can take. These might include discrete actions like accelerating, braking, turning left or right, changing lanes, or more continuous actions like adjusting the steering angle and speed incrementally. The granularity of the action space can vary depending on the level of control modeled (high-level route planning versus low-level motor control).

The transition model defines how the environment evolves in response to an action taken by the vehicle. For example, if the car chooses to accelerate, the next state will reflect a new position and speed, potentially bringing it closer to other vehicles or traffic signals. This transition is probabilistic because the environment contains other agents (e.g., drivers, pedestrians) whose behavior can’t be perfectly predicted. Still, the model captures the likelihood of various outcomes based on the current state and chosen action.

Finally, the reward function guides the vehicle toward its goal by assigning feedback to its actions. Positive rewards might be given for maintaining a safe speed, staying in the correct lane, and moving toward the destination efficiently. Negative rewards would be associated with undesirable outcomes like collisions, sudden braking, lane departures, or failing to follow traffic rules. The goal of the reinforcement learning agent (the car) is to learn a policy that maximizes cumulative reward, leading to safe and optimal driving behavior.

In summary, autonomous driving naturally fits into the MDP framework, with clear definitions of states, actions, transitions, and rewards. This makes it an ideal candidate for reinforcement learning approaches.

# Task 3

One domain where reinforcement learning (RL) is showing significant promise is healthcare, particularly in optimizing treatment strategies for chronic diseases. A compelling problem in this area is personalized treatment planning for patients with diabetes. Managing diabetes involves making frequent and personalized decisions about medication dosages, diet, physical activity, and glucose monitoring. The complexity and individual variability in patient response make this a challenging task for traditional rule-based systems or static protocols. Reinforcement learning can be used to model the treatment process as a sequential decision-making problem, where the agent learns to recommend personalized actions based on the patient's evolving condition.

In this context, the patient's state can be defined by various clinical features, including current blood glucose level, insulin dose history, meal timing, physical activity, and other biometric or lifestyle data. The actions would correspond to different treatment recommendations, such as insulin dosage adjustments or meal timing suggestions. The transition model captures how the patient's condition changes in response to these actions — for example, how a certain insulin dose impacts future blood glucose levels. The reward function encourages the agent to maintain the patient’s glucose levels within a healthy range while minimizing side effects like hypoglycemia.

An open-source project that addresses this problem is the “Reinforcement Learning for Optimal Diabetes Treatment” environment provided by OpenAI Gym under the name SimGlucose. This simulator, developed by the MIT Laboratory for Computational Physiology, provides a realistic and medically grounded environment for training RL agents to manage Type 1 diabetes. It models patient physiology based on the FDA-approved UVA/Padova simulator and includes different virtual patient profiles with varying insulin sensitivities and lifestyles. Researchers can test and evaluate RL algorithms like Q-learning, DQN, or actor-critic methods in this simulated setting to learn optimal insulin dosing policies.

The SimGlucose environment has become a valuable tool for researchers in both healthcare and AI, as it allows for safe and repeatable experimentation without risking patient safety. It also promotes reproducibility and collaboration by providing a shared benchmark for comparing algorithms. In summary, RL provides a powerful framework for tackling complex, personalized treatment decisions in healthcare, and SimGlucose demonstrates a practical, open-source implementation of this approach for diabetes management.

# Task 4

In [None]:
import random
import numpy as np
from collections import defaultdict

# Tic-Tac-Toe Game Environment
class TicTacToe:
    def __init__(self):
        self.board = [' '] * 9
        self.current_winner = None

    def available_actions(self):
        return [i for i, x in enumerate(self.board) if x == ' ']

    def make_move(self, square, letter):
        if self.board[square] == ' ':
            self.board[square] = letter
            if self.winner(square, letter):
                self.current_winner = letter
            return True
        return False

    def winner(self, square, letter):
        row = square // 3 * 3
        if all(self.board[row + i] == letter for i in range(3)):
            return True
        col = square % 3
        if all(self.board[col + i * 3] == letter for i in range(3)):
            return True
        if square % 2 == 0:
            if all(self.board[i] == letter for i in [0, 4, 8]):
                return True
            if all(self.board[i] == letter for i in [2, 4, 6]):
                return True
        return False

    def is_draw(self):
        return ' ' not in self.board

    def reset(self):
        self.board = [' '] * 9
        self.current_winner = None

    def get_state(self):
        return ''.join(self.board)

# Q-learning Agent
class QAgent:
    def __init__(self, symbol, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.q_table = defaultdict(float)
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.symbol = symbol
        self.opponent = 'O' if symbol == 'X' else 'X'

    def get_action(self, state, actions):
        if random.random() < self.epsilon:
            return random.choice(actions)
        q_values = [self.q_table[(state, a)] for a in actions]
        max_q = max(q_values)
        best_actions = [a for a, q in zip(actions, q_values) if q == max_q]
        return random.choice(best_actions)

    def update(self, state, action, reward, next_state, done):
        max_q_next = 0 if done else max([self.q_table[(next_state, a)] for a in range(9)])
        current_q = self.q_table[(state, action)]
        self.q_table[(state, action)] += self.alpha * (reward + self.gamma * max_q_next - current_q)

# Training Function
def train(agent, episodes=5000):
    env = TicTacToe()
    for ep in range(episodes):
        env.reset()
        state = env.get_state()
        done = False
        while not done:
            action = agent.get_action(state, env.available_actions())
            valid = env.make_move(action, agent.symbol)
            if not valid:
                agent.update(state, action, -10, state, True)
                break

            next_state = env.get_state()
            if env.current_winner == agent.symbol:
                agent.update(state, action, 1, next_state, True)
                done = True
            elif env.is_draw():
                agent.update(state, action, 0.5, next_state, True)
                done = True
            else:
                # Opponent is random
                opp_action = random.choice(env.available_actions())
                env.make_move(opp_action, agent.opponent)
                if env.current_winner == agent.opponent:
                    agent.update(state, action, -1, next_state, True)
                    done = True
                elif env.is_draw():
                    agent.update(state, action, 0, next_state, True)
                    done = True
                else:
                    agent.update(state, action, 0, next_state, False)
                    state = env.get_state()

# Print the board
def print_board(board):
    for i in range(0, 9, 3):
        print(' | '.join(board[i:i+3]))
        if i < 6:
            print('--+---+--')

# Play against the trained agent
def play_game(agent):
    env = TicTacToe()
    env.reset()
    print("You are playing as 'O'. Agent is 'X'. Board positions are 0-8:")
    print_board([str(i) for i in range(9)])

    state = env.get_state()
    while True:
        # Agent move
        action = agent.get_action(state, env.available_actions())
        env.make_move(action, agent.symbol)
        print("\nAgent move:")
        print_board(env.board)
        if env.current_winner == agent.symbol:
            print("Agent wins!")
            break
        if env.is_draw():
            print("It's a draw!")
            break

        # Human move
        while True:
            try:
                move = int(input("Your move (0-8): "))
                if move in env.available_actions():
                    env.make_move(move, agent.opponent)
                    break
                else:
                    print("Invalid move. Try again.")
            except:
                print("Enter a number between 0-8.")

        if env.current_winner == agent.opponent:
            print_board(env.board)
            print("You win!")
            break
        if env.is_draw():
            print("It's a draw!")
            break

        state = env.get_state()

# Initialize and train
agent = QAgent(symbol='X')
print("Training agent... Please wait.")
train(agent, episodes=10000)
print("Training complete. Let's play!")
play_game(agent)

Training agent... Please wait.
Training complete. Let's play!
You are playing as 'O'. Agent is 'X'. Board positions are 0-8:
0 | 1 | 2
--+---+--
3 | 4 | 5
--+---+--
6 | 7 | 8

Agent move:
X |   |  
--+---+--
  |   |  
--+---+--
  |   |  
Your move (0-8): 4

Agent move:
X |   |  
--+---+--
  | O |  
--+---+--
  | X |  
Your move (0-8): 6

Agent move:
X |   |  
--+---+--
  | O | X
--+---+--
O | X |  
Your move (0-8): 2
X |   | O
--+---+--
  | O | X
--+---+--
O | X |  
You win!


The code implements a Tic-Tac-Toe game environment and a Q-learning agent that learns to play the game through self-play and interaction with a random opponent. The environment is defined by the TicTacToe class, which manages the board state, available actions, move execution, and win/draw detection. The QAgent class implements the Q-learning algorithm by maintaining a Q-table to estimate the value of state-action pairs and updating these values based on observed rewards and future value estimates. The agent is trained over 10,000 episodes using a reward structure that incentivizes winning (+1), penalizes losing (-1), and gives a small reward for drawing (0.5). The evaluation metric is the agent's ability to consistently beat or draw against a human or random opponent after training. A simple command-line interface allows users to play against the trained agent to observe its learned behavior. This implementation was inspired by open educational resources and tutorials on Q-learning, particularly examples from GeeksforGeeks and the OpenAI Gym framework.

# Task 5

In [None]:
!pip install surprise implicit

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split as surprise_split
from surprise.accuracy import rmse
from scipy.sparse import csr_matrix
import implicit



In [None]:
# Download from GroupLens site if needed: https://grouplens.org/datasets/movielens/100k/
# Assuming it's already downloaded as `u.data`
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.data',
                 sep='\t', names=column_names)

df.drop('timestamp', axis=1, inplace=True)
df.head()

Unnamed: 0,user_id,item_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [None]:
n_users = df['user_id'].nunique()
n_items = df['item_id'].nunique()
print(f"Unique users: {n_users}, Unique items: {n_items}")

# User-item matrix
user_item_matrix = df.pivot(index='user_id', columns='item_id', values='rating').fillna(0)
user_item_matrix.head()

Unique users: 943, Unique items: 1682


item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Use Surprise's built-in SVD
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']], reader)
trainset, testset = surprise_split(data, test_size=0.2, random_state=42)

model_svd = SVD()
model_svd.fit(trainset)
predictions_svd = model_svd.test(testset)

# RMSE
rmse_svd = rmse(predictions_svd)

RMSE: 0.9375


In [None]:
# 1. Create CSR item-user matrix
item_user_sparse = user_item_sparse.T.tocsr()

# 2. Train ALS
model_als = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=10)
model_als.fit(item_user_sparse * 15)

# 3. Safe precision@k
def precision_at_k(model, sparse_matrix, user_ids, k=5):
    precisions = []
    for uid in user_ids:
        try:
            recommendations = model.recommend(uid, sparse_matrix[uid], N=k, filter_already_liked_items=True)
            recommended_items = set(int(item[0]) for item in recommendations)
            actual_items = set(sparse_matrix[uid].indices)
            if actual_items:
                precision = len(recommended_items & actual_items) / k
                precisions.append(precision)
        except IndexError:
            continue  # skip invalid user
    return np.mean(precisions)

# 4. Select valid users
valid_user_ids = list(set(user_item_sparse.nonzero()[0]))[:50]
precision_als = precision_at_k(model_als, user_item_sparse, user_ids=valid_user_ids)
print(f"Precision@5 (ALS): {precision_als:.4f}")

  0%|          | 0/10 [00:00<?, ?it/s]

Precision@5 (ALS): 0.0286


In [None]:
print(f"Model Comparison:")
print(f"- SVD RMSE: {rmse_svd:.4f}")
print(f"- ALS Precision@5: {precision_als:.4f}")

Model Comparison:
- SVD RMSE: 0.9375
- ALS Precision@5: 0.0286


For this task, I used the MovieLens 100k dataset to implement and compare two collaborative filtering recommendation systems: Matrix Factorization using Singular Value Decomposition (SVD) and Alternating Least Squares (ALS). After performing data cleaning and exploratory data analysis, I converted the ratings data into a user-item matrix suitable for modeling. The SVD model was implemented using the Surprise library, while ALS was implemented using the implicit library, which is optimized for large-scale implicit feedback datasets. To evaluate the models, I used two standard recommender system metrics: Root Mean Squared Error (RMSE) and Precision@5. RMSE is commonly used to evaluate rating prediction accuracy, while Precision@k assesses the relevance of the top-k recommended items. The SVD model achieved an RMSE of 0.9375, indicating good prediction accuracy on held-out data. The ALS model achieved a Precision@5 of 0.0286, suggesting that only a small portion of the recommended items were actually relevant to the user. While ALS is typically used for implicit feedback datasets, its relatively lower performance here may be due to the explicit nature of the MovieLens data and limited parameter tuning. This exercise demonstrates how different recommendation algorithms excel on different metrics, and emphasizes the importance of choosing the right model and evaluation method for a given task. My implementation was informed by standard recommender systems literature and tools, including the Surprise documentation (https://surpriselib.com/) and implicit library guidelines (https://github.com/benfred/implicit).