In [None]:
# %%
from openai import OpenAI
import os
import pdfplumber
import pandas as pd

# %%
pdf_path = "/Users/kishanterdal/Downloads/Descriptive_Task_05/2024SUStats.pdf"

with pdfplumber.open(pdf_path) as pdf:
    pdf_text = ""
    for page in pdf.pages:
        pdf_text += page.extract_text()

pdf_text

# %%
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="XX",
)

### Replacing the api key with XX for security reasons

question_block = """
    Please analyze the following Syracuse Women's Lacrosse 2024 season statistics and answer the following 7 high-impact questions. Be concise, but explain your reasoning when needed.

1. How many total games did Syracuse play in the 2024 season?
2. How many goals did Syracuse score in total during the season?
3. What was Syracuse’s overall win-loss record in 2024?
4. Did Syracuse score any goals in overtime during the 2024 season?
5. What was the average number of goals scored by Syracuse per game?
6. As a coach, if I wanted to win two more games this coming season, should I focus on offense or defense and If so, what is the one player I should work with to be a game changer and why?

    Please format your response as:
    Answer 1: ...
    Answer 2: ...
    Answer 3: ...
    Answer 4: ...
    Answer 5: ...
    Answer 6: ...
    """

completion = client.chat.completions.create(
        model="mistralai/mistral-7b-instruct:free",
        messages=[
            {"role": "system", "content": "You are a strategic sports analyst reviewing Syracuse Women's Lacrosse 2024 season statistics to identify areas of improvement that could help win more games."},
            {"role": "user", "content": f"Here is the full dataset:\n\n{pdf_text}\n\n{question_block}"}
        ],
        temperature=0.5
    )

print(completion.choices[0].message.content)

# %% [markdown]
# In my initial experimentation with Mistral 7B Instruct, I observed that while the model **struggled with precise, fact-based questions** when applied to structured data (like sports statistics from a PDF).
# 
# Specifically, when asked 5 straightforward questions about Syracuse Women's Lacrosse 2024 season:
# 
# * Mistral **incorrectly calculated** the number of games played as **24 instead of 22**, despite the "16-6" record clearly indicating 22 games.
# * It **hallucinated a total goal count** of **435**, by multiplying the per-game average by a wrong game count, instead of using the explicit total of **335 goals** from the "Goals by Period" table.
# * It **invented a 16-8 win-loss record**, which contradicts the official "16-6" record stated in the data.
# * Mistral consistently built logic on **fabricated data** (24 games, 435 goals, fake player)
# 
# Due to this pattern of factual inaccuracies, I decided to evaluate **Google Gemma 3 12B Instruct**, which is open-weight and optimized for instruction-following and factual consistency.

# %%
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="XX"
)

### Replacing the api key with XX for security reasons

question_block = """
Please analyze the following Syracuse Women's Lacrosse 2024 season statistics and answer the following 5 high-impact questions. Be concise, but explain your reasoning when needed.

1. How many total games did Syracuse play in the 2024 season?
2. How many goals did Syracuse score in total during the season?
3. What was Syracuse’s overall win-loss record in 2024?
4. Did Syracuse score any goals in overtime during the 2024 season?
5. What was the average number of goals scored by Syracuse per game?
6. As a coach, if I wanted to win two more games this coming season, should I focus on offense or defense and If so, what is the one player I should work with to be a game changer and why?

Please format your response as:
Answer 1: ...
Answer 2: ...
Answer 3: ...
Answer 4: ...
Answer 5: ...
Answer 6: ...
"""


# Call Gemma model through OpenRouter
completion = client.chat.completions.create(
    model="google/gemma-3-12b-it:free",
    messages=[
        {"role": "system", "content": "You are a strategic sports analyst reviewing Syracuse Women's Lacrosse 2024 season statistics to identify areas of improvement that could help win more games."},
        {"role": "user", "content": f"Here is the full dataset:\n\n{pdf_text}\n\n{question_block}"}
    ],
    temperature=0.5
)

# Print result
print(completion.choices[0].message.content)


# %% [markdown]
# When tested with the **exact same dataset and questions**, **Gemma correctly answered all 5 questions**, including:
# 
# * Correctly identifying the total games as **22**
# * Accurately reporting **335 total goals** (as explicitly listed)
# * Preserving the actual **16-6 win-loss record**
# * Recognizing **zero goals scored in overtime**
# * Using the given **15.23 average goals per game** directly instead of re-deriving it incorrectly
# 
# This level of **grounded accuracy and reliability** makes Google Gemma a more suitable choice for structured analysis tasks, especially when precision matters.
# 

# %% [markdown]
# ###  Experimentation
# 
# I will be experimenting with two open-source language models — **Mistral 7B Instruct** and **Google Gemma 3 12B Instruct** — to evaluate their performance on structured reasoning and factual analysis tasks. The evaluation will be conducted using **five distinct tests** designed to measure reliability, reasoning ability, and robustness in real-world scenarios.
# 
# The five tests are:
# 
# 1. **Repetition Consistency Test**
#    To assess whether the model gives consistent answers to semantically equivalent questions.
# 
# 2. **Multi-Hop Reasoning Test**
#    To evaluate the model’s ability to combine multiple pieces of information and perform intermediate calculations.
# 
# 3. **Prompt Robustness Test**
#    To test how well the model performs with open-ended or ambiguous prompts.
# 
# 4. **Math & Logic Problems**
#    To check the model’s accuracy in basic arithmetic and logical deduction.
# 
# 5. **Model Comparison Table**
#    A side-by-side comparison of answers from both models across all test cases, with manual grading of correctness and quality.
# 
# This structured evaluation will help determine which model is better suited for factual Q\&A tasks involving structured data, such as sports analytics or performance summaries.

# %%
models = {
    "Mistral": "mistralai/mistral-7b-instruct:free",
    "Gemma": "google/gemma-3-12b-it:free"
}

questions = {
    "Repetition Consistency Test": [
        "How many games did Syracuse play in 2024?",
        "What was the total number of matches Syracuse played in the 2024 season?"
    ],
    "Multi-Hop Reasoning": [
        "If Syracuse scored 335 goals in 22 games, what was their average goals per game?",
        "If 65 of those 335 goals were from free positions, what percentage of Syracuse's goals were free-position goals?"
    ],
    "Prompt Robustness Test": [
        "Tell me something insightful from Syracuse's 2024 lacrosse season stats.",
        "What should a coach know about this team's performance?"
    ],
    "Math & Logic Problems": [
        "If a player scores 2.5 goals in 10 games, how many total goals is that?",
        "A team won 16 out of 22 games. What is their win percentage?"
    ]
}

# %%
def ask_model(model_name, question):
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant answering questions based on structured statistics or logic."
            },
            {
                "role": "user",
                "content": question
            }
        ],
        temperature=0.5
    )
    return response.choices[0].message.content.strip()

# %%
results = []

for test_name, qs in questions.items():
    for q in qs:
        mistral_answer = ask_model(models["Mistral"], q)
        gemma_answer = ask_model(models["Gemma"], q)

        results.append({
            "Test Case": test_name,
            "Question": q,
            "Mistral Answer": mistral_answer,
            "Gemma Answer": gemma_answer
        })

# %%
df = pd.DataFrame(results)
df.to_csv("model_comparison_results.csv", index=False)
df

# %% [markdown]
# ## Experiment Overview
# 
# Two open-source large language models were evaluated:
# 
# * **Mistral 7B Instruct** (`mistralai/mistral-7b-instruct:free`)
# * **Google Gemma 3 12B Instruct** (`google/gemma-3-12b-it:free`)
# 
# ### Evaluation Criteria:
# 
# Models were tested on 5 categories:
# 
# 1. **Repetition Consistency Test**
# 2. **Multi-Hop Reasoning**
# 3. **Prompt Robustness Test**
# 4. **Math & Logic Problems**
# 5. **Comparison Table (Ground Truth Matching)**
# 
# ---
# 
# ## 📊 Results Summary
# 
# | Metric                               | Mistral 7B | Gemma 12B |
# | ------------------------------------ | ---------- | --------- |
# | ✅ Correct Answers (Objective Qs)     | 1          | 3         |
# | ❌ Incorrect Answers                  | 5          | 3         |
# | ⚖️ Subjective Questions (not graded) | 2          | 2         |
# | 📈 Total Evaluated (objective only)  | 6          | 6         |
# 
# ---
# 
# ## 🔍 Detailed Question-by-Question Evaluation
# 
# | Test Case              | Question                                                               | Ground Truth | Mistral Verdict | Gemma Verdict |
# | ---------------------- | ---------------------------------------------------------------------- | ------------ | --------------- | ------------- |
# | Repetition Consistency | How many games did Syracuse play in 2024?                              | 22           | Incorrect       | **Correct**   |
# | Repetition Consistency | What was the total number of matches Syracuse played in 2024?          | 22           | Incorrect       | **Correct**   |
# | Multi-Hop Reasoning    | If Syracuse scored 335 goals in 22 games, what was avg goals per game? | 15.23        | Incorrect       | **Correct**   |
# | Multi-Hop Reasoning    | What % of goals were from free positions? (65 of 335)                  | 19.40%       | Incorrect       | Incorrect     |
# | Prompt Robustness      | Tell me something insightful from the stats                            | —            | Subjective      | Subjective    |
# | Prompt Robustness      | What should a coach know about performance?                            | —            | Subjective      | Subjective    |
# | Math & Logic           | Player scores 2.5 goals in 10 games. Total goals?                      | 25.0         | **Correct**     | Incorrect     |
# | Math & Logic           | Team wins 16 of 22 games. Win percentage?                              | 72.73%       | Incorrect       | Incorrect     |
# 
# ---
# 
# ## ✅ Key Observations
# 
# * **Gemma** was consistently better at:
# 
#   * Understanding structured data context (e.g., total games, goals)
#   * Performing **multi-hop reasoning** and **unit math**
# * **Mistral** struggled with:
# 
#   * Basic arithmetic (e.g., percentage)
#   * Grounded answers from context
# * Both models handled **subjective prompts** reasonably, but those were excluded from scoring.
# 
# ---
# 
# ## 🏁 Final Verdict
# 
# | Criteria               | Winner                                |
# | ---------------------- | ------------------------------------- |
# | Factual Accuracy       | ✅ **Gemma**                           |
# | Multi-Step Reasoning   | ✅ **Gemma**                           |
# | Basic Math Consistency | ❌ Neither (both failed % calculation) |
# | Instruction Following  | ✅ **Gemma**                           |
# | Overall                | ✅ **Gemma Wins**                      |
# 
# ---
# 
# 
# If the task involves **structured data interpretation**, **numerical reasoning**, or **sports analytics**, **Google Gemma 3 12B** is a more reliable choice than **Mistral 7B**.

# %% [markdown]
# 