# LLM Models Comparison using LLM as Judge

## Import the Needed Python Packages 

In [1]:
import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display


In [2]:
load_dotenv(override=True)

True

In [3]:
openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
groq_api_key = os.getenv('GROQ_API_KEY')

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set (and this is optional)")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set (and this is optional)")

if groq_api_key:
    print(f"Groq API Key exists and begins {groq_api_key[:4]}")
else:
    print("Groq API Key not set (and this is optional)")

OpenAI API Key exists and begins sk-proj-
Anthropic API Key exists and begins sk-ant-
Google API Key exists and begins AI
DeepSeek API Key exists and begins sk-
Groq API Key exists and begins gsk_


## LLMs evaluation 

In [4]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence."
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]
print(messages)

[{'role': 'user', 'content': 'Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence.Answer only with the question, no explanation.'}]


### OpenAI to generate a challenge

In [5]:
model = OpenAI()
response = model.chat.completions.create(
    model="gpt-4.1-mini",
    messages=messages,
)
question = response.choices[0].message.content
print(question)

How would you reconcile the ethical implications of deploying autonomous surveillance systems in public spaces with the need to preserve individual privacy rights, considering diverse cultural attitudes towards privacy and security?


In [6]:
messages = [{"role": "user", "content": question}]
competitors = []
answers = []
models = []

### Response: OpenAI

In [7]:
model_name = "gpt-4.1-mini"

response = model.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)
models.append(model)

Reconciling the ethical implications of deploying autonomous surveillance systems in public spaces with the preservation of individual privacy rights—especially in light of diverse cultural attitudes toward privacy and security—requires a multifaceted, context-sensitive approach. Here are key strategies and principles to consider:

### 1. **Principle of Proportionality and Necessity**
- **Assess the purpose:** Surveillance should be deployed only when there is a clear, compelling public interest (e.g., preventing serious crimes or ensuring public safety).
- **Limit the scope:** Collect only the data necessary for the stated purpose, avoiding excessive or indiscriminate monitoring.

### 2. **Transparency and Accountability**
- **Inform the public:** Clearly communicate what data is being collected, how it is used, and how long it is retained.
- **Enable oversight:** Establish independent oversight bodies that can audit the systems and processes, ensuring accountability and addressing misuse or abuse.

### 3. **Privacy by Design**
- **Embed privacy protections:** Implement technical measures such as data minimization, anonymization, encryption, and secure storage.
- **Enable control:** Where feasible, allow individuals some degree of control over how their data is collected and used.

### 4. **Respect for Cultural Diversity**
- **Contextualize deployment:** Understand and respect local cultural norms and values regarding privacy and security.
- **Engage communities:** Involve diverse stakeholders—including marginalized and vulnerable groups—in dialogue and decision-making to identify acceptable trade-offs.

### 5. **Legal and Ethical Frameworks**
- **Adhere to law:** Ensure surveillance complies with national laws and international human rights standards.
- **Ethical guidance:** Develop and apply ethical guidelines that emphasize human dignity, freedom, and proportionality.

### 6. **Regular Review and Adaptation**
- **Continuous evaluation:** Regularly assess the impact of surveillance on privacy and security, adjusting policies and technologies as needed.
- **Sunset clauses:** Where appropriate, set expiration dates for surveillance programs to avoid indefinite data collection.

---

### Example Implementation Scenario:
In a culturally privacy-sensitive society, an autonomous surveillance system designed to deter violent crime might operate with strict anonymization by default, only flagging specific behaviors rather than capturing identifiable images. Public consultation informs system design, and an independent privacy board oversees data handling. By contrast, in a society that prioritizes collective security over individual privacy, more detailed monitoring might be accepted, but still under transparent regulatory oversight.

---

### Summary
Balancing ethical concerns requires respecting individual privacy, ensuring transparency and accountability, adapting to cultural values, and limiting surveillance to legitimate needs. This approach helps maintain public trust while leveraging technology to enhance security in a way that honors diverse perspectives on privacy.

### Response: Anthropic

In [8]:
model_name = "claude-3-7-sonnet-latest"

model = Anthropic()
response = model.messages.create(model=model_name, messages=messages, max_tokens=1000)
answer = response.content[0].text

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)
models.append(model)

# Reconciling Autonomous Surveillance and Privacy Rights

This complex issue requires balancing multiple considerations:

## Key Tensions
- Public safety and security benefits vs. fundamental privacy rights
- Technological capabilities vs. ethical boundaries
- State/corporate interests vs. individual autonomy

## Potential Reconciliation Approaches

**Transparency and Consent Mechanisms**
- Clear signage about surveillance areas and capabilities
- Opt-in features where feasible, especially for identity recognition
- Public education about data collection scope and limitations

**Cultural Contextualization**
- Recognize that privacy expectations vary significantly across cultures
- Adapt implementation based on local values and democratic processes
- Avoid imposing one cultural framework universally

**Technical Safeguards**
- Privacy-by-design principles with data minimization
- Anonymization techniques where identification isn't essential
- Decentralized data storage with robust security

**Governance Frameworks**
- Independent oversight committees with diverse representation
- Regular privacy impact assessments with public reporting
- Clear limitations on data sharing and purpose restrictions

The most ethical approach likely involves finding a middle path that acknowledges both legitimate security concerns and the fundamental right to privacy, while remaining sensitive to different cultural contexts.

### Response: Google

In [9]:
model = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.0-flash"

response = model.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)
models.append(model)

Reconciling the ethical implications of autonomous surveillance systems (ASS) with individual privacy rights, while navigating diverse cultural attitudes, is a complex challenge requiring a multi-faceted approach. Here's a breakdown of how we might approach this:

**1. Establishing Clear Ethical Guidelines & Legal Frameworks:**

*   **Purpose Limitation:**  ASS deployment must be justified by a specific, demonstrable, and legitimate public interest. This prevents "surveillance creep" and function creep (using the data for purposes beyond the original intent). Examples:  reducing violent crime in known hotspots, improving traffic flow to reduce pollution.
*   **Necessity and Proportionality:** The use of ASS should be a last resort, only deployed when less intrusive methods are insufficient to achieve the legitimate aim. The level of surveillance should be proportional to the threat and the potential benefits.  For instance, using facial recognition to identify jaywalkers would likely be considered disproportionate.
*   **Data Minimization:**  Collect only the minimum amount of data necessary to achieve the stated purpose. Avoid collecting and storing personal data irrelevant to the specific objective. Data retention periods should be clearly defined and limited.
*   **Transparency and Accountability:**  Publicly disclose the deployment of ASS, including its purpose, capabilities, data retention policies, and responsible party.  Establish clear lines of accountability for misuse or breaches of privacy. Independent oversight bodies can provide further scrutiny.
*   **Human Oversight:**  Autonomous systems should not operate without human oversight.  Human operators should be responsible for making critical decisions based on the information gathered by ASS and for responding to potential biases or errors.  This prevents automation bias and ensures ethical considerations are always present.
*   **Auditing and Evaluation:**  Regularly audit the performance of ASS to ensure they are achieving their intended purpose and not disproportionately impacting specific groups or communities. Evaluate the effectiveness of the system and its impact on privacy.

**2. Technical Safeguards for Privacy Protection:**

*   **Privacy-Enhancing Technologies (PETs):**  Implement technologies that minimize the collection and storage of personal data. Examples include:
    *   **Differential Privacy:** Adding noise to data to protect individual identities while allowing for statistical analysis.
    *   **Federated Learning:**  Training AI models on decentralized data sources without directly accessing or sharing raw data.
    *   **Homomorphic Encryption:** Performing computations on encrypted data without decrypting it.
    *   **Anonymization/Pseudonymization:** Removing or masking identifying information.
*   **Edge Computing:** Processing data locally on the device rather than sending it to a central server, reducing the risk of data breaches and improving privacy.
*   **Selective Surveillance:**  Design the system to target specific behaviors or events rather than continuously monitoring entire populations.  For example, detecting unusual crowd movements suggestive of a potential threat.
*   **Blurring/Masking:**  Blurring faces or license plates in video footage unless specifically needed for investigation.

**3. Addressing Diverse Cultural Attitudes:**

*   **Community Engagement:**  Engage with diverse communities to understand their specific concerns and values regarding privacy and security. This involves public consultations, workshops, and forums.  Respect local customs and traditions related to privacy.
*   **Culturally Sensitive Design:** Tailor the design and deployment of ASS to reflect the specific cultural context.  What is considered acceptable surveillance in one culture may be unacceptable in another.
*   **Privacy Impact Assessments (PIA):**  Conduct PIAs that specifically consider the potential impact of ASS on different cultural groups.  These assessments should identify and address any potential biases or discriminatory effects.
*   **Multi-Stakeholder Dialogue:**  Establish platforms for dialogue between government agencies, technology developers, civil society organizations, and the public to discuss the ethical and societal implications of ASS.
*   **Education and Awareness:**  Educate the public about the capabilities of ASS, the potential risks to privacy, and the safeguards that are in place.  Empower individuals to make informed decisions about their privacy.
*   **Variable Consent Mechanisms:**  Explore different consent mechanisms that allow individuals to opt-in or opt-out of certain types of surveillance, depending on their cultural values and preferences.  This might involve different levels of anonymization or data retention.

**4. Emphasizing Benefits and Building Trust:**

*   **Transparency about Benefits:** Clearly communicate the benefits of ASS to the public, focusing on how it can improve public safety, security, or quality of life.  Provide evidence to support these claims.
*   **Demonstrate Accountability:**  Actively address concerns about privacy violations or misuse of data.  Hold those responsible accountable for their actions.
*   **Focus on Prevention, Not Just Detection:**  Design ASS to prevent crime and promote positive social outcomes, rather than simply detecting and punishing offenders.  For example, using ASS to improve traffic flow and reduce accidents.
*   **Empowerment, Not Just Control:**  Consider how ASS can be used to empower individuals and communities, rather than simply controlling their behavior. For example, providing real-time information about public transportation or emergency situations.

**Challenges and Ongoing Considerations:**

*   **Bias in Algorithms:**  Algorithms used in ASS can be biased based on the data they are trained on. This can lead to discriminatory outcomes.  Ongoing efforts are needed to identify and mitigate bias in AI.
*   **Evolving Technology:**  ASS technology is rapidly evolving, making it difficult to keep pace with the ethical and legal implications.  Continuous monitoring and adaptation of regulations are essential.
*   **Global Harmonization:**  International cooperation is needed to develop common standards and best practices for the deployment of ASS.
*   **Balancing Security and Freedom:**  Finding the right balance between security and freedom is an ongoing challenge.  Societies must be willing to have difficult conversations about the trade-offs involved.

**In conclusion, reconciling the ethical implications of ASS with privacy rights requires a multi-layered approach that combines strong legal frameworks, robust technical safeguards, culturally sensitive engagement, and a commitment to transparency and accountability. There is no one-size-fits-all solution, and ongoing dialogue and adaptation are crucial to ensure that ASS are used responsibly and ethically in a diverse and evolving world.**


### Response: Deepseek

In [10]:
model = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
model_name = "deepseek-chat"

response = model.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)
models.append(model)

The deployment of autonomous surveillance systems in public spaces presents a complex ethical challenge that requires balancing collective security with individual privacy rights, especially given diverse cultural perspectives on these values. Here’s a framework for reconciliation:

### **1. Contextual and Proportional Deployment**
   - **Purpose Limitation**: Surveillance should be narrowly tailored to specific, legitimate purposes (e.g., crime prevention, traffic management) rather than indiscriminate monitoring.
   - **Risk-Based Approach**: Deploy systems only in high-risk areas where benefits demonstrably outweigh privacy intrusions (e.g., transport hubs vs. residential neighborhoods).

### **2. Legal and Regulatory Safeguards**
   - **Transparency**: Publicly disclose the use, scope, and capabilities of surveillance systems, including data retention policies.
   - **Consent & Opt-Outs**: Where culturally appropriate, allow opt-outs for non-essential surveillance (e.g., in low-risk areas or private-public spaces like malls).
   - **Data Minimization**: Collect only necessary data, anonymize where possible, and implement strict deletion timelines.

### **3. Cultural Sensitivity & Public Participation**
   - **Community Engagement**: Involve local stakeholders in decision-making to align systems with cultural norms (e.g., some societies prioritize security over privacy, while others resist state monitoring).
   - **Adaptive Policies**: Allow regional or municipal flexibility in deployment based on local attitudes (e.g., stricter rules in privacy-conscious regions like the EU vs. more permissive approaches in security-focused contexts).

### **4. Technological Mitigations**
   - **Privacy-Enhancing Technologies (PETs)**: Use techniques like edge computing (processing data locally without storage), blurring faces in non-critical analysis, or differential privacy to minimize identifiability.
   - **Bias Audits**: Regularly test systems for discriminatory outcomes (e.g., racial/gender bias in facial recognition) to prevent harm to marginalized groups.

### **5. Oversight & Accountability**
   - **Independent Review Boards**: Establish ethics committees with diverse representation to oversee deployments and investigate misuse.
   - **Redress Mechanisms**: Provide accessible avenues for individuals to challenge surveillance (e.g., requesting data deletion or filing complaints).

### **6. International Standards & Cooperation**
   - **Cross-Border Frameworks**: Align with global guidelines (e.g., GDPR principles, OECD AI recommendations) while respecting local variations.
   - **Human Rights Alignment**: Ensure systems comply with universal rights (e.g., UN guidelines on privacy in digital surveillance).

### **Reconciliation in Practice**
- **Example 1**: In a city with high crime but strong privacy concerns, deploy surveillance with real-time anonymization and sunset clauses requiring renewal based on efficacy reviews.  
- **Example 2**: In a culture valuing communal safety (e.g., Singapore), emphasize transparency about benefits while allowing opt-outs in non-critical zones.  

### **Conclusion**
The ethical balance hinges on **proportionality, accountability, and adaptability**. By embedding privacy protections into design, engaging communities, and enforcing strict oversight, societies can harness surveillance benefits without eroding fundamental rights. Cultural diversity demands solutions that are neither one-size-fits-all nor absolute but instead contextually negotiated.

### Response: Groq

In [11]:
model = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
model_name = "llama-3.3-70b-versatile"

response = model.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)
models.append(model)

Reconciling the ethical implications of deploying autonomous surveillance systems in public spaces with the need to preserve individual privacy rights is a complex issue that involves balancing individual rights with societal needs, while also considering diverse cultural attitudes towards privacy and security. Here are some steps to achieve this balance:

1. **Establish clear policies and regulations**: Governments and regulatory bodies should develop and implement clear policies and guidelines for the deployment and use of autonomous surveillance systems in public spaces, taking into account the principles of necessity, proportionality, and transparency.
2. **Conduct public consultations and education**: Engage with diverse stakeholders, including community groups, civil society organizations, and individual citizens, to raise awareness about the benefits and risks of autonomous surveillance systems and gather feedback on their concerns and expectations.
3. **Implement privacy-enhancing technologies**: Incorporate privacy-enhancing technologies, such as data anonymization, encryption, and secure data storage, into the design and operation of autonomous surveillance systems to minimize the collection and retention of personal data.
4. **Define and limit data collection and use**: Clearly define the purposes for which data collected by autonomous surveillance systems can be used, and limit data collection to only what is strictly necessary to achieve those purposes.
5. **Ensure accountability and oversight**: Establish independent oversight bodies to monitor the deployment and use of autonomous surveillance systems, investigate complaints, and ensure that operators comply with regulations and guidelines.
6. **Respect cultural differences and contextualize deployment**: Consider diverse cultural attitudes towards privacy and security when deploying autonomous surveillance systems, and contextualize their deployment to reflect local norms and values.
7. **Foster transparency and explainability**: Provide clear information about the deployment and operation of autonomous surveillance systems, including the data they collect, how it is used, and the algorithms and decision-making processes involved.
8. **Implement robust data protection measures**: Implement robust data protection measures, such as access controls, data encryption, and secure data storage, to prevent unauthorized access, misuse, or exploitation of personal data.
9. **Ensure human oversight and review**: Implement human oversight and review mechanisms to detect and correct errors, biases, or unfair outcomes resulting from the use of autonomous surveillance systems.
10. **Continuously monitor and evaluate**: Continuously monitor and evaluate the impact of autonomous surveillance systems on individual privacy rights and societal security, and make adjustments as needed to ensure that they are used in a way that respects human rights and dignity.

**Diverse cultural attitudes towards privacy and security**:

1. **Collectivist vs. individualist cultures**: In collectivist cultures, the priority may be on group security and social harmony, while in individualist cultures, the emphasis may be on individual privacy and autonomy.
2. **High-context vs. low-context cultures**: In high-context cultures, relationships and social norms may play a larger role in shaping attitudes towards privacy and security, while in low-context cultures, the focus may be on individual rights and freedoms.
3. **Different perceptions of risk and trust**: Cultures may have different perceptions of risk and trust, influencing their attitudes towards autonomous surveillance systems and their willingness to accept trade-offs between security and privacy.

**Preserving individual privacy rights**:

1. **Data minimization**: Collect only the minimum amount of data necessary to achieve the intended purpose.
2. **Anonymization and pseudonymization**: Use techniques to anonymize or pseudonymize data, making it more difficult to identify individuals.
3. **Access controls**: Implement strict access controls to ensure that only authorized personnel can access and process personal data.
4. **Right to opt-out**: Provide individuals with the right to opt-out of being surveilled or to request that their data be deleted.
5. **Right to explanation**: Provide individuals with the right to receive explanations about how their data is being used and what decisions are being made about them.

By following these steps and considering diverse cultural attitudes towards privacy and security, it is possible to reconcile the ethical implications of deploying autonomous surveillance systems in public spaces with the need to preserve individual privacy rights.

### Response: Ollama (local service): XXXXX

In [12]:
# Run `ollama serve` in the terminal first
#!ollama pull llama3.2:latest
!ollama pull mistral-small:latest

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling 102a747c1376: 100% ▕██████████████████▏  14 GB                         [K
pulling 6db27cd4e277: 100% ▕██████████████████▏  695 B                         [K
pulling 6d7b25ffd247: 100% ▕██████████████████▏  644 B                         [K
pulling 43070e2d4e53: 100% ▕██████████████████▏  11 KB                         [K
pulling e0daf17ff83e: 100% ▕██████████████████▏   21 B                         [K
pulling 41ffc852c4b6: 100% ▕██████████████████▏  562 B                         [K
verifying sha256 digest [K
writing manifest [K
success [K[?25h[?2026l


In [13]:
model = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "mistral-small:latest"

response = model.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)
models.append(model)

Reconciling the ethical implications of deploying autonomous surveillance systems in public spaces with the need to preserve individual privacy rights is a complex task that requires a balanced approach. Here are some steps and considerations to address this challenge while taking into account diverse cultural attitudes:

### 1. **Public Consultation and Transparency**
- **Engage Stakeholders**: Involve the public, local communities, and privacy advocacy groups in discussions from the outset. Understand their concerns and expectations.
- **Transparency**: Clearly communicate how surveillance systems will be used, what data will be collected, who will have access to it, and for what purposes.

### 2. **Regulatory Frameworks**
- **Legal Compliance**: Ensure that the deployment of surveillance systems adheres to existing laws and regulations regarding privacy rights.
- **Data Protection**: Implement robust data protection measures, including encryption, anonymization, and secure storage practices. Ensure compliance with international standards like GDPR or CCPA.

### 3. **Cultural Sensitivity**
- **Contextual Understanding**: Recognize that attitudes towards privacy and security vary widely across cultures. For example, some societies may prioritize security over individual privacy.
- **Tailored Approaches**: Design surveillance policies and technologies to reflect the specific cultural norms and legal frameworks of different regions.

### 4. **Ethical Guidelines**
- **Proportionate Measures**: Ensure that the scope and intensity of surveillance are proportionate to the risks they aim to mitigate. Avoid over-surveillance.
- **Minimization Principles**: Collect only the data that is necessary for achieving legitimate purposes, and retain it for as short a period as possible.

### 5. **Technological Safeguards**
- **Data Anonymization**: Use techniques like facial blur and anonymization to protect individual identities in surveillance footage.
- **Audit Trails**: Implement systems that log access to surveillance data, ensuring accountability and deterring misuse.
- **Biometric Data Regulation**: Be cautious with biometrics; obtain explicit consent for the collection of such sensitive data.

### 6. **Oversight Mechanisms**
- **Independent Review**:
**Independent review bodies should periodically assess the impact of surveillance systems on privacy rights.**
   - Ombudsman: Appoint an ombudsman to oversee the ethical use of surveillance technologies.
- **Ethics Committees**: Establish ethics committees comprising diverse stakeholders (e.g., legal experts, ethicists, community representatives) to monitor and guide the deployment and operation of these systems.

### 7. **Educational Outreach**
- **Awareness Campaigns:** Educate the public about their rights and the purpose of surveillance systems to avoid misunderstandings.
- **Privacy Literacy**: Promote privacy literacy so that individuals understand how their data is being used and can make informed decisions.

### 8. **Continuous Evaluation**
- **Impact Assessments**: Conduct regular impact assessments to evaluate the effectiveness, fairness, and ethical implications of surveillance systems. Adjust policies as needed.
- **Feedback Loops**: Create mechanisms for continuous feedback from affected communities to adapt practices on an ongoing basis.

By integrating these considerations, it is possible to deploy autonomous surveillance systems in a manner that respects individual privacy rights while addressing legitimate security concerns across diverse cultural contexts.

## Comparison

In [14]:
print(competitors)
#print(answers)

['gpt-4.1-mini', 'claude-3-7-sonnet-latest', 'gemini-2.0-flash', 'deepseek-chat', 'llama-3.3-70b-versatile', 'mistral-small:latest']


In [15]:
#for competitor, answer in zip(competitors, answers):
#    print(f"Competitor: {competitor}\n\n{answer}")

In [16]:
# Prepare the final input for comparison
together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"
print(together)

# Response from competitor 1

Reconciling the ethical implications of deploying autonomous surveillance systems in public spaces with the preservation of individual privacy rights—especially in light of diverse cultural attitudes toward privacy and security—requires a multifaceted, context-sensitive approach. Here are key strategies and principles to consider:

### 1. **Principle of Proportionality and Necessity**
- **Assess the purpose:** Surveillance should be deployed only when there is a clear, compelling public interest (e.g., preventing serious crimes or ensuring public safety).
- **Limit the scope:** Collect only the data necessary for the stated purpose, avoiding excessive or indiscriminate monitoring.

### 2. **Transparency and Accountability**
- **Inform the public:** Clearly communicate what data is being collected, how it is used, and how long it is retained.
- **Enable oversight:** Establish independent oversight bodies that can audit the systems and processes, ensuring ac

## LLMs as judge

In [17]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the ranked order of the competitors in a form of a Python list [number of the best competitor, number of second best competitor, ...], nothing else. Do not include markdown formatting, nor code blocks. Return only the list."""


In [75]:
#print(judge)

In [18]:
judge_messages = [{"role": "user", "content": judge}]

#### A single judge evaluation

In [19]:
openai = OpenAI()
response = openai.chat.completions.create(
    model="o3-mini",
    messages=judge_messages,
)
results = response.choices[0].message.content
print(results)

{"results": [3, 1, 4, 6, 2, 5]}


In [22]:
ranks

{'results': [3, 1, 4, 6, 2, 5]}

In [None]:
parsed = json.loads(results)
ranks = parsed["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

Rank 1: gemini-2.0-flash
Rank 2: gpt-4.1-mini
Rank 3: deepseek-chat
Rank 4: mistral-small:latest
Rank 5: claude-3-7-sonnet-latest
Rank 6: llama-3.3-70b-versatile


#### Cross, Multi-LLM Evaluation

In [25]:
models

[<openai.OpenAI at 0x7364b4239a90>,
 <anthropic.Anthropic at 0x73649fa882f0>,
 <openai.OpenAI at 0x73649fa42350>,
 <openai.OpenAI at 0x73649fa40f50>,
 <openai.OpenAI at 0x73649fae43e0>,
 <openai.OpenAI at 0x73649fae7820>]

In [None]:
# Query LLMs for their scores (this is asynchronous)
import re

models_scores = []

for index, model in enumerate(models):

    if "claude" in competitors[index].lower():
        response = model.messages.create(model=competitors[index], messages=judge_messages, max_tokens=1000)
        model_scoring = response.content[0].text

    else:
        response = model.chat.completions.create(model=competitors[index], messages=judge_messages)
        model_scoring = response.choices[0].message.content

    models_scores.append(model_scoring)

In [27]:
models_scores

['[3, 1, 4, 6, 2, 5]',
 '{"results": [3, 4, 1, 6, 2, 5]}',
 '["3", "4", "1", "6", "5", "2"]\n',
 '```json\n{"results": ["1", "4", "3", "6", "2", "5"]}\n```',
 '{"results": [1, 3, 2, 4, 6, 5]}',
 '[5, 4, 3, 6, 1, 2]']

In [28]:
# Standardize the scores to a common format
import ast
import json
import re

results = []

for item in models_scores:
    item = item.strip()
    
    # JSON wrapped in markdown
    if item.startswith('```json'):
        clean_json = re.sub(r'^```json\n|```$', '', item).strip()
        parsed = json.loads(clean_json)
        results.append([int(x) for x in parsed['results']])
    
    # Plain JSON with results list
    elif item.startswith('{') and '"results"' in item:
        parsed = json.loads(item)
        results.append([int(x) for x in parsed['results']])
    
    # Python-style list string
    elif item.startswith('[') and item.endswith(']'):
        parsed = ast.literal_eval(item)
        results.append([int(x) for x in parsed])
    
    else:
        raise ValueError(f"Unrecognized format: {item}")

print(results)

[[3, 1, 4, 6, 2, 5], [3, 4, 1, 6, 2, 5], [3, 4, 1, 6, 5, 2], [1, 4, 3, 6, 2, 5], [1, 3, 2, 4, 6, 5], [5, 4, 3, 6, 1, 2]]


In [29]:
print(f"Model ranking results:")
for index, model in enumerate(models):
    print(f"{competitors[index]}")
    
    for index2, result in enumerate(results[index]):
        competitor = competitors[int(result)-1]
        print(f"   Rank {index2+1}: {competitor}")

Model ranking results:
gpt-4.1-mini
   Rank 1: gemini-2.0-flash
   Rank 2: gpt-4.1-mini
   Rank 3: deepseek-chat
   Rank 4: mistral-small:latest
   Rank 5: claude-3-7-sonnet-latest
   Rank 6: llama-3.3-70b-versatile
claude-3-7-sonnet-latest
   Rank 1: gemini-2.0-flash
   Rank 2: deepseek-chat
   Rank 3: gpt-4.1-mini
   Rank 4: mistral-small:latest
   Rank 5: claude-3-7-sonnet-latest
   Rank 6: llama-3.3-70b-versatile
gemini-2.0-flash
   Rank 1: gemini-2.0-flash
   Rank 2: deepseek-chat
   Rank 3: gpt-4.1-mini
   Rank 4: mistral-small:latest
   Rank 5: llama-3.3-70b-versatile
   Rank 6: claude-3-7-sonnet-latest
deepseek-chat
   Rank 1: gpt-4.1-mini
   Rank 2: deepseek-chat
   Rank 3: gemini-2.0-flash
   Rank 4: mistral-small:latest
   Rank 5: claude-3-7-sonnet-latest
   Rank 6: llama-3.3-70b-versatile
llama-3.3-70b-versatile
   Rank 1: gpt-4.1-mini
   Rank 2: gemini-2.0-flash
   Rank 3: claude-3-7-sonnet-latest
   Rank 4: deepseek-chat
   Rank 5: mistral-small:latest
   Rank 6: llama-3.

In [None]:
# Average the results (based on the index of the competitors)
# Index 0 values → [1, 2, 1, 2]
# Index 1 values → [2, 1, 2, 1]
# Index 2 values → [3, 3, 3, 3]
# Index 3 values → [4, 3, 2, 3]

from statistics import mean

data = results

# Find the maximum length among the lists (normally all lists should be of the same length)
max_len = max(len(sublist) for sublist in data)

# Compute the average for each index across all lists
averages = []
for i in range(max_len):
    values_at_i = [sublist[i] for sublist in data if i < len(sublist)]
    averages.append(mean(values_at_i))

print(averages)

[2.6666666666666665, 3.3333333333333335, 2.3333333333333335, 5.666666666666667, 3, 4]


In [108]:
# Transpose the data and calculate the average of each column
averages = [sum(col) / len(col) for col in zip(*results)]
print(averages)

[2.6666666666666665, 3.3333333333333335, 2.3333333333333335, 5.666666666666667, 3.0, 4.0]


In [114]:
import statistics

# Transpose and compute medians
medians = [statistics.median(col) for col in zip(*results)]
print(medians)

[3.0, 4.0, 2.5, 6.0, 2.0, 5.0]


In [109]:
# Rank the averages

values = averages
sorted_values = sorted(set(values))
rank_dict = {v: i + 1 for i, v in enumerate(sorted_values)}
ranks = [rank_dict[v] for v in values]

print(competitors)
print(ranks)

['gpt-4.1-mini', 'claude-3-7-sonnet-latest', 'gemini-2.0-flash', 'deepseek-chat', 'llama-3.3-70b-versatile', 'mistral-small:latest']
[2, 4, 1, 6, 3, 5]


In [120]:
rankings

[[3, 1, 4, 6, 2, 5],
 [3, 4, 1, 6, 2, 5],
 [3, 4, 1, 6, 5, 2],
 [1, 4, 3, 6, 2, 5],
 [1, 3, 2, 4, 6, 5],
 [5, 4, 3, 6, 1, 2]]

In [121]:
from collections import defaultdict

#rankings = [
#    [3, 1, 4, 6, 2, 5],
#    [3, 4, 1, 6, 2, 5],
#    [3, 4, 1, 6, 5, 2],
#    [1, 4, 3, 6, 2, 5],
#    [1, 3, 2, 4, 6, 5],
#    [5, 4, 3, 6, 1, 2]
#]

# Gather all positions per competitor
positions = defaultdict(list)

for race in rankings:
    for place, competitor_id in enumerate(race):
        positions[competitor_id].append(place + 1)  # +1 because place 0 = 1st

# Compute average position (lower is better)
average_ranks = {
    competitor: sum(places) / len(places)
    for competitor, places in positions.items()
}

# Sort competitors by average rank
ranked_competitors = sorted(average_ranks.items(), key=lambda x: x[1])

# Output
for i, (competitor, avg_rank) in enumerate(ranked_competitors, start=1):
    print(f"{i}. {competitors[competitor-1]} -- average rank = {avg_rank:.2f}")

1. gemini-2.0-flash -- average rank = 1.83
2. gpt-4.1-mini -- average rank = 2.50
3. deepseek-chat -- average rank = 2.50
4. mistral-small:latest -- average rank = 4.17
5. claude-3-7-sonnet-latest -- average rank = 5.00
6. llama-3.3-70b-versatile -- average rank = 5.00
