# Github Models

This notebook is for testing the Github model APIs to get a sense of any limitations before redoing the implementation.

In [48]:
import os
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage, JsonSchemaFormat
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(
    endpoint="https://models.inference.ai.azure.com",
    credential=AzureKeyCredential(os.environ["GITHUB_TOKEN"]),
    # Needed for structured output
    api_version="2024-08-01-preview",
)


In [None]:
models_to_try = [
    "Llama-3.3-70B-Instruct",
    "Meta-Llama-3.1-405B-Instruct",
    "gpt-4o",
    "gpt-4o-mini",
    "o1-mini",
    "Phi-3.5-MoE-instruct"
]

In [49]:
response = client.complete(
    messages=[
        SystemMessage(content=""""""),
        UserMessage(content="Can you explain the basics of machine learning?"),
    ],
    model="gpt-4o-mini",
    temperature=0.2,
    max_tokens=2048,
    top_p=0.1
)

print(response.choices[0].message.content)


Certainly! Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit instructions. Instead, they learn from data and improve their performance over time. Here are the basics:

### Key Concepts

1. **Data**: The foundation of machine learning. Data can be structured (like tables in a database) or unstructured (like images, text, or audio). The quality and quantity of data significantly impact the performance of ML models.

2. **Features**: These are individual measurable properties or characteristics of the data. In a dataset, features are the input variables used to make predictions.

3. **Labels**: In supervised learning, labels are the output or target variable that the model is trying to predict. For example, in a dataset predicting house prices, the price would be the label.

4. **Training and Testing**: The dataset is typically split into two parts:


In [50]:
from main import readme_guidelines, ReadmeRecommendation

def fill_prompt(
    readme: str, pull_request_markdown: str, feedback: str
) -> tuple[SystemMessage, UserMessage]:
    return SystemMessage(
                content=f"""
You'll review a pull request and determine if the README should be updated, then suggest appropriate changes.
The README should be updated if it contains outdated information or if the pull request introduces major new features that are similar to those currently documented in the README.

When updating the README, be sure to:
* Keep the language timeless. Do not reference "recent" or "recently."
* Focus on the current state of the project features and requirements.

{readme_guidelines}

"""), UserMessage(
                content=f"""
# Existing README
{readme}

# Pull request changes
{pull_request_markdown}

# Optional User Feedback about README updates
{feedback}

# Task
Based on the above information, please provide a structured output indicating:
A) should_update: Should the README be updated?
B) reason: Why?
C) updated_readme: The updated README content (if applicable)
"""
)

fill_prompt("# DEFAULT README", "# PR STUFF", "")

({'role': 'system', 'content': '\nYou\'ll review a pull request and determine if the README should be updated, then suggest appropriate changes.\nThe README should be updated if it contains outdated information or if the pull request introduces major new features that are similar to those currently documented in the README.\n\nWhen updating the README, be sure to:\n* Keep the language timeless. Do not reference "recent" or "recently."\n* Focus on the current state of the project features and requirements.\n\n\n# README Guidelines\n\n## Provide a Brief Overview of the Project\nInclude a brief but informative description of your project\'s purpose, functionality, and goals. This helps users quickly grasp the value of your project and determine if it\'s relevant to their needs.\nExample: A user-friendly weather forecasting app that provides real-time data, daily forecasts, and weather alerts for locations worldwide.\n\n## Installation and Setup\nList Prerequisites and System Requirements\

In [51]:
from github import Github, Repository, PullRequest

url = "https://github.com/ktrnka/company-detective/pull/44"

g = Github(os.environ["GITHUB_TOKEN"])

def parse_pr_link(github_client: Github, url: str) -> tuple[Repository, PullRequest]:
    # TODO: Improve this code to be more robust
    repo_name = '/'.join(url.split('/')[-4:-2])
    pr_number = int(url.split('/')[-1])

    repo = github_client.get_repo(repo_name)
    pull_request = repo.get_pull(pr_number)
    return repo, pull_request

repo, pr = parse_pr_link(g, url)
repo, pr

(Repository(full_name="ktrnka/company-detective"), PullRequest(title="Make sure Google Play scraping respects the num reviews param if poss…", number=44))

In [52]:
def get_readme(repo: Repository, pr: PullRequest, use_base_readme=False) -> str:
    return repo.get_contents(
            "README.md", ref=pr.base.sha if use_base_readme else pr.head.sha
        ).decoded_content.decode()

print(get_readme(*parse_pr_link(g, url)))

# Company Detective

This project summarizes publicly available information about a company. It leverages various APIs to gather and analyze data, providing a comprehensive overview of the target company.

Live site: https://ktrnka.github.io/company-detective

![System diagram](system_diagram.png)

## Features

- Multiple information sources including Crunchbase, news articles, and company websites
- Utilizes AI to analyze and summarize information
- Configured via Airtable
- Google Analytics integration for tracking user interactions and site performance.

## Prerequisites

- Python 3.11 or higher
- uv (Astral's Python package installer and resolver)

## API Keys Required

This project requires API keys for the following services:

- OpenAI
- Reddit
- Google Custom Search Engine
- Scrapfly
- AWS
- Langsmith (Optional)
- Crunchbase (via Scrapfly)
- Airtable

Ensure you have obtained the necessary API keys before proceeding with the setup. The project is designed to handle missing API k

In [53]:
from main import pull_request_to_markdown

print(pull_request_to_markdown(pr))


## [Make sure Google Play scraping respects the num reviews param if poss…])https://github.com/ktrnka/company-detective/pull/44
Oops I meant to include this in the previous PR

### Commit Messages
- Make sure Google Play scraping respects the num reviews param if possible
### src/data_sources/app_stores/google_play.py
@@ -134,8 +134,6 @@ def scrape_app_info(app_id: str) -> GooglePlayAppInfo:
 
 
 def scrape_reviews(app_id: str, num_reviews=100) -> List[GooglePlayReview]:
-    assert num_reviews <= 100, "Google Play scraping is only implemented to fetch 100 reviews at most"
-
     review_cache = CollectionCache(cache, ttl=timedelta(days=14))
 
     cache_key = f"google_play_reviews:{app_id}"
@@ -147,8 +145,8 @@ def scrape_reviews(app_id: str, num_reviews=100) -> List[GooglePlayReview]:
             lang="en",
             country="us",
             sort=google_play_scraper.Sort.NEWEST,
-            # TODO: See if there's any sort of throttling on this
-            count=num_reviews,
+ 

In [54]:
from main import ReadmeRecommendation

In [55]:
ReadmeRecommendation.model_json_schema()

{'description': 'Structured output for the README review task', 'properties': {'should_update': {'description': 'Whether the README should be updated or not', 'title': 'Should Update', 'type': 'boolean'}, 'reason': {'description': 'Reason for the recommendation', 'title': 'Reason', 'type': 'string'}, 'updated_readme': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'description': 'Updated README content, required if should_update is True, otherwise optional', 'title': 'Updated Readme'}}, 'required': ['should_update', 'reason'], 'title': 'ReadmeRecommendation', 'type': 'object'}

In [58]:
import json
from typing import Optional
from azure.ai.inference.models import JsonSchemaFormat
from pydantic import BaseModel, Field

repo, pr = parse_pr_link(g, url)
readme_content = get_readme(repo, pr)
pr_content = pull_request_to_markdown(pr)

class ReadmeRecommendation2(BaseModel):
    """
    Structured output for the README review task
    """

    should_update: bool = Field(
        description="Whether the README should be updated or not"
    )
    reason: str = Field(description="Reason for the recommendation")
    updated_readme: Optional[str] = Field(
        description="Updated README content, required if should_update is True, otherwise optional",
    )


response = client.complete(
    messages=fill_prompt(readme_content, pr_content, ""),
    model="gpt-4o-mini",
    temperature=0.2,
    # The max on my tier
    max_tokens=4000,
    top_p=0.1,
    response_format=JsonSchemaFormat(
        name=ReadmeRecommendation.__name__,
        schema=ReadmeRecommendation.model_json_schema(),
        description="Structured output for recommendations about possible README updates",
        strict=True
    )
)

json_response_message = json.loads(response.choices[0].message.content)

print(response.choices[0].message.content)


NameError: name 'BaseModel' is not defined

In [44]:
response

{'choices': [{'finish_reason': 'length', 'index': 0, 'message': {'content': '{"description": "Structured output for the README review task", "properties": {"should_update": {"description": "Whether the README should be updated or not", "title": "Should Update", "type": "boolean"}, "reason": {"description": "Reason for the recommendation", "title": "Reason", "type": "string"}, "updated_readme": {"anyOf": [{"type": "string"}, {"type": "null"}], "default": null, "description": "Updated README content, required if should_update is True, otherwise optional", "title": "Updated Readme"}}, "required": ["should_update", "reason"], "title": "ReadmeRecommendation", "type": "object"}\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n \n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \n \n\n\n\n\n\n \

In [22]:
from pprint import pprint

type(response)

<class 'azure.ai.inference.models._patch.ChatCompletions'>

In [45]:
# Things that are useful to check
print(f"Model: {response.model}")
print(f"Usage: {response.usage}")

choice = response.choices[0]
print(f"Finish reason: {choice.finish_reason}")

Model: Llama-3.3-70B-Instruct
Usage: {'completion_tokens': 4000, 'prompt_tokens': 1924, 'total_tokens': 5924}
Finish reason: CompletionsFinishReason.TOKEN_LIMIT_REACHED


In [25]:
response

{'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': '## Task Output\n\nA) **should_update**: No\nB) **reason**: The pull request introduces a minor change to the Google Play scraping functionality, which does not significantly alter the project\'s features or requirements. The existing README already documents the project\'s purpose, functionality, and goals, and the change does not warrant a major update.\nC) **updated_readme**: Not applicable, as no update is recommended.\n\nHowever, it\'s worth noting that the README could be improved in some areas, such as:\n\n* Adding more details about the project\'s goals and motivations\n* Providing a clearer explanation of the project\'s architecture and technical stack\n* Including more information about the project\'s testing and development processes\n* Updating the "Note" section to reflect the project\'s current status and any plans for future development\n\nBut these changes are not directly related to the pull requ

In [None]:
if choice.finish_reason == CompletionFinishReason.STOPPED:
    print(f"Stop reason: {choice.stop_reason}")