# POC: Length calibration

Context: The system has a tendency to increase README length. When I measured, I found that when updates were suggested, 100% of the time it suggested an overall increase in lines. Most PRs should result in fairly minimal README changes depending on the scope of the PR.

The idea is to detect how extensive the code changes are to set expectations about how extensive the README should be updated. For example:
PR edited 10% of the files of the repo and increased the overall repo size by 5%
Expected README change: Adjust by +/- 10% length

In [2]:
from main import github_client

REPO_NAME, PR_NUMBER = ('locustio/locust', 2856)

# Get the repository
repo = github_client.get_repo(REPO_NAME)

# Get the pull request
pull_request = repo.get_pull(PR_NUMBER)

# Get the base commit from the base branch
base_branch = pull_request.base.ref
base_commit = repo.get_branch(base_branch).commit

# Get the tree for the base commit to count total files in the repository
base_tree = base_commit.commit.tree

def count_files_in_tree(tree):
    # TODO: Review this code
    file_count = 0
    for item in tree.tree:
        if item.type == 'blob':  # It's a file
            file_count += 1
        elif item.type == 'tree':  # It's a folder, we need to go deeper
            subtree = repo.get_git_tree(item.sha, recursive=True)
            file_count += len([i for i in subtree.tree if i.type == 'blob'])
    return file_count

total_files_in_base = count_files_in_tree(base_tree)

# Get the number of files changed in the pull request
changed_files_in_pr = pull_request.changed_files

# Calculate the percentage of files changed
percentage_files_changed = (changed_files_in_pr / total_files_in_base)

print(f"Total files in base branch: {total_files_in_base}")
print(f"Files changed in PR: {changed_files_in_pr}")
print(f"Percentage of files changed: {percentage_files_changed:.1%}")

# Note: This takes 5 sec to run

Total files in base branch: 346
Files changed in PR: 1
Percentage of files changed: 0.3%


In [3]:
def count_pr_diff(pull_request):
    # Get the diff
    diff = pull_request.get_files()
    # Count the number of lines added and removed
    lines_added = 0
    lines_removed = 0
    for file in diff:
        lines_added += file.additions
        lines_removed += file.deletions
    return lines_added, lines_removed

count_pr_diff(pull_request)

(36, 47)

In [4]:
# get_languages is fast to run but it returns the bytes by language
# In theory we could compute this whole thing differently:
# 1. Compute the ideal README length based on the get_languages() values
# 2. Compute the actual README length
# 3. Let the LLM know about the constraints (e.g. the README is a bit too short overall, and it can add up to 10% more)

repo.get_languages()

{'Python': 1027990,
 'TypeScript': 214667,
 'HTML': 3201,
 'Dockerfile': 1642,
 'Makefile': 1458}