<a href="https://colab.research.google.com/github/j-tenny/INF502-Project2/blob/Drafts/ProgramOutline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Program Description**

We are building a command line application that summarizes information from Github. The application will create summaries and visualizations describing some statistics across all repositories on Github and will also allow the user to request a summary about a specific repository or a specific Github user.

## **Requirements**

* The application will be able to pull and summarize data about all repositories on Github.

* The application will be able to pull data for any specific repository requested by the user. For this repository, we will look at all of the pull requests that have occured.

  * For each pull request we will store:
    * Pull request title
    * Pull request number
    * Body
    * State
    * Date of creation (created_at)
    * Closing date (if the state is different than open)
    * User

* The application will also pull data for each author (user) found in the pull requests for this repository. For each user, collect the following info from the pull requests and/or their profile page on Github:
  * Number of pull requests submitted to this repository
  * Number of Repositories the user has contributed to
  * Number of Followers
  * Number of Following
  * Number of contributions across all repositories in the last year.

* You must develop a function called `save_as_csv` that can be reused to convert any object to a csv entry (row). The function receives the file name and the object to be converted. If the file does not exist, you need to create the file (with a header). If the file  exists, you need to append a new line with the object in the CSV. To make it possible, you will need to have a method in each of your classes with the very same name, which will return a string with the data already structured as a CSV. Use this function to create/update the files as following (NO REPEATED ENTRIES): when you collect data from a repositories, you need to add it to a CSV called `repositories.csv` when you collect the pull requests of a repositories, you need to store them in a file named after the owner and the name of repository(repos/owner-repo.csv) when you collect data from users, you need to add it to a CSV called `users.csv`.


### **Menu Options**
You should have a "Main Menu" and various "Sub Menus" built into the application.

**Main Menu**

* Option 1: Summarize info about all repositories on Github. This will produce visualiztions of the following:
   * A line graph showing the total number of pull requests per day
   * A line graph comparing number of open and closed pull requests per day
   * A bars plot comparing the number of users per repository


* Option 2: Request data for a specific repository (from GitHub) by providing the owner and repository name.

* Option 3: View a list of repositories you have requested in this session.

* Option 4: Exit the program

**Submenus**
* For any repository you have collected data on, you should be able to select that repository, then perform any of the following actions:

  * Show all pull requests from a certain repository
  * Show the summary of a repository. Summary must contain:
    * Number of pull requests in `open` state
    * Number of pull requests in `closed` state
    * Number of users
    * Date of the oldest pull request
    * Create and store visual representation data about the repository (via pandas)
      * A boxplot that compares closed vs. open pull requests in terms of number of commits
      * A boxplot that compares closed vs. open pull requests in terms of additions and deletions
      * A boxplot that compares the number of changed files grouped by the author association
      * A scatterplot that shows the relationship between additions and deletions
      * Calculate the correlation between all the numeric data in the pull requests for a repository (and visualize as a matrix?)

* Calculate the correlation between the data fields collected about users
  * following
  * followers
  * number of pull requests
  * number of contributions
  * etc.

## **Additional Requirements**

* You need to use object-oriented programming (OOP) to structure your code for collecting and analyzing data.

* Will need to write at least 5 unit tests for this project.


## **Object Oriented Programming**

**See this example to understand how we can use object oriented programming in the context of this project.**

First, we create a class that will hold data and methods at the level of an individual repository. This class will store properties such as the repository and owner. These properties can then be used in functions that perform actions like downloading pull requests from github.

In [12]:
# Class Definition
class Repository:
  def __init__(self,owner_name,repo_name):
    self.owner_name = owner_name
    self.repo_name = repo_name


  def get_pulls_as_json(self):
    import requests

    # GitHub API endpoint for pull requests
    url = f"https://api.github.com/repos/{self.owner_name}/{self.repo_name}/pulls"

    # Make a GET request to retrieve pull requests
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response
        pull_requests_json = response.json()

    else:
        print(f"Failed to retrieve pull requests. Status code: {response.status_code}")
        print(response.text)

    return pull_requests



# Example Usage
jabref_repo = Repository(owner_name='jabref', repo_name='jabref')
pull_requests = jabref_repo.get_pulls_as_json()

print('Total number of pull requests:',len(pull_requests))

Total number of pull requests: 28


Using the Github REST API, we requested information about pull requests in the JSON format. JSON is a format specification for strings and text files. When translated into python data structures using `response.json()`, we get back a bunch of nested lists and dictionaries. In this case, each element in the pull_requests list is a dictionary containing information about an individual pull request.

In [21]:
# View the first pull request (note, this creates a very long output...)
pull_requests[0]

{'url': 'https://api.github.com/repos/JabRef/jabref/pulls/10646',
 'id': 1606405716,
 'node_id': 'PR_kwDOAQ0TF85fv85U',
 'html_url': 'https://github.com/JabRef/jabref/pull/10646',
 'diff_url': 'https://github.com/JabRef/jabref/pull/10646.diff',
 'patch_url': 'https://github.com/JabRef/jabref/pull/10646.patch',
 'issue_url': 'https://api.github.com/repos/JabRef/jabref/issues/10646',
 'number': 10646,
 'state': 'open',
 'locked': False,
 'title': 'Use Prettier Java for automatic code formatting',
 'user': {'login': 'koppor',
  'id': 1366654,
  'node_id': 'MDQ6VXNlcjEzNjY2NTQ=',
  'avatar_url': 'https://avatars.githubusercontent.com/u/1366654?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/koppor',
  'html_url': 'https://github.com/koppor',
  'followers_url': 'https://api.github.com/users/koppor/followers',
  'following_url': 'https://api.github.com/users/koppor/following{/other_user}',
  'gists_url': 'https://api.github.com/users/koppor/gists{/gist_id}',
  'starred_url'

In [None]:
# Get a property from the first pull request
pull_requests[0]['title']

To help organize all of the information about pull requests, we can create a `PullRequest` class to store that information. For flexibility and convenience, we can set up this class so that we can manually specify each field OR read in the fields directly from a json file. It can also be useful to specify the expected data type which is stored in each field.

In [46]:
# Class definition
class PullRequest:
  def __init__(self,title:str = None, number:int = None, body:str = None, state:str = None, created_at:str = None, closed_at:str = None):
    self.title = title
    self.number = number
    self.body = body
    self.state = state
    self.created_at = created_at
    self.closed_at = closed_at

  def fill_from_json(self,json):
    self.title = json['title']
    self.number = json['number']
    self.body = json['body']
    self.state = json['state']
    self.created_at = json['created_at']
    self.closed_at = json['closed_at']

  def to_dict(self):
    return {'title':self.title,
            'number':self.number,
            'body':self.body,
            'state':self.state,
            'created_at':self.created_at,
            'closed_at':self.closed_at
            }

  def __str__(self):
    return f'Pull Request #{self.number}: {self.title}'

  def __repr__(self):
    return f'(Class: PullRequest) #{self.number}: {self.title}'

# Example use
pull_request1 = PullRequest()
pull_request1.fill_from_json(pull_requests[1])
print('The title of this pull request was:', pull_request1.title)
print('Full data record:')
print(pull_request1.to_dict())

The title of this pull request was: Issue 10431 relevance star
Full data record:
{'title': 'Issue 10431 relevance star', 'number': 10620, 'body': '<!-- \r\nDescribe the changes you have made here: what, why, ... \r\nLink the issue that will be closed, e.g., "Closes #333".\r\nIf your PR closes a koppor issue, link it using its URL, e.g., "Closes https://github.com/koppor/jabref/issues/47".\r\n"Closes" is a keyword GitHub uses to link PRs with issues; do not change it.\r\nDon\'t reference an issue in the PR title because GitHub does not support auto-linking there.\r\n-->\r\n\r\nA relevance field was modified so it can  no longer be visible upon hovering as it was causing confusion regarding the actual value of selected entry. It has been modified so a new menu appears upon clicking on the "Relevance" field of a selected entry with a selection of 2 options to set it to - Set as Relevant / Set as Irrelevant. Upon choosing one of the mentioned options, the selected entry will be then marked

Now that I've defined a class for pull requests, I can expand the functionality of the Repository class. I'm adding a function that compiles the pull requests as a tuple of PullRequest objects. I'm using a tuple because I don't want to accidently change the pull request data after downloading it from github. Then, I added a function that turns the tuple into a pandas dataframe.

In [62]:
# Class Definition
class Repository:
  def __init__(self,owner_name,repo_name):
    # Assign properties
    self.owner_name = owner_name
    self.repo_name = repo_name

    # Initialize an empty variable
    self.pull_requests = tuple()

    # Automatically run function to get pull requests
    self.get_pulls()



  def get_pulls_as_json(self):
    import requests

    # GitHub API endpoint for pull requests
    url = f"https://api.github.com/repos/{self.owner_name}/{self.repo_name}/pulls"

    # Make a GET request to retrieve pull requests
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response
        pull_requests_json = response.json()

    else:
        print(f"Failed to retrieve pull requests. Status code: {response.status_code}")
        print(response.text)

    return pull_requests


  def get_pulls(self):
    # Get pull requests from github in json format
    pulls_json = self.get_pulls_as_json()

    # Temporarily create an empty list
    pull_requests_list = list()

    # Convert each pull request in the json to a PullRequest object and add it
    # to the list of pull requests stored in this Repository object
    for json_record in pulls_json:
      pull_request_instance = PullRequest()
      pull_request_instance.fill_from_json(json_record)
      pull_requests_list.append(pull_request_instance)

    # Convert list to tuple so it's safer from accidental changes
    self.pull_requests = tuple(pull_requests_list)

  def pull_requests_to_json(self):
    output_list = list()
    for pull_request in self.pull_requests:
      output_list.append(pull_request.to_dict())

    return output_list


  def pull_requests_to_pandas(self):
    import pandas as pd
    return pd.DataFrame(self.pull_requests_to_json())


# Example Usage
jabref_repo = Repository(owner_name='jabref', repo_name='jabref')
jabref_repo.pull_requests_to_pandas()

Unnamed: 0,title,number,body,state,created_at,closed_at
0,Use Prettier Java for automatic code formatting,10646,## Background\r\n\r\nI am so fed up that\r\n\r...,open,2023-11-17T13:07:30Z,
1,Issue 10431 relevance star,10620,<!-- \r\nDescribe the changes you have made he...,open,2023-11-06T07:44:50Z,
2,Do not show user-specific comment as default,10610,Fixes https://github.com/JabRef/jabref/issues/...,open,2023-10-31T01:11:31Z,
3,Predatory journal checker,10592,Continue to resolve koppor#348\r\n\r\nThe impl...,open,2023-10-27T19:46:48Z,
4,Fix for delete entries should ask user,10591,<!-- \r\nDescribe the changes you have made he...,open,2023-10-27T14:45:22Z,
5,Add git support,10586,<!-- \r\nDescribe the changes you have made he...,open,2023-10-25T22:04:55Z,
6,[WIP] This relativizes the PDF's filepaths aft...,10582,<!-- \r\nDescribe the changes you have made he...,open,2023-10-25T19:31:14Z,
7,[WIP] Jump to entry cli,10578,<!-- \r\nDescribe the changes you have made he...,open,2023-10-25T07:10:34Z,
8,Display files from referenced crossref in entr...,10577,Resolves #7731\r\nImproves code quality of - ...,open,2023-10-25T01:39:07Z,
9,Spotbugs,10565,<!-- \r\nDescribe the changes you have made he...,open,2023-10-24T07:04:41Z,


In [54]:
type(jabref_repo.pull_requests[1])

__main__.PullRequest

In [55]:
test = (1,2,3,4)

In [56]:
for n in test:
  print(n)

1
2
3
4


In [57]:
for pr in jabref_repo.pull_requests:
  print(pr)

Pull Request #10646: Use Prettier Java for automatic code formatting
Pull Request #10620: Issue 10431 relevance star
Pull Request #10610: Do not show user-specific comment as default
Pull Request #10592: Predatory journal checker 
Pull Request #10591: Fix for delete entries should ask user
Pull Request #10586: Add git support
Pull Request #10582: [WIP] This relativizes the PDF's filepaths after importing through "Find Unlinked Files"
Pull Request #10578: [WIP] Jump to entry cli
Pull Request #10577: Display files from referenced crossref in entry table (Toro520 version)
Pull Request #10565: Spotbugs
Pull Request #10541: Add support for LTWA (List of Title Word Abbreviations)
Pull Request #10540: Implemented "Welcome Interface for New Users"
Pull Request #10526: Issue 9798: Relinking after moving
Pull Request #10521: Add auto group colour assignment
Pull Request #10519: Improving Booktitle Integrity Check
Pull Request #10518: Added Fetcher for ISIDORE
Pull Request #10496: Fix duplicate e