# purpose

- as described [here](../readme.md), github graphql allows only 100 records to be fetched at a time
- manual pagination is a lot of work
- so this jupyter fully automates the extraction of data from github
- each time you run this notebook, 100 records would be fetched from github and stored in a csv or json file in this folder


# structure

- this scripts tracks
  - what was the last record that was fetched
  - from whereon should we start fetching the next set of records
- this process is tracked by updating the file named [counter.json](./counter.json) - it is a python list with json objects within it
- `counter.json` contains the [pageInfo](https://docs.github.com/en/graphql/reference/objects#pageinfo) details returned by each graphql query

# workflow

- let us suppose you want to fetch details of all closed github issues till date
- you start by emptying `counter.json` barring the first record
- then execute the jupyter notebook
- `if counter_value == 1:` validates if this is the first run
- if it indeed is, then it will fetch the first 100 issues from your repository
- and it would also increment the counter value in `counter.json`
- counter.json will also be updated with the cursor value indicating which was the last record that was fetched
- then the next time this notebook is run, the next 100 records would be fetched starting where the previous run ended


# setup

In [None]:
# fetch github token, python packages, queries and other parameters

%run ../../100-set_parameters/100-set_parameters.ipynb # magic commands

# find counter_value


In [None]:
# read counter.json

# load the counter.json
def read_counter(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data

# fetch the current value of the counter by accessing the last json object in the list
counter_file_path = 'counter.json'
all_records = read_counter(counter_file_path)
last_record = all_records[-1]
counter_value = last_record["counter"]

# fetch also the current value of endCursor so that the next run can start from here
end_cursor = last_record['endCursor']


In [None]:
# query parameters
# specify the query parameters to be used while fetching issues
# github does not allow fetching more than 100 issues

fetch_issue_parameters = {
    "repository_name": "tensorflow",
    "owner_name": "tensorflow",
    "number_of_issues": 100,
		"end_cursor": end_cursor
  }

with open('issue_params.json', 'w') as file:
    json.dump(fetch_issue_parameters, file)

# update counter_value


In [None]:
# update counter.json
# the counter.json will be populated using the pageInfo details from the response to the latest query execution

# load the input json that contains the extract graphql query
def load_counter(input_file):
    with open(input_file, 'r') as file:
        input_data = json.load(file)
    return input_data

# write the page_info details inside the counter
def write_json(page_info, output_file):
	with open(output_file,'r+') as file:

              # first we load existing data into a dict
              file_data = json.load(file)

              # auto-increment counter
              page_info['counter'] = len(file_data) + 1  # Auto-increment the counter

              # append new_data with file_data inside run_counter
              file_data.append(page_info)

              # sets file's current position at offset
              file.seek(0)

              # convert back to json.
              json.dump(file_data, file, indent = 4)

def write_counter(input_file, output_file):

    # get graphql response from the latest run
    input_data = load_counter(input_file)

    # extract pageInfo node from inside that response
    page_info = input_data['data']['repository']['issues']['pageInfo']

    # update counter.json so that the cursor value can be used for next run
    write_json(page_info,output_file)

# aggregate data

- previous functions help us to paginate through 100 records at a time
- what that means is, i fetch the first 100 records and then insert them in a csv - plus i also update the counter
- then i paginate through the next 100 records and insert them into a csv too
- each time a new set of 100 records is received, that data is appended to [366-aggregated_data.csv](366-aggregated_data.csv)
- this would be the final file that would be used for all visualizations [here](../../400-visualize_data/)
- check this [readme](../readme.md) for details


In [None]:
# append csv

def append_csv_rows(input_data, output_data, include_headers):
    # read rows from the source file
    with open(input_data, 'r') as source:
        reader = csv.reader(source)
        rows = list(reader)  # convert the reader object to a list of rows

    # do not include the column headers if they already exist in destination file
    if not include_headers:
        rows = rows[1:]  # exclude the first row (assumed to be headers)

    # append rows to the destination file
    with open(output_data, 'a', newline='') as destination:
        writer = csv.writer(destination)
        writer.writerows(rows)  # write the rows to the destination file

In [None]:
# append json

def append_json_data(source_file, aggregated_file):
    # check if the aggregated file exists and read its data
    if os.path.exists(aggregated_file):
        with open(aggregated_file, 'r') as f:
            aggregated_data = json.load(f)
    else:
        # initialize an empty structure if the file does not exist
        aggregated_data = {
            "data": {
                "viewer": None,
                "repository": {
                    "issues": {
                        "edges": []
                    }
                }
            }
        }

    # load the new data to append
    with open(source_file, 'r') as f:
        new_data = json.load(f)

    # initialize the viewer if it is not yet set
    if aggregated_data['data']['viewer'] is None:
        aggregated_data['data']['viewer'] = new_data['data']['viewer']

    # append new issues to the aggregated data
    new_issues = new_data['data']['repository']['issues']['edges']
    aggregated_issues = aggregated_data['data']['repository']['issues']['edges']
    aggregated_issues.extend(new_issues)

    # write the updated data back to the aggregated file
    with open(aggregated_file, 'w') as f:
        json.dump(aggregated_data, f, indent=4)

# fetch data


In [None]:
# execute query
# decide to fetch first 100 records or next 100 records based on counter value captured in previous cells

# the code would throw an error if there are less than 100 records to be fetched
# the code still gets all the remaining records ... just throws an error - this is expected
# so i have just suppressed the error

try:
    if counter_value == 1:
        print("getting first 100 records")

        # fetch first 100 records
        %run ./../320-fetch_first_100_closed_issues/320-fetch_first_100_closed_issues.ipynb

        # update counter
        input_file = '320-first_100_closed_issues.json'
        output_file = 'counter.json'
        write_counter(input_file, output_file)

        # append data to CSV
        input_data_csv = '320-first_100_closed_issues.csv'
        output_data_csv = '366-aggregated_data.csv'
        include_headers = True
        append_csv_rows(input_data_csv, output_data_csv, include_headers)

        # append data to JSON
        input_data_json = '320-first_100_closed_issues.json'
        output_data_json = '366-aggregated_data.json'
        append_json_data(input_data_json, output_data_json)

    else:
        print("getting next 100 records")

        # fetch next 100 records
        %run ./../340-fetch_next_100_closed_issues/340-fetch_next_100_closed_issues.ipynb

        # update counter
        input_file = '340-next_100_closed_issues.json'
        output_file = 'counter.json'
        write_counter(input_file, output_file)

        # append data to CSV
        input_data_csv = '340-next_100_closed_issues.csv'
        output_data_csv = '366-aggregated_data.csv'
        include_headers = False
        append_csv_rows(input_data_csv, output_data_csv, include_headers)

        # append data to JSON
        input_data_json = '340-next_100_closed_issues.json'
        output_data_json = '366-aggregated_data.json'
        append_json_data(input_data_json, output_data_json)

except Exception:
    # handle the exception if necessary in future
    pass