# Fetch labelled issues using data from GH Archive
The purpose of this notebook is to use the data provided by GH Archive and format it into dataframes that we can use

In [1]:
import pandas as pd
import numpy as np
import requests
from dotenv import load_dotenv
import os
import requests
from concurrent.futures import ThreadPoolExecutor

## Download data to local

In [2]:
def download_data(partition):
    """
    Download data from url and save it to filename
    """

    url = f"http://data.gharchive.org/{partition}.json.gz"

    response = requests.get(url)

    if response.status_code == 200:
        with open(f"../data/{partition}.json.gz", "wb") as file:
            file.write(response.content)
    else:
        print(f"Failed to download data for partition {partition}")

In [3]:
def get_issues_df(partition):
    df = pd.read_json(f"../data/{partition}.json.gz", lines=True)

    issues_events_df = df.query("type == 'IssuesEvent'")

    issues_df = pd.json_normalize(issues_events_df["payload"])

    issues_df = issues_df.filter(items=[ 'issue.id', 'issue.url', 'action', 'issue.user.login', 'issue.title', 'issue.labels', 'issue.body'])
    issues_df["issue.labels"] = issues_df["issue.labels"].apply(lambda x: [label["name"] for label in x])

    return issues_df

In [4]:
def filter_df_by_label(df, label_name: str):
    return df[df["issue.labels"].apply(lambda labels: label_name in [l.lower() for l in labels])]

In [5]:
partition = "2024-06-02-2"

# download_data(partition)

issues_df = get_issues_df(partition)

filter_df_by_label(issues_df, "bug")


Unnamed: 0,issue.id,issue.url,action,issue.user.login,issue.title,issue.labels,issue.body
57,2329426990,https://api.github.com/repos/MaaAssistantArkni...,opened,KOMEIJIHAJIME,基建控制中心基建副手干员满信赖时导致基建换班循环卡死,[bug],### 在提问之前...\n\n- [X] 我理解 Issue 是用于反馈和解决问题的，而非...
70,2329427114,https://api.github.com/repos/FunkinCrew/Funkin...,opened,saicronise,Bug Report: [Game Crashes when you try to sele...,[bug],Well I don't know if anyone has already talked...
106,2329427525,https://api.github.com/repos/geode-sdk/geode/i...,opened,KillerCraftYT,Geode not showing up,"[bug, unverified]",### Geode Issue\n\n- [X] I confirm that this b...
111,2326847194,https://api.github.com/repos/mgmeyers/obsidian...,closed,luke396,[Bug]: Click `Add a card` show long blank board,[bug],### Describe the bug\n\nThis is what appears w...
156,2176345616,https://api.github.com/repos/TownyAdvanced/Map...,closed,Folas1337,Pl3xMap and MapTowny produce diagonal lines in...,[bug],## Describe the Bug\r\nI just recently updated...
...,...,...,...,...,...,...,...
1691,2244202345,https://api.github.com/repos/maxpatiiuk/calend...,closed,maxpatiiuk,bug: Closing the overlay should exit the prefe...,[bug],1. Open Calendar Plus\r\n2. Open preferences\r...
1703,2329439829,https://api.github.com/repos/HisAtri/LrcApi/is...,opened,dajiangfu,为什么我按照文档拉取镜像并启动容器后在音流里面调用接口，无法获取歌手和封面图片呢，是这个ap...,[bug],### 提交Issue之前，你应当知道：\n\n- [X] Issue是用于快速定位和解决问...
1714,983322043,https://api.github.com/repos/craftworkgames/Mo...,closed,EnemyArea,[Particle] Dispose causes app-crash,[bug],If you dispose an particle it crashes without ...
1716,2329439910,https://api.github.com/repos/SrijanShovit/Heal...,opened,ranamanish674zu,diabetesclassification (Using Machine Learning...,"[bug, invalid]",### Describe the bug\n\nThe bug in the diabete...


### Count issue types

In [8]:
pd.Series(np.concatenate(issues_df["issue.labels"].values)).value_counts().head(20)

status                    258
bug                       115
enhancement               110
✨ feature                  62
consumer                   62
teams                      62
High priority              62
stale                      54
question                   20
Stale                      18
type/enhancement           16
good first issue           10
unverified                  8
feature                     8
documentation               8
Task                        8
help wanted                 7
daily                       7
linter-failure              6
b2b-marketing-services      6
Name: count, dtype: int64