# GitHub Stars EDA
## Data Source
[Github Archive](https://www.gharchive.org/) bigquery's public data set, year 2022.

## Data loading

In [5]:
import pandas as pd
from google.cloud import bigquery
from google.oauth2 import service_account
import numpy as np

In [6]:
credentials = service_account.Credentials.from_service_account_file(
'.bigquery-eda-key.json')
project_id='github-eda'
client = bigquery.Client(credentials= credentials,project=project_id)

## Data Exploration

In [50]:
# stars received per user in 2022 (doesn't include users with no stars)
WATCH_EVENTS_PER_USER_QUERY= """
SELECT split(repo.name, '/')[OFFSET(0)] as user, count(*) as stars
FROM `githubarchive.year.2022`
where type = 'WatchEvent'
group by user
order by stars asc
"""

df_watch_events = client.query(WATCH_EVENTS_PER_USER_QUERY).to_dataframe()['stars']
df_watch_events.head()

0    1
1    1
2    1
3    1
4    1
Name: stars, dtype: Int64

In [51]:
df_watch_events.describe()

count    3.243376e+06
mean     1.742826e+01
std      4.948940e+02
min      1.000000e+00
25%      1.000000e+00
50%      2.000000e+00
75%      4.000000e+00
max      5.683110e+05
Name: stars, dtype: float64

In [53]:
USERS_COUNT_QUERY = """
SELECT count(distinct(split(repo.name, '/')[OFFSET(0)])) as users_count
FROM `githubarchive.year.2022`
"""
# total number of users (with or without starred repos)
users_count = client.query(USERS_COUNT_QUERY).to_dataframe()['users_count'][0]
users_count

17690503

In [54]:
users_without_starred_repos = users_count - df_watch_events.count()
users_without_starred_repos

14447127

In [55]:
stars_series = np.concatenate((np.zeros((users_without_starred_repos,), dtype=int), df_watch_events.loc[:].to_list()
), axis=None)

df_stars = pd.DataFrame(stars_series, columns=['stars'])

In [56]:
df_stars.quantile([0.6, 0.9, 0.95, 0.99, 0.999])

Unnamed: 0,stars
0.6,0.0
0.9,1.0
0.95,4.0
0.99,30.0
0.999,410.0


In [57]:
count, division = np.histogram(df_stars, bins=[0, 1, 6, 10, 100, 500, 1000, 10000, 50000])
count

array([14447127,  2554952,   240252,   383093,    50603,     7506,
           6629,      322])

In [63]:
df_stars.boxplot()

<AxesSubplot: ylabel='Frequency'>

## Conclusion
The GitHub stars' distribution is extremely skewed:
- 90% of the users have 0 or 1 star (aggregated across all their public repositories).
- 5% have 4 star or more.
- 1% have 30 stars or more.
- 0.1% have 410 stars or more.