The goal is to mine a dataset of github users from Google and Microsoft using github api.

We gonna get the data of users from each company and save them on disk.

This is just a meaningless exercise, no practical use is thought of whatsoever.

**In order the script to work, you need to create a personal access token for github api v4
and save it to to file ./token.txt**

See instructions https://developer.github.com/v4/guides/forming-calls/#authenticating-with-graphql

watch out for rate limits https://developer.github.com/v4/guides/resource-limitations/

In [1]:
import numpy as np
import pandas as pd
import re
import requests
import time


USERS_PER_QUERY = 35 # choosed not to exceed the api limits

# the query text to use with github api,
# for reference look here:
# https://developer.github.com/v4
QUERY = '''{
  organization(login: %s) {
    members(%sfirst: %d) {
      totalCount
      edges {
        cursor
        node {
          commitComments {
            totalCount
          }
          followers {
            totalCount
          }
          following {
            totalCount
          }
          gists {
            totalCount
          }
          issues {
            totalCount
          }
          pullRequests {
            totalCount
          }
          repositories(first: 50) {
            totalCount
            nodes {
              languages(first: 5) {
                nodes {
                  name
                }
              }
            }
          }
          starredRepositories {
            totalCount
          }
          createdAt
          login
          name
          location
          isBountyHunter
          isCampusExpert
          isDeveloperProgramMember
          isHireable
        }
      }
    }
  }
}'''

In [2]:
def get_users(org):
    '''
    creates a pandas dataframe with user from organization org
    using github api data
    '''
    
    def make_request(prev_user_code=''):
        ''' makes a request to github api v4'''
        if prev_user_code:
            query = {"query" : QUERY % ('"%s"' % org,
                                        'after: %s, ' % prev_user_code,
                                        USERS_PER_QUERY)}
        else:
            query = {"query" : QUERY % ('"%s"' % org, '', USERS_PER_QUERY)}
        for _ in range(10):
            try:
                r = requests.post(url=url, json=query, headers=headers)
                last_user_code  = re.findall('"cursor":("[\w=]+")', r.text)[-1]
                return r, last_user_code
            except:
                print('Error while making request. Trying again...')
                time.sleep(1)
        print('Due to numerous errors, data recieving was aborted')
        return None, None
    
    print('recieving users from %s ...' % org)
    url = 'https://api.github.com/graphql'    
    api_token = open("./token.txt", "r").read()
    headers = {'Authorization': 'token %s' % api_token}

    df = pd.DataFrame()
    r, last_user_code = make_request()
    
    while r:
        json = r.json()
        if df.empty:
            user0 = json['data']['organization']['members']['edges'][0]['node']
            cols = list(user0.keys()) + ['own_repos_langs']
            df = pd.DataFrame(columns=cols)
        
        users = json['data']['organization']['members']['edges']
        print('recieved', len(users))
        for user in users:
            node = user['node']
            features = [node[key]['totalCount'] if type(node[key]) is dict 
                       else node[key] for key in node]
            langs = list(set(lang['name']
                             for repo in node['repositories']['nodes']
                             for lang in repo['languages']['nodes']))
            features.append(langs)
            df.loc[len(df)] = features
            
        if len(users) < USERS_PER_QUERY:
            break
        r, last_user_code = make_request(last_user_code)
        

    df.to_csv('./%s.csv' % org, columns=df.columns)
    return df


**Warning:** It will take several minutes

In [3]:
orgs = ['microsoft', 'google']
dfs = []
for org in orgs:
    df = get_users(org)
    df['org'] = org
    dfs.append(df)
    
df = pd.concat(dfs, axis=0, ignore_index=True)
df.to_csv('./github_users.csv', columns=df.columns)

recieving users from microsoft ...
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recieved 35
recie

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5491 entries, 0 to 5490
Data columns (total 18 columns):
commitComments              5491 non-null object
followers                   5491 non-null object
following                   5491 non-null object
gists                       5491 non-null object
issues                      5491 non-null object
pullRequests                5491 non-null object
repositories                5491 non-null object
starredRepositories         5491 non-null object
createdAt                   5491 non-null object
login                       5491 non-null object
name                        5165 non-null object
location                    3195 non-null object
isBountyHunter              5491 non-null object
isCampusExpert              5491 non-null object
isDeveloperProgramMember    5491 non-null object
isHireable                  5491 non-null object
own_repos_langs             5491 non-null object
org                         5491 non-null object
dtypes: obj

Let's discover the most popular languages / technologies

In [5]:
df['own_repos_langs'] = df['own_repos_langs'].apply(' '.join)

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

countvect = CountVectorizer()

counts = countvect.fit(df.own_repos_langs)

In [7]:
cols = counts.get_feature_names()

In [8]:
languages = pd.DataFrame(counts.transform(df.own_repos_langs).todense(), columns=cols)

In [9]:
top50 = languages.sum().sort_values(ascending=False)[:50]
top50

shell           3437
javascript      3410
html            3265
css             3128
python          2798
powershell      2072
makefile        1997
batchfile       1987
java            1922
ruby            1782
objective       1280
typescript      1078
php              972
go               903
cmake            827
perl             824
asp              814
groovy           725
coffeescript     624
script           554
vim              548
xslt             492
assembly         380
protocol         371
buffer           371
basic            361
visual           361
tex              355
lisp             336
smalltalk        322
scala            284
emacs            283
notebook         261
jupyter          261
lua              256
apacheconf       254
m4               250
swift            235
haskell          216
smarty           213
arduino          208
glsl             206
matlab           181
awk              178
groff            170
machine          168
gcc              168
description  

Here they are. Let's save them

In [10]:
top50.to_csv('./top50langs.csv', index=True)