## GitHub Data Analysis


## Introduction
Every software engineer uses GitHub. Being a open source distributed version control tool, GitHub has thoundsands of new repositorys in every hour. Thus, GitHub could also be used as a huge dynamic data source to analyze technology status quo and trend.

In this project, we will be looking into serveral things like who is the most popular person in certain field, what is the current hottest project and how much does different programming languages being used.


### GitHub API

We will use GitHub API from [here](https://developer.github.com/v3/).
All the API calls are using HTTPS requests and it will return in JSON format.

Steps to use GitHub API:
1. Install `pygithub` by 
`-pip install pygithub`
2. Generate a GitHub Personal access token required for `GitHub API`
3. Test You API in local terminal using the following command. It is expected to return a list of dictionary contains your account info
##### - curl https://api.github.com/user\?access_token\={YOUR_TOKEN}




### NetworkX
NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. It provides tools to work with large dataset with network strucutres. With NetworkX, we can load and store neyworks in standard data format easily. It can also helps us to generate classic networks, analyze network strucutre, build network models and much more.

You can install `NetworkX` by `-pip install networkx`


In [4]:
import sys
from github import Github
import networkx as nx
from operator import itemgetter

# Global Variables
ACCESS_TOKEN = '05bb4eb867b152be20dd11f4fa292107c839931c'
USER = 'minrk'  # Define the GitHub User Name
REPO = 'findspark'  # Define the Repo name
client = Github(ACCESS_TOKEN)
graph = nx.DiGraph()

### Set Up NetworkX Graph


After defined the user and repo name that we are going to explore, we can then set up the NetworkX graph.

We will add the repo and each user who starred the repo as nodes, and build edges between them. After this, we also add edges between users and their followers. 

In [5]:
def buildRepoRelations(REPO):
    user = client.get_user(USER)
    repo = user.get_repo(REPO)  # Get a specific repo
    REPOS = user.get_repos()
    stargazers = list(repo.get_stargazers())  # The list of users who starred this REPO
    graph.add_node(repo.name + '(repo)', type='repo', lang=repo.language, owner=user.login)
    for stargazer in stargazers:
        graph.add_node(stargazer.login + '(user)', type='user')
        graph.add_edge(stargazer.login + '(user)', repo.name + '(repo)', type='gazes')
    #	print(len(stargazers))#See if it return a correct list
    return stargazers


def buildUserRelations(stargazers):
    for i, stargazer in enumerate(stargazers):
        followers = stargazer.get_followers()
        try:
            for follower in followers:
                if follower.login + '(user)' in graph:
                    graph.add_edge(follower.login + '(user)', stargazer.login + '(user)', type='follows')
        except Exception:  
            print("Encountered an error when finding follower for user: ", stargazer.login)
        #See How many available API calls remaining
        print ("API Calls Remaining", client.rate_limiting)

In [7]:
stargazers = buildRepoRelations(REPO)
buildUserRelations(stargazers)

API Calls Remaining (3968, 5000)
API Calls Remaining (3921, 5000)
API Calls Remaining (3916, 5000)
API Calls Remaining (3912, 5000)
API Calls Remaining (3907, 5000)
API Calls Remaining (3874, 5000)
API Calls Remaining (3870, 5000)
API Calls Remaining (3868, 5000)
API Calls Remaining (3852, 5000)
API Calls Remaining (3846, 5000)
API Calls Remaining (3844, 5000)
API Calls Remaining (3835, 5000)
API Calls Remaining (3831, 5000)
API Calls Remaining (3829, 5000)
API Calls Remaining (3827, 5000)
API Calls Remaining (3825, 5000)
API Calls Remaining (3808, 5000)
API Calls Remaining (3807, 5000)
API Calls Remaining (3805, 5000)
API Calls Remaining (3801, 5000)
API Calls Remaining (3797, 5000)
API Calls Remaining (3794, 5000)
API Calls Remaining (3792, 5000)
API Calls Remaining (3790, 5000)
API Calls Remaining (3788, 5000)
API Calls Remaining (3787, 5000)
API Calls Remaining (3785, 5000)
API Calls Remaining (3779, 5000)
API Calls Remaining (3777, 5000)
API Calls Remaining (3775, 5000)
API Calls 

### Find Hottest User
In this step, we use the graph initialized above to find the hottest users. The hottest user is defined as the GitHub user followed by most of the people who starred the repo we defined previously. This can also be interpreted as those who starred this repo also follows ... 


In [8]:
from collections import Counter
from operator import itemgetter

def getHottestUser(stargazers):
    
    temp_list = []
    for edge in graph.edges(data = True):
        if edge[2]['type'] == 'follows':
            temp_list.append(edge[1])
    counter = Counter(temp_list)
    
    popular_users = []
    for u, f in counter.most_common():
        popular_users.append((u,f))
    print ("Number of popular users", len(popular_users))
    print ("Top popular users:", popular_users[:10])
    
getHottestUser(stargazers)




Number of popular users 32
Top popular users: [('minimaxir(user)', 7), ('stared(user)', 6), ('freeman-lab(user)', 5), ('rgbkrk(user)', 4), ('nchammas(user)', 3), ('dclambert(user)', 3), ('esafak(user)', 3), ('dapurv5(user)', 2), ('rholder(user)', 2), ('xiaohan2012(user)', 2)]


The result above shows the most popular users. However, we care more about some centralities that NetworkX provided.
#### Degree Centrality
First, the Degree Centrality for a node v is the fraction of nodes it is connected to. 
#### Betweenness Centrality
Also, the Betweenness Centrality compute the shortest path for nodes. It is the sum of the fraction of all-pairs shortest paths that pass through the node v. 
#### Closeness Centrality
Lastly, the Closeness Centrality of a node u is the reciprocal of the sum of the shortest path distances from u to all n-1 other nodes. Since the sum of distances depends on the number of nodes in the graph, closeness is normalized by the sum of minimum possible distances n-1.




In [9]:
def formatResult(graph):
    graph_copy = graph.copy()
    # Remove center node
    graph_copy.remove_node('findspark(repo)')

    dc = sorted(nx.degree_centrality(graph_copy).items(), 
                key=itemgetter(1), reverse=True)

    bc = sorted(nx.betweenness_centrality(graph_copy).items(), 
                key=itemgetter(1), reverse=True)
    cc = sorted(nx.closeness_centrality(graph_copy).items(), 
                key=itemgetter(1), reverse=True)
    return (dc, bc, cc)

dc, bc, cc = formatResult(graph)

print ("Degree Centrality")
print (dc[:5],'\n')

print ("Betweenness Centrality")
print (bc[:5],'\n')

print ("Closeness Centrality")
print (cc[:5])

Degree Centrality
[('fly51fly(user)', 0.08955223880597014), ('gauravssnl(user)', 0.04477611940298507), ('esafak(user)', 0.03980099502487562), ('minimaxir(user)', 0.03482587064676617), ('andrewiiird(user)', 0.03482587064676617)] 

Betweenness Centrality
[('esafak(user)', 0.0003772802653399668), ('rgbkrk(user)', 0.00028606965174129354), ('dapurv5(user)', 9.950248756218905e-05), ('andrewiiird(user)', 9.12106135986733e-05), ('nchammas(user)', 8.706467661691541e-05)] 

Closeness Centrality
[('freeman-lab(user)', 0.03655634869132597), ('minimaxir(user)', 0.03482587064676617), ('stared(user)', 0.030472636815920398), ('rgbkrk(user)', 0.026533996683250412), ('nchammas(user)', 0.01990049751243781)]


### Find Hottest Repository 

Next, we go through each user for their starred repos and then add these repos into the network. After that, it is easy for us to get the popular repositories. Moreover, we can also get to know the language preference of one certain user.

In [10]:
def buildRepoNet(stargazers, limit_repo):
    for i, v in enumerate(stargazers):
        print(v.login)
        try:
            for starred in v.get_starred()[:limit_repo]:  # Slice to avoid supernodes
                graph.add_node(starred.name + '(repo)', type='repo', lang=starred.language, \
                           owner=starred.owner.login)
                graph.add_edge(v.login + '(user)', starred.name + '(repo)', type='gazes')
        except Exception:  # ssl.SSLError:
            print("Encountered an error fetching starred repos for", v.login, "Skipping.")

        print("Num nodes/edges in graph", graph.number_of_nodes(), "/", graph.number_of_edges())
    print(nx.info(graph), '\n')


Sometimes a user marks too many repos and it takes a lot of time to build the net. So here the limit_repo parameter could define the maximum of the repos of one user

In [11]:
buildRepoNet(stargazers,5)

pchalasani
Num nodes/edges in graph 208 / 269
rgbkrk
Num nodes/edges in graph 213 / 274
esafak
Num nodes/edges in graph 218 / 279
rdhyee
Num nodes/edges in graph 223 / 284
rholder
Num nodes/edges in graph 228 / 289
freeman-lab
Num nodes/edges in graph 233 / 294
he0x
Num nodes/edges in graph 238 / 299
cbouey
Num nodes/edges in graph 243 / 304
stared
Num nodes/edges in graph 248 / 309
ryan-williams
Num nodes/edges in graph 252 / 314
nehalecky
Num nodes/edges in graph 257 / 319
jazzwang
Num nodes/edges in graph 262 / 324
paulochf
Num nodes/edges in graph 267 / 329
Erstwild
Num nodes/edges in graph 272 / 334
sandysnunes
Num nodes/edges in graph 277 / 339
d18s
Num nodes/edges in graph 282 / 344
amontalenti
Num nodes/edges in graph 287 / 349
richardskim111
Num nodes/edges in graph 292 / 354
hmourit
Num nodes/edges in graph 297 / 359
robcowie
Num nodes/edges in graph 301 / 363
ocanbascil
Num nodes/edges in graph 306 / 368
fish2000
Num nodes/edges in graph 311 / 373
jiamo
Num nodes/edges in gr

Num nodes/edges in graph 1040 / 1157
mikestaszel
Num nodes/edges in graph 1044 / 1161
sophiedophie
Num nodes/edges in graph 1048 / 1166
Nimayer
Num nodes/edges in graph 1053 / 1171
chocolocked
Num nodes/edges in graph 1057 / 1176
gabides
Num nodes/edges in graph 1058 / 1177
soumyadsanyal
Encountered an error fetching starred repos for soumyadsanyal Skipping.
Num nodes/edges in graph 1058 / 1177
mas-dse-m2ryu
Encountered an error fetching starred repos for mas-dse-m2ryu Skipping.
Num nodes/edges in graph 1058 / 1177
Boukefalos
Encountered an error fetching starred repos for Boukefalos Skipping.
Num nodes/edges in graph 1058 / 1177
saikatkumardey
Encountered an error fetching starred repos for saikatkumardey Skipping.
Num nodes/edges in graph 1058 / 1177
pluteski
Num nodes/edges in graph 1062 / 1181
rainsunny
Num nodes/edges in graph 1066 / 1185
shriharishastry
Num nodes/edges in graph 1070 / 1189
bwhitesell
Num nodes/edges in graph 1073 / 1193
fivejjs
Num nodes/edges in graph 1076 / 119

In [18]:
def getTopNRepos(n):
    print("Top "+str(n)+" Popular repositories:")
    
    repos = []
#     networkx2 should be in_degree and 1 should be in_degree_iter
    for (v, i) in graph.in_degree_iter():
        if graph.node[v]['type'] == 'repo':
            repos.append((v,i))
    repos = sorted(repos, key = lambda x:x[1], reverse=True)
    print(repos[:n])

In [19]:
getTopNRepos(10)

Top 10 Popular repositories:
[('findspark(repo)', 202), ('kubeflow(repo)', 6), ('models(repo)', 3), ('gvisor(repo)', 3), ('hypertools(repo)', 3), ('arrow(repo)', 3), ('foundationdb(repo)', 3), ('reference(repo)', 3), ('examples(repo)', 3), ('ann-visualizer(repo)', 3)]


In [20]:
def getUserPreference(username):
    print("Respositories that "+ username+" has starred")
    for v in graph[username+"(user)"]:
        if graph[username+"(user)"][v]['type'] == 'gazes':
            print(v)

    print("Programming languages "+ username+" is interested in")
    
    langs = set()
    for v in graph[username+"(user)"]:
        if graph[username+"(user)"][v]['type'] == 'gazes':
            langs.add(graph.node[v]['lang'])
    print(langs)




In [21]:
getUserPreference('luzhijun')

Respositories that luzhijun has starred
findspark(repo)
Kalman-and-Bayesian-Filters-in-Python(repo)
TensorFlow-Machine-Learning-Cookbook(repo)
recharts2(repo)
fpinscala(repo)
be-a-professional-programmer(repo)
Programming languages luzhijun is interested in
{'Jupyter Notebook', 'Python', 'Scala', 'R', None}


### Find Hot Programming Language
What if we want to know which is the most popular language for all projects in the current map we generated? We can go through the projects and then extract the language part and put it into the counter to count, this seems easy to do, but what if we want to query how many users are programming in a particular kind of language? This requires scanning the in-degree users of each project, and finally summing up the count results. This idea requires large time complexity. A good idea at this time is to expand the map and use the language as a single node. The logic diagram of the final map is as follows:


[<img src="https://raw.githubusercontent.com/hanna1994/han1994/master/191525762986_.pic_hd.jpg">](https://raw.githubusercontent.com/hanna1994/han1994/master/191525762986_.pic_hd.jpg)

In [25]:
repos = [n for n in graph.nodes_iter() if graph.node[n]['type'] == 'repo']

for repo in repos:
    # some empty projects may have none of the language part
    lang = (graph.node[repo]['lang'] or "") + "(lang)"
    # users who gazed repo
    stargazers = [u for (u, r, d) in graph.in_edges_iter(repo, data=True) if d['type'] == 'gazes']
    
    for sg in stargazers:
        graph.add_node(lang, type='lang')
        graph.add_edge(sg, lang, type='programs')
        graph.add_edge(lang, repo, type='implements')  

then let's see the count results of the languages. First we should see what languages are there in the map:

In [33]:
print (nx.info(graph),'\n')
# see what languages are in the map
print ([n 
       for n in graph.nodes() 
           if graph.node[n]['type'] == 'lang'])


Name: 
Type: DiGraph
Number of nodes: 1144
Number of edges: 2832
Average in degree:   2.4755
Average out degree:   2.4755 

['Python(lang)', 'JavaScript(lang)', 'Jupyter Notebook(lang)', 'Java(lang)', 'C++(lang)', 'R(lang)', 'Go(lang)', '(lang)', 'C(lang)', 'Ruby(lang)', 'HTML(lang)', 'C#(lang)', 'Scala(lang)', 'Shell(lang)', 'PLpgSQL(lang)', 'Haskell(lang)', 'CSS(lang)', 'PHP(lang)', 'Vim script(lang)', 'Swift(lang)', 'Processing(lang)', 'Makefile(lang)', 'Elm(lang)', 'TypeScript(lang)', 'Emacs Lisp(lang)', 'DIGITAL Command Language(lang)', 'TeX(lang)', 'PLSQL(lang)', 'OCaml(lang)', 'Rust(lang)', 'Erlang(lang)', 'PureBasic(lang)', 'Lua(lang)', 'Julia(lang)', 'Matlab(lang)', 'Objective-C(lang)', 'Kotlin(lang)', 'Clojure(lang)', 'Roff(lang)', 'SQLPL(lang)', 'CoffeeScript(lang)', 'NSIS(lang)', 'D(lang)', 'CMake(lang)', 'Dart(lang)', 'Fortran(lang)', 'Perl(lang)', 'GCC Machine Description(lang)', 'Pony(lang)']


Now we can see the top 10 hottest languages users are using:

In [35]:
# find hottest language
print ("Most popular languages")
print (sorted([(n, graph.in_degree(n))
 for n in graph.nodes() 
     if graph.node[n]['type'] == 'lang'], key=itemgetter(1), reverse=True)[:10])

Most popular languages
[('Python(lang)', 202), ('JavaScript(lang)', 67), ('(lang)', 62), ('Jupyter Notebook(lang)', 56), ('C++(lang)', 51), ('Java(lang)', 38), ('Go(lang)', 38), ('Shell(lang)', 28), ('HTML(lang)', 24), ('Scala(lang)', 24)]


And let us see some common languages usage:

In [42]:
# find how many people are using the certain kind of language
python_programmers = [u 
                      for (u, l) in graph.in_edges('Python(lang)') 
                          if graph.node[u]['type'] == 'user']
print ("Number of Python programmers:", len(python_programmers))

javascript_programmers = [u for 
                          (u, l) in graph.in_edges('JavaScript(lang)') 
                              if graph.node[u]['type'] == 'user']
print ("Number of JavaScript programmers:", len(javascript_programmers))

R_programmers = [u for 
                          (u, l) in graph.in_edges('R(lang)') 
                              if graph.node[u]['type'] == 'user']
print ("Number of R programmers:", len(R_programmers))
print("--------------------------------------------------------")

C_programmers = [u for 
                        (u, l) in graph.in_edges('C(lang)') 
                              if graph.node[u]['type'] == 'user']
print ("Number of C programmers:", len(C_programmers))

Cplus_programmers = [u for 
                        (u, l) in graph.in_edges('C++(lang)') 
                              if graph.node[u]['type'] == 'user']
print ("Number of C++ programmers:", len(Cplus_programmers))
print("--------------------------------------------------------")
CSS_programmers = [u for 
                        (u, l) in graph.in_edges('CSS(lang)') 
                              if graph.node[u]['type'] == 'user']
print ("Number of CSS programmers:", len(CSS_programmers))

HTML_programmers = [u for 
                        (u, l) in graph.in_edges('HTML(lang)') 
                              if graph.node[u]['type'] == 'user']
print ("Number of HTML programmers:", len(HTML_programmers))

PHP_programmers = [u for 
                        (u, l) in graph.in_edges('PHP(lang)') 
                              if graph.node[u]['type'] == 'user']
print ("Number of PHP programmers:", len(PHP_programmers))


Number of Python programmers: 202
Number of JavaScript programmers: 67
Number of R programmers: 11
--------------------------------------------------------
Number of C programmers: 19
Number of C++ programmers: 51
--------------------------------------------------------
Number of CSS programmers: 13
Number of HTML programmers: 24
Number of PHP programmers: 3


for the most languages people used: pyhton and js, we also analyzed the intersection of these two languages:

In [40]:
# people who use both languages
print ("Number of programmers who use JavaScript and Python:")
print (len(set(python_programmers).intersection(set(javascript_programmers))))

# people who only use python but not js
print ("Number of programmers who use JavaScript but not Python:")
print (len(set(javascript_programmers).difference(set(python_programmers))))

Number of programmers who use JavaScript and Python:
67
Number of programmers who use JavaScript but not Python:
0


Wala! Now we can analyze the results! It can be seen from the results that JS and R, python and other scripting languages in github's preferred language list occupy the mainstream. Although there are only 10,000 items involved in the data we analyzed, it is like the old saying: “github is the world of front end”. 

What's more, Scala is also very hot from seeing our results, mainly because the "source of the project" is findspark. 

Another interesting phenomenon is that the programmers who use python definetly will use js, and those who use js will not necessarily use python (80-74=6 js individuals not using python). 