# Analyse GitHub archives using GraphX

_Trying to detect open source communies based on contributions_

## Setup the environment to work with GraphX and Json data 

### Import some github data

In [ ]:
import sys.process._
if (!new java.io.File("/tmp/github.json").exists) {
  new java.net.URL("http://data.githubarchive.org/2015-01-01-15.json.gz")  #> new java.io.File("/tmp/github.json.gz") !!
  
  Seq("gunzip", "-f", "/tmp/github.json.gz")!!
}


by making the implicit value scala.language.postfixOps visible.
This can be achieved by adding the import clause 'import scala.language.postfixOps'
or by setting the compiler option -language:postfixOps.
See the Scaladoc for value scala.language.postfixOps for a discussion
why the feature should be explicitly enabled.
         new java.net.URL("http://data.githubarchive.org/2015-01-01-15.json.gz")  #> new java.io.File("/tmp/github.json.gz") !!
                                                                                                                             ^
by making the implicit value scala.language.postfixOps visible.
         Seq("gunzip", "-f", "/tmp/github.json.gz")!!
                                                   ^
import sys.process._
res1: Any = ""


### **The size of the data**

In [ ]:
:sh du -h /tmp/github.json

25M	/tmp/github.json

import sys.process._




## First some Spark manipulation 

In [ ]:
val raw = sparkContext.textFile("/tmp/github.json")

raw: org.apache.spark.rdd.RDD[String] = /tmp/github.json MapPartitionsRDD[1] at textFile at <console>:73


### The number of lines in the file

In [ ]:
raw.count

res6: Long = 11351


### Convert line to JSON _(simple Map of Maps)_

In [ ]:
val json = raw.mapPartitions{ lines => 
  import com.fasterxml.jackson._
  import com.fasterxml.jackson.core._
  import com.fasterxml.jackson.databind._
  import com.fasterxml.jackson.module.scala._
  val mapper = new ObjectMapper()
  mapper.registerModule(DefaultScalaModule)
  lines.map(x => mapper.readValue(x, classOf[Map[String,Any]]))
}

json: org.apache.spark.rdd.RDD[Map[String,Any]] = MapPartitionsRDD[2] at mapPartitions at <console>:75


### Let's look at the two first rows

In [ ]:
json.take(2).toList

res9: List[Map[String,Any]] = List(Map(actor -> Map(gravatar_id -> "", url -> https://api.github.com/users/petroav, id -> 665991, login -> petroav, avatar_url -> https://avatars.githubusercontent.com/u/665991?), payload -> Map(description -> Solution to homework and assignments from MIT's 6.828 (Operating Systems Engineering). Done in my spare time., ref_type -> branch, ref -> master, master_branch -> master, pusher_type -> user), public -> true, id -> 2489651045, created_at -> 2015-01-01T15:00:00Z, repo -> Map(id -> 28688495, name -> petroav/6.828, url -> https://api.github.com/repos/petroav/6.828), type -> CreateEvent), Map(actor -> Map(gravatar_id -> "", url -> https://api.github.com/users/rspt, id -> 3854017, login -> rspt, avatar_url -> https://avatars.githubusercontent.com/u/38540...

## The graph part 

We could use the *actors* and the *repos* as vertices, and use the *event* as relationship between them.

There are *id*s for actor and repo, so we can directly use them in GraphX as such.

In [ ]:
import org.apache.spark.rdd._
import org.apache.spark.graphx._

import org.apache.spark.rdd._
import org.apache.spark.graphx._


### RDD vertices {Actors U Repos}

In [ ]:
val actors:RDD[(VertexId, (Short, String))] = json.map{ x => 
  val actor = x("actor").asInstanceOf[Map[String, Any]]
  val id = actor("id").toString.toLong
  val login = actor("login").toString
  (id, (0, login))
}
val repos:RDD[(VertexId, (Short, String))] = json.map{ x => 
  val repo = x("repo").asInstanceOf[Map[String, Any]]
  val id = repo("id").toString.toLong
  val name = repo("name").toString
  (id, (1, name))
}
val vertices:RDD[(VertexId, (Short, String))] = actors union repos

actors: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, (Short, String))] = MapPartitionsRDD[3] at map at <console>:83
repos: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, (Short, String))] = MapPartitionsRDD[4] at map at <console>:89
vertices: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, (Short, String))] = UnionRDD[5] at union at <console>:95


### RDD of Edges 

Now an **RDD** with the edges (including reverse ones, that is from repo to actor)

In [ ]:
// None → repo to actor
// Some("PushEvent") → actor pushed on repo
val edges:RDD[Edge[Option[String]]] = json.flatMap { x =>
  val event = x.get("type").map(_.toString)
  val actor = x("actor").asInstanceOf[Map[String, Any]]("id").toString.toLong
  val repo = x("repo").asInstanceOf[Map[String, Any]]("id").toString.toLong
  List(Edge(actor, repo, event), Edge(actor, repo, None))
}

edges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Option[String]]] = MapPartitionsRDD[6] at flatMap at <console>:85


### Graph

In [ ]:
val graph = Graph(vertices, edges)

graph: org.apache.spark.graphx.Graph[(Short, String),Option[String]] = org.apache.spark.graphx.impl.GraphImpl@32d88be9


## Open source working community 

A very very simple example of such extraction would simply be to extract the connected components 

So that, a component is the actors and repos having connections between them but not with other actor or repos. A connection being a collaboration.

### Computing connected components 

In [ ]:
val cc = graph.connectedComponents

cc: org.apache.spark.graphx.Graph[org.apache.spark.graphx.VertexId,Option[String]] = org.apache.spark.graphx.impl.GraphImpl@3e1e8afd


The `cc` variable is the original graph but vertives' payload/properties is only the cluster to which is belongs. The cluster is characterized by the smallest `VertexId` in the cluster.

#### Number of connected components 

Computing the number of clusters can easily be done by counting the number of distinct `payload` for the vertices.

In [ ]:
<strong style="color: red">{cc.vertices.map(_._2).distinct.count}</strong>

res16: scala.xml.Elem = <strong style="color: red">4774</strong>


### Clusters by language 

We can try to concentrate our analysis to specific languages, since we don't have the language information in the events data (we need extra call to the GitHub API for that) we'll take a naive approach, that is, **we'll only consider the repo having the language in their name** -- albeit it's not 100% safe.

#### Utility functions

The following function compute retrieves the cluster for a given cluster.

In [ ]:
import org.apache.spark.SparkContext._
def cluster(lgg:String) = {
  // collect all repos for the language `lgg`
  val lggRepos:List[(VertexId, (Short, String))] = vertices.filter { x => 
                    x._2._1 /*vertex type*/ == 1 /*repo*/ && 
                    x._2._2/*repo name*/.toLowerCase.contains(lgg) //here we SHOULD exclude the prefix of '/'
                }.collect().toList
  // keep only the set
  // ***** IN A CLUSTER →→→ THIS NEEDS TO BE A BROADCAST VARIABLE *****wwwwzeqc
  val lggRepoIds:List[Long] = lggRepos.map(_._1).distinct
  // clusters "id" for these repos → BROADCAST
  val clusterIds:List[Long] = cc.vertices.filter(x => lggRepoIds.contains(x._1))
                            .map(_._2)
                            .collect()
                            .toList
  // return the vertices being clustered sorted by decreasing cardinality
  val clusters:List[(Long, Iterable[Long])] = cc.vertices.filter{ x => clusterIds.contains(x._2) }
                 .groupBy(_._2)
                 .mapValues(_.map(_._1))
                 .collect().toList
                 .sortBy(_._2.size)
                 .reverse
  clusters
}

import org.apache.spark.SparkContext._
cluster: (lgg: String)List[(Long, Iterable[Long])]


Shows the list of repos and actors included in the given cluster

In [ ]:
def showCluster(lgg:String, clusterIds:List[Long]) = {
  val c = graph.vertices
               .filter(x => clusterIds.contains(x._1))
               .collect().toList

  <div>
  <p><strong>Repos</strong></p>
  <ul>{ 
  c.collect { case (x, (1, r)) =>
           //show the repo
           val t = if (r.toLowerCase.contains(lgg)) <strong style="color: red;">{r}</strong> else r
             <li><a href={"http://github.com/"+r}>{t}</a></li> 
          }
  }</ul>
  <p><strong>Users</strong></p>
  <ul>{ 
  c.collect { case (x, (0, n)) =>
           //show the repo
             <li><a href={"http://github.com/"+n}>{n}</a></li> 
          }
  }</ul>
  </div>
}

showCluster: (lgg: String, clusterIds: List[Long])scala.xml.Elem


## Javascript

In [ ]:
val js = cluster("js")

js: List[(Long, Iterable[Long])] = List((2844,List(315596, 670440, 23082332, 95872, 1136652, 16960472, 1415488, 21872392, 119508, 26068656, 7944140, 4507612, 26809512, 28081156, 2057932, 3444336, 23338500, 2844, 188172, 36964, 6755852, 489576, 1230048, 24049584, 15204860, 6462268, 22430020, 1642136, 208340, 8749504, 27150864, 10364944, 22975952, 304332, 170820, 3591964, 4930716, 553444, 21108956, 2829600, 19821524, 959908, 3296912, 8665740, 7652428, 9896628, 2126244, 115904, 2606236, 2595532, 16809332, 14840449, 28573641, 261237, 10364781, 892945, 774297, 178965, 4371337, 6452529, 20407433, 28517589, 362985, 28538501, 183721, 28685197, 9914221, 10849933, 28452477, 18607529, 2029169, 16179237, 3106725, 1295961, 4927517, 21315049, 1719745, 453705, 1885237, 24847217, 13645881, 1039597, 282...

**Let's look at the 3 biggest clusters**

In [ ]:
layout(3, js.take(3).map(r => html(showCluster("js", r._2.toList))))

res21: notebook.front.Widget = <widget>


0,1,2
Repos  slackhq/SlackTextViewControlleremirozer/fake2dbjosephmisiti/awesome-machine-learningzhxnlai/ZLSwipeableViewddeboer/vatinomergul123/LLSimpleCameralexrus/VPNOnclockfly/gearpumpeigengo/liftpapers-we-love/papers-we-loveongakuer/CircleIndicatorcoursera/pandas-plyJigarM/UICollectionView-Swiftsnowplow/snowplowbro/brogorhill/uBlocklexrus/vpn-deploy-playbookOpenRA/OpenRAmsiemens/PyGitUptwbs/bootstrapstrukturag/spreed-webrtcmichaelschramek/michaelschramek.github.iokozross/awesome-clrethinkdb/rethinkdbjostw/jos.twimgix/imgix-emacsSTRML/react-grid-layoutCPAN-PRC/resourcesmutualmobile/MMDrawerControllerajalt/fuckitpysolnic/transprocFamous/famousMatt-Esch/virtual-domaol/molochdongri/OAuthSwiftstatsmodels/statsmodelsLipkeGu/OpenRAdrrb/java-rust-exampleflarum/coreh5bp/Front-end-Developer-Interview-Questionsphilipwalton/solved-by-flexboxwasabeef/awesome-android-librariesbclozel/spring-resource-handlingenaqx/awesome-reactLnxPrgr3/crossfeedmyfreeweb/dotfileswasabeef/awesome-android-uijgeigerm/spotify2playmusicbegriffs/postgrestvinta/awesome-pythonschneiderandre/poppingccoenraets/PageSliderprakhar1989/awesome-coursesstaltz/cycleadrai/flowchart.jsariok/BWWalkthroughrmcgibbo/slidedeckjessesquires/JSQSystemSoundPlayerJamesking56/Cachetbramp/js-sequence-diagramscemolcay/PullToRefreshCoreTextgulpjs/gulpguardian/open-platform-sitegorbin/ASNEgorhill/uMatrixangular/angular.jsauchenberg/chrome-devtools-appmrmrs/colorsReactive-Extensions/RxJSSamyPesse/tv.jsmamaral/MAThemeKitindragiek/DominantColorjessesquires/JSQMessagesViewControllerwinebarrel/piculetdimsemenov/PhotoSwipecachethq/Cachetdockerboard/dockerboardrobbiev/devdnsyuppielabel/YPDrawSignatureViewcfpb/qustevencorona/elastic-haproxyintel-hadoop/gearpumpalvarotrigo/fullPage.jsknsv/mermaidguardian/frontendRamotion/adaptive-tab-barnvbn/submanmindd-it/pappu-pakiaarashpayan/appiraterSFTtech/openagecfpb/hmda-viz-prototype  Users  goshakkklshooaslamjperlysmoodjoelataylorDavidHu0921LipkeGutimmyshenstemManUtopiKtheeferkaittodeskVFedykaloasutaadel112lunziimyfreewebkuehrmannpanhapandatlongrenseyhunaksirodohtGrahamCampbellGuorgMasmineyangWillDiggleaudySymphoclockflynieuwenhovenlrebolax2boolaspiressvatsanjostwLeoKudrikcemclellanmkalbmoureaufSolidorwdownearthmarkemermmccaffhemincongJamesking56toromegugeoerPrefinemwouterwcyberjar09gorhillSoufiendmeulenbluebird88airtoxinUniIslandthangngoc89ZachOrrbgruszkalemonxiao0afaviaraindevseccodehitenprataptonydalyaschepisegordorstephenwaycmrecimrjsnookrhomlmuhaha03davidmaitlandjdreesenmenembusybeaverszrenweiiPaulProtinyaonijouleCradledavidchasewildtypeollymjkxyzshoitohaosdentjosef-pktxiaodi555549mschramekOnatcermhparker23thelONE,Repos  apache/wss4jmwclient/mwclientspring-projects/spring-webflowalexz-enwp/wikitoolsspring-projects/spring-amqpspring-projects/spring-dataapache/log4jwikimedia/pywikibot-corespring-projects/spring-wsSeleniumHQ/seleniumtj/git-extrasspring-projects/spring-batcheasymock/easymockapache/axis2-javaapache/camelsass/sassapache/hadoopbouil/angular-google-chartgoogle/guavafreemarker/freemarkerspring-projects/spring-socialmariofusco/lambdajpostgres/postgresthymeleaf/thymeleafless/less-docsdjango/djangoapache/stormgwtproject/gwtapache/commons-langmoment/momentapache/lucene-solrtastejs/todomvcpython/cpythonzzzeek/sqlalchemyqos-ch/slf4japache/activemqapache/xalan-japache/mahoutapache/cxfehcache/ehcache-jcachehapijs/hapisebastianbenz/Jnariospring-projects/spring-integration  Users  altmerddekanyberkerpeksagwearp,Repos  SteamedFish/vimrcTox/Tox-WebsiteTox/Tox-Docsrsudev/AntoxSteamedFish/configAstonex/Antoxcgeo/cgeolifetyper/FreeRouter_V2lodash/lodash-cliTox/toxicpolarssl/polarsslquantum-os/sddm-themeAstonex/ToxBoxculmor30/cgeo-wearquantum-os/qml-materialAstonex/Docssamueltardieu/cgeolifetyper/scriptsquantum-os/qml-extrasbestiejs/platform.jsLineflyer/cgeostrycore/scriptscernekee/ics-openconnectsddm/sddmRamblurr/Anki-AndroidJFreegman/toxicquantum-os/quantum-shellschwabe/cgeobestiejs/benchmark.jsisohuntto/openbaylodash/lodashbestiejs/json3SteamedFish/gfwiplistiBeliever/cross-pkgschwabe/ics-openvpnjdalton/docdownAstonex/Tox-STSquantum-os/quantum-osTox/toxme.serankjie/anyconnect-gfw-list  Users  samueltardieuARoiDjgoldfarjackkriegerFluxinatedjdalton


## Scala

In [ ]:
val scala = cluster("scala")

scala: List[(Long, Iterable[Long])] = List((474633,List(5929896, 474633, 13899590)), (4078208,List(4078208, 28688755, 28689127)), (4402043,List(28684770, 4402043)), (3208807,List(28304675, 3208807)), (287491,List(27677770, 287491)), (1345438,List(1345438, 5016982)), (24870,List(28598724, 24870)), (3648029,List(3648029, 28049195)), (1145180,List(1145180, 4350848)), (289960,List(289960, 20581297)), (6007632,List(6204600, 6007632)), (648508,List(648508, 28133279)))


In [ ]:
layout(4, scala.map(r => html(showCluster("scala", r._2.toList))))


res24: notebook.front.Widget = <widget>


0,1,2,3
Repos  scalaz/scalazpavelfatin/patterns  Users  0x414c,Repos  itsvenkis/scalaitsvenkis/scala  Users  itsvenkis,Repos  Hossein-Boka/scalarest  Users  Hossein-Boka,Repos  shasdemir/scala-book  Users  shasdemir
Repos  ummels/scala-prioritymap  Users  ummels,Repos  Scalarm/scalarm_information_service  Users  kliput,Repos  okomok/vim-scala  Users  okomok,Repos  nicolasstucki/scala-rrb-vector-thesis  Users  nicolasstucki
Repos  ornicar/scalachess  Users  Happy0,Repos  inc-lc/ilc-scala  Users  Blaisorblade,Repos  ikeike443/Sublime-Scalariform  Users  fkennayo,Repos  shijinkui/scala-best-practices  Users  shijinkui


## Spark ^^ 

In [ ]:
val spark = cluster("spark")

spark: List[(Long, Iterable[Long])] = List((8108735,List(9467640, 17165658, 8108735, 25984335)), (6604878,List(28689340, 6604878, 26539815)), (10364991,List(28689305, 10364991)), (292693,List(28688640, 292693)), (548180,List(548180, 1614137)), (494304,List(494304, 28136522)))


In [ ]:
layout(4, spark.map(r => html(showCluster("spark", r._2.toList))))

res27: notebook.front.Widget = <widget>


0,1,2,3
Repos  apache/sparktgaloppo/spark  Users  tgaloppoSparkQA,Repos  ScruffR/photonspark/photon  Users  ScruffR,Repos  SparkMonkay/Codes  Users  SparkMonkay,Repos  mikemoraned/spark-play-sbt  Users  mikemoraned
Repos  SparkDevNetwork/Rock  Users  azturner,Repos  rkuo/LearningSpark  Users  rkuo,,
