# .NET for Apache Spark  Demo

This demo shows some of the features of .NET for Apache Spark by analyzing the [ArXiv dataset](https://www.kaggle.com/Cornell-University/arxiv) from Kaggle

## Install Microsoft.Spark NuGet package

In [1]:
#r "nuget:Microsoft.Spark,0.12.1"

## Import Microsoft.Spark packages

In [2]:
open Microsoft.Spark
open Microsoft.Spark.Sql

Unhandled exception: System.ArgumentNullException: Value cannot be null. (Parameter 'directory')
   at Microsoft.DotNet.Interactive.CompositeKernelExtensionLoader.LoadFromDirectoryAsync(DirectoryInfo directory, CompositeKernel kernel, KernelInvocationContext context)
   at Microsoft.DotNet.Interactive.CompositeKernel.LoadExtensionsFromDirectory(DirectoryInfo directory, KernelInvocationContext context)
   at Microsoft.DotNet.Interactive.Commands.LoadExtensionsInDirectory.InvokeAsync(KernelInvocationContext context)
   at Microsoft.DotNet.Interactive.CompositeKernel.HandleAsync(IKernelCommand command, KernelInvocationContext context)
   at Microsoft.DotNet.Interactive.KernelCommandPipeline.<BuildPipeline>b__6_0(IKernelCommand command, KernelInvocationContext context, KernelPipelineContinuation _)
   at Microsoft.DotNet.Interactive.KernelBase.<AddSetKernelMiddleware>b__9_0(IKernelCommand command, KernelInvocationContext context, KernelPipelineContinuation next)
   at Microsoft.DotNet.Interactive.KernelCommandPipeline.SendAsync(IKernelCommand command, KernelInvocationContext context)

## Define path where the data is

In [3]:
let DATA_DIR = "C:\\Dev\\arxiv-metadata-oai-snapshot.json"

## Initialize SparkSession

`SparkSession` is the entrypoint of Spark applications.

In [4]:
let sparkSession = 
    SparkSession
        .Builder()
        .AppName("arxiv-analytics")
        .GetOrCreate()

[2020-09-30T17:08:43.0742586Z] [DESKTOP-HOF6587] [Info] [ConfigurationService] 'DOTNETBACKEND_PORT' environment variable is not set.
[2020-09-30T17:08:43.0984930Z] [DESKTOP-HOF6587] [Info] [ConfigurationService] Using port 5567 for connection.
[2020-09-30T17:08:43.1007873Z] [DESKTOP-HOF6587] [Info] [JvmBridge] JvMBridge port is 5567


## Load data into DataFrame

In [5]:
let arxivData = 
    sparkSession
        .Read()
        .Option("inferSchema",true)
        .Json([|DATA_DIR|])

## Display columns

In [6]:
arxivData.Columns()

index,value
0,abstract
1,authors
2,authors_parsed
3,categories
4,comments
5,doi
6,id
7,journal-ref
8,license
9,report-no


In [7]:
arxivData.PrintSchema()

root
 |-- abstract: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- authors_parsed: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- categories: string (nullable = true)
 |-- comments: string (nullable = true)
 |-- doi: string (nullable = true)
 |-- id: string (nullable = true)
 |-- journal-ref: string (nullable = true)
 |-- license: string (nullable = true)
 |-- report-no: string (nullable = true)
 |-- submitter: string (nullable = true)
 |-- title: string (nullable = true)
 |-- update_date: string (nullable = true)
 |-- versions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- created: string (nullable = true)
 |    |    |-- version: string (nullable = true)



## See first few rows

In [8]:
arxivData.Show(3)

+--------------------+--------------------+--------------------+--------------+--------------------+--------------------+---------+--------------------+--------------------+----------------+--------------+--------------------+-----------+--------------------+
|            abstract|             authors|      authors_parsed|    categories|            comments|                 doi|       id|         journal-ref|             license|       report-no|     submitter|               title|update_date|            versions|
+--------------------+--------------------+--------------------+--------------+--------------------+--------------------+---------+--------------------+--------------------+----------------+--------------+--------------------+-----------+--------------------+
|  A fully differe...|C. Bal\'azs, E. L...|[[Bal�zs, C., ], ...|        hep-ph|37 pages, 15 figu...|10.1103/PhysRevD....|0704.0001|Phys.Rev.D76:0130...|                null|ANL-HEP-PR-07-12|Pavel Nadolsky|Calculation of 

## Get total number of articles

In [9]:
let totalArticles = 
    arxivData
        .Select(
            Functions.Count(Functions.Col("id")).Alias("total_articles"),
            Functions.CountDistinct(Functions.Col("id")).Alias("distinct_articles"))

In [10]:
totalArticles.Show()

+--------------+-----------------+
|total_articles|distinct_articles|
+--------------+-----------------+
|       1767485|          1767482|
+--------------+-----------------+



## Get category counts in descending order

In [11]:
let distinctArticles = arxivData.DropDuplicates("id")

In [12]:
let columnSubset = 
    [|
        Functions.Col("id")
        Functions.Col("title")
        Functions.Col("abstract")
        Functions.Col("categories")
    |]

let articles = distinctArticles.Select(columnSubset)

In [13]:
let categories = 
    articles
        .Select(Functions.Col("categories"))
        .GroupBy(Functions.Col("categories"))
        .Count()
        .OrderBy(Functions.Col("count").Desc())

In [14]:
categories.Show(10)

+-----------------+-----+
|       categories|count|
+-----------------+-----+
|         astro-ph|86914|
|           hep-ph|73272|
|         quant-ph|53392|
|           hep-th|53049|
|cond-mat.mtrl-sci|29641|
|cond-mat.mes-hall|29495|
|            gr-qc|25377|
|            cs.CV|24689|
|          math.AP|23788|
|      astro-ph.SR|22750|
+-----------------+-----+
only showing top 10 rows



In [19]:
let astroph = 
    articles
        .Filter(Functions.Col("categories").EqualTo("astro-ph"))

In [20]:
astroph.Show(3)

+---------+--------------------+--------------------+----------+
|       id|               title|            abstract|categories|
+---------+--------------------+--------------------+----------+
|0704.3919|On over-reflectio...|  The dynamics of...|  astro-ph|
|0705.0685|PSR J1453+1902 an...|  We present 3 yr...|  astro-ph|
|0705.0780|Massive stars and...|  We first presen...|  astro-ph|
+---------+--------------------+--------------------+----------+
only showing top 3 rows



## Write out transformed output

In [22]:
articles
    .Coalesce(1)
    .Write()
    .Mode("overwrite")
    .Csv("output")