Skip to content

Latest commit

 

History

History
1289 lines (603 loc) · 171 KB

Mobius_API_Documentation.md

File metadata and controls

1289 lines (603 loc) · 171 KB

##

Mobius API Documentation

###Microsoft.Spark.CSharp.Core.Accumulator ####Summary

        A shared variable that can be accumulated, i.e., has a commutative and associative "add"
        operation. Worker tasks on a Spark cluster can add values to an Accumulator with the +=
        operator, but only the driver program is allowed to access its value, using Value.
        Updates from the workers get propagated automatically to the driver program.
        
        While  supports accumulators for primitive data types like int and
        float, users can also define accumulators for custom types by providing a custom
         object. Refer to the doctest of this module for an example.
        
        See python implementation in accumulators.py, worker.py, PythonRDD.scala

####Methods

NameDescription
Adds a term to this accumulator's value
The += operator; adds a term to this accumulator's value
Creates and returns a string representation of the current accumulator
Provide a "zero value" for the type
Add two values of the accumulator's data type, returning a new value;

###Microsoft.Spark.CSharp.Core.Accumulator`1 ####Summary

        A generic version of  where the element type is specified by the driver program.
        
        The type of element in the accumulator.

####Methods

NameDescription
AddAdds a term to this accumulator's value
op_AdditionThe += operator; adds a term to this accumulator's value
ToStringCreates and returns a string representation of the current accumulator

###Microsoft.Spark.CSharp.Core.AccumulatorParam`1 ####Summary

        An AccumulatorParam that uses the + operators to add values. Designed for simple types
        such as integers, floats, and lists. Requires the zero value for the underlying type
        as a parameter.

####Methods

NameDescription
ZeroProvide a "zero value" for the type
AddInPlaceAdd two values of the accumulator's data type, returning a new value;

###Microsoft.Spark.CSharp.Core.AccumulatorServer ####Summary

        A simple TCP server that intercepts shutdown() in order to interrupt
        our continuous polling on the handler.

###Microsoft.Spark.CSharp.Core.Broadcast ####Summary

        A broadcast variable created with SparkContext.Broadcast().
        Access its value through Value.
        
        var b = sc.Broadcast(new int[] {1, 2, 3, 4, 5})
        b.Value
        [1, 2, 3, 4, 5]
        sc.Parallelize(new in[] {0, 0}).FlatMap(x: b.Value).Collect()
        [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
        b.Unpersist()
        
        See python implementation in broadcast.py, worker.py, PythonRDD.scala

####Methods

NameDescription
Delete cached copies of this broadcast on the executors.

###Microsoft.Spark.CSharp.Core.Broadcast`1 ####Summary

        A generic version of  where the element can be specified.
        
        The type of element in Broadcast

####Methods

NameDescription
UnpersistDelete cached copies of this broadcast on the executors.

###Microsoft.Spark.CSharp.Core.CSharpWorkerFunc ####Summary

        Function that will be executed in CSharpWorker

####Methods

NameDescription
ChainUsed to chain functions

###Microsoft.Spark.CSharp.Core.Option`1 ####Summary

        Container for an optional value of type T. If the value of type T is present, the Option.IsDefined is TRUE and GetValue() return the value. 
        If the value is absent, the Option.IsDefined is FALSE, exception will be thrown when calling GetValue().

####Methods

NameDescription
GetValueReturns the value of the option if Option.IsDefined is TRUE; otherwise, throws an .

###Microsoft.Spark.CSharp.Core.Partitioner ####Summary

        An object that defines how the elements in a key-value pair RDD are partitioned by key.
        Maps each key to a partition ID, from 0 to "numPartitions - 1".

####Methods

NameDescription
EqualsDetermines whether the specified object is equal to the current object.
GetHashCodeServes as the default hash function.

###Microsoft.Spark.CSharp.Core.RDDCollector ####Summary

        Used for collect operation on RDD

###Microsoft.Spark.CSharp.Core.DoubleRDDFunctions ####Summary

        Extra functions available on RDDs of Doubles through an implicit conversion. 

####Methods

NameDescription
SumAdd up the elements in this RDD. sc.Parallelize(new double[] {1.0, 2.0, 3.0}).Sum() 6.0
StatsReturn a object that captures the mean, variance and count of the RDD's elements in one operation.
HistogramCompute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1. If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per element(where n = # buckets). Buckets must be sorted and not contain any duplicates, must be at least two elements. If `buckets` is a number, it will generates buckets which are evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1 If the RDD contains infinity, NaN throws an exception If the elements in RDD do not vary (max == min) always returns a single bucket. It will return an tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) ([0, 25, 50], [25, 26]) >>> rdd.histogram([0, 5, 25, 50]) ([0, 5, 25, 50], [5, 20, 26]) >>> rdd.histogram([0, 15, 30, 45, 60]) # evenly spaced buckets ([0, 15, 30, 45, 60], [15, 15, 15, 6]) >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"]) >>> rdd.histogram(("a", "b", "c")) (('a', 'b', 'c'), [2, 2])
MeanCompute the mean of this RDD's elements. sc.Parallelize(new double[]{1, 2, 3}).Mean() 2.0
VarianceCompute the variance of this RDD's elements. sc.Parallelize(new double[]{1, 2, 3}).Variance() 0.666...
StdevCompute the standard deviation of this RDD's elements. sc.Parallelize(new double[]{1, 2, 3}).Stdev() 0.816...
SampleStdevCompute the sample standard deviation of this RDD's elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N). sc.Parallelize(new double[]{1, 2, 3}).SampleStdev() 1.0
SampleVarianceCompute the sample variance of this RDD's elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N). sc.Parallelize(new double[]{1, 2, 3}).SampleVariance() 1.0

###Microsoft.Spark.CSharp.Core.IRDDCollector ####Summary

        Interface for collect operation on RDD

###Microsoft.Spark.CSharp.Core.OrderedRDDFunctions ####Summary

        Extra functions available on RDDs of (key, value) pairs where the key is sortable through
        a function to sort the key.

####Methods

NameDescription
SortByKey``2Sorts this RDD, which is assumed to consist of Tuple pairs.
SortByKey``3Sorts this RDD, which is assumed to consist of Tuples. If Item1 is type of string, case is sensitive.
repartitionAndSortWithinPartitions``2Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling `repartition` and then sorting within each partition because it can push the sorting down into the shuffle machinery.

###Microsoft.Spark.CSharp.Core.PairRDDFunctions ####Summary

        operations only available to Tuple RDD
        
        See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

####Methods

NameDescription
CollectAsMap``2Return the key-value pairs in this RDD to the master as a dictionary. var m = sc.Parallelize(new[] { new Tuple<int, int>(1, 2), new Tuple<int, int>(3, 4) }, 1).CollectAsMap() m[1] 2 m[3] 4
Keys``2Return an RDD with the keys of each tuple. >>> m = sc.Parallelize(new[] { new Tuple<int, int>(1, 2), new Tuple<int, int>(3, 4) }, 1).Keys().Collect() [1, 3]
Values``2Return an RDD with the values of each tuple. >>> m = sc.Parallelize(new[] { new Tuple<int, int>(1, 2), new Tuple<int, int>(3, 4) }, 1).Values().Collect() [2, 4]
ReduceByKey``2Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with partitions, or the default parallelism level if is not specified. sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .ReduceByKey((x, y) => x + y).Collect() [('a', 2), ('b', 1)]
ReduceByKeyLocally``2Merge the values for each key using an associative reduce function, but return the results immediately to the master as a dictionary. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .ReduceByKeyLocally((x, y) => x + y).Collect() [('a', 2), ('b', 1)]
CountByKey``2Count the number of elements for each key, and return the result to the master as a dictionary. sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .CountByKey((x, y) => x + y).Collect() [('a', 2), ('b', 1)]
Join``3Return an RDD containing all pairs of elements with matching keys in this RDD and . Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this RDD and (k, v2) is in . Performs a hash join across the cluster. var l = sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 1); var r = sc.Parallelize( new[] { new Tuple<string, int>("a", 2), new Tuple<string, int>("a", 3) }, 1); var m = l.Join(r, 2).Collect(); [('a', (1, 2)), ('a', (1, 3))]
LeftOuterJoin``3Perform a left outer join of this RDD and . For each element (k, v) in this RDD, the resulting RDD will either contain all pairs (k, (v, Option)) for w in , where Option.IsDefined is TRUE, or the pair (k, (v, Option)) if no elements in have key k, where Option.IsDefined is FALSE. Hash-partitions the resulting RDD into the given number of partitions. var l = sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 1); var r = sc.Parallelize( new[] { new Tuple<string, int>("a", 2) }, 1); var m = l.LeftOuterJoin(r).Collect(); [('a', (1, 2)), ('b', (4, Option))] * Option.IsDefined = FALSE
RightOuterJoin``3Perform a right outer join of this RDD and . For each element (k, w) in , the resulting RDD will either contain all pairs (k, (Option, w)) for v in this, where Option.IsDefined is TRUE, or the pair (k, (Option, w)) if no elements in this RDD have key k, where Option.IsDefined is FALSE. Hash-partitions the resulting RDD into the given number of partitions. var l = sc.Parallelize( new[] { new Tuple<string, int>("a", 2) }, 1); var r = sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 1); var m = l.RightOuterJoin(r).Collect(); [('a', (2, 1)), ('b', (Option, 4))] * Option.IsDefined = FALSE
FullOuterJoin``3Perform a full outer join of this RDD and . For each element (k, v) in this RDD, the resulting RDD will either contain all pairs (k, (v, w)) for w in , or the pair (k, (v, None)) if no elements in have key k. Similarly, for each element (k, w) in , the resulting RDD will either contain all pairs (k, (v, w)) for v in this RDD, or the pair (k, (None, w)) if no elements in this RDD have key k. Hash-partitions the resulting RDD into the given number of partitions. var l = sc.Parallelize( new[] { new Tuple<string, int>("a", 1), Tuple<string, int>("b", 4) }, 1); var r = sc.Parallelize( new[] { new Tuple<string, int>("a", 2), new Tuple<string, int>("c", 8) }, 1); var m = l.FullOuterJoin(r).Collect(); [('a', (1, 2)), ('b', (4, None)), ('c', (None, 8))]
PartitionBy``2Return a copy of the RDD partitioned using the specified partitioner. sc.Parallelize(new[] { 1, 2, 3, 4, 2, 4, 1 }, 1).Map(x => new Tuple<int, int>(x, x)).PartitionBy(3).Glom().Collect()
CombineByKey``3# TODO: add control over map-side aggregation Generic function to combine the elements for each key using a custom set of aggregation functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C. Note that V and C can be different -- for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]). Users provide three functions: - , which turns a V into a C (e.g., creates a one-element list) - , to merge a V into a C (e.g., adds it to the end of a list) - , to combine two C's into a single one. In addition, users can control the partitioning of the output RDD. sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect() [('a', '11'), ('b', '1')]
AggregateByKey``3Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U. sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect() [('a', 2), ('b', 1)]
FoldByKey``2Merge the values for each key using an associative function "func" and a neutral "zeroValue" which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.). sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect() [('a', 2), ('b', 1)]
GroupByKey``2Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance. sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .GroupByKey().MapValues(l => string.Join(" ", l)).Collect() [('a', [1, 1]), ('b', [1])]
MapValues``3Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning. sc.Parallelize( new[] { new Tuple<string, string[]>("a", new[]{"apple", "banana", "lemon"}), new Tuple<string, string[]>("b", new[]{"grapes"}) }, 2) .MapValues(x => x.Length).Collect() [('a', 3), ('b', 1)]
FlatMapValues``3Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning. x = sc.Parallelize( new[] { new Tuple<string, string[]>("a", new[]{"x", "y", "z"}), new Tuple<string, string[]>("b", new[]{"p", "r"}) }, 2) .FlatMapValues(x => x).Collect() [('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]
MapPartitionsWithIndex``5explicitly convert Tuple<K, V> to Tuple<K, dynamic> since they are incompatibles types unlike V to dynamic
GroupWith``3For each key k in this RDD or , return a resulting RDD that contains a tuple with the list of values for that key in this RDD as well as . var x = sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 2); var y = sc.Parallelize(new[] { new Tuple<string, int>("a", 2) }, 1); x.GroupWith(y).Collect(); [('a', ([1], [2])), ('b', ([4], []))]
GroupWith``4var x = sc.Parallelize(new[] { new Tuple<string, int>("a", 5), new Tuple<string, int>("b", 6) }, 2); var y = sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 2); var z = sc.Parallelize(new[] { new Tuple<string, int>("a", 2) }, 1); x.GroupWith(y, z).Collect();
GroupWith``5var x = sc.Parallelize(new[] { new Tuple<string, int>("a", 5), new Tuple<string, int>("b", 6) }, 2); var y = sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 2); var z = sc.Parallelize(new[] { new Tuple<string, int>("a", 2) }, 1); var w = sc.Parallelize(new[] { new Tuple<string, int>("b", 42) }, 1); var m = x.GroupWith(y, z, w).MapValues(l => string.Join(" ", l.Item1) + " : " + string.Join(" ", l.Item2) + " : " + string.Join(" ", l.Item3) + " : " + string.Join(" ", l.Item4)).Collect();
SubtractByKey``3Return each (key, value) pair in this RDD that has no pair with matching key in . var x = sc.Parallelize(new[] { new Tuple<string, int?>("a", 1), new Tuple<string, int?>("b", 4), new Tuple<string, int?>("b", 5), new Tuple<string, int?>("a", 2) }, 2); var y = sc.Parallelize(new[] { new Tuple<string, int?>("a", 3), new Tuple<string, int?>("c", null) }, 2); x.SubtractByKey(y).Collect(); [('b', 4), ('b', 5)]
Lookup``2Return the list of values in the RDD for key `key`. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. >>> l = range(1000) >>> rdd = sc.Parallelize(Enumerable.Range(0, 1000).Zip(Enumerable.Range(0, 1000), (x, y) => new Tuple<int, int>(x, y)), 10) >>> rdd.lookup(42) [42]
SaveAsNewAPIHadoopDataset``2Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package). Keys/values are converted for output using either user specified converters or, by default, org.apache.spark.api.python.JavaToWritableConverter.
SaveAsNewAPIHadoopFile``2
SaveAsHadoopDataset``2Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package). Keys/values are converted for output using either user specified converters or, by default, org.apache.spark.api.python.JavaToWritableConverter.
SaveAsHadoopFile``2Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package). Key and value types will be inferred if not specified. Keys and values are converted for output using either user specified converters or org.apache.spark.api.python.JavaToWritableConverter. The is applied on top of the base Hadoop conf associated with the SparkContext of this RDD to create a merged Hadoop MapReduce job configuration for saving the data.
SaveAsSequenceFile``2Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the org.apache.hadoop.io.Writable types that we convert from the RDD's key and value types. The mechanism is as follows: 1. Pyrolite is used to convert pickled Python RDD into RDD of Java objects. 2. Keys and values of this Java RDD are converted to Writables and written out.
NullIfEmpty``1Converts a collection to a list where the element type is Option(T) type. If the collection is empty, just returns the empty list.

###Microsoft.Spark.CSharp.Core.PipelinedRDD`1 ####Summary

        Wraps C#-based transformations that can be executed within a stage. It helps avoid unnecessary Ser/De of data between
        JVM and CLR to execute C# transformations and pipelines them

####Methods

NameDescription
MapPartitionsWithIndex``1Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

###Microsoft.Spark.CSharp.Core.PriorityQueue`1 ####Summary

        A bounded priority queue implemented with max binary heap.
        
        Construction steps:
         1. Build a Max Heap of the first k elements.
         2. For each element after the kth element, compare it with the root of the max heap,
           a. If the element is less than the root, replace root with this element, heapify.
           b. Else ignore it.

####Methods

NameDescription
OfferInserts the specified element into this priority queue.

###Microsoft.Spark.CSharp.Core.Profiler ####Summary

        A class represents a profiler

###Microsoft.Spark.CSharp.Core.RDD`1 ####Summary

        Represents a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, 
        partitioned collection of elements that can be operated on in parallel
        
        See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
        
        Type of the RDD

####Methods

NameDescription
CachePersist this RDD with the default storage level .
PersistSet this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. If no storage level is specified defaults to . sc.Parallelize(new string[] {"b", "a", "c").Persist().isCached True
UnpersistMark the RDD as non-persistent, and remove all blocks for it from memory and disk.
CheckpointMark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with ) and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
GetNumPartitionsReturns the number of partitions of this RDD.
Map``1Return a new RDD by applying a function to each element of this RDD. sc.Parallelize(new string[]{"b", "a", "c"}, 1).Map(x => new Tuple<string, int>(x, 1)).Collect() [('a', 1), ('b', 1), ('c', 1)]
FlatMap``1Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. sc.Parallelize(new int[] {2, 3, 4}, 1).FlatMap(x => Enumerable.Range(1, x - 1)).Collect() [1, 1, 1, 2, 2, 3]
MapPartitions``1Return a new RDD by applying a function to each partition of this RDD. sc.Parallelize(new int[] {1, 2, 3, 4}, 2).MapPartitions(iter => new[]{iter.Sum(x => (x as decimal?))}).Collect() [3, 7]
MapPartitionsWithIndex``1Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. sc.Parallelize(new int[]{1, 2, 3, 4}, 4).MapPartitionsWithIndex<double>((pid, iter) => (double)pid).Sum() 6
FilterReturn a new RDD containing only the elements that satisfy a predicate. sc.Parallelize(new int[]{1, 2, 3, 4, 5}, 1).Filter(x => x % 2 == 0).Collect() [2, 4]
DistinctReturn a new RDD containing the distinct elements in this RDD. >>> sc.Parallelize(new int[] {1, 1, 2, 3}, 1).Distinct().Collect() [1, 2, 3]
SampleReturn a sampled subset of this RDD. var rdd = sc.Parallelize(Enumerable.Range(0, 100), 4) 6 <= rdd.Sample(False, 0.1, 81).count() <= 14 true
RandomSplitRandomly splits this RDD with the provided weights. var rdd = sc.Parallelize(Enumerable.Range(0, 500), 1) var rdds = rdd.RandomSplit(new double[] {2, 3}, 17) 150 < rdds[0].Count() < 250 250 < rdds[1].Count() < 350
TakeSampleReturn a fixed-size sampled subset of this RDD. var rdd = sc.Parallelize(Enumerable.Range(0, 10), 2) rdd.TakeSample(true, 20, 1).Length 20 rdd.TakeSample(false, 5, 2).Length 5 rdd.TakeSample(false, 15, 3).Length 10
ComputeFractionForSampleSizeReturns a sampling rate that guarantees a sample of size >= sampleSizeLowerBound 99.99% of the time. How the sampling rate is determined: Let p = num / total, where num is the sample size and total is the total number of data points in the RDD. We're trying to compute q > p such that - when sampling with replacement, we're drawing each data point with prob_i ~ Pois(q), where we want to guarantee Pr[s < num] < 0.0001 for s = sum(prob_i for i from 0 to total), i.e. the failure rate of not having a sufficiently large sample < 0.0001. Setting q = p + 5 * sqrt(p/total) is sufficient to guarantee 0.9999 success rate for num > 12, but we need a slightly larger q (9 empirically determined). - when sampling without replacement, we're drawing each data point with prob_i ~ Binomial(total, fraction) and our choice of q guarantees 1-delta, or 0.9999 success rate, where success rate is defined the same as in sampling with replacement.
UnionReturn the union of this RDD and another one. var rdd = sc.Parallelize(new int[] { 1, 1, 2, 3 }, 1) rdd.union(rdd).collect() [1, 1, 2, 3, 1, 1, 2, 3]
IntersectionReturn the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did. Note that this method performs a shuffle internally. var rdd1 = sc.Parallelize(new int[] { 1, 10, 2, 3, 4, 5 }, 1) var rdd2 = sc.Parallelize(new int[] { 1, 6, 2, 3, 7, 8 }, 1) var rdd1.Intersection(rdd2).Collect() [1, 2, 3]
GlomReturn an RDD created by coalescing all elements within each partition into a list. var rdd = sc.Parallelize(new int[] { 1, 2, 3, 4 }, 2) rdd.Glom().Collect() [[1, 2], [3, 4]]
Cartesian``1Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other. rdd = sc.Parallelize(new int[] { 1, 2 }, 1) rdd.Cartesian(rdd).Collect() [(1, 1), (1, 2), (2, 1), (2, 2)]
GroupBy``1Return an RDD of grouped items. Each group consists of a key and a sequence of elements mapping to that key. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated. Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] or [[PairRDDFunctions.reduceByKey]] will provide much better performance. >>> rdd = sc.Parallelize(new int[] { 1, 1, 2, 3, 5, 8 }, 1) >>> result = rdd.GroupBy(lambda x: x % 2).Collect() [(0, [2, 8]), (1, [1, 1, 3, 5])]
PipeReturn an RDD created by piping elements to a forked external process. >>> sc.Parallelize(new char[] { '1', '2', '3', '4' }, 1).Pipe("cat").Collect() [u'1', u'2', u'3', u'4']
ForeachApplies a function to all elements of this RDD. sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).Foreach(x => Console.Write(x))
ForeachPartitionApplies a function to each partition of this RDD. sc.parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).ForeachPartition(iter => { foreach (var x in iter) Console.Write(x + " "); })
CollectReturn a list that contains all of the elements in this RDD.
ReduceReduces the elements of this RDD using the specified commutative and associative binary operator. sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).Reduce((x, y) => x + y) 15
TreeReduceReduces the elements of this RDD in a multi-level tree pattern. >>> add = lambda x, y: x + y >>> rdd = sc.Parallelize(new int[] { -5, -4, -3, -2, -1, 1, 2, 3, 4 }, 10).TreeReduce((x, y) => x + y)) >>> rdd.TreeReduce(add) -5 >>> rdd.TreeReduce(add, 1) -5 >>> rdd.TreeReduce(add, 2) -5 >>> rdd.TreeReduce(add, 5) -5 >>> rdd.TreeReduce(add, 10) -5
FoldAggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral "zero value." The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2. This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection. >>> from operator import add >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add) 15
Aggregate``1Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral "zero value." The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2. The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U >>> sc.parallelize(new int[] { 1, 2, 3, 4 }, 1).Aggregate(0, (x, y) => x + y, (x, y) => x + y)) 10
TreeAggregate``1Aggregates the elements of this RDD in a multi-level tree pattern. rdd = sc.Parallelize(new int[] { 1, 2, 3, 4 }, 1).TreeAggregate(0, (x, y) => x + y, (x, y) => x + y)) 10
CountReturn the number of elements in this RDD.
CountByValueReturn the count of each unique value in this RDD as a dictionary of (value, count) pairs. sc.Parallelize(new int[] { 1, 2, 1, 2, 2 }, 2).CountByValue()) [(1, 2), (2, 3)]
TakeTake the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. Translated from the Scala implementation in RDD#take(). sc.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Cache().Take(2))) [2, 3] sc.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Take(10) [2, 3, 4, 5, 6] sc.Parallelize(Enumerable.Range(0, 100), 100).Filter(x => x > 90).Take(3) [91, 92, 93]
FirstReturn the first element in this RDD. >>> sc.Parallelize(new int[] { 2, 3, 4 }, 2).First() 2
IsEmptyReturns true if and only if the RDD contains no elements at all. Note that an RDD may be empty even when it has at least 1 partition. sc.Parallelize(new int[0], 1).isEmpty() true sc.Parallelize(new int[] {1}).isEmpty() false
SubtractReturn each value in this RDD that is not contained in . var x = sc.Parallelize(new int[] { 1, 2, 3, 4 }, 1) var y = sc.Parallelize(new int[] { 3 }, 1) x.Subtract(y).Collect()) [1, 2, 4]
KeyBy``1Creates tuples of the elements in this RDD by applying . sc.Parallelize(new int[] { 1, 2, 3, 4 }, 1).KeyBy(x => x * x).Collect()) (1, 1), (4, 2), (9, 3), (16, 4)
RepartitionReturn a new RDD that has exactly numPartitions partitions. Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using `Coalesce`, which can avoid performing a shuffle. var rdd = sc.Parallelize(new int[] { 1, 2, 3, 4, 5, 6, 7 }, 4) rdd.Glom().Collect().Length 4 rdd.Repartition(2).Glom().Collect().Length 2
CoalesceReturn a new RDD that is reduced into `numPartitions` partitions. sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 3).Glom().Collect().Length 3 >>> sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 3).Coalesce(1).Glom().Collect().Length 1
Zip``1Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other). var x = sc.parallelize(range(0,5)) var y = sc.parallelize(range(1000, 1005)) x.Zip(y).Collect() [(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]
ZipWithIndexZips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This method needs to trigger a spark job when this RDD contains more than one partitions. sc.Parallelize(new string[] { "a", "b", "c", "d" }, 3).ZipWithIndex().Collect() [('a', 0), ('b', 1), ('c', 2), ('d', 3)]
ZipWithUniqueIdZips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k, 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won't trigger a spark job, which is different from >>> sc.Parallelize(new string[] { "a", "b", "c", "d" }, 1).ZipWithIndex().Collect() [('a', 0), ('b', 1), ('c', 4), ('d', 2), ('e', 5)]
SetNameAssign a name to this RDD. >>> rdd1 = sc.parallelize([1, 2]) >>> rdd1.setName('RDD1').name() u'RDD1'
ToDebugStringA description of this RDD and its recursive dependencies for debugging.
GetStorageLevelGet the RDD's current storage level. >>> rdd1 = sc.parallelize([1,2]) >>> rdd1.getStorageLevel() StorageLevel(False, False, False, False, 1) >>> print(rdd1.getStorageLevel()) Serialized 1x Replicated
ToLocalIteratorReturn an iterator that contains all of the elements in this RDD. The iterator will consume as much memory as the largest partition in this RDD. sc.Parallelize(Enumerable.Range(0, 10), 1).ToLocalIterator() [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
RandomSampleWithRangeInternal method exposed for Random Splits in DataFrames. Samples an RDD given a probability range.

###Microsoft.Spark.CSharp.Core.StringRDDFunctions ####Summary

        Some useful utility functions for RDD{string}

####Methods

NameDescription
SaveAsTextFileSave this RDD as a text file, using string representations of elements.

###Microsoft.Spark.CSharp.Core.ComparableRDDFunctions ####Summary

        Some useful utility functions for RDD's containing IComparable values.

####Methods

NameDescription
Max``1Find the maximum item in this RDD. sc.Parallelize(new double[] { 1.0, 5.0, 43.0, 10.0 }, 2).Max() 43.0
Min``1Find the minimum item in this RDD. sc.Parallelize(new double[] { 2.0, 5.0, 43.0, 10.0 }, 2).Min() >>> rdd.min() 2.0
TakeOrdered``1Get the N elements from a RDD ordered in ascending order or as specified by the optional key function. sc.Parallelize(new int[] { 10, 1, 2, 9, 3, 4, 5, 6, 7 }, 2).TakeOrdered(6) [1, 2, 3, 4, 5, 6]
Top``1Get the top N elements from a RDD. Note: It returns the list sorted in descending order. sc.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Top(3) [6, 5, 4]

###Microsoft.Spark.CSharp.Core.SparkConf ####Summary

         Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
        
         Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified
         by the user. Spark does not support modifying the configuration at runtime.
         
         See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkConf

####Methods

NameDescription
SetMasterThe master URL to connect to, such as "local" to run locally with one thread, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
SetAppNameSet a name for your application. Shown in the Spark web UI.
SetSparkHomeSet the location where Spark is installed on worker nodes.
SetSet the value of a string config
GetIntGet a int parameter value, falling back to a default if not set
GetGet a string parameter value, falling back to a default if not set
GetAllGet all parameters as a list of pairs

###Microsoft.Spark.CSharp.Core.SparkContext ####Summary

        Main entry point for Spark functionality. A SparkContext represents the 
        connection to a Spark cluster, and can be used to create RDDs, accumulators 
        and broadcast variables on that cluster.

####Methods

NameDescription
GetActiveSparkContextGet existing SparkContext
GetConfReturn a copy of this JavaSparkContext's configuration. The configuration ''cannot'' be changed at runtime.
GetOrCreateThis function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext. Note: This function cannot be used to create multiple SparkContext instances even if multiple contexts are allowed.
TextFileRead a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
Parallelize``1Distribute a local collection to form an RDD. sc.Parallelize(new int[] {0, 2, 3, 4, 6}, 5).Glom().Collect() [[0], [2], [3], [4], [6]]
EmptyRDDCreate an RDD that has no partitions or elements.
WholeTextFilesRead a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. For example, if you have the following files: {{{ hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn }}} Do {{{ RDD<Tuple<string, string>> rdd = sparkContext.WholeTextFiles("hdfs://a-hdfs-path") }}} then `rdd` contains {{{ (a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content) }}} Small files are preferred, large file is also allowable, but may cause bad performance. minPartitions A suggestion value of the minimal splitting number for input data.
BinaryFilesRead a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. For example, if you have the following files: {{{ hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn }}} Do RDD<Tuple<string, byte[]>>"/> rdd = sparkContext.dataStreamFiles("hdfs://a-hdfs-path")`, then `rdd` contains {{{ (a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content) }}} @note Small files are preferred; very large files but may cause bad performance. @param minPartitions A suggestion value of the minimal splitting number for input data.
SequenceFileRead a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is as follows: 1. A Java RDD is created from the SequenceFile or other InputFormat, and the key and value Writable classes 2. Serialization is attempted via Pyrolite pickling 3. If this fails, the fallback is to call 'toString' on each key and value 4. PickleSerializer is used to deserialize pickled objects on the Python side
NewAPIHadoopFileRead a 'new API' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile. A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java
NewAPIHadoopRDDRead a 'new API' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile.
HadoopFileRead an 'old' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile. A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java.
HadoopRDDRead an 'old' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile.
Union``1Build the union of a list of RDDs. This supports unions() of RDDs with different serialized formats, although this forces them to be reserialized using the default serializer: >>> path = os.path.join(tempdir, "union-text.txt") >>> with open(path, "w") as testFile: ... _ = testFile.write("Hello") >>> textFile = sc.textFile(path) >>> textFile.collect() [u'Hello'] >>> parallelized = sc.parallelize(["World!"]) >>> sorted(sc.union([textFile, parallelized]).collect()) [u'Hello', 'World!']
Broadcast``1Broadcast a read-only variable to the cluster, returning a Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.
Accumulator``1Create an with the given initial value, using a given helper object to define how to add values of the data type if provided. Default AccumulatorParams are used for integers and floating-point numbers if you do not provide one. For other types, a custom AccumulatorParam can be used.
StopShut down the SparkContext.
AddFileAdd a file to be downloaded with this Spark job on every node. The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use `SparkFiles.get(fileName)` to find its download location.
SetCheckpointDirSet the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster.
SetJobGroupAssigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared. Often, a unit of execution in an application consists of multiple Spark actions or jobs. Application programmers can use this method to group all those jobs together and give a group description. Once set, the Spark web UI will associate such jobs with this group. The application can also use [[org.apache.spark.api.java.JavaSparkContext.cancelJobGroup]] to cancel all running jobs in this group. For example, {{{ // In the main thread: sc.setJobGroup("some_job_to_cancel", "some job description"); rdd.map(...).count(); // In a separate thread: sc.cancelJobGroup("some_job_to_cancel"); }}} If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads. This is useful to help ensure that the tasks are actually stopped in a timely manner, but is off by default due to HDFS-1208, where HDFS may respond to Thread.interrupt() by marking nodes as dead.
SetLocalPropertySet a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool.
GetLocalPropertyGet a local property set in this thread, or null if it is missing. See [[org.apache.spark.api.java.JavaSparkContext.setLocalProperty]].
SetLogLevelControl our logLevel. This overrides any user-defined log settings. @param logLevel The desired log level as a string. Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
RunJob``1Run a job on a given set of partitions of an RDD.
CancelJobGroupCancel active jobs for the specified group. See for more information.
CancelAllJobsCancel all jobs that have been scheduled or are running.

###Microsoft.Spark.CSharp.Core.StatCounter ####Summary

        A class for tracking the statistics of a set of numbers (count, mean and variance) in a numerically
        robust way. Includes support for merging two StatCounters. Based on Welford and Chan's algorithms
        for running variance. 

####Methods

NameDescription
MergeAdd a value into this StatCounter, updating the internal statistics.
MergeAdd multiple values into this StatCounter, updating the internal statistics.
MergeMerge another StatCounter into this one, adding up the internal statistics.
copyClone this StatCounter
ToStringReturns a string that represents this StatCounter.

###Microsoft.Spark.CSharp.Core.StatusTracker ####Summary

        Low-level status reporting APIs for monitoring job and stage progress.

####Methods

NameDescription
GetJobIdsForGroupReturn a list of all known jobs in a particular job group. If `jobGroup` is None, then returns all known jobs that are not associated with a job group. The returned list may contain running, failed, and completed jobs, and may vary across invocations of this method. This method does not guarantee the order of the elements in its result.
GetActiveStageIdsReturns an array containing the ids of all active stages.
GetActiveJobsIdsReturns an array containing the ids of all active jobs.
GetJobInfoReturns a :class:`SparkJobInfo` object, or None if the job info could not be found or was garbage collected.
GetStageInfoReturns a :class:`SparkStageInfo` object, or None if the stage info could not be found or was garbage collected.

###Microsoft.Spark.CSharp.Core.JobExecutionStatus ####Summary

        Status associated with a job information of Spark

###Microsoft.Spark.CSharp.Core.SparkJobInfo ####Summary

        SparkJobInfo represents a job information of Spark

####Methods

NameDescription

###Microsoft.Spark.CSharp.Core.SparkStageInfo ####Summary

        SparkJobInfo represents a stage information of Spark

####Methods

NameDescription

###Microsoft.Spark.CSharp.Core.StorageLevelType ####Summary

        Defines the type of storage levels

###Microsoft.Spark.CSharp.Core.StorageLevel ####Summary

        Flags for controlling the storage of an RDD. Each StorageLevel records whether to use 
        memory, whether to drop the RDD to disk if it falls out of memory, whether to keep the 
        data in memory in a serialized format, and whether to replicate the RDD partitions 
        on multiple nodes.

####Methods

NameDescription
ToStringReturns a readable string that represents the type

###Microsoft.Spark.CSharp.Network.ByteBuf ####Summary

        ByteBuf delimits a section of a ByteBufChunk.
        It is the smallest unit to be allocated.

####Methods

NameDescription
ClearSets the readerIndex and writerIndex of this buffer to 0.
IsReadableIs this ByteSegment readable if and only if the buffer contains equal or more than the specified number of elements
IsWritableReturns true if and only if the buffer has enough Capacity to accommodate size additional bytes.
ReadByteGets a byte at the current readerIndex and increases the readerIndex by 1 in this buffer.
ReadBytesReads a block of bytes from the ByteBuf and writes the data to a buffer.
ReleaseRelease the ByteBuf back to the ByteBufPool
WriteBytesWrites a block of bytes to the ByteBuf using data read from a buffer.
GetInputRioBufReturns a RioBuf object for input (receive)
GetOutputRioBufReturns a RioBuf object for output (send).
NewErrorStatusByteBufCreates an empty ByteBuf with error status.
Finalizer.
Allocates a ByteBuf from this ByteChunk.
Release all resources
Releases the ByteBuf back to this ByteChunk
Returns a readable string for the ByteBufChunk
Static method to create a new ByteBufChunk with given segment and chunk size. If isUnsafe is true, it allocates memory from the process's heap.
Wraps HeapFree to process heap.
Implementation of the Dispose pattern.
Add the ByteBufChunk to this ByteBufChunkList linked-list based on ByteBufChunk's usage. So it will be moved to the right ByteBufChunkList that has the correct minUsage/maxUsage.
Allocates a ByteBuf from this ByteBufChunkList if it is not empty.
Releases the segment back to its ByteBufChunk.
Adds the ByteBufChunk to this ByteBufChunkList
Moves the ByteBufChunk down the ByteBufChunkList linked-list so it will end up in the right ByteBufChunkList that has the correct minUsage/maxUsage in respect to ByteBufChunk.Usage.
Remove the ByteBufChunk from this ByteBufChunkList
Returns a readable string for this ByteBufChunkList
Allocates a ByteBuf from this ByteBufPool to use.
Deallocates a ByteBuf back to this ByteBufPool.
Gets a readable string for this ByteBufPool
Returns the chunk numbers in each queue.

###Microsoft.Spark.CSharp.Network.ByteBufChunk ####Summary

        ByteBufChunk represents a memory blocks that can be allocated from 
        .Net heap (managed code) or process heap(unsafe code)

####Methods

NameDescription
FinalizeFinalizer.
AllocateAllocates a ByteBuf from this ByteChunk.
DisposeRelease all resources
FreeReleases the ByteBuf back to this ByteChunk
ToStringReturns a readable string for the ByteBufChunk
NewChunkStatic method to create a new ByteBufChunk with given segment and chunk size. If isUnsafe is true, it allocates memory from the process's heap.
FreeToProcessHeapWraps HeapFree to process heap.
DisposeImplementation of the Dispose pattern.
Add the ByteBufChunk to this ByteBufChunkList linked-list based on ByteBufChunk's usage. So it will be moved to the right ByteBufChunkList that has the correct minUsage/maxUsage.
Allocates a ByteBuf from this ByteBufChunkList if it is not empty.
Releases the segment back to its ByteBufChunk.
Adds the ByteBufChunk to this ByteBufChunkList
Moves the ByteBufChunk down the ByteBufChunkList linked-list so it will end up in the right ByteBufChunkList that has the correct minUsage/maxUsage in respect to ByteBufChunk.Usage.
Remove the ByteBufChunk from this ByteBufChunkList
Returns a readable string for this ByteBufChunkList

###Microsoft.Spark.CSharp.Network.ByteBufChunk.Segment ####Summary

        Segment struct delimits a section of a byte chunk.

###Microsoft.Spark.CSharp.Network.ByteBufChunkList ####Summary

        ByteBufChunkList class represents a simple linked like list used to store ByteBufChunk objects
        based on its usage.

####Methods

NameDescription
AddAdd the ByteBufChunk to this ByteBufChunkList linked-list based on ByteBufChunk's usage. So it will be moved to the right ByteBufChunkList that has the correct minUsage/maxUsage.
AllocateAllocates a ByteBuf from this ByteBufChunkList if it is not empty.
FreeReleases the segment back to its ByteBufChunk.
AddInternalAdds the ByteBufChunk to this ByteBufChunkList
MoveInternalMoves the ByteBufChunk down the ByteBufChunkList linked-list so it will end up in the right ByteBufChunkList that has the correct minUsage/maxUsage in respect to ByteBufChunk.Usage.
RemoveRemove the ByteBufChunk from this ByteBufChunkList
ToStringReturns a readable string for this ByteBufChunkList

###Microsoft.Spark.CSharp.Network.ByteBufPool ####Summary

        ByteBufPool class is used to manage the ByteBuf pool that allocate and free pooled memory buffer.
        We borrows some ideas from Netty buffer memory management.

####Methods

NameDescription
AllocateAllocates a ByteBuf from this ByteBufPool to use.
FreeDeallocates a ByteBuf back to this ByteBufPool.
ToStringGets a readable string for this ByteBufPool
GetUsagesReturns the chunk numbers in each queue.

###Microsoft.Spark.CSharp.Network.RioNative ####Summary

        RioNative class imports and initializes RIOSock.dll for use with RIO socket APIs.
        It also provided a simple thread pool that retrieves the results from IO completion port.

####Methods

NameDescription
FinalizeFinalizer
DisposeRelease all resources.
SetUseThreadPoolSets whether use thread pool to query RIO socket results, it must be called before calling EnsureRioLoaded()
EnsureRioLoadedEnsures that the native dll of RIO socket is loaded and initialized.
UnloadRioExplicitly unload the native dll of RIO socket, and release resources.
InitInitializes RIOSock native library.

###Microsoft.Spark.CSharp.Network.RioResult ####Summary

        The RioResult structure contains data used to indicate request completion results used with RIO socket

###Microsoft.Spark.CSharp.Network.SocketStream ####Summary

        Provides the underlying stream of data for network access.
        Just like a NetworkStream.

####Methods

NameDescription
FlushFlushes data in send cache to the stream.
SeekSeeks a specific position in the stream. This method is not supported by the SocketDataStream class.
SetLengthSets the length of the stream. This method is not supported by the SocketDataStream class.
ReadByteReads a byte from the stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream.
ReadReads data from the stream.
WriteWrites data to the stream.

###Microsoft.Spark.CSharp.Network.SockDataToken ####Summary

        SockDataToken class is used to associate with the SocketAsyncEventArgs object.
        Primarily, it is a way to pass state to the event handler.

####Methods

NameDescription
ResetReset this token
DetachDataDetach the data ownership.

###Microsoft.Spark.CSharp.Network.SocketFactory ####Summary

        SocketFactory is used to create ISocketWrapper instance based on a configuration and OS version.
        
        The ISocket instance can be RioSocket object, if the configuration is set to RioSocket and
        only the application is running on a Windows OS that supports Registered IO socket.

####Methods

NameDescription
CreateSocketCreates a ISocket instance based on the configuration and OS version.
IsRioSockSupportedIndicates whether current OS supports RIO socket.

###Microsoft.Spark.CSharp.Sql.Builder ####Summary

        The entry point to programming Spark with the Dataset and DataFrame API.

####Methods

NameDescription
MasterSets the Spark master URL to connect to, such as "local" to run locally, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
AppNameSets a name for the application, which will be shown in the Spark web UI. If no application name is set, a randomly generated name will be used.
ConfigSets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
ConfigSets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
ConfigSets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
ConfigSets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
ConfigSets a list of config options based on the given SparkConf
EnableHiveSupportEnables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
GetOrCreateGets an existing [[SparkSession]] or, if there is no existing one, creates a new one based on the options set in this builder.

###Microsoft.Spark.CSharp.Sql.Catalog.Catalog ####Summary

        Catalog interface for Spark.

####Methods

NameDescription
ListDatabasesReturns a list of databases available across all sessions.
ListTablesReturns a list of tables in the current database or given database This includes all temporary tables.
ListColumnsReturns a list of columns for the given table in the current database or the given temporary table.
ListFunctionsReturns a list of functions registered in the specified database. This includes all temporary functions
SetCurrentDatabaseSets the current default database in this session.
DropTempViewDrops the temporary view with the given view name in the catalog. If the view has been cached before, then it will also be uncached.
IsCachedReturns true if the table is currently cached in-memory.
CacheTableCaches the specified table in-memory.
UnCacheTableRemoves the specified table from the in-memory cache.
RefreshTableInvalidate and refresh all the cached metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks.When those change outside of Spark SQL, users should call this function to invalidate the cache. If this table is cached as an InMemoryRelation, drop the original cached version and make the new version cached lazily.
ClearCacheRemoves all cached tables from the in-memory cache.
CreateExternalTableCreates an external table from the given path and returns the corresponding DataFrame. It will use the default data source configured by spark.sql.sources.default.
CreateExternalTableCreates an external table from the given path on a data source and returns DataFrame
CreateExternalTableCreates an external table from the given path based on a data source and a set of options. Then, returns the corresponding DataFrame.
CreateExternalTableCreate an external table from the given path based on a data source, a schema and a set of options.Then, returns the corresponding DataFrame.

###Microsoft.Spark.CSharp.Sql.Catalog.Database ####Summary

        A database in Spark

###Microsoft.Spark.CSharp.Sql.Catalog.Table ####Summary

        A table in Spark

###Microsoft.Spark.CSharp.Sql.Catalog.Column ####Summary

        A column in Spark

###Microsoft.Spark.CSharp.Sql.Catalog.Function ####Summary

        A user-defined function in Spark

###Microsoft.Spark.CSharp.Sql.Column ####Summary

        A column that will be computed based on the data in a DataFrame.

####Methods

NameDescription
op_LogicalNotThe logical negation operator that negates its operand.
op_UnaryNegationNegation of itself.
op_AdditionSum of this expression and another expression.
op_SubtractionSubtraction of this expression and another expression.
op_MultiplyMultiplication of this expression and another expression.
op_DivisionDivision this expression by another expression.
op_ModulusModulo (a.k.a. remainder) expression.
op_EqualityThe equality operator returns true if the values of its operands are equal, false otherwise.
op_InequalityThe inequality operator returns false if its operands are equal, true otherwise.
op_LessThanThe "less than" relational operator that returns true if the first operand is less than the second, false otherwise.
op_LessThanOrEqualThe "less than or equal" relational operator that returns true if the first operand is less than or equal to the second, false otherwise.
op_GreaterThanOrEqualThe "greater than or equal" relational operator that returns true if the first operand is greater than or equal to the second, false otherwise.
op_GreaterThanThe "greater than" relational operator that returns true if the first operand is greater than the second, false otherwise.
op_BitwiseOrCompute bitwise OR of this expression with another expression.
op_BitwiseAndCompute bitwise AND of this expression with another expression.
op_ExclusiveOrCompute bitwise XOR of this expression with another expression.
GetHashCodeRequired when operator == or operator != is defined
EqualsRequired when operator == or operator != is defined
LikeSQL like expression.
RLikeSQL RLIKE expression (LIKE with Regex).
StartsWithString starts with another string literal.
EndsWithString ends with another string literal.
AscReturns a sort expression based on the ascending order.
DescReturns a sort expression based on the descending order.
AliasReturns this column aliased with a new name.
AliasReturns this column aliased with new names
CastCasts the column to a different data type, using the canonical string representation of the type. The supported types are: `string`, `boolean`, `byte`, `short`, `int`, `long`, `float`, `double`, `decimal`, `date`, `timestamp`. E.g. // Casts colA to integer. df.select(df("colA").cast("int"))

###Microsoft.Spark.CSharp.Sql.DataFrame ####Summary

         A distributed collection of data organized into named columns.
        
        See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

####Methods

NameDescription
RegisterTempTableRegisters this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SqlContext that was used to create this DataFrame.
CountNumber of rows in the DataFrame
ShowDisplays rows of the DataFrame in tabular form
ShowSchemaPrints the schema information of the DataFrame
CollectReturns all of Rows in this DataFrame
ToRDDConverts the DataFrame to RDD of Row
ToJSONReturns the content of the DataFrame as RDD of JSON strings
ExplainPrints the plans (logical and physical) to the console for debugging purposes
SelectSelects a set of columns specified by column name or Column. df.Select("colA", df["colB"]) df.Select("*", df["colB"] + 10)
SelectSelects a set of columns. This is a variant of `select` that can only select existing columns using column names (i.e. cannot construct expressions). df.Select("colA", "colB")
SelectExprSelects a set of SQL expressions. This is a variant of `select` that accepts SQL expressions. df.SelectExpr("colA", "colB as newName", "abs(colC)")
WhereFilters rows using the given condition
FilterFilters rows using the given condition
GroupByGroups the DataFrame using the specified columns, so we can run aggregation on them.
RollupCreate a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
CubeCreate a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.
AggAggregates on the DataFrame for the given column-aggregate function mapping
JoinJoin with another DataFrame - Cartesian join
JoinJoin with another DataFrame - Inner equi-join using given column name
JoinJoin with another DataFrame - Inner equi-join using given column name
JoinJoin with another DataFrame, using the specified JoinType
IntersectIntersect with another DataFrame. This is equivalent to `INTERSECT` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, intersect(self, other)
UnionAllUnion with another DataFrame WITHOUT removing duplicated rows. This is equivalent to `UNION ALL` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, unionAll(self, other)
SubtractReturns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to `EXCEPT` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, subtract(self, other)
DropReturns a new DataFrame with a column dropped. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, drop(self, col)
DropNaReturns a new DataFrame omitting rows with null values. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dropna(self, how='any', thresh=None, subset=None)
NaReturns a DataFrameNaFunctions for working with missing data.
FillNaReplace null values, alias for ``na.fill()`
DropDuplicatesReturns a new DataFrame with duplicate rows removed, considering only the subset of columns. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dropDuplicates(self, subset=None)
Replace``1Returns a new DataFrame replacing a value with another value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None)
ReplaceAll``1Returns a new DataFrame replacing values with other values. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None)
ReplaceAll``1Returns a new DataFrame replacing values with another value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None)
RandomSplitRandomly splits this DataFrame with the provided weights. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, randomSplit(self, weights, seed=None)
ColumnsReturns all column names as a list. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, columns(self)
DTypesReturns all column names and their data types. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dtypes(self)
SortReturns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, sort(self, *cols, **kwargs)
SortReturns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, sort(self, *cols, **kwargs)
SortWithinPartitionsReturns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/dataframe.py, sortWithinPartitions(self, *cols, **kwargs)
SortWithinPartitionReturns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/dataframe.py, sortWithinPartitions(self, *cols, **kwargs)
AliasReturns a new DataFrame with an alias set. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, alias(self, alias)
WithColumnReturns a new DataFrame by adding a column. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, withColumn(self, colName, col)
WithColumnRenamedReturns a new DataFrame by renaming an existing column. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, withColumnRenamed(self, existing, new)
CorrCalculates the correlation of two columns of a DataFrame as a double value. Currently only supports the Pearson Correlation Coefficient. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, corr(self, col1, col2, method=None)
CovCalculate the sample covariance of two columns as a double value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, cov(self, col1, col2)
FreqItemsFinding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in "http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou". Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, freqItems(self, cols, support=None) Note: This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
CrosstabComputes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, crosstab(self, col1, col2)
DescribeComputes statistics for numeric columns. This include count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.
LimitReturns a new DataFrame by taking the first `n` rows. The difference between this function and `head` is that `head` returns an array while `limit` returns a new DataFrame.
HeadReturns the first `n` rows.
FirstReturns the first row.
TakeReturns the first `n` rows in the DataFrame.
DistinctReturns a new DataFrame that contains only the unique rows from this DataFrame.
CoalesceReturns a new DataFrame that has exactly `numPartitions` partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
PersistPersist this DataFrame with the default storage level (`MEMORY_AND_DISK`)
UnpersistMark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
CachePersist this DataFrame with the default storage level (`MEMORY_AND_DISK`)
RepartitionReturns a new DataFrame that has exactly `numPartitions` partitions.
RepartitionReturns a new [[DataFrame]] partitioned by the given partitioning columns into . The resulting DataFrame is hash partitioned. optional. If not specified, keep current partitions.
RepartitionReturns a new [[DataFrame]] partitioned by the given partitioning columns into . The resulting DataFrame is hash partitioned. optional. If not specified, keep current partitions.
SampleReturns a new DataFrame by sampling a fraction of rows.
FlatMap``1Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.
Map``1Returns a new RDD by applying a function to all rows of this DataFrame.
MapPartitions``1Returns a new RDD by applying a function to each partition of this DataFrame.
ForeachPartitionApplies a function f to each partition of this DataFrame.
ForeachApplies a function f to all rows.
WriteInterface for saving the content of the DataFrame out into external storage.
SaveAsParquetFileSaves the contents of this DataFrame as a parquet file, preserving the schema. Files that are written out using this method can be read back in as a DataFrame using the `parquetFile` function in SQLContext.
InsertIntoAdds the rows from this RDD to the specified table, optionally overwriting the existing data.
SaveAsTableCreates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options. Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an `insertInto`. Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved.
SaveSaves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.
Returns a new DataFrame that drops rows containing any null values.
Returns a new DataFrame that drops rows containing null values. If `how` is "any", then drop rows containing any null values. If `how` is "all", then drop rows only if every column is null for that row.
Returns a new [[DataFrame]] that drops rows containing null values in the specified columns. If `how` is "any", then drop rows containing any null values in the specified columns. If `how` is "all", then drop rows only if every specified column is null for that row.
Returns a new DataFrame that drops rows containing any null values in the specified columns.
Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values.
Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values values in the specified columns.
Returns a new DataFrame that replaces null values in numeric columns with `value`.
Returns a new DataFrame that replaces null values in string columns with `value`.
Returns a new DataFrame that replaces null values in specified numeric columns. If a specified column is not a numeric column, it is ignored.
Returns a new DataFrame that replaces null values in specified string columns. If a specified column is not a numeric column, it is ignored.
Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. The value must be of the following type: `Integer`, `Long`, `Float`, `Double`, `String`. For example, the following replaces null values in column "A" with string "unknown", and null values in column "B" with numeric value 1.0. import com.google.common.collect.ImmutableMap; df.na.fill(ImmutableMap.of("A", "unknown", "B", 1.0));
Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height". df.replace("height", ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed")); // Replaces all occurrences of "UNKNOWN" with "unnamed" in all string columns. df.replace("*", ImmutableMap.of("UNKNOWN", "unnamed"));
Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height" and "weight". df.replace(new String[] {"height", "weight"}, ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "firstname" and "lastname". df.replace(new String[] {"firstname", "lastname"}, ImmutableMap.of("UNKNOWN", "unnamed"));
Specifies the input data source format.
Specifies the input schema. Some data sources (e.g. JSON) can infer the input schema automatically from data. By specifying the schema here, the underlying data source can skip the schema inference step, and thus speed up data loading.
Adds an input option for the underlying data source.
Adds input options for the underlying data source.
Loads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by a local or distributed file system).
Loads input in as a DataFrame, for data sources that don't require a path (e.g. external key-value stores).
Construct a [[DataFrame]] representing the database table accessible via JDBC URL, url named table and connection properties.
Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
Construct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The `predicates` parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
Loads a JSON file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.
Loads a Parquet file, returning the result as a [[DataFrame]]. This function returns an empty DataFrame if no paths are passed in.
Loads a AVRO file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.
Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime.
Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime.
Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
Adds an output option for the underlying data source.
Adds output options for the underlying data source.
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. This is only applicable for Parquet at the moment.
Saves the content of the DataFrame at the specified path.
Saves the content of the DataFrame as the specified table.
Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table. Because it inserts data to an existing table, format or options will be ignored.
Saves the content of the DataFrame as the specified table. In the case the table already exists, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). When `mode` is `Overwrite`, the schema of the DataFrame does not need to be the same as that of the existing table. When `mode` is `Append`, the schema of the DataFrame need to be the same as that of the existing table, and format or options will be ignored.
Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("json").Save(path)
Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("parquet").Save(path)
Saves the content of the DataFrame in AVRO format at the specified path. This is equivalent to: Format("com.databricks.spark.avro").Save(path)

###Microsoft.Spark.CSharp.Sql.JoinType ####Summary

        The type of join operation for DataFrame

###Microsoft.Spark.CSharp.Sql.GroupedData ####Summary

        A set of methods for aggregations on a DataFrame, created by DataFrame.groupBy.

####Methods

NameDescription
AggCompute aggregates by specifying a dictionary from column name to aggregate methods. The available aggregate methods are avg, max, min, sum, count.
CountCount the number of rows for each group.
MeanCompute the average value for each numeric columns for each group. This is an alias for avg. When specified columns are given, only compute the average values for them.
MaxCompute the max value for each numeric columns for each group. When specified columns are given, only compute the max values for them.
MinCompute the min value for each numeric column for each group.
AvgCompute the mean value for each numeric columns for each group. When specified columns are given, only compute the mean values for them.
SumCompute the sum for each numeric columns for each group. When specified columns are given, only compute the sum for them.

###Microsoft.Spark.CSharp.Sql.DataFrameNaFunctions ####Summary

        Functionality for working with missing data in DataFrames.

####Methods

NameDescription
DropReturns a new DataFrame that drops rows containing any null values.
DropReturns a new DataFrame that drops rows containing null values. If `how` is "any", then drop rows containing any null values. If `how` is "all", then drop rows only if every column is null for that row.
DropReturns a new [[DataFrame]] that drops rows containing null values in the specified columns. If `how` is "any", then drop rows containing any null values in the specified columns. If `how` is "all", then drop rows only if every specified column is null for that row.
DropReturns a new DataFrame that drops rows containing any null values in the specified columns.
DropReturns a new DataFrame that drops rows containing less than `minNonNulls` non-null values.
DropReturns a new DataFrame that drops rows containing less than `minNonNulls` non-null values values in the specified columns.
FillReturns a new DataFrame that replaces null values in numeric columns with `value`.
FillReturns a new DataFrame that replaces null values in string columns with `value`.
FillReturns a new DataFrame that replaces null values in specified numeric columns. If a specified column is not a numeric column, it is ignored.
FillReturns a new DataFrame that replaces null values in specified string columns. If a specified column is not a numeric column, it is ignored.
FillReplaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. The value must be of the following type: `Integer`, `Long`, `Float`, `Double`, `String`. For example, the following replaces null values in column "A" with string "unknown", and null values in column "B" with numeric value 1.0. import com.google.common.collect.ImmutableMap; df.na.fill(ImmutableMap.of("A", "unknown", "B", 1.0));
Replace``1Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height". df.replace("height", ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed")); // Replaces all occurrences of "UNKNOWN" with "unnamed" in all string columns. df.replace("*", ImmutableMap.of("UNKNOWN", "unnamed"));
Replace``1Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height" and "weight". df.replace(new String[] {"height", "weight"}, ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "firstname" and "lastname". df.replace(new String[] {"firstname", "lastname"}, ImmutableMap.of("UNKNOWN", "unnamed"));

###Microsoft.Spark.CSharp.Sql.DataFrameReader ####Summary

        Interface used to load a DataFrame from external storage systems (e.g. file systems,
        key-value stores, etc). Use SQLContext.read() to access this.

####Methods

NameDescription
FormatSpecifies the input data source format.
SchemaSpecifies the input schema. Some data sources (e.g. JSON) can infer the input schema automatically from data. By specifying the schema here, the underlying data source can skip the schema inference step, and thus speed up data loading.
OptionAdds an input option for the underlying data source.
OptionsAdds input options for the underlying data source.
LoadLoads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by a local or distributed file system).
LoadLoads input in as a DataFrame, for data sources that don't require a path (e.g. external key-value stores).
JdbcConstruct a [[DataFrame]] representing the database table accessible via JDBC URL, url named table and connection properties.
JdbcConstruct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
JdbcConstruct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The `predicates` parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
JsonLoads a JSON file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.
ParquetLoads a Parquet file, returning the result as a [[DataFrame]]. This function returns an empty DataFrame if no paths are passed in.
AvroLoads a AVRO file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.

###Microsoft.Spark.CSharp.Sql.DataFrameWriter ####Summary

        Interface used to write a DataFrame to external storage systems (e.g. file systems,
        key-value stores, etc). Use DataFrame.Write to access this.
        
        See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

####Methods

NameDescription
ModeSpecifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime.
ModeSpecifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime.
FormatSpecifies the underlying output data source. Built-in options include "parquet", "json", etc.
OptionAdds an output option for the underlying data source.
OptionsAdds output options for the underlying data source.
PartitionByPartitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. This is only applicable for Parquet at the moment.
SaveSaves the content of the DataFrame at the specified path.
SaveSaves the content of the DataFrame as the specified table.
InsertIntoInserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table. Because it inserts data to an existing table, format or options will be ignored.
SaveAsTableSaves the content of the DataFrame as the specified table. In the case the table already exists, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). When `mode` is `Overwrite`, the schema of the DataFrame does not need to be the same as that of the existing table. When `mode` is `Append`, the schema of the DataFrame need to be the same as that of the existing table, and format or options will be ignored.
JdbcSaves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
JsonSaves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("json").Save(path)
ParquetSaves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("parquet").Save(path)
AvroSaves the content of the DataFrame in AVRO format at the specified path. This is equivalent to: Format("com.databricks.spark.avro").Save(path)

###Microsoft.Spark.CSharp.Sql.Dataset ####Summary

         Dataset is a strongly typed collection of domain-specific objects that can be transformed
        in parallel using functional or relational operations.Each Dataset also has an untyped view 
        called a DataFrame, which is a Dataset of Row.

####Methods

NameDescription
ToDFConverts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic[[Row]] objects that allow fields to be accessed by ordinal or name.
PrintSchemaPrints the schema to the console in a nice tree format.
ExplainPrints the plans (logical and physical) to the console for debugging purposes.
ExplainPrints the physical plan to the console for debugging purposes.
DTypesReturns all column names and their data types as an array.
ColumnsReturns all column names as an array.
ShowDisplays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.
ShowSchemaPrints schema

###Microsoft.Spark.CSharp.Sql.Dataset`1 ####Summary

        Dataset of specific types
        
        Type parameter

###Microsoft.Spark.CSharp.Sql.HiveContext ####Summary

        HiveContext is deprecated. Use SparkSession.Builder().EnableHiveSupport()
        HiveContext is a variant of Spark SQL that integrates with data stored in Hive. 
        Configuration for Hive is read from hive-site.xml on the classpath.
        It supports running both SQL and HiveQL commands.

####Methods

NameDescription
SqlExecutes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect'
RefreshTableInvalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache.

###Microsoft.Spark.CSharp.Sql.PythonSerDe ####Summary

        Used for SerDe of Python objects

####Methods

NameDescription
GetUnpickledObjectsUnpickles objects from byte[]

###Microsoft.Spark.CSharp.Sql.RowConstructor ####Summary

        Used by Unpickler to unpickle pickled objects. It is also used to construct a Row (C# representation of pickled objects).

####Methods

NameDescription
ToStringReturns a string that represents the current object.
constructUsed by Unpickler - do not use to construct Row. Use GetRow() method
GetRowUsed to construct a Row

###Microsoft.Spark.CSharp.Sql.Row ####Summary

         Represents one row of output from a relational operator.

####Methods

NameDescription
Returns a string that represents the current object.
Used by Unpickler - do not use to construct Row. Use GetRow() method
Used to construct a Row
SizeNumber of elements in the Row.
GetSchemaSchema for the row.
GetReturns the value at position i.
GetReturns the value of a given columnName.
GetAs``1Returns the value at position i, the return value will be cast to type T.
GetAs``1Returns the value of a given columnName, the return value will be cast to type T.

###Microsoft.Spark.CSharp.Sql.Functions ####Summary

        DataFrame Built-in functions

####Methods

NameDescription
LitCreates a Column of any literal value.
ColReturns a Column based on the given column name.
ColumnReturns a Column based on the given column name.
AscReturns a sort expression based on ascending order of the column.
DescReturns a sort expression based on the descending order of the column.
UpperConverts a string column to upper case.
LowerConverts a string column to lower case.
SqrtComputes the square root of the specified float column.
AbsComputes the absolute value.
MaxReturns the maximum value of the expression in a group.
MinReturns the minimum value of the expression in a group.
FirstReturns the first value in a group.
LastReturns the last value in a group.
CountReturns the number of items in a group.
SumReturns the sum of all values in the expression.
AvgReturns the average of the values in a group.
MeanReturns the average of the values in a group.
SumDistinctReturns the sum of distinct values in the expression.
ArrayCreates a new array column. The input columns must all have the same data type.
CoalesceReturns the first column that is not null, or null if all inputs are null.
CountDistinctReturns the number of distinct items in a group.
StructCreates a new struct column.
ApproxCountDistinctReturns the approximate number of distinct items in a group
ExplodeCreates a new row for each element in the given array or map column.
RandGenerate a random column with i.i.d. samples from U[0.0, 1.0].
RandnGenerate a column with i.i.d. samples from the standard normal distribution.
NtileReturns the ntile group id (from 1 to n inclusive) in an ordered window partition. This is equivalent to the NTILE function in SQL.
AcosComputes the cosine inverse of the given column; the returned angle is in the range 0.
AsinComputes the sine inverse of the given column; the returned angle is in the range -pi/2 through pi/2.
AtanComputes the tangent inverse of the given column.
CbrtComputes the cube-root of the given column.
CeilComputes the ceiling of the given column.
CosComputes the cosine of the given column.
CoshComputes the hyperbolic cosine of the given column.
ExpComputes the exponential of the given column.
Expm1Computes the exponential of the given value minus column.
FloorComputes the floor of the given column.
LogComputes the natural logarithm of the given column.
Log10Computes the logarithm of the given column in base 10.
Log1pComputes the natural logarithm of the given column plus one.
RintReturns the double value that is closest in value to the argument and is equal to a mathematical integer.
SignumComputes the signum of the given column.
SinComputes the sine of the given column.
SinhComputes the hyperbolic sine of the given column.
TanComputes the tangent of the given column.
TanhComputes the hyperbolic tangent of the given column.
ToDegreesConverts an angle measured in radians to an approximately equivalent angle measured in degrees.
ToRadiansConverts an angle measured in degrees to an approximately equivalent angle measured in radians.
BitwiseNOTComputes bitwise NOT.
Atan2Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta).
HypotComputes sqrt(a2 + b2) without intermediate overflow or underflow.
HypotComputes sqrt(a2 + b2) without intermediate overflow or underflow.
HypotComputes sqrt(a2 + b2) without intermediate overflow or underflow.
PowReturns the value of the first argument raised to the power of the second argument.
PowReturns the value of the first argument raised to the power of the second argument.
PowReturns the value of the first argument raised to the power of the second argument.
ApproxCountDistinctReturns the approximate number of distinct items in a group.
WhenEvaluates a list of conditions and returns one of multiple possible result expressions.
LagReturns the value that is offset rows before the current row, and null if there is less than offset rows before the current row.
LeadReturns the value that is offset rows after the current row, and null if there is less than offset rows after the current row.
RowNumberReturns a sequential number starting at 1 within a window partition.
DenseRankReturns the rank of rows within a window partition, without any gaps.
RankReturns the rank of rows within a window partition.
CumeDistReturns the cumulative distribution of values within a window partition
PercentRankReturns the relative rank (i.e. percentile) of rows within a window partition.
MonotonicallyIncreasingIdA column expression that generates monotonically increasing 64-bit integers.
SparkPartitionIdPartition ID of the Spark task. Note that this is indeterministic because it depends on data partitioning and task scheduling.
RandGenerate a random column with i.i.d. samples from U[0.0, 1.0].
RandnGenerate a column with i.i.d. samples from the standard normal distribution.
Udf``1Defines a user-defined function of 0 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``2Defines a user-defined function of 1 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``3Defines a user-defined function of 2 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``4Defines a user-defined function of 3 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``5Defines a user-defined function of 4 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``6Defines a user-defined function of 5 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``7Defines a user-defined function of 6 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``8Defines a user-defined function of 7 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``9Defines a user-defined function of 8 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``10Defines a user-defined function of 9 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``11Defines a user-defined function of 10 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.

###Microsoft.Spark.CSharp.Sql.SaveMode ####Summary

        SaveMode is used to specify the expected behavior of saving a DataFrame to a data source.

####Methods

NameDescription
Gets the string for the value of SaveMode

###Microsoft.Spark.CSharp.Sql.SaveModeExtensions ####Summary

        For SaveMode.ErrorIfExists, the corresponding literal string in spark is "error" or "default".

####Methods

NameDescription
GetStringValueGets the string for the value of SaveMode

###Microsoft.Spark.CSharp.Sql.SparkSession ####Summary

        The entry point to programming Spark with the Dataset and DataFrame API.

####Methods

NameDescription
BuilderBuilder for SparkSession
NewSessionStart a new session with isolated SQL configurations, temporary tables, registered functions are isolated, but sharing the underlying [[SparkContext]] and cached data. Note: Other than the [[SparkContext]], all shared state is initialized lazily. This method will force the initialization of the shared state to ensure that parent and child sessions are set up with the same shared state. If the underlying catalog implementation is Hive, this will initialize the metastore, which may take some time.
StopStop underlying SparkContext
ReadReturns a DataFrameReader that can be used to read non-streaming data in as a DataFrame
CreateDataFrameCreates a from a RDD containing array of object using the given schema.
TableReturns the specified table as a
SqlExecutes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect'

###Microsoft.Spark.CSharp.Sql.SqlContext ####Summary

        The entry point for working with structured data (rows and columns) in Spark.  
        Allows the creation of [[DataFrame]] objects as well as the execution of SQL queries.

####Methods

NameDescription
GetOrCreateGet the existing SQLContext or create a new one with given SparkContext.
NewSessionReturns a new SQLContext as new session, that has separate SQLConf, registered temporary tables and UDFs, but shared SparkContext and table cache.
GetConfReturns the value of Spark SQL configuration property for the given key. If the key is not set, returns defaultValue.
SetConfSets the given Spark SQL configuration property.
ReadReturns a DataFrameReader that can be used to read data in as a DataFrame.
ReadDataFrameLoads a dataframe the source path using the given schema and options
CreateDataFrameCreates a from a RDD containing array of object using the given schema.
RegisterDataFrameAsTableRegisters the given as a temporary table in the catalog. Temporary tables exist only during the lifetime of this instance of SqlContext.
DropTempTableRemove the temp table from catalog.
TableReturns the specified table as a
TablesReturns a containing names of tables in the given database. If is not specified, the current database will be used. The returned DataFrame has two columns: 'tableName' and 'isTemporary' (a column with bool type indicating if a table is a temporary one or not).
TableNamesReturns a list of names of tables in the database
CacheTableCaches the specified table in-memory.
UncacheTableRemoves the specified table from the in-memory cache.
ClearCacheRemoves all cached tables from the in-memory cache.
IsCachedReturns true if the table is currently cached in-memory.
SqlExecutes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect'
JsonFileLoads a JSON file (one object per line), returning the result as a DataFrame It goes through the entire dataset once to determine the schema.
JsonFileLoads a JSON file (one object per line) and applies the given schema
TextFileLoads text file with the specific column delimited using the given schema
TextFileLoads a text file (one object per line), returning the result as a DataFrame
RegisterFunction``1Register UDF with no input argument, e.g: SqlContext.RegisterFunction<bool>("MyFilter", () => true); sqlContext.Sql("SELECT * FROM MyTable where MyFilter()");
RegisterFunction``2Register UDF with 1 input argument, e.g: SqlContext.RegisterFunction<bool, string>("MyFilter", (arg1) => arg1 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1)");
RegisterFunction``3Register UDF with 2 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string>("MyFilter", (arg1, arg2) => arg1 != null && arg2 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2)");
RegisterFunction``4Register UDF with 3 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, string>("MyFilter", (arg1, arg2, arg3) => arg1 != null && arg2 != null && arg3 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, columnName3)");
RegisterFunction``5Register UDF with 4 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg4) => arg1 != null && arg2 != null && ... && arg3 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName4)");
RegisterFunction``6Register UDF with 5 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg5) => arg1 != null && arg2 != null && ... && arg5 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName5)");
RegisterFunction``7Register UDF with 6 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg6) => arg1 != null && arg2 != null && ... && arg6 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName6)");
RegisterFunction``8Register UDF with 7 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg7) => arg1 != null && arg2 != null && ... && arg7 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName7)");
RegisterFunction``9Register UDF with 8 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg8) => arg1 != null && arg2 != null && ... && arg8 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName8)");
RegisterFunction``10Register UDF with 9 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg9) => arg1 != null && arg2 != null && ... && arg9 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName9)");
RegisterFunction``11Register UDF with 10 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg10) => arg1 != null && arg2 != null && ... && arg10 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName10)");

###Microsoft.Spark.CSharp.Sql.DataType ####Summary

        The base type of all Spark SQL data types.

####Methods

NameDescription
ParseDataTypeFromJsonParses a Json string to construct a DataType.
ParseDataTypeFromJsonParse a JToken object to construct a DataType.

###Microsoft.Spark.CSharp.Sql.AtomicType ####Summary

        An internal type used to represent a simple type. 

###Microsoft.Spark.CSharp.Sql.ComplexType ####Summary

        An internal type used to represent a complex type (such as arrays, structs, and maps).

####Methods

NameDescription
FromJsonAbstract method that constructs a complex type from a Json object
FromJsonConstructs a complex type from a Json string

###Microsoft.Spark.CSharp.Sql.NullType ####Summary

        The data type representing NULL values.

###Microsoft.Spark.CSharp.Sql.StringType ####Summary

        The data type representing String values.

###Microsoft.Spark.CSharp.Sql.BinaryType ####Summary

        The data type representing binary values.

###Microsoft.Spark.CSharp.Sql.BooleanType ####Summary

        The data type representing Boolean values.

###Microsoft.Spark.CSharp.Sql.DateType ####Summary

        The data type representing Date values.

###Microsoft.Spark.CSharp.Sql.TimestampType ####Summary

        The data type representing Timestamp values. 

###Microsoft.Spark.CSharp.Sql.DoubleType ####Summary

        The data type representing Double values.

###Microsoft.Spark.CSharp.Sql.FloatType ####Summary

###Microsoft.Spark.CSharp.Sql.ByteType ####Summary

        The data type representing Float values.

###Microsoft.Spark.CSharp.Sql.IntegerType ####Summary

###Microsoft.Spark.CSharp.Sql.LongType ####Summary

        The data type representing Int values.

###Microsoft.Spark.CSharp.Sql.ShortType ####Summary

        The data type representing Short values.

###Microsoft.Spark.CSharp.Sql.DecimalType ####Summary

        The data type representing Decimal values.

####Methods

NameDescription
FromJsonConstructs a DecimalType from a Json object

###Microsoft.Spark.CSharp.Sql.ArrayType ####Summary

        The data type for collections of multiple values. 

####Methods

NameDescription
FromJsonConstructs a ArrayType from a Json object

###Microsoft.Spark.CSharp.Sql.MapType ####Summary

        The data type for Maps. Not implemented yet.

####Methods

NameDescription
FromJsonConstructs a StructField from a Json object. Not implemented yet.

###Microsoft.Spark.CSharp.Sql.StructField ####Summary

        A field inside a StructType.

####Methods

NameDescription
FromJsonConstructs a StructField from a Json object

###Microsoft.Spark.CSharp.Sql.StructType ####Summary

        Struct type, consisting of a list of StructField
        This is the data type representing a Row

####Methods

NameDescription
FromJsonConstructs a StructType from a Json object

###Microsoft.Spark.CSharp.Sql.UdfRegistration ####Summary

        Used for registering User Defined Functions. SparkSession.Udf is used to access instance of this type.

####Methods

NameDescription
RegisterFunction``1Register UDF with no input argument, e.g: SqlContext.RegisterFunction<bool>("MyFilter", () => true); sqlContext.Sql("SELECT * FROM MyTable where MyFilter()");
RegisterFunction``2Register UDF with 1 input argument, e.g: SqlContext.RegisterFunction<bool, string>("MyFilter", (arg1) => arg1 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1)");
RegisterFunction``3Register UDF with 2 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string>("MyFilter", (arg1, arg2) => arg1 != null && arg2 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2)");
RegisterFunction``4Register UDF with 3 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, string>("MyFilter", (arg1, arg2, arg3) => arg1 != null && arg2 != null && arg3 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, columnName3)");
RegisterFunction``5Register UDF with 4 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg4) => arg1 != null && arg2 != null && ... && arg3 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName4)");
RegisterFunction``6Register UDF with 5 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg5) => arg1 != null && arg2 != null && ... && arg5 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName5)");
RegisterFunction``7Register UDF with 6 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg6) => arg1 != null && arg2 != null && ... && arg6 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName6)");
RegisterFunction``8Register UDF with 7 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg7) => arg1 != null && arg2 != null && ... && arg7 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName7)");
RegisterFunction``9Register UDF with 8 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg8) => arg1 != null && arg2 != null && ... && arg8 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName8)");
RegisterFunction``10Register UDF with 9 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg9) => arg1 != null && arg2 != null && ... && arg9 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName9)");
RegisterFunction``11Register UDF with 10 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg10) => arg1 != null && arg2 != null && ... && arg10 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName10)");

###Microsoft.Spark.CSharp.Streaming.ConstantInputDStream`1 ####Summary

        An input stream that always returns the same RDD on each timestep. Useful for testing.

####Methods

NameDescription

###Microsoft.Spark.CSharp.Streaming.DStream`1 ####Summary

        A Discretized Stream (DStream), the basic abstraction in Spark Streaming,
        is a continuous sequence of RDDs (of the same type) representing a
        continuous stream of data (see ) in the Spark core documentation
        for more details on RDDs).
        
        DStreams can either be created from live data (such as, data from TCP
        sockets, Kafka, Flume, etc.) using a  or it can be
        generated by transforming existing DStreams using operations such as
        `Map`, `Window` and `ReduceByKeyAndWindow`. While a Spark Streaming
        program is running, each DStream periodically generates a RDD, either
        from live data or by transforming the RDD generated by a parent DStream.
        
        DStreams internally is characterized by a few basic properties:
         - A list of other DStreams that the DStream depends on
         - A time interval at which the DStream generates an RDD
         - A function that is used to generate an RDD after each time interval

####Methods

NameDescription
CountReturn a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
FilterReturn a new DStream containing only the elements that satisfy predicate.
FlatMap``1Return a new DStream by applying a function to all elements of this DStream, and then flattening the results
Map``1Return a new DStream by applying a function to each element of DStream.
MapPartitions``1Return a new DStream in which each RDD is generated by applying mapPartitions() to each RDDs of this DStream.
MapPartitionsWithIndex``1Return a new DStream in which each RDD is generated by applying mapPartitionsWithIndex() to each RDDs of this DStream.
ReduceReturn a new DStream in which each RDD has a single element generated by reducing each RDD of this DStream.
ForeachRDDApply a function to each RDD in this DStream.
ForeachRDDApply a function to each RDD in this DStream.
PrintPrint the first num elements of each RDD generated in this DStream. @param num: the number of elements from the first will be printed.
GlomReturn a new DStream in which RDD is generated by applying glom() to RDD of this DStream.
CachePersist the RDDs of this DStream with the default storage level .
PersistPersist the RDDs of this DStream with the given storage level
CheckpointEnable periodic checkpointing of RDDs of this DStream
CountByValueReturn a new DStream in which each RDD contains the counts of each distinct value in each RDD of this DStream.
SaveAsTextFilesSave each RDD in this DStream as text file, using string representation of elements.
Transform``1Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream. `func` can have one argument of `rdd`, or have two arguments of (`time`, `rdd`)
Transform``1Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream. `func` can have one argument of `rdd`, or have two arguments of (`time`, `rdd`)
TransformWith``2Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream and 'other' DStream. `func` can have two arguments of (`rdd_a`, `rdd_b`) or have three arguments of (`time`, `rdd_a`, `rdd_b`)
TransformWith``2Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream and 'other' DStream. `func` can have two arguments of (`rdd_a`, `rdd_b`) or have three arguments of (`time`, `rdd_a`, `rdd_b`)
RepartitionReturn a new DStream with an increased or decreased level of parallelism.
UnionReturn a new DStream by unifying data of another DStream with this DStream. @param other: Another DStream having the same interval (i.e., slideDuration) as this DStream.
SliceReturn all the RDDs between 'fromTime' to 'toTime' (both included)
WindowReturn a new DStream in which each RDD contains all the elements in seen in a sliding window of time over this DStream. @param windowDuration: width of the window; must be a multiple of this DStream's batching interval @param slideDuration: sliding interval of the window (i.e., the interval after which the new DStream will generate RDDs); must be a multiple of this DStream's batching interval
ReduceByWindowReturn a new DStream in which each RDD has a single element generated by reducing all elements in a sliding window over this DStream. if `invReduceFunc` is not None, the reduction is done incrementally using the old window's reduced value : 1. reduce the new values that entered the window (e.g., adding new counts) 2. "inverse reduce" the old values that left the window (e.g., subtracting old counts) This is more efficient than `invReduceFunc` is None.
CountByWindowReturn a new DStream in which each RDD has a single element generated by counting the number of elements in a window over this DStream. windowDuration and slideDuration are as defined in the window() operation. This is equivalent to window(windowDuration, slideDuration).count(), but will be more efficient if window is large.
CountByValueAndWindowReturn a new DStream in which each RDD contains the count of distinct elements in RDDs in a sliding window over this DStream.

###Microsoft.Spark.CSharp.Streaming.EventHubsUtils ####Summary

        Utility for creating streams from 

####Methods

NameDescription
CreateUnionStreamCreate a unioned EventHubs stream that receives data from Microsoft Azure Eventhubs The unioned stream will receive message from all partitions of the EventHubs

###Microsoft.Spark.CSharp.Streaming.KafkaUtils ####Summary

        Utils for Kafka input stream.

####Methods

NameDescription
CreateStreamCreate an input stream that pulls messages from a Kafka Broker.
CreateStreamCreate an input stream that pulls messages from a Kafka Broker.
CreateDirectStreamCreate an input stream that directly pulls messages from a Kafka Broker and specific offset. This is not a receiver based Kafka input stream, it directly pulls the message from Kafka in each batch duration and processed without storing. This does not use Zookeeper to store offsets. The consumed offsets are tracked by the stream itself. For interoperability with Kafka monitoring tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application. You can access the offsets used in each batch from the generated RDDs (see [[org.apache.spark.streaming.kafka.HasOffsetRanges]]). To recover from driver failures, you have to enable checkpointing in the StreamingContext. The information on consumed offset can be recovered from the checkpoint. See the programming guide for details (constraints, etc.).
CreateDirectStream``1Create an input stream that directly pulls messages from a Kafka Broker and specific offset. This is not a receiver based Kafka input stream, it directly pulls the message from Kafka in each batch duration and processed without storing. This does not use Zookeeper to store offsets. The consumed offsets are tracked by the stream itself. For interoperability with Kafka monitoring tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application. You can access the offsets used in each batch from the generated RDDs (see [[org.apache.spark.streaming.kafka.HasOffsetRanges]]). To recover from driver failures, you have to enable checkpointing in the StreamingContext. The information on consumed offset can be recovered from the checkpoint. See the programming guide for details (constraints, etc.).
GetOffsetRangecreate offset range from kafka messages when CSharpReader is enabled
GetNumPartitionsFromConfigtopics should contain only one topic if choose to repartitions to a configured numPartitions TODO: move to scala and merge into DynamicPartitionKafkaRDD.getPartitions to remove above limitation

###Microsoft.Spark.CSharp.Streaming.OffsetRange ####Summary

        Kafka offset range

####Methods

NameDescription
ToStringOffsetRange string format

###Microsoft.Spark.CSharp.Streaming.MapWithStateDStream`4 ####Summary

        DStream representing the stream of data generated by `mapWithState` operation on a pair DStream.
        Additionally, it also gives access to the stream of state snapshots, that is, the state data of all keys after a batch has updated them.
        
        Type of the key
        Type of the value
        Type of the state data
        Type of the mapped data

####Methods

NameDescription
StateSnapshotsReturn a pair DStream where each RDD is the snapshot of the state of all the keys.

###Microsoft.Spark.CSharp.Streaming.KeyedState`1 ####Summary

        Class to hold a state instance and the timestamp when the state is updated or created.
        No need to explicitly make this class clonable, since the serialization and deserialization in Worker is already a kind of clone mechanism. 
        
        Type of the state data

###Microsoft.Spark.CSharp.Streaming.MapWithStateRDDRecord`3 ####Summary

        Record storing the keyed-state MapWithStateRDD. 
        Each record contains a stateMap and a sequence of records returned by the mapping function of MapWithState.
        Note: don't need to explicitly make this class clonable, since the serialization and deserialization in Worker is already a kind of clone. 
        
        Type of the key
        Type of the state data
        Type of the mapped data

###Microsoft.Spark.CSharp.Streaming.StateSpec`4 ####Summary

        Representing all the specifications of the DStream transformation `mapWithState` operation.
        
        Type of the key
        Type of the value
        Type of the state data
        Type of the mapped data

####Methods

NameDescription
NumPartitionsSet the number of partitions by which the state RDDs generated by `mapWithState` will be partitioned. Hash partitioning will be used.
TimeoutSet the duration after which the state of an idle key will be removed. A key and its state is considered idle if it has not received any data for at least the given duration. The mapping function will be called one final time on the idle states that are going to be removed; [[org.apache.spark.streaming.State State.isTimingOut()]] set to `true` in that call.
InitialStateSet the RDD containing the initial states that will be used by mapWithState

###Microsoft.Spark.CSharp.Streaming.State`1 ####Summary

        class for getting and updating the state in mapping function used in the `mapWithState` operation
        
        Type of the state

####Methods

NameDescription
ExistsReturns whether the state already exists
GetGets the state if it exists, otherwise it will throw ArgumentException.
UpdateUpdates the state with a new value.
RemoveRemoves the state if it exists.
IsTimingOutReturns whether the state is timing out and going to be removed by the system after the current batch.

###Microsoft.Spark.CSharp.Streaming.PairDStreamFunctions ####Summary

        operations only available to Tuple RDD

####Methods

NameDescription
ReduceByKey``2Return a new DStream by applying ReduceByKey to each RDD.
CombineByKey``3Return a new DStream by applying combineByKey to each RDD.
PartitionBy``2Return a new DStream in which each RDD are partitioned by numPartitions.
MapValues``3Return a new DStream by applying a map function to the value of each key-value pairs in this DStream without changing the key.
FlatMapValues``3Return a new DStream by applying a flatmap function to the value of each key-value pairs in this DStream without changing the key.
GroupByKey``2Return a new DStream by applying groupByKey on each RDD.
GroupWith``3Return a new DStream by applying 'cogroup' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
Join``3Return a new DStream by applying 'join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
LeftOuterJoin``3Return a new DStream by applying 'left outer join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
RightOuterJoin``3Return a new DStream by applying 'right outer join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
FullOuterJoin``3Return a new DStream by applying 'full outer join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
GroupByKeyAndWindow``2Return a new DStream by applying `GroupByKey` over a sliding window. Similar to `DStream.GroupByKey()`, but applies it over a sliding window.
ReduceByKeyAndWindow``2Return a new DStream by applying incremental `reduceByKey` over a sliding window. The reduced value of over a new window is calculated using the old window's reduce value : 1. reduce the new values that entered the window (e.g., adding new counts) 2. "inverse reduce" the old values that left the window (e.g., subtracting old counts) `invFunc` can be None, then it will reduce all the RDDs in window, could be slower than having `invFunc`.
UpdateStateByKey``3Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key.
UpdateStateByKey``3Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key.
UpdateStateByKey``3Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key.
MapWithState``4Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key.

###Microsoft.Spark.CSharp.Streaming.CSharpInputDStreamUtils ####Summary

        Utils for csharp input stream.

####Methods

NameDescription
CreateStream``1Create an input stream that user can control the data injection by C# code
CreateStream``1Create an input stream that user can control the data injection by C# code

###Microsoft.Spark.CSharp.Streaming.StreamingContext ####Summary

        Main entry point for Spark Streaming functionality. It provides methods used to create
        [[org.apache.spark.streaming.dstream.DStream]]s from various input sources. It can be either
        created by providing a Spark master URL and an appName, or from a org.apache.spark.SparkConf
        configuration (see core Spark documentation), or from an existing org.apache.spark.SparkContext.
        The associated SparkContext can be accessed using `context.sparkContext`. After
        creating and transforming DStreams, the streaming computation can be started and stopped
        using `context.start()` and `context.stop()`, respectively.
        `context.awaitTermination()` allows the current thread to wait for the termination
        of the context by `stop()` or by an exception.

####Methods

NameDescription
GetOrCreateEither recreate a StreamingContext from checkpoint data or create a new StreamingContext. If checkpoint data exists in the provided `checkpointPath`, then StreamingContext will be recreated from the checkpoint data. If the data does not exist, then the provided setupFunc will be used to create a JavaStreamingContext.
StartStart the execution of the streams.
StopStop the execution of the streams.
RememberSet each DStreams in this context to remember RDDs it generated in the last given duration. DStreams remember RDDs only for a limited duration of time and releases them for garbage collection. This method allows the developer to specify how long to remember the RDDs ( if the developer wishes to query old data outside the DStream computation).
CheckpointSet the context to periodically checkpoint the DStream operations for driver fault-tolerance.
SocketTextStreamCreate an input from TCP source hostname:port. Data is received using a TCP socket and receive byte is interpreted as UTF8 encoded ``\\n`` delimited lines.
TextFileStreamCreate an input stream that monitors a Hadoop-compatible file system for new files and reads them as text files. Files must be wrriten to the monitored directory by "moving" them from another location within the same file system. File names starting with . are ignored.
AwaitTerminationWait for the execution to stop.
AwaitTerminationOrTimeoutWait for the execution to stop.
Transform``1Create a new DStream in which each RDD is generated by applying a function on RDDs of the DStreams. The order of the JavaRDDs in the transform function parameter will be the same as the order of corresponding DStreams in the list.
Union``1Create a unified DStream from multiple DStreams of the same type and same slide duration.

###Microsoft.Spark.CSharp.Streaming.TransformedDStream`1 ####Summary

        TransformedDStream is an DStream generated by an C# function
        transforming each RDD of an DStream to another RDDs.
        
        Multiple continuous transformations of DStream can be combined into
        one transformation.