**ML in Spark**
---

### Spark Context
SparkContext is the internal engine that allows the connections with the clusters. If you want to run an operation, you need a SparkContext.

### Create a SparkContext

In [1]:
# First of all, you need to initiate a SparkContext.

import pyspark
from pyspark import SparkContext
sc =SparkContext()

Now that the SparkContext is ready, you can create a collection of data called RDD, Resilient Distributed Dataset. Computation in an RDD is automatically parallelized across the cluster.

In [2]:
nums= sc.parallelize([1,2,3,4])

# you can access the first row
nums.take(1)

You can apply a transformation to the data with a lambda function. In the PySpark example below, you return the square of nums. It is a map transformation

In [4]:
squared = nums.map(lambda x: x*x).collect()
for num in squared:
    print('%i ' % (num))

1 
4 
9 
16 


### SQLContext

A more convenient way is to use the DataFrame. SparkContext is already set, you can use it to create the DataFrame. You also need to declare the SQLContext

SQLContext allows connecting the engine with different data sources. It is used to initiate the functionalities of Spark SQL.

In [5]:
from pyspark.sql import Row
from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)



In [7]:
# Now in this Spark tutorial Python, let's create a list of tuples. \
#Each tuple will contain the name of the people and their age. Four steps are required:

In [8]:
# Create the list of tuple with the information
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]

In [10]:
# Build the RDD
rdd = sc.parallelize(list_p)

In [11]:
#Convert to Tuples
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

In [12]:
# Create a DataFrame context
DF_ppl = sqlContext.createDataFrame(ppl)

In [13]:
#If you want to access the type of each feature, you can use printSchema()

DF_ppl.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



In [14]:
DF_ppl.show(3)

+-----+---+
| name|age|
+-----+---+
| John| 19|
|Smith| 29|
| Adam| 35|
+-----+---+
only showing top 3 rows



# Machine Learning Example with PySpark
---

Now that you have a brief idea of Spark and SQLContext, you are ready to build your first Machine Learning program using Spark.

Following are the steps to build a Machine Learning program with PySpark:

1. Basic operation with PySpark\
2. Data preprocessing\
3. Build a data processing pipeline\
4. Build the classifier: logistic\
5. Train and evaluate the model\
6. Tune the hyperparameter\
Note that, the dataset we use is not very big and you may think that the computation takes a long time. Spark is designed to process a considerable amount of data. Spark's performance increases relatively to other machine learning libraries when the dataset grows larger.