
## Apache Spark

Apache Spark is a fast and general-purpose open source cluster computing framework. Spark has a Core Engine which manages of Distributed task dispatching, scheduling etc, and provides high-level APIs (supports Scala, Python, R) for implementation. 

The APIs provieded are centered on Sparks's native data structure known as Resilient Distributed Dataset (RDD). RDDs are immutable data-structures which are inherently fault tolerant. RDDs will be described in detail in later sections.

Spark requires a cluster manager and a distributed storage system to implement its core funtionalities. Generally, Spark uses HDFS and YARN of Hadoop 2.x to support its distributed storage and cluster management. Spark also supports standalone local cluster (With just one thread). However, it can be used to interface Cassandra, Amazon S3 for storage and Apache Mesos for cluster management. 

Spark's Python API, PySpark is used to program Spark.

## Installation

### Windows

1. Install latest version of JAVA.
2. Set JAVA_HOME environment variable, pointing to the Java directory

   For example, your path could be: C:\Program Files\Java\jdk1.8.0_91
3. Download latest Hadoop version from [here](http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.1/).
4. Set HADOOP_HOME environment variable pointing to Hadoop directory
5. Download *winutils.exe* from [here](https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin).
6. Copy winutils.exe to the bin folder in Hadoop directory (HADOOP_HOME/bi).
7. Download Apache Spark pre-built for your Hadoop version from [here](http://spark.apache.org/downloads.html).
8. Set SPARK_HOME environment variable pointing to Hadoop directory.
9. This is a good time to restart your computer.

##### Running Spark
10. Now open the command prompt, navigate to SPARK_HOME/bin and execute pyspark.exe
    This initializes PySpark and the command prompt must look as shown below. 
    
    ![Here](https://github.com/pbskumar/spark/blob/master/images/pyspark_init.JPG?raw=true)
    
    
    
11. Instead of navigating to SPARK_HOME/bin everytime, it is a good idea to add [SPARK_HOME/bin/pyspark.exe) to the PATH environment variable. This way we can run pyspark from any folder.



### Linux (Ubuntu)

1. Install openSSH for remote access
   
   `sudo apt-get install openssh-server`
2. Create a folder named *software* and navigate to it.
   
   `mkdir ~/software`
   
   `cd ~/software`
3. Create a backup of `.bashrc` file


    `cp ~/.bashrc ~software/bashrc_original`
4. Download Java. 
    `wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-8u92-linux-x64.tar.gz`
5. Download Hadoop 2. From [here](http://www-eu.apache.org/dist/hadoop/common/) or using the following script

    `wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz`
6. Download Spark
    
    
    
    
To be update soon.
[Reference](https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python)

### Initializing Spark

#### Windows

1. As mentioned earlier, we can directly use the interactive *pyspark.exe* shell.
2. Instead, we could use Jupyter Notebook or any IDE (PyCharm)
   The script to initialize spark is given below:
   ```python
   import findspark
   findspark.init()
   
   try:
       from pyspark import SparkContext
       from pyspark import SparkConf

   except ImportError as e:
       print("Error: ", e)
       sys.exit(1)
        
   conf = SparkConf()
   conf.setMaster("local")
   conf.setAppName("spark_wc")
   
   sc = SparkContext(conf=conf)
   ```
   Note: If `findspark` is not found install it using the following command
   `pip install findspark`
   
3. Another script which can initialize Spark is:
    ```python
    import os
    import sys

    spark_home_folder = os.environ['SPARK_HOME']
    sys.path.append(spark_home_folder + r'\python')

    try:
        from pyspark import SparkContext
        from pyspark import SparkConf

    except ImportError as e:
        print("Error: ", e)
        sys.exit(1)

    conf = SparkConf()
    conf.setMaster("local")
    conf.setAppName("spark_wc")
    sc = SparkContext(conf=conf)
    ```
    
4. Start a Jupyter notebook using the following command

    `jupyter notebook <path/to/directory>`    
    Simply use any of the scripts given above.
    
5. In IDEs, write your code after the initialization script.

#### Linux (Ubuntu)

To be updated soon