## Requirements
1. Apache Spark binary (https://spark.apache.org/)
2. For Windows: winutils (https://medium.com/@dvainrub/how-to-install-apache-spark-2-x-in-your-pc-e2047246ffc3)
3. Setting ```JAVA_HOME```, ```SPARK_HOME```, and ```HADOOP_HOME```
4. Python 3.x (from Anaconda distribution)
5. ```findspark``` https://pypi.org/project/findspark/
6. Jupyter Notebook (available from Anaconda installation)

### References
https://spark.apache.org/docs/2.3.3/sql-programming-guide.html

## Spark Initialization

In [1]:
# Import findspark to read SPARK_HOME and HADOOP_HOME
import findspark

findspark.init()

In [3]:
# Import required library
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

print(spark)

<pyspark.sql.session.SparkSession object at 0x000001E5CD818F60>


## Loading Data using Spark

In [5]:
df = spark.read.json("D://spark-2.3.1-bin-hadoop2.7//examples//src//main//resources//people.json")

In [6]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [8]:
df2 = spark.read.csv("D:/Documents/dataset/Uber-Jan-Feb-FOIL.csv", header=True, inferSchema=True)

In [9]:
df2.show()

+-----------------------+--------+---------------+-----+
|dispatching_base_number|    date|active_vehicles|trips|
+-----------------------+--------+---------------+-----+
|                 B02512|1/1/2015|            190| 1132|
|                 B02765|1/1/2015|            225| 1765|
|                 B02764|1/1/2015|           3427|29421|
|                 B02682|1/1/2015|            945| 7679|
|                 B02617|1/1/2015|           1228| 9537|
|                 B02598|1/1/2015|            870| 6903|
|                 B02598|1/2/2015|            785| 4768|
|                 B02617|1/2/2015|           1137| 7065|
|                 B02512|1/2/2015|            175|  875|
|                 B02682|1/2/2015|            890| 5506|
|                 B02765|1/2/2015|            196| 1001|
|                 B02764|1/2/2015|           3147|19974|
|                 B02765|1/3/2015|            201| 1526|
|                 B02617|1/3/2015|           1188|10664|
|                 B02598|1/3/20

In [11]:
df2.schema

StructType(List(StructField(dispatching_base_number,StringType,true),StructField(date,StringType,true),StructField(active_vehicles,IntegerType,true),StructField(trips,IntegerType,true)))

In [16]:
df3 = spark.read.csv("D:/Documents/dataset/Crimes_-_2001_to_present.csv", header=True, inferSchema=True)

In [17]:
df3.count()

6814395

In [18]:
df3.schema

StructType(List(StructField(ID,IntegerType,true),StructField(Case Number,StringType,true),StructField(Date,StringType,true),StructField(Block,StringType,true),StructField(IUCR,StringType,true),StructField(Primary Type,StringType,true),StructField(Description,StringType,true),StructField(Location Description,StringType,true),StructField(Arrest,BooleanType,true),StructField(Domestic,BooleanType,true),StructField(Beat,IntegerType,true),StructField(District,IntegerType,true),StructField(Ward,IntegerType,true),StructField(Community Area,IntegerType,true),StructField(FBI Code,StringType,true),StructField(X Coordinate,IntegerType,true),StructField(Y Coordinate,IntegerType,true),StructField(Year,IntegerType,true),StructField(Updated On,StringType,true),StructField(Latitude,DoubleType,true),StructField(Longitude,DoubleType,true),StructField(Location,StringType,true)))

## Data Mining Process