# Tutorial for data analysis using PySpark

This tutorial is based on the LinkedIn Learning example "Apache PySpark by Example" by Jonathan Fernandes.

The first steps needed to run this tutorial comprise installing PySpark, creating a Spark Session and downloading the data (to the virtual environment).

### Install PySpark and create Spark session

In [None]:
!pip install pyspark==3.5.1

Create new Spark session and context.

In [None]:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark

### Download City of Chicago's police stations dataset

In [None]:
!wget -O police_station.csv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.csv?accessType=DOWNLOAD
!ls -l

Read the csv file to create a RDD and then show the first line.

In [None]:
psrdd = sc.textFile('police-stations.csv')
psrdd.first()

Save the first line of the dataset as the header.

In [None]:
ps_header = psrdd.first()

Now create a new RDD with all the data except for the header.

In [None]:
ps_rest = psrdd.filter(lambda line: line!= ps_header)
ps_rest.first()

### How many police stations are there?

To see how many police stations there we can use the map function to read the data line by line, and either collec all the data (which is not recommended for large datasets) or use count to count all the lines in the data.

In [None]:
ps_rest.map(lambda line: line.split(',')).collect()

In [None]:
ps_rest.map(lambda line: line.split(',')).count()

### Display the District ID, District name, Address and Zip for the police station with District ID 7

The distric ID, name, address and zip codes correspond to columns 0, 1, 2 and 5 of the dataset. We can use the map function, together with a lambda function that reads a line of the dataset, splits the line and selects only the columns needed. Use collect to display all the data.

In [None]:
(ps_rest.filter(lambda line: line.split(',')[0] == '7').
 map(lambda line: (line.split(',')[0],
                   line.split(',')[1],
                   line.split(',')[2],
                   line.split(',')[5]
                   )).collect())

## Police stations 10 and 11 are geographically close to each other. Display the District ID, District name, address and zip code

Similarly, use map, lambda and split, to show all rows that are either 10 or 11 (district ID) for columns 0, 1, 2 and 5.

In [None]:
(ps_rest.filter(lambda line: line.split(',')[0] in ['10','11']).
 map(lambda line: (line.split(',')[0],
                   line.split(',')[1],
                   line.split(',')[2],
                   line.split(',')[5])).collect())