In [1]:
from init_SparkContext import init_SparkContext
sc, spark = init_SparkContext(appName = "NYCrimeAnalysis")

Spark found in your system !!
Spark Context and Spark session initialized !!


In [2]:
sc

In [7]:
# Load the data and get a quick sense
path = "D:\\Big Data\\Notebook\\data\\NYPD_7_Major_Felony_Incidents.csv"
data = sc.textFile(path)

In [8]:
data

D:\Big Data\Notebook\data\NYPD_7_Major_Felony_Incidents.csv MapPartitionsRDD[4] at textFile at NativeMethodAccessorImpl.java:0

In [9]:
data.take(10)

['OBJECTID,Identifier,Occurrence Date,Day of Week,Occurrence Month,Occurrence Day,Occurrence Year,Occurrence Hour,CompStat Month,CompStat Day,CompStat Year,Offense,Offense Classification,Sector,Precinct,Borough,Jurisdiction,XCoordinate,YCoordinate,Location 1',
 '1,f070032d,09/06/1940 07:30:00 PM,Friday,Sep,6,1940,19,9,7,2010,BURGLARY,FELONY,D,66,BROOKLYN,N.Y. POLICE DEPT,987478,166141,"(40.6227027620001, -73.9883732929999)"',
 '2,c6245d4d,12/14/1968 12:20:00 AM,Saturday,Dec,14,1968,0,12,14,2008,GRAND LARCENY,FELONY,G,28,MANHATTAN,N.Y. POLICE DEPT,996470,232106,"(40.8037530600001, -73.955861904)"',
 '3,716dbc6f,10/30/1970 03:30:00 PM,Friday,Oct,30,1970,15,10,31,2008,BURGLARY,FELONY,H,84,BROOKLYN,N.Y. POLICE DEPT,986508,190249,"(40.688874254, -73.9918594329999)"',
 '4,638cd7b7,07/18/1972 11:00:00 PM,Tuesday,Jul,18,1972,23,7,19,2012,GRAND LARCENY OF MOTOR VEHICLE,FELONY,F,73,BROOKLYN,N.Y. POLICE DEPT,1005876,182440,"(40.6674141890001, -73.9220463899999)"',
 '5,6e410287,05/21/1987 12:01:00

### Drawing Insights

- What is the trend in crime over the past few years?

- Which categories of crimes are the most common?

- In which boroughs is a particular category of crime most prevalent?

### Cleaning Data

- Filter the header

- Missing values

- Anomalous data

### Transforming Data

- Extracting fields

- Computing metrics

##### Spark has  a special way of performing all of these operations,  which is a functional way and is slightly different from the traditional way that you would deal with data sets. We will understand tthe functional way of working with data in Spark and the different operations that you would use in this paradigm.

### Transforming Data with Spark

- To transform data in spark, we use a special paradigm called the functional paradigm.


- As we know, RDD is a collections of records. Any transformation or computation on this collection of objects involves doing something with each item in the collection.


- One way of doing something with each item in the collection is the imperative way, which is using for loops or while loops.


- In this method, we would basically take each element at a time perform some transformation on it then move on to the next element, transform that element and so on until you reach the end of the collection.


- In the imperative way, we perform an operation sequentially on each element of the collection. This allows you to keep track of which element we are currently operating on and how mamy we have already finished and how many are left to go.


- But this method doesn't involve any parallelism and hence might not be taking advantage of the performance advantages that our distributed computing system might provide.


- On the other hand, we could use a functional way.


- The functional way will perform an operation independently on every element of the records at the same time and return a new set of records. So it doesn't modify each record in place.


- In this method, we are basically taking a function that defines some logic and applying that function on each record in the collection at the same time.


- This functional programming allows us to process data in parallel.


- Spark uses the functional programming way to actually perform operations on RDDs.


- The function that we might apply on each record could be an explicitly defined function.


- Such a function would basically act on each record so it should have a single argument and once this function is applied we will get a new RDD whose records will depend upon the results which are returned bu this function.


- Rather than defining an explicit function, we can also use lambda functions.


- Lambda function are normally defined by an input on one side and an expression which performs some computation on the input and returns an output.

### Functional Programming

- Filter: The function can be used to filter records which match a certain condition.


- Map: They can be used to map or transform each record to a new record.


- Reduce: They can be used to combine the records in a specified way, for instance, recompute a sum.

#### Filter:

- Filter records matching a given condition.


- The filter operation takes in a function which returns a Boolean value.


- IT will retunrn either true or false for each record that it processes.


- If the function returns true, then you would keep the record otherwise, you would drop the record.


- The result of the filter operation would be a new RDD in which you have dropped all the records that didn't match the condition that you have specified in your Boolean function.


- This operation is useful to filter out a header row in a dat set or you might use it to select rows corresponding to a specific value.


#### Map:

- Takes a record and transform a record to another record.


#### Reduce

- The reduce operation is used to combine records in an RDD in a specified way for instance if you wanted to combine the sum of some values or a maximum or a minimum.




- The reduce operation is slightly different from the filter and map operations, which are truly applied in parallel on all records in the RDD.


- The reduce operation on the other hand is applied on two records at a time. Therefore, unlike the filter and map operations, which take in functions with a single argument the argument representing one record, the reduce opeartion takes in a function with two arguments.

#### Combining Records


- You would start by applying the function on the first two records in the RDD and get the result. Let's say the function was sum, You would get the sum of the first two arguments then you would apply the same function under the result of the first application and the second record.


- In each step, you would apply the function on the result from the previous step and the current record. You would do this until you have combined all the records


- Now RDD as you know are partitioned, so the data are split across multiple nodes. In such a case, the reduce operation is applied on each partition then the results from all the partitions are taken to one single node, and reduce operation is applied on those results again.

##### Filter and Map are Transformation and Reduce is  Action

In [10]:
# Filter the header row
header = data.first()

In [12]:
print(header)

OBJECTID,Identifier,Occurrence Date,Day of Week,Occurrence Month,Occurrence Day,Occurrence Year,Occurrence Hour,CompStat Month,CompStat Day,CompStat Year,Offense,Offense Classification,Sector,Precinct,Borough,Jurisdiction,XCoordinate,YCoordinate,Location 1


In [29]:
datawoHeader = data.filter(lambda x: x!=header)

In [30]:
datawoHeader.first()

'1,f070032d,09/06/1940 07:30:00 PM,Friday,Sep,6,1940,19,9,7,2010,BURGLARY,FELONY,D,66,BROOKLYN,N.Y. POLICE DEPT,987478,166141,"(40.6227027620001, -73.9883732929999)"'

### Transforming records from strings to named tuples

- Now we have an RDD which has each record as a string. We would like to convert this to an RDD of named tuples so that each record in the RDd is actually a tuple, and each elementin in that tuple can be referred to by the field name.


- The map operation is what will help us do this transformation.

In [31]:
# Parse the rows to extract fields

In [36]:
new_rdd_with_list = datawoHeader.map(lambda x:x.split(","))

In [37]:
new_rdd_with_list.take(10)

[['1',
  'f070032d',
  '09/06/1940 07:30:00 PM',
  'Friday',
  'Sep',
  '6',
  '1940',
  '19',
  '9',
  '7',
  '2010',
  'BURGLARY',
  'FELONY',
  'D',
  '66',
  'BROOKLYN',
  'N.Y. POLICE DEPT',
  '987478',
  '166141',
  '"(40.6227027620001',
  ' -73.9883732929999)"'],
 ['2',
  'c6245d4d',
  '12/14/1968 12:20:00 AM',
  'Saturday',
  'Dec',
  '14',
  '1968',
  '0',
  '12',
  '14',
  '2008',
  'GRAND LARCENY',
  'FELONY',
  'G',
  '28',
  'MANHATTAN',
  'N.Y. POLICE DEPT',
  '996470',
  '232106',
  '"(40.8037530600001',
  ' -73.955861904)"'],
 ['3',
  '716dbc6f',
  '10/30/1970 03:30:00 PM',
  'Friday',
  'Oct',
  '30',
  '1970',
  '15',
  '10',
  '31',
  '2008',
  'BURGLARY',
  'FELONY',
  'H',
  '84',
  'BROOKLYN',
  'N.Y. POLICE DEPT',
  '986508',
  '190249',
  '"(40.688874254',
  ' -73.9918594329999)"'],
 ['4',
  '638cd7b7',
  '07/18/1972 11:00:00 PM',
  'Tuesday',
  'Jul',
  '18',
  '1972',
  '23',
  '7',
  '19',
  '2012',
  'GRAND LARCENY OF MOTOR VEHICLE',
  'FELONY',
  'F',
  '73

In [42]:
datawoHeader.map(lambda x:x.split(",")).take(5)

[['1',
  'f070032d',
  '09/06/1940 07:30:00 PM',
  'Friday',
  'Sep',
  '6',
  '1940',
  '19',
  '9',
  '7',
  '2010',
  'BURGLARY',
  'FELONY',
  'D',
  '66',
  'BROOKLYN',
  'N.Y. POLICE DEPT',
  '987478',
  '166141',
  '"(40.6227027620001',
  ' -73.9883732929999)"'],
 ['2',
  'c6245d4d',
  '12/14/1968 12:20:00 AM',
  'Saturday',
  'Dec',
  '14',
  '1968',
  '0',
  '12',
  '14',
  '2008',
  'GRAND LARCENY',
  'FELONY',
  'G',
  '28',
  'MANHATTAN',
  'N.Y. POLICE DEPT',
  '996470',
  '232106',
  '"(40.8037530600001',
  ' -73.955861904)"'],
 ['3',
  '716dbc6f',
  '10/30/1970 03:30:00 PM',
  'Friday',
  'Oct',
  '30',
  '1970',
  '15',
  '10',
  '31',
  '2008',
  'BURGLARY',
  'FELONY',
  'H',
  '84',
  'BROOKLYN',
  'N.Y. POLICE DEPT',
  '986508',
  '190249',
  '"(40.688874254',
  ' -73.9918594329999)"'],
 ['4',
  '638cd7b7',
  '07/18/1972 11:00:00 PM',
  'Tuesday',
  'Jul',
  '18',
  '1972',
  '23',
  '7',
  '19',
  '2012',
  'GRAND LARCENY OF MOTOR VEHICLE',
  'FELONY',
  'F',
  '73

In [159]:
import csv
from io import StringIO
from collections import namedtuple

In [45]:
header

'OBJECTID,Identifier,Occurrence Date,Day of Week,Occurrence Month,Occurrence Day,Occurrence Year,Occurrence Hour,CompStat Month,CompStat Day,CompStat Year,Offense,Offense Classification,Sector,Precinct,Borough,Jurisdiction,XCoordinate,YCoordinate,Location 1'

In [43]:
fields = header.replace(" ", "_").replace("/", "_").split(",")

In [44]:
fields

['OBJECTID',
 'Identifier',
 'Occurrence_Date',
 'Day_of_Week',
 'Occurrence_Month',
 'Occurrence_Day',
 'Occurrence_Year',
 'Occurrence_Hour',
 'CompStat_Month',
 'CompStat_Day',
 'CompStat_Year',
 'Offense',
 'Offense_Classification',
 'Sector',
 'Precinct',
 'Borough',
 'Jurisdiction',
 'XCoordinate',
 'YCoordinate',
 'Location_1']

In [64]:
len(fields)

20

In [47]:
crime = namedtuple('crime', fields)

In [160]:
def parse(row):
    reader = csv.reader(StringIO(row))
    row = next(reader)
    return crime(*row)

In [161]:
crimes=datawoHeader.map(parse)

In [162]:
crimes.first()

crime(OBJECTID='1', Identifier='f070032d', Occurrence_Date='09/06/1940 07:30:00 PM', Day_of_Week='Friday', Occurrence_Month='Sep', Occurrence_Day='6', Occurrence_Year='1940', Occurrence_Hour='19', CompStat_Month='9', CompStat_Day='7', CompStat_Year='2010', Offense='BURGLARY', Offense_Classification='FELONY', Sector='D', Precinct='66', Borough='BROOKLYN', Jurisdiction='N.Y. POLICE DEPT', XCoordinate='987478', YCoordinate='166141', Location_1='(40.6227027620001, -73.9883732929999)')

In [169]:
crimes.first().Offense

'BURGLARY'

In [170]:
crimes.first().Offense_Classification

'FELONY'

In [172]:
crimes.first().Location_1

'(40.6227027620001, -73.9883732929999)'

#### Rough for function Explanation

In [163]:
import csv
row = csv.reader("Upendra,Bokaro,sunday,Bike,'(8102297061 samsumg)'")
next(row)

['U']

In [165]:
import csv
row = csv.reader(StringIO("Upendra,Bokaro,sunday,Bike,'(8102297061 samsumg)'"))

In [166]:
next(row)

['Upendra', 'Bokaro', 'sunday', 'Bike', "'(8102297061 samsumg)'"]

In [157]:
row1 = next(row)
row1

['Upendra', 'Bokaro', 'sunday', 'Bike', "'(8102297061 samsumg)'"]

In [175]:
f = ["Name", "Place", "Day", "Mode", "PhoneNumber"]

In [176]:
test = namedtuple("testing", f)

In [177]:
test(*row1)

testing(Name='Upendra', Place='Bokaro', Day='sunday', Mode='Bike', PhoneNumber="'(8102297061 samsumg)'")

In [178]:
test(*row1).Place

'Bokaro'