# Assignment 3

*Objectives*: Wrangle a data set using two new tools, [Trifacta Wrangler](https://www.trifacta.com/products/wrangler/) and [Apache Spark](https://spark.apache.org/).  Results should include a cleaned-up data set and summary statistics.

*Grading criteria*: The tasks should all be completed, and questions should all be answered with clear responses, with shell commands and markdown cells explaining your work as appropriate in the cells provided (as more as needed).  The notebook itself should be completely reproducible (using AWS an EC2 instance based on the class AMI) from start to finish; another person should be able to use the code to obtain the same results as yours.  Note that you will receive no more than partial credit if you do not add text/markdown cells explaining your thinking where required.

*Attestation*: **Work individually**.  At the end of your submitted notebook, state that you did all of the substantial work on this assignment yourself, and acknowledge any assistance you received.

*Deadline*: Sunday, October 22, 12pm.  Zip your notebook and wrangled dataset and submit it to Blackboard as a single file.

## Part 1 - Wrangle a dataset with Trifacta

For this part, select a dataset from the [OKFN US City Open Data Census](http://us-city.census.okfn.org/).  Choose one according to your interest, but try to choose one that's "green" and has somewhere between 10,000 and 1,000,000 rows.  Try to choose a dataset that is less than 50MB (to save your instructors some time and space during grading!).

Document your process by answering each of the following questions.

### Q1.1 - Choose your dataset

Which dataset did you choose?  What is it called, and what is it about?  Provide a link to its main web page (not its data link, which you'll include next).

**Answer**

```
I chose the crime data in Baltimore. It's called "BPD_Part_1_Victim_Based_Crime_Data".
The data includes details about the crimes reported by victims, for example the date, the exact time, the location and the type of every crime report.
The links are as below:
http://us-city.census.okfn.org/entry/baltimore/crime-stats
https://data.baltimorecity.gov/Public-Safety/BPD-Part-1-Victim-Based-Crime-Data/wsfq-mvij
```

### Q1.2 - Get your data

Use `wget` to download your data onto your instance. 

**Answer**

```
Get the CSV and name the file "crime.csv".
```

In [1]:
!wget https://data.baltimorecity.gov/api/views/wsfq-mvij/rows.csv?accessType=DOWNLOAD -O crime.csv

--2017-10-25 19:02:34--  https://data.baltimorecity.gov/api/views/wsfq-mvij/rows.csv?accessType=DOWNLOAD
Resolving data.baltimorecity.gov (data.baltimorecity.gov)... 52.206.68.26
Connecting to data.baltimorecity.gov (data.baltimorecity.gov)|52.206.68.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘crime.csv’

crime.csv               [        <=>         ]  40.15M  1.81MB/s    in 23s     

Last-modified header invalid -- time-stamp ignored.
2017-10-25 19:02:57 (1.75 MB/s) - ‘crime.csv’ saved [42098939]



### Q1.3 - Explore your data

Use command line tools of your choice (CSVKit, XSV, or other UNIX commands we've seen in class already) to explore your data.  How long is it?  Does it seem relatively clean, or do you see data issues that need wrangling?

**Answer**

```
See what's in the data file.
```

In [2]:
!csvcut -n crime.csv

  1: CrimeDate
  2: CrimeTime
  3: CrimeCode
  4: Location
  5: Description
  6: Inside/Outside
  7: Weapon
  8: Post
  9: District
 10: Neighborhood
 11: Longitude
 12: Latitude
 13: Location 1
 14: Premise
 15: Total Incidents


In [3]:
!head -5 crime.csv | csvlook

|-------------+-----------+-----------+----------------------------+----------------+----------------+---------+------+--------------+----------------+-----------+----------+---------------------------------+------------+------------------|
|  CrimeDate  | CrimeTime | CrimeCode | Location                   | Description    | Inside/Outside | Weapon  | Post | District     | Neighborhood   | Longitude | Latitude | Location 1                      | Premise    | Total Incidents  |
|-------------+-----------+-----------+----------------------------+----------------+----------------+---------+------+--------------+----------------+-----------+----------+---------------------------------+------------+------------------|
|  10/14/2017 | 23:00:00  | 4A        | 2700 BEETHOVEN AVE         | AGG. ASSAULT   | I              | FIREARM | 622  | NORTHWESTERN |                | -76.71131 | 39.32639 | (39.3263900000, -76.7113100000) | ROW/TOWNHO | 1                |
|  10/14/2017 | 22:15:00  | 4E  

```
Have a more comfortable look at all columns of first several rows.
```

In [4]:
!csvcut -c 1-5 crime.csv | head -20 | csvlook

|-------------+-----------+-----------+----------------------------+-----------------------|
|  CrimeDate  | CrimeTime | CrimeCode | Location                   | Description           |
|-------------+-----------+-----------+----------------------------+-----------------------|
|  10/14/2017 | 23:00:00  | 4A        | 2700 BEETHOVEN AVE         | AGG. ASSAULT          |
|  10/14/2017 | 22:15:00  | 4E        | 4200 EDMONDSON AVE         | COMMON ASSAULT        |
|  10/14/2017 | 22:10:00  | 4E        | BALTIMORE ST & N HOWARD ST | COMMON ASSAULT        |
|  10/14/2017 | 22:10:00  | 4E        | BALTIMORE ST & N HOWARD ST | COMMON ASSAULT        |
|  10/14/2017 | 22:00:00  | 4E        | 400 WALTON AV, AAC         | COMMON ASSAULT        |
|  10/14/2017 | 21:42:00  | 9S        | 900 N EDEN ST              | SHOOTING              |
|  10/14/2017 | 21:28:00  | 9S        | 1400 CARROLL ST            | SHOOTING              |
|  10/14/2017 | 21:28:00  | 9S        | 1400 CARROLL ST     

In [5]:
!csvcut -c 6-10 crime.csv | head -20 | csvlook

|-----------------+---------+------+--------------+----------------------------|
|  Inside/Outside | Weapon  | Post | District     | Neighborhood               |
|-----------------+---------+------+--------------+----------------------------|
|  I              | FIREARM | 622  | NORTHWESTERN |                            |
|  I              | HANDS   | 822  | SOUTHWESTERN | Rognel Heights             |
|  O              | HANDS   | 111  | CENTRAL      | Downtown                   |
|  O              | HANDS   | 111  | CENTRAL      | Downtown                   |
|  O              | HANDS   | 123  | CENTRAL      | Upton                      |
|  Outside        | FIREARM | 313  | EASTERN      | Oldtown                    |
|  Outside        | FIREARM | 932  | SOUTHERN     | Washington Village/Pigtow  |
|  Outside        | FIREARM | 932  | SOUTHERN     | Washington Village/Pigtow  |
|  Outside        | FIREARM | 211  | SOUTHEASTERN | Perkins Homes              |
|  O              | FIREARM 

In [6]:
!csvcut -c 11-15 crime.csv | head -20 | csvlook

|------------+----------+---------------------------------+------------+------------------|
|  Longitude | Latitude | Location 1                      | Premise    | Total Incidents  |
|------------+----------+---------------------------------+------------+------------------|
|  -76.71131 | 39.32639 | (39.3263900000, -76.7113100000) | ROW/TOWNHO | 1                |
|  -76.68623 | 39.29372 | (39.2937200000, -76.6862300000) | LIQUOR STO | 1                |
|  -76.61945 | 39.28935 | (39.2893500000, -76.6194500000) | STREET     | 1                |
|  -76.61945 | 39.28935 | (39.2893500000, -76.6194500000) | STREET     | 1                |
|  -76.62471 | 39.29995 | (39.2999500000, -76.6247100000) | YARD       | 1                |
|  -76.59931 | 39.30036 | (39.3003600000, -76.5993100000) | Street     | 1                |
|  -76.63605 | 39.27848 | (39.2784800000, -76.6360500000) | Street     | 1                |
|  -76.63605 | 39.27848 | (39.2784800000, -76.6360500000) | Street    

```
I find the Total_Incidents interesting. Let's see what it has.
```

In [7]:
!csvcut -c 15 crime.csv | csvstat

  1. Total Incidents
	<class 'int'>
	Nulls: False
	Values: 1

Row count: 282749


```
All values in Total_Incidents is 1.
```

```
Let's see how many records do we have.
```

In [8]:
!wc -l crime.csv

282750 crime.csv


Add any additional comments here.

```
The data seems quite clean. 
But I still find some missing values in the data. I will replace these missing values with "unknown".
The Inside/Outside column seems confusing with I/Inside and O/Outside. I will just keep I and O and replace the inside/outside with i/o.
For the total incidents column, all the value is 1. It gives us no information, so, I will delete this column.
I noticed that the location columns have "," in their values. They may have some influence later. So, I will delete this column.
```

### Q1.4 - Wrangle your data with Trifacta

Use Trifacta to import your data.  Find at least two columns you want to wrangle and clean them up - you can split values into new columns, remove bad values, whatever you like.

Execute your recipe, generating a summary you can review, and save your recipe.

Paste your recipe into the cell below using the markdown provided.

**Answer**

```

Paste your recipe here as text.  Leave the marks above and below this line to format the text.

```

```
We already have a column that combines the latitude and longitude. But I can still get my own column of crime scene coordinates.
And I will delete the redundant columns ('longitude', 'latitude', 'location_1') to make the file smaller.

Below is my final recipe(generated from "copy to clipboard"):

splitrows table: MISSING col: column1 on: '\n' quote: '\"'
split col: column1 on: ',' limit: 14 quote: '\"'
header table: MISSING
replace col: Inside_Outside with: '' on: `nside|utside` global: true
drop table: MISSING col: Total_Incidents
merge col: Latitude,Longitude with: ',' as: 'mylonglat'
drop table: MISSING col: Latitude
drop table: MISSING col: Longitude
drop table: MISSING col: Location_1
set col: CrimeDate, CrimeTime, CrimeCode, Location, Description, Inside_Outside, Weapon, Post, District, Neighborhood, mylonglat, Premise value: ifmissing($col, 'unknown')

OR using a more readable language：

Break into rows using '\n' as a delimiter
Split column1 into 15 columns on ','
Convert row 1 to header
Replace 'nside|utside' from Inside_Outside with ''
Drop Total_Incidents
Concatenate Latitude, Longitude separated by ','
Drop 'Latitude'
Drop 'Longitude'
Drop 'Location_1'
Set 12 columns to ifmissing($col,"unknown")
```

### Q1.5 - Evaluate

How did it go?  Did your recipe work on the whole dataset?  Did you run into any problems?

**Answer**

```
I generated the result.
The result shows that 99.99% matched, 0.01% mismatched and 0% missing.
For the mismatch, the only 1 mismatched value is in the CrimeTime column.
It's because the range of time is between 00:00:00 and 23:59:59. But there is a 24:00:00 in the data.
It seems that the recipe works well.

I will leave the 24:00:00 unchanged because it is hard to say if it represents 00:00:00 of the present day or the 00:00:00 of the next day.
```

## Part 2 - Summary statistics with Spark

Use Spark to load your data and compute basic summary statistics (counts, or average, min/max, and mean).  You may borrow liberally from the example we saw in class, just change a few things as appropriate.

This is just to get you a taste... we'll do more with Spark next week and in Project 3.

### Q2.1 - Start Spark

First, load up Spark by executing the following cells.  You can just execute them!

In [9]:
import os

In [10]:
os.environ['SPARK_HOME'] = '/usr/local/lib/spark'

In [11]:
import findspark

In [12]:
findspark.init()

In [13]:
from pyspark import SparkContext

In [14]:
spark = SparkContext(appName='assignment-3')

In [15]:
spark

If it worked, you should see the description of your **SparkContext** and a link (that you can visit by replacing its IP address with your EC2 instance host name).

### Q2.2 - Upload your wrangled data

Upload the data you wrangled with Trifacta in Part 1.  You may use Jupyter's upload function for this, it doesn't need to be captured here.  You may want to compress your data before uploading it.

In a few cells below, ensure that your data uploaded correctly, and uncompress it if necessary.  Count its lines, check its filesize, or look at the first few lines as you deem appropriate until you're confident you have all your data to use here in the notebook.

**Answer**

```
Let's see if the crimenew.csv exists in the folder.
```

In [16]:
import os
assert "crimenew.csv" in os.listdir()

* Count lines and check size

In [17]:
!wc -l crimenew.csv

282750 crimenew.csv


In [18]:
!ls -lh crimenew.csv

-rw-rw-r-- 1 ubuntu ubuntu 34M Oct 25 16:27 crimenew.csv


In [19]:
!stat crimenew.csv

  File: crimenew.csv
  Size: 34899262  	Blocks: 68168      IO Block: 4096   regular file
Device: ca01h/51713d	Inode: 777441      Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  ubuntu)   Gid: ( 1000/  ubuntu)
Access: 2017-10-25 16:29:21.919970725 +0000
Modify: 2017-10-25 16:27:46.872508972 +0000
Change: 2017-10-25 16:27:46.872508972 +0000
 Birth: -


In [20]:
!csvcut -n crimenew.csv

  1: CrimeDate
  2: CrimeTime
  3: CrimeCode
  4: Description
  5: Inside_Outside
  6: Weapon
  7: Post
  8: District
  9: Neighborhood
 10: mylonglat
 11: Premise


In [21]:
!head -1 crimenew.csv

"CrimeDate","CrimeTime","CrimeCode","Description","Inside_Outside","Weapon","Post","District","Neighborhood","mylonglat","Premise"


In [22]:
!head -10 crimenew.csv| csvlook

|-------------+-----------+-----------+----------------+----------------+---------+------+--------------+---------------------------+--------------------+-------------|
|  CrimeDate  | CrimeTime | CrimeCode | Description    | Inside_Outside | Weapon  | Post | District     | Neighborhood              | mylonglat          | Premise     |
|-------------+-----------+-----------+----------------+----------------+---------+------+--------------+---------------------------+--------------------+-------------|
|  10/14/2017 | 23:00:00  | 4A        | AGG. ASSAULT   | I              | FIREARM | 622  | NORTHWESTERN | unknown                   | 39.32639,-76.71131 | ROW/TOWNHO  |
|  10/14/2017 | 22:15:00  | 4E        | COMMON ASSAULT | I              | HANDS   | 822  | SOUTHWESTERN | Rognel Heights            | 39.29372,-76.68623 | LIQUOR STO  |
|  10/14/2017 | 22:10:00  | 4E        | COMMON ASSAULT | O              | HANDS   | 111  | CENTRAL      | Downtown                  | 39.28935,-76.619

### Q2.3 - Load your data into a Spark RDD

Load up your data using the techniques we reviewed in class.  Extract the header. Get a count to verify that it's working correctly.

Modify the cells below to get started.

**Answer**

In [23]:
# Edit this cell to point to your file!
data = spark.textFile('crimenew.csv')

In [24]:
header = data.first()
header

'"CrimeDate","CrimeTime","CrimeCode","Description","Inside_Outside","Weapon","Post","District","Neighborhood","mylonglat","Premise"'

In [25]:
data.count()

282750

In [26]:
%time data.count()

CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 845 ms


282750

```
It's the same with what we know from the output above.
```

### Q2.4 - Summarize your data

Choose one of the two techniques we saw in class to compute some basic numbers on one of your columns.  Your options are:

 * Use `map` and `filter` and `reduceByKey` with `lambda` functions find min/max values and to count frequencies in one column
 * Use the `Statistics` module to compute count, mean, min/max (don't forget to import it and numpy)
 
It's your choice.

**Answer**

```
Let's so the calculation to the Post column.
```

* First let's try the old method using csvcut.

In [27]:
!csvcut -c 7 crimenew.csv | csvstat

  1. Post
	<class 'str'>
	Nulls: False
	Unique values: 181
	5 most frequent values:
		111:	9326
		212:	4759
		913:	4636
		922:	4594
		432:	4377
	Max length: 7

Row count: 282749


* New way to do it.

In [28]:
data_post = data.filter(lambda  row: row != header) \
                .map(lambda row: row.split(",")) \
                .map(lambda  row: row[6]) \
                .filter(lambda  row: row != '"unknown"')
data_post.take(5)

['"622"', '"822"', '"111"', '"111"', '"123"']

In [29]:
data_post.max()

'"999"'

In [30]:
data_post.min()

'"111"'

In [31]:
from operator import add
frequency = data.filter(lambda row: row != header) \
                .map(lambda row: row.split(",")) \
                .map(lambda cols: (cols[6], 1)) \
                .filter(lambda  row: row[0] != '"unknown"') \
                .reduceByKey(add) \
                .takeOrdered(1000,key=lambda pair: -pair[1])
print(frequency[0],frequency[-1])

('"111"', 9326) ('"216"', 1)


### Q2.5 - Evaluate

How did it go?  Did it work as you expected?  Did you run into any issues?

What do you like about using Spark?  Or do you dislike it?

**Answer**

Write your answer here.

```
Undoubtedly, I like spark.
Although I met some troubles, but all the troubles are solved.
And now Spark works well without errors. 
```

* Let's see the time needed using old methods(csvkit).

In [32]:
%time !csvcut -c 7 crimenew.csv | csvstat

  1. Post
	<class 'str'>
	Nulls: False
	Unique values: 181
	5 most frequent values:
		111:	9326
		212:	4759
		913:	4636
		922:	4594
		432:	4377
	Max length: 7

Row count: 282749
CPU times: user 336 ms, sys: 48 ms, total: 384 ms
Wall time: 27.9 s


* Let's see the time needed using spark.

In [33]:
from operator import add
%time data.filter(lambda row: row != header) \
            .map(lambda row: row.split(",")) \
            .map(lambda cols: (cols[6], 1)) \
            .filter(lambda  row: row[0] != '"unknown"') \
            .reduceByKey(add) \
            .takeOrdered(0,key=lambda pair: -pair[1])

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 1.51 s


[]

```
28 seconds verses 1.5 seconds!
It's a huge difference!
No one likes to wait for the commands to execute. A faster way is what I was looking for!
```