No description or website provided.
HTML Python Scala Java Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
scripts
src/main
target/scala-2.10
.gitignore
README.adoc
build.sbt
categories.csv
categories.json
categories.py
clean.sh
connect_categories.py
create_files.sh
crime_types.html
diff.csv
diff.py
extract.sh
fbi.py
fbi_spark.py
import.sh
import_categories.py
import_diff.py
plugins.sbt
pyspark_csv.py
reset.sh
soupselect.py
tasks.adoc
to_csv.sh

README.adoc

Importing Chicago Crime Dataset into Neo4j

The following are instructions for importing the open Chicago crime data set into Neo4j.

export CSV_FILE=Crimes_-_2001_to_present.csv
tar -xvf spark-1.3.0-bin-hadoop1.tgz
  • Generate the CSV files that Neo4j import will use:

./create-files

You should see the following at the beginning of the output:

$ ./create_files.sh
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using /Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv
...

The following files will be generated:

$ ls -alh tmp/*.csv
-rwxrwxrwx  1 markneedham  wheel   3.0K 14 Apr 06:48 /tmp/beats.csv
-rwxrwxrwx  1 markneedham  wheel   217M 14 Apr 06:48 /tmp/crimes.csv
-rwxrwxrwx  1 markneedham  wheel    84M 14 Apr 06:49 /tmp/crimesBeats.csv
-rwxrwxrwx  1 markneedham  wheel   120M 14 Apr 06:49 /tmp/crimesPrimaryTypes.csv
-rwxrwxrwx  1 markneedham  wheel   912B 14 Apr 06:48 /tmp/primaryTypes.csv

Now let’s clean up the CSV files to get rid of empty columns:

./clean.sh
tar -xvf neo4j-enterprise-2.2.3-unix.tar.gz
  • Create a Neo4j graph from the CSV files:

./import.sh [/path/to/csv/files] [/path/to/neo4j]

FAQS

Using a different version of Spark

If you want to use a different version of Spark you’ll need to update the appropriate references in create_files.sh and build.sbt

I can’t generate the CSV files

If you forget to set the CSV_FILE environment variable or set it to a non-existent file you’ll see the following error message:

$ ./create_files.sh
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread "main" java.lang.RuntimeException: Cannot find CSV file [null]
...

Set the CSV_FILE environment variable and you’re good to go:

export CSV_FILE=Crimes_-_2001_to_present.csv

I made some changes to the project and want to build it

If you haven’t made any changes to the source code of the project there’s no need to build it - a JAR is checked in and referenced in create_files.sh. You can, however, rebuild the JAR with the following command:

sbt clean package

I can’t build this project

sbt doesn’t seem to honour java_home so if you’re seeing the following error:

2015-04-14 13:57:50,116 INFO  [main] spark.SparkContext (Logging.scala:logInfo(59)) - Job finished: saveAsTextFile at GenerateCSVFiles.scala:51, took 8.292283862 s
Exception in thread "main" java.lang.UnsupportedClassVersionError: MyFileUtil : Unsupported major.minor version 52.0
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)

You should set the following environment variable:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

And then retry:

PATH=$JAVA_HOME/bin:$PATH sbt clean package