Importing Chicago Crime Dataset into Neo4j
The following are instructions for importing the open Chicago crime data set into Neo4j.
Download the Chicago crime data set
Set a system variable containing the path to the data set. e.g.
tar -xvf spark-1.3.0-bin-hadoop1.tgz
Generate the CSV files that Neo4j import will use:
You should see the following at the beginning of the output:
$ ./create_files.sh Spark assembly has been built with Hive, including Datanucleus jars on classpath Using /Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv ...
The following files will be generated:
$ ls -alh tmp/*.csv -rwxrwxrwx 1 markneedham wheel 3.0K 14 Apr 06:48 /tmp/beats.csv -rwxrwxrwx 1 markneedham wheel 217M 14 Apr 06:48 /tmp/crimes.csv -rwxrwxrwx 1 markneedham wheel 84M 14 Apr 06:49 /tmp/crimesBeats.csv -rwxrwxrwx 1 markneedham wheel 120M 14 Apr 06:49 /tmp/crimesPrimaryTypes.csv -rwxrwxrwx 1 markneedham wheel 912B 14 Apr 06:48 /tmp/primaryTypes.csv
Now let’s clean up the CSV files to get rid of empty columns:
tar -xvf neo4j-enterprise-2.2.3-unix.tar.gz
Create a Neo4j graph from the CSV files:
./import.sh [/path/to/csv/files] [/path/to/neo4j]
Using a different version of Spark
If you want to use a different version of Spark you’ll need to update the appropriate references in
I can’t generate the CSV files
If you forget to set the
CSV_FILE environment variable or set it to a non-existent file you’ll see the following error message:
$ ./create_files.sh Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread "main" java.lang.RuntimeException: Cannot find CSV file [null] ...
CSV_FILE environment variable and you’re good to go:
I made some changes to the project and want to build it
If you haven’t made any changes to the source code of the project there’s no need to build it - a JAR is checked in and referenced in
You can, however, rebuild the JAR with the following command:
sbt clean package
I can’t build this project
sbt doesn’t seem to honour
java_home so if you’re seeing the following error:
2015-04-14 13:57:50,116 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Job finished: saveAsTextFile at GenerateCSVFiles.scala:51, took 8.292283862 s Exception in thread "main" java.lang.UnsupportedClassVersionError: MyFileUtil : Unsupported major.minor version 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
You should set the following environment variable:
And then retry:
PATH=$JAVA_HOME/bin:$PATH sbt clean package