-
Notifications
You must be signed in to change notification settings - Fork 36
/
for_developers_only
51 lines (38 loc) · 1.81 KB
/
for_developers_only
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
The jars in the 'lib' directory are imported into the local .m2 repository,
because they are not available anywhere in web-based maven repositories
These instructions assume that you are in the 'freeeed-processing' project directory
Q. Where is my maven in Windows?
A. In NetBeans, it's something like
"C:\Program Files\NetBeans 7.3\java\maven\bin\mvn.bat"
IMPORTANT
To run the maven import commands in Windows, you need to be in DOS, in the freeeed-processing directory
If you want to use the third-party proprietary driver for PST extraction,
get the jar and do this (but you need the jpst.jar, which is licensed):
(This is not needed, as lotus is removed for now)
mvn install:install-file -DgroupId=com.ibm \
-DartifactId=notes \
-Dversion=7.3.4 \
-Dfile=lib/Notes.jar \
-Dpackaging=jar \
-DgeneratePom=true
Linux:
mvn install:install-file -DgroupId=com.independentsoft \
-DartifactId=JPST \
-Dversion=1.0 \
-Dfile=lib/jpst.jar \
-Dpackaging=jar \
-DgeneratePom=true
Windows:
install_jpst_cygwin.bat
When testing, if you plan to use JPST, do the target assembly:single first
For PST processing, normally you would use readpst. JPST is a special case for Windows.
To install readpst, go here https://github.com/shmsoft/FreeEed/wiki/FreeEed-Installation
Architectural discussion - deduplication
How it was
Mapper reads, Reducer combines all records for the same hash together and outputs them with master or duplicate field
Now there is no Reducer
For deduplication, instead, we will use a hash map (or any NoSQL)
1. In Mapper, get the file, calculate its hash
2. Now check if the file's hash has already been seen
3. If yes, our file is a duplicate. In the 'master' field, we put the UPI of the master. If not, then the master field is empty
4. In a distributed mode, we use DynamoDB (or its local version for a non-distributed mode)