Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 348 lines (310 sloc) 13.869 kb
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
1 %% -*- mode: erlang -*-
2
3 [
3795664 Matteo Redaelli added sasl log_error options
matteoredaelli authored
4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5 %% SASL config
6 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7
8 {sasl, [
9 {sasl_error_logger, {file, "priv/log/sasl-error.log"}},
10 {errlog_type, error},
11 {error_logger_mf_dir, "priv/log/sasl"}, % Log directory
12 {error_logger_mf_maxbytes, 10485760}, % 10 MB max file size
13 {error_logger_mf_maxfiles, 5} % 5 files max
14 ]
15 },
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
16 {ebot, [
3795664 Matteo Redaelli added sasl log_error options
matteoredaelli authored
17
d898e2a Matteo Redaelli moved web configs at application level
matteoredaelli authored
18 %% !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
3795664 Matteo Redaelli added sasl log_error options
matteoredaelli authored
19 %% see EBOT options in ebot.app and add your changes here!
d898e2a Matteo Redaelli moved web configs at application level
matteoredaelli authored
20 %% !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
21
22 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
23 %% CACHE
24 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
25
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
26 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
27 %% DATABASE
28 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
29 %%
30 %% you need to set the db backend (COUCHDB or RIAK)
31 %% in src/ebot.hrl file
32 {db_hostname, "127.0.0.1"},
33 %% COUCHDB
1bd6160 Matteo Redaelli rabbitmq2.4 and now riak is the default db
matteoredaelli authored
34 %%{db_port, 5984},
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
35 %% RIAK
1bd6160 Matteo Redaelli rabbitmq2.4 and now riak is the default db
matteoredaelli authored
36 {db_port, 8087},
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
37
38 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
d898e2a Matteo Redaelli moved web configs at application level
matteoredaelli authored
39 %% MQ
40 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
41
42 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
43 %% WEB
44 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
45
46 %% -------------------------------------------------------------------------------------------------
a90a1b9 Matteo Redaelli issue 19: normalize_url for different domains/urls is now possible
matteoredaelli authored
47 %% normalize_url_list
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
48 %% -------------------------------------------------------------------------------------------------
49 %%
a90a1b9 Matteo Redaelli issue 19: normalize_url for different domains/urls is now possible
matteoredaelli authored
50 %% {normalize_url_list, [{RE, NormalizeUrlOptions},..]}
51 %%
52 %% options of normalize_url :
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
53 %% add_final_slash
54 %% to_lower_case : urls are case insensive and some web pages have links with some uppercase letters..
55 %% without_internal_links
56 %% without_queries,
57 %% {max_depth, 2}
58 %% the url path will be truncated to a max_depth path
59 %% http://www.redaelli.org/matteo/blog/a/ -> http://www.redaelli.org/matteo/blog/
60 %% should be the same as "tot_new_urls_queues" in ebot_mq.conf
263fe51 Matteo Redaelli remaming terms and functions
matteoredaelli authored
61 %% you should also start at least one crawler for depth in [0,max_depth]. see "worker_pools" in this file
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
62 %% (TODO) {remove_filename, false}
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
63 {normalize_url,
a90a1b9 Matteo Redaelli issue 19: normalize_url for different domains/urls is now possible
matteoredaelli authored
64 [
3f98c34 Matteo Redaelli issue 19: normalize_url for different domains/urls is now possible
matteoredaelli authored
65 %% {\\.com/",
66 %% [
67 %% {plugin, ebot_url_util, url_domain},
68 %% add_final_slash,
69 %% to_lower_case
70 %% ]
71 %% },
72
6833c1b Matteo Redaelli simplified normalize_url function
matteoredaelli authored
73 %% !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
a90a1b9 Matteo Redaelli issue 19: normalize_url for different domains/urls is now possible
matteoredaelli authored
74 %% default setting for normalize_url
6833c1b Matteo Redaelli simplified normalize_url function
matteoredaelli authored
75 %% !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
a90a1b9 Matteo Redaelli issue 19: normalize_url for different domains/urls is now possible
matteoredaelli authored
76 %% rememeber, at least one regexp must match all urls
77 %% "." should be used
78 {".",
79 [
80 %% -------------------------------------------------------------------
81 %% {plugin, Module, Function/1}
82 %% -------------------------------------------------------------------
83 %% you can call a custom module:function(Url) for normaling urls
84
85 %% are you interested only in domain homepages?
86 %% {plugin, ebot_url_util, url_domain},
87
6833c1b Matteo Redaelli simplified normalize_url function
matteoredaelli authored
88 %% removing blank characters at the begin and end of url string:
89 %% yes, sometime happens!
90 strip,
91
a90a1b9 Matteo Redaelli issue 19: normalize_url for different domains/urls is now possible
matteoredaelli authored
92 %% -------------------------------------------------------------------
93 %% {replace_string, [{from,to},..]}
94 %% -------------------------------------------------------------------
95 {replace_string, [
96 %% http://www.gettyre.it/motoweb/XXX;jsessionid=250485C.sae_1
97 {";[A-Za-z0-9]+=[^&;?]+", ""},
98 %% some sites have newlines in url links:
99 %% see http://opensource.linux-mirror.org/index.php
100 %% TODO maybe it still doesn t work
101 {"\n",""},
102 %% http://github.com/dizzyd/ibrowse
103 {"&quot\$",""}
104 ]},
105 %% -------------------------------------------------------------------
106 %% add_final_slash
107 %% -------------------------------------------------------------------
108 %% example: http://www.redaelli.org => http://www.redaelli.org/
109 add_final_slash,
110
111 %% -------------------------------------------------------------------
112 %% {max_depth, 3}
113 %% -------------------------------------------------------------------
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
114 %% paths > max_depth are truncated to max_depth
a90a1b9 Matteo Redaelli issue 19: normalize_url for different domains/urls is now possible
matteoredaelli authored
115 %% for instance, if {max_depth,0}
116 %% http://www.redaelli.org/matteo/ => http://www.redaelli.org/
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
117 {max_depth, 4},
a90a1b9 Matteo Redaelli issue 19: normalize_url for different domains/urls is now possible
matteoredaelli authored
118
119 %% -------------------------------------------------------------------
120 %% to_lower_case
121 %% -------------------------------------------------------------------
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
122 %% for some web servers web urls are case insensitive.
123 %% it is safer to lowecase all urls in order to avoid duplicates in the database
124 %% {plugin, string, to_lower},
a90a1b9 Matteo Redaelli issue 19: normalize_url for different domains/urls is now possible
matteoredaelli authored
125
126 %% -------------------------------------------------------------------
127 %% without_internal_links
128 %% -------------------------------------------------------------------
129 %% internal links (#) are removed
130 without_internal_links,
131
132 %% -------------------------------------------------------------------
133 %% without_queries
134 %% -------------------------------------------------------------------
135 %% parameters, like ?a=1&b=3, are removed from urls
136 without_queries
137 ]}%% end default
138 ] % end list of {regexp, ListOptions}
139 }, %% end normalize_url
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
140
141 %% -------------------------------------------------------------------------------------------------
142 %% tobe_saved_headers
143 %% -------------------------------------------------------------------------------------------------
144 %% headers (if exist) that will be saved in the database
145 {tobe_saved_headers,
146 [
147 <<"content-length">>,
c34bc20 Matteo Redaelli improved documentation, new help command
matteoredaelli authored
148 <<"content-type">>,
149 <<"server">>,
150 <<"x-powered-by">>
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
151 ]},
e4c22e4 Matteo Redaelli added a option to save some html tags to the db (ex. title)
matteoredaelli authored
152 %% -------------------------------------------------------------------------------------------------
4bc6217 Matteo Redaelli reused a new function is_valid for urls,links,images
matteoredaelli authored
153 %% is_valid_image
154 %% -------------------------------------------------------------------------------------------------
155 %%
156 %% this option is useful to check links before converting them to absolute urls
157 %% when they are relative links
158 %%
159 {is_valid_image,
160 [
161 %% the url will be analyzed if ALL regexps will be satisfied
162 {validate_all_url_regexps, [
163 {nomatch, "\.bmp$"},
164 {nomatch, "\.raw$"}
165 ]
166 }
167 ]
168 },
169 %% -------------------------------------------------------------------------------------------------
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
170 %% is_valid_link
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
171 %% -------------------------------------------------------------------------------------------------
172 %%
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
173 %% this option is useful to check links before converting them to absolute urls
174 %% when they are relative links
175 %%
176 {is_valid_link,
177 [
178 %% the url will be analyzed if ALL regexps will be satisfied
179 {validate_all_url_regexps, [
180 {nomatch, "feed:"},
181 {nomatch, "ftp:"},
182 {nomatch, "javascript:"},
183 {nomatch, "mailto:"},
184 {nomatch, "news:"}
185 ]
186 }
187 ]
188 },
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
189
190 %% -------------------------------------------------------------------------------------------------
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
191 %% is_valid_url
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
192 %% -------------------------------------------------------------------------------------------------
193 %%
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
194 {is_valid_url,
195 [
a7464c3 Matteo Redaelli allowed custom functions for is_valid_url
matteoredaelli authored
196 %% you can call your custom function that wull return true or false
197 %% {plugin, Module, function},
198
199 %% silly function : {plugin, erlang, is_list},
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
200 %% an url is valid if its mime type satify any of the following regexps
201 {validate_any_mime_regexps, [
202 {match, "^text/"}
e30913f Matteo Redaelli added (but commented) the saving of image urls of pages to the db
matteoredaelli authored
203 %%,{match, "^image/"}
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
204 ]
205 },
c34bc20 Matteo Redaelli improved documentation, new help command
matteoredaelli authored
206 %% the url will be analyzed if ALL the following regexps will be satisfied
2967813 Matteo Redaelli allowed custom functions for is_valid_url
matteoredaelli authored
207 {validate_all_url_regexps, [
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
208 {match, "^http://"},
209 %% {nomatch, "^https"},
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
210
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
211 {nomatch, "//.+//"},
212 {nomatch, "/bugs/"},
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
213 {nomatch, "viewcvs"},
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
214 %% Skipping Apache.org urls
215 {nomatch, "\\.apache\\..+/dist/"},
216 {nomatch, "/snapshots/"},
217 {nomatch, "^http://mail-archives"},
218 {nomatch, "bugs.+/.+"},
219 %% apache mirror sites.. TODO
220 {nomatch, "apache\\.fastbull\\.org/.+"},
221
222 %% Skipping unwanted files
223 {nomatch, "\\.deb$"},
224 {nomatch, "\\.git$"},
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
225 {nomatch, "\\.tgz$"},
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
226 {nomatch, "\\.jar$"},
227 {nomatch, "\\.rpm$"},
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
228 {nomatch, "\\.tar$"},
229 {nomatch, "\\.gz$"},
9a8aec9 Matteo Redaelli renamed some functions
matteoredaelli authored
230 {nomatch, "\\.makefile$"},
231 {nomatch, "\\.Makefile$"},
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
232 % Skipping CVS repositories
233 {nomatch, "/cvs/\\."},
234
235 %% Skipping Github unseful pages
236 {nomatch, "github\\.+/issues"},
237 {nomatch, "gist\\.github\\.com"},
238 %% the page gives incomplete header
239 {nomatch, "svn\\.github\\.com"},
240
241 %% Skipping Gitorious unseful pages
242 {nomatch, "git.+/merge_requests/"},
243 {nomatch, "git.+/commits/"},
244 {nomatch, "git.+/trees/"},
245
246 %% Skipping Git repositories
247 {nomatch, "git.+/commit/"},
248 {nomatch, "git.+/tree/"},
249
250 %% Skipping HG repositories
251 {nomatch, "/changeset/"},
252
253 %% Skipping SVN repositories
254 {nomatch, "svn.+/viewvc/.+/"},
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
255 {nomatch, "/svn[\\./]"},
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
256 {nomatch, "/branches"},
257 {nomatch, "/trunk"},
258 {nomatch, "/tags"}
259 ]
260 }, %% end of validate_all_regexps
261
c34bc20 Matteo Redaelli improved documentation, new help command
matteoredaelli authored
262 %% The url will be analyzed if ANY of the following regexps will be satisfied.
263 %% Here you should put the list of web sites to be visited by ebot.
264
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
265 {validate_any_url_regexps, [
266 %% at least one regexp must be defined
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
267 %% {match,"."},
44c27c5 Matteo Redaelli html meta tags keywords and description can be saved to db
matteoredaelli authored
268 {match, "redaelli\\.org"},
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
269 %% Opensource projects
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
270 {match, "apache\\.org"},
271 {match, "freshmeat\\.net"},
272 {match, "github\\.com"},
273 {match, "code\\.google\\.com"},
274 {match, "sourceforge\\.net"},
275 {match, "ohloh\\.net"},
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
276 {match, "bitbucket\\.org"},
277 {match, "www\\.gettyre\\.it"},
278 {match, "www\\.tyres-pneus-online\\.co\\.uk"}
a1d94db Matteo Redaelli is_valis_url is not fully customizable
matteoredaelli authored
279 ]
280 }
281 ] %% end list of options
282 } %% end is_valid_url
283 ,
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
284
285 %% -------------------------------------------------------------------------------------------------
286 %% obsolete_urls_after_day
287 %% -------------------------------------------------------------------------------------------------
288 %%
c34bc20 Matteo Redaelli improved documentation, new help command
matteoredaelli authored
289 %% after how many days, an url that is stored in the DB will become obsolete
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
290
c34bc20 Matteo Redaelli improved documentation, new help command
matteoredaelli authored
291 {obsolete_urls_after_days, 10},
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
292
293 %% -------------------------------------------------------------------------------------------------
294 %% save_referrals
295 %% -------------------------------------------------------------------------------------------------
296 %%
9ce734d Matteo Redaelli added sbdomain option in parameter save_referrals
matteoredaelli authored
297 %% (cumulable) values: external, domain, subdomain
298 %%
299 %% domain: means same domain,
300 %% es: true <= http://www.redaelli.org/a and http://www.redaelli.org/
301 %% false <= http://www.redaelli.org/a and http://redaelli.org/
302 %%
303 %% subdomain: means same main domain but not same domain,
304 %% es: false <= http://www.redaelli.org/a and http://www.redaelli.org/
305 %% true <= http://www.redaelli.org/a and http://redaelli.org/
306 %%
307 %% external: means samenot same domain and not same main domain
308 %% es: false <= http://www.redaelli.org/a and http://www.redaelli.org/
309 %% false <= http://www.redaelli.org/a and http://redaelli.org/
310 %% true <= http://www.redaelli.org/a and http://matteoredaelli.wordpress.com/
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
311
312 {save_referrals, [external]},
313
314 %% -------------------------------------------------------------------------------------------------
fc99ffb Matteo Redaelli remaming terms and functions
matteoredaelli authored
315 %% workers_pool
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
316 %% -------------------------------------------------------------------------------------------------
317 %%
318 %% how many crawler threads will be started for each candidated url queue/depth
263fe51 Matteo Redaelli remaming terms and functions
matteoredaelli authored
319 %% {worker_pools, [{0,3},{1,2},{2,1}]} means
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
320 %% 3 crawlers will analyze urls got from AMQP queue ebot.new.0 that countains urls with depth==0
321 %% (ex. http://www.redaelli.org, http://www.redaelli.org/index.html)
322 %% 2 crawlers will analyze urls got from AMQP queue ebot.new.1 that countains urls with depth==1
323 %% (ex. http://www.redaelli.org/matteo/, http://www.redaelli.org/matteo/index.html)
324 %% 1 crawlers will analyze urls got from AMQP queue ebot.new.2 that countains urls with depth==2
325
cf43003 Matteo Redaelli riak works again: open_or_create_url must return Doc, otherwise ebot dum...
matteoredaelli authored
326 %%{workers_pool, [{0,4}, {1,2}, {2,1}] },
0314942 Matteo Redaelli removed default option to lowecase urls: no usually valid
matteoredaelli authored
327 {workers_pool, [{0,2}, {1,2}, {2,2}, {3,2}, {4,2} ] },
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
328 %% -------------------------------------------------------------------------------------------------
d824805 Matteo Redaelli remaming terms and functions
matteoredaelli authored
329 %% start_workers_at_boot
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
330 %% -------------------------------------------------------------------------------------------------
331 %%
332 %% are the crawlers started automatically at boot time?
cdfebc1 Matteo Redaelli renamed functions
matteoredaelli authored
333 {start_workers_at_boot, true},
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
334
335 %% -------------------------------------------------------------------------------------------------
263fe51 Matteo Redaelli remaming terms and functions
matteoredaelli authored
336 %% workers_sleep_time
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
337 %% -------------------------------------------------------------------------------------------------
338 %%
339 %% how many milliseconds will each crawler sleep between two url crawls?
c34bc20 Matteo Redaelli improved documentation, new help command
matteoredaelli authored
340 %% this option is useful in order to avoid heavy workloads for the visited websites
341 %% and for the ebot system if the hardware is not enough powerful
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
342
c34bc20 Matteo Redaelli improved documentation, new help command
matteoredaelli authored
343 {workers_sleep_time, 2000}
f89a6de Matteo Redaelli moved web configs at application level
matteoredaelli authored
344
345 ]
346 }
347 ].
Something went wrong with that request. Please try again.