Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

A few documentation imrovements

JIRA: MADLIB-569

Small updates in Feature Extraction, Compatibility, Linrar Regression, and Utilities
  • Loading branch information...
commit 16820bbbbdfcdd54a21ca5d4bf915f2f61729042 1 parent 357c7b0
Florian Schoppmann authored
84 methods/textfex_viterbi/src/pg_gp/textfex.sql_in
@@ -4,7 +4,7 @@
4 4 *
5 5 * @brief SQL function for text feature extraction
6 6 * @date February 2012
7   - *
  7 + *
8 8 * @sa For an introduction to text feature extraction, see the module
9 9 * description \ref grp_textfex_viterbi
10 10 *//* ----------------------------------------------------------------------- */
@@ -15,11 +15,11 @@ m4_include(`SQLCommon.m4')
15 15 @addtogroup grp_textfex_viterbi
16 16
17 17 @about
18   -This module provides a functionality of the feature extraction for basic text
19   -analysis tasks such as part-of-speech(POS) tagging, named entity resolution.
20   -In addition to the feature extraction, it also has a Viterbi implementation
  18 +The Feature Extraction module provides functionality for basic text-analysis
  19 +tasks such as part-of-speech (POS) tagging and named-entity resolution.
  20 +In addition to feature extraction, it also has a Viterbi implementation
21 21 to get the best label sequence and the conditional probability
22   -\f$ p(top1_label_sequence|sentence) \f$.
  22 +\f$ \Pr( \text{best label sequence} \mid \text{Sentence}) \f$.
23 23
24 24 At present, six feature types are implemented.
25 25 - Edge Feature: transition feature that encodes the transition feature
@@ -37,7 +37,7 @@ You can add your own feature type according to the training model.
37 37
38 38 Instead of scanning every token in a sentence and extracting features for
39 39 each token on the fly, we extract features for each distinct token and
40   -materialize it in the table. When we call viterbi function to get the best
  40 +materialize it in the table. When we call the Viterbi function to get the best
41 41 label sequence, we only need a single lookup to get the feature weight.
42 42
43 43 @usage
@@ -46,7 +46,7 @@ label sequence, we only need a single lookup to get the feature weight.
46 46 to convert the data in the model files to the data format required by in
47 47 this module.
48 48
49   - - Load model from local drive to database
  49 + - Load model from local drive to database
50 50 <pre>SELECT madlib.load_crf_model(
51 51 '<em>/path/to/data</em>');</pre>
52 52
@@ -110,7 +110,7 @@ label sequence, we only need a single lookup to get the feature weight.
110 110 * ...
111 111 * edgeFeature 44 a a a a a a...a
112 112 * endFeature 45 a a a a a a...a</pre>
113   - *
  113 + *
114 114 * - viterbi_r table
115 115 * is related to specific tokens. It encodes the single state features,
116 116 * e.g., wordFeature, RegexFeature for all tokens. The r table is represented
@@ -180,46 +180,46 @@ $$
180 180 rv = plpy.execute("SELECT COUNT(*) AS total_label FROM " + labeltbl + ";")
181 181 nlabel = rv[0]['total_label']
182 182
183   - plpy.execute("""INSERT INTO segment_hashtbl(seg_text)
  183 + plpy.execute("""INSERT INTO segment_hashtbl(seg_text)
184 184 SELECT DISTINCT seg_text
185 185 FROM """ + segmenttbl + """;""")
186 186
187   - plpy.execute("""INSERT INTO unknown_segment_hashtbl(seg_text)
188   - ((SELECT DISTINCT seg_text
189   - FROM segment_hashtbl)
  187 + plpy.execute("""INSERT INTO unknown_segment_hashtbl(seg_text)
  188 + ((SELECT DISTINCT seg_text
  189 + FROM segment_hashtbl)
190 190 EXCEPT
191   - (SELECT DISTINCT token
192   - FROM """ + dictionary + """
  191 + (SELECT DISTINCT token
  192 + FROM """ + dictionary + """
193 193 WHERE total>1));""")
194 194
195 195 plpy.execute("""INSERT INTO prev_labeltbl
196 196 SELECT id
197 197 FROM """ + labeltbl + """;
198   - INSERT INTO prev_labeltbl VALUES(-1);
  198 + INSERT INTO prev_labeltbl VALUES(-1);
199 199 INSERT INTO prev_labeltbl VALUES( """ + str(nlabel) + """);""")
200 200
201 201 # Generate sparse M factor table
202   - plpy.execute("""INSERT INTO mtbl(prev_label, label, value)
203   - SELECT prev_label.id, label.id, 0
204   - FROM """ + labeltbl + """ AS label,
  202 + plpy.execute("""INSERT INTO mtbl(prev_label, label, value)
  203 + SELECT prev_label.id, label.id, 0
  204 + FROM """ + labeltbl + """ AS label,
205 205 prev_labeltbl as prev_label;""")
206 206
207   - # EdgeFeature and startFeature, startFeature can be considered as a special edgeFeature
208   - plpy.execute("""INSERT INTO mtbl(prev_label, label, value)
  207 + # EdgeFeature and startFeature, startFeature can be considered as a special edgeFeature
  208 + plpy.execute("""INSERT INTO mtbl(prev_label, label, value)
209 209 SELECT prev_label_id,label_id,weight
210   - FROM """ + featuretbl + """ AS features
  210 + FROM """ + featuretbl + """ AS features
211 211 WHERE features.prev_label_id<>-1 OR features.name = 'S.';""")
212 212
213 213 # EndFeature, endFeature can be considered as a special edgeFeature
214   - plpy.execute("""INSERT INTO mtbl(prev_label, label, value)
  214 + plpy.execute("""INSERT INTO mtbl(prev_label, label, value)
215 215 SELECT """ + str(nlabel) + """, label_id, weight
216   - FROM """ + featuretbl + """ AS features
  216 + FROM """ + featuretbl + """ AS features
217 217 WHERE features.name = 'End.';""")
218 218
219 219 m4_ifdef(`__HAS_ORDERED_AGGREGATES__', `
220 220 plpy.execute("""INSERT INTO {viterbi_mtbl}
221   - SELECT array_agg(weight ORDER BY prev_label,label)
222   - FROM (SELECT prev_label, label, (SUM(value)*1000)::integer AS weight
  221 + SELECT array_agg(weight ORDER BY prev_label,label)
  222 + FROM (SELECT prev_label, label, (SUM(value)*1000)::integer AS weight
223 223 FROM mtbl
224 224 GROUP BY prev_label,label
225 225 ORDER BY prev_label,label) as TEMP_MTBL;""".format(
@@ -241,37 +241,37 @@ m4_ifdef(`__HAS_ORDERED_AGGREGATES__', `
241 241 ))
242 242 ')
243 243
244   - plpy.execute("""INSERT INTO rtbl(seg_text, label, value)
245   - SELECT segment_hashtbl.seg_text, labels.id, 0
246   - FROM segment_hashtbl segment_hashtbl,
  244 + plpy.execute("""INSERT INTO rtbl(seg_text, label, value)
  245 + SELECT segment_hashtbl.seg_text, labels.id, 0
  246 + FROM segment_hashtbl segment_hashtbl,
247 247 """ + labeltbl + """ AS labels;""")
248 248
249 249 # RegExFeature
250   - plpy.execute("""INSERT INTO rtbl(seg_text, label, value)
251   - SELECT segment_hashtbl.seg_text, features.label_id, features.weight
252   - FROM segment_hashtbl AS segment_hashtbl,
  250 + plpy.execute("""INSERT INTO rtbl(seg_text, label, value)
  251 + SELECT segment_hashtbl.seg_text, features.label_id, features.weight
  252 + FROM segment_hashtbl AS segment_hashtbl,
253 253 """ + featuretbl + """ AS features,
254 254 """ + regextbl + """ AS regex
255   - WHERE segment_hashtbl.seg_text ~ regex.pattern
  255 + WHERE segment_hashtbl.seg_text ~ regex.pattern
256 256 AND features.name||'%' ='R_' || regex.name;""")
257 257
258 258 # UnknownFeature
259   - plpy.execute("""INSERT INTO rtbl(seg_text, label, value)
260   - SELECT segment_hashtbl.seg_text, features.label_id, features.weight
261   - FROM unknown_segment_hashtbl AS segment_hashtbl,
262   - """ + featuretbl + """ AS features
  259 + plpy.execute("""INSERT INTO rtbl(seg_text, label, value)
  260 + SELECT segment_hashtbl.seg_text, features.label_id, features.weight
  261 + FROM unknown_segment_hashtbl AS segment_hashtbl,
  262 + """ + featuretbl + """ AS features
263 263 WHERE features.name = 'U';""")
264 264
265 265 # Wordfeature
266   - plpy.execute("""INSERT INTO rtbl(seg_text, label, value)
267   - SELECT seg_text, label_id, weight
268   - FROM segment_hashtbl,
269   - """ + featuretbl + """
  266 + plpy.execute("""INSERT INTO rtbl(seg_text, label, value)
  267 + SELECT seg_text, label_id, weight
  268 + FROM segment_hashtbl,
  269 + """ + featuretbl + """
270 270 WHERE name = 'W_' || seg_text;""")
271 271
272 272 # Factor table
273   - plpy.execute("""INSERT INTO """ + viterbi_rtbl + """(seg_text, label, score)
274   - SELECT seg_text,label,(SUM(value)*1000)::integer AS score
  273 + plpy.execute("""INSERT INTO """ + viterbi_rtbl + """(seg_text, label, score)
  274 + SELECT seg_text,label,(SUM(value)*1000)::integer AS score
275 275 FROM rtbl
276 276 GROUP BY seg_text,label;""")
277 277
27 src/ports/greenplum/modules/compatibility/compatibility.sql_in
... ... @@ -1,4 +1,4 @@
1   -/* ----------------------------------------------------------------------- *//**
  1 +/* ----------------------------------------------------------------------- *//**
2 2 *
3 3 * @file compatibility.sql_in
4 4 *
@@ -22,21 +22,24 @@ This module contains workarounds for the following issues:
22 22
23 23 - <tt>CREATE TABLE <em>table_name</em> AS <em>query</em></tt> statements where
24 24 <em>query</em> contains certain MADlib functions fails with the error
25   - “function cannot execute on segment because it issues a non-SELECT statement”.
  25 + “function cannot execute on segment because it issues a non-SELECT statement”
  26 + (on Greenplum before version 4.2).
26 27 The workaround is:
27 28 <pre>SELECT \ref create_table_as('<em>table_name</em>', $$
28 29 <em>query</em>
29 30 $$, 'BY (<em>column</em>, [...]) | RANDOMLY');</pre>
30 31 - <tt>INSERT INTO <em>table_name</em> <em>query</em></tt> where <em>query</em>
31 32 contains certain MADlib functions fails with the error “function cannot
32   - execute on segment because it issues a non-SELECT statement”. The workaround
33   - is:
  33 + execute on segment because it issues a non-SELECT statement” (on Greenplum
  34 + before version 4.2). The workaround is:
34 35 <pre>SELECT \ref insert_into('<em>table_name</em>', $$
35 36 <em>query</em>
36 37 $$);</pre>
37 38
38 39 @note
39   -These functions are not installed on other DBMSs (and not needed there).
  40 +These functions are not installed on other DBMSs (and not needed there). On
  41 +Greenplum 4.2 and later, they are installed only for backward compatibility,
  42 +but not otherwise needed.
40 43 Workarounds should be used only when necessary. For portability and best
41 44 performance, standard SQL should be prefered whenever possible.
42 45
@@ -77,7 +80,7 @@ $$;
77 80 * <pre>SELECT insert_into('<em>table_name</em>', $$
78 81 * <em>query</em>
79 82 *$$);</pre>
80   - *
  83 + *
81 84 * @examp
82 85 * <pre>SELECT insert_into('public.test', $$
83 86 * SELECT * FROM generate_series(1,10) AS id
@@ -109,7 +112,7 @@ BEGIN
109 112 EXECUTE 'SET client_min_messages TO warning';
110 113
111 114 PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
112   -
  115 +
113 116 EXECUTE
114 117 'DROP FUNCTION IF EXISTS pg_temp._madlib_temp_function();
115 118 CREATE FUNCTION pg_temp._madlib_temp_function()
@@ -158,16 +161,16 @@ BEGIN
158 161 ELSE
159 162 whatToCreate := 'TABLE';
160 163 END IF;
161   -
  164 +
162 165 -- We separate the following EXECUTE statement because it is prone
163 166 -- to generate an exception -- e.g., if the table already exists
164 167 -- In that case we want to keep the context in the error message short
165 168 EXECUTE
166   - 'CREATE ' || whatToCreate || ' ' || "inTableName" || ' AS
  169 + 'CREATE ' || whatToCreate || ' ' || "inTableName" || ' AS
167 170 SELECT * FROM (' || "inSQL" || ') AS _madlib_ignore
168 171 WHERE FALSE
169 172 DISTRIBUTED ' || "inDistributed";
170   -
  173 +
171 174 PERFORM MADLIB_SCHEMA.insert_into("inTableName", "inSQL");
172 175 END;
173 176 $$;
@@ -189,7 +192,7 @@ $$;
189 192 * <pre>SELECT create_table_as('<em>table_name</em>', $$
190 193 * <em>query</em>
191 194 *$$, 'BY (<em>column</em>, [...]) | RANDOMLY');</pre>
192   - *
  195 + *
193 196 * @examp
194 197 * <pre>SELECT create_table_as('public.test', $$
195 198 * SELECT * FROM generate_series(1,10) AS id
@@ -204,7 +207,7 @@ $$;
204 207 * Known caveats of this workaround:
205 208 * - For queries returning a large number of rows, this function will be
206 209 * significantly slower than the <tt>CREATE TABLE AS</tt> statement.
207   - */
  210 + */
208 211 CREATE FUNCTION MADLIB_SCHEMA.create_table_as(
209 212 "inTableName" VARCHAR,
210 213 "inSQL" VARCHAR,
16 src/ports/postgres/modules/regress/linear.sql_in
... ... @@ -1,4 +1,4 @@
1   -/* ----------------------------------------------------------------------- *//**
  1 +/* ----------------------------------------------------------------------- *//**
2 2 *
3 3 * @file linear.sql_in
4 4 *
@@ -110,7 +110,7 @@ is defined as
110 110 \f[
111 111 \frac{\max_{\| z \|_2 = 1} \| X z \|_2}{\min_{\| z \|_2 = 1} \| X z \|_2} .
112 112 \f]
113   -The condition number of a problem is a worst-case measure of sensitive the
  113 +The condition number of a problem is a worst-case measure of how sensitive the
114 114 result is to small perturbations of the input. A large condition number (say,
115 115 more than 1000) indicates the presence of significant multicollinearity.
116 116
@@ -175,31 +175,31 @@ sql> COPY houses FROM STDIN WITH DELIMITER '|';
175 175 -# You can call the linregr() function for an individual metric:
176 176 \verbatim
177 177 sql> SELECT (linregr(price, array[1, bedroom, bath, size])).coef FROM houses;
178   - coef
  178 + coef
179 179 ------------------------------------------------------------------------
180 180 {27923.4334170641,-35524.7753390234,2269.34393735323,130.793920208133}
181 181 (1 row)
182 182
183 183 sql> SELECT (linregr(price, array[1, bedroom, bath, size])).r2 FROM houses;
184   - r2
  184 + r2
185 185 -------------------
186 186 0.745374010140315
187 187 (1 row)
188 188
189 189 sql> SELECT (linregr(price, array[1, bedroom, bath, size])).std_err FROM houses;
190   - std_err
  190 + std_err
191 191 ----------------------------------------------------------------------
192 192 {56306.4821787474,25036.6537279169,22208.6687270562,36.208642285651}
193 193 (1 row)
194 194
195 195 sql> SELECT (linregr(price, array[1, bedroom, bath, size])).t_stats FROM houses;
196   - t_stats
  196 + t_stats
197 197 ------------------------------------------------------------------------
198 198 {0.495918628487924,-1.41891067892239,0.10218279921428,3.6122293450358}
199 199 (1 row)
200 200
201 201 sql> SELECT (linregr(price, array[1, bedroom, bath, size])).p_values FROM houses;
202   - p_values
  202 + p_values
203 203 -----------------------------------------------------------------------------
204 204 {0.629711069315512,0.183633155781461,0.920450514073051,0.00408159079312354}
205 205 (1 row)
@@ -305,7 +305,7 @@ LANGUAGE C IMMUTABLE STRICT;
305 305 CREATE AGGREGATE MADLIB_SCHEMA.linregr(
306 306 /*+ "dependentVariable" */ DOUBLE PRECISION,
307 307 /*+ "independentVariables" */ DOUBLE PRECISION[]) (
308   -
  308 +
309 309 SFUNC=MADLIB_SCHEMA.linregr_transition,
310 310 STYPE=float8[],
311 311 FINALFUNC=MADLIB_SCHEMA.linregr_final,
11 src/ports/postgres/modules/utilities/utilities.sql_in
@@ -110,11 +110,12 @@ $$;
110 110 /**
111 111 * @brief Check if a floating-point number is NaN (not a number)
112 112 *
113   - * This function exists for portability. Some DBMSs like PostgreSQL make
114   - * floating-point numbers a fully ordered set -- contrary to IEEE 754.
115   - * http://www.postgresql.org/docs/current/static/datatype-numeric.html#DATATYPE-FLOAT
116   - * For portability, MADlib code should not make use of such "features" directly,
117   - * but instead only use isnan() instead.
  113 + * This function exists for portability. Some DBMSs like PostgreSQL treat
  114 + * floating-point numbers as fully ordered -- contrary to IEEE 754. (See, e.g.,
  115 + * the <a href=
  116 + * "http://www.postgresql.org/docs/current/static/datatype-numeric.html#DATATYPE-FLOAT"
  117 + * >PostgreSQL documentation</a>. For portability, MADlib code should not make
  118 + * use of such "features" directly, but only use isnan() instead.
118 119 *
119 120 * @param number
120 121 * @returns \c TRUE if \c number is \c NaN, \c FALSE otherwise

0 comments on commit 16820bb

Please sign in to comment.
Something went wrong with that request. Please try again.