Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Running nlp_compromise with SQL #111

Closed
fhoffa opened this issue May 10, 2016 · 4 comments
Closed

BigQuery: Running nlp_compromise with SQL #111

fhoffa opened this issue May 10, 2016 · 4 comments

Comments

@fhoffa
Copy link

fhoffa commented May 10, 2016

Hi! I'm running some experiments with BigQuery and nlp_compromise you might enjoy. Let's say we want to analyze all of /r/movies comments with a score>5 to extract the top people mentioned.

How about something like this?

BigQuery

Code:

SELECT word_normal, COUNT(*) c, FIRST(json_answer) sample_json_answer, 
FROM js(
(
  SELECT body
  FROM [fh-bigquery:reddit_comments.2015_09] 
  WHERE score>5 
  AND LOWER(subreddit)=LOWER('movies')
),
body,
"[
  {name: 'word_normal', type:'string'},
  {name: 'json_answer', type:'string'},
  {name: 'text', type:'string'},
  {name: 'body', type:'string'}]",

  "function(r, emit) {
   var nlp = nlp_compromise;
   try {
     var new_cnt = nlp.text(r.body).people();
   } catch (e) { return; }
   new_cnt.forEach(function(x) { 
     if (x.lastName==null) {
       return;
     }
     emit({title: r.body, 
         word_normal: x.normal,
         text: x.text,
         json_answer: JSON.stringify(x)});
   });
  }")
GROUP BY 1
ORDER BY c DESC
LIMIT 100

To try this, follow the instructions at https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/, and open the options dialog to set gs://fh-bigquery/js/nlp_compromise.min.js as the UDF source.

Thoughts?

(if I don't answer here, please ping me on https://twitter.com/felipehoffa)

@spencermountain
Copy link
Owner

YESSSSS~!

@fhoffa
Copy link
Author

fhoffa commented Dec 15, 2017

Updated version, with Standard SQL in BigQuery now:

#standardSQL
CREATE TEMP FUNCTION myFunc(text STRING)
  RETURNS ARRAY<STRING>
  LANGUAGE js AS
"""
  try {
    //text = unescape(text.replace(/<(?:.)*?>/gm, ' ').replace('&#x27;', "'").replace('&quot;', '"'));

    var arr = nlp(text).people().data();
    arr = arr.filter(function(w) {return w.firstName && w.lastName})
    if(0==arr.length) {return;}
    return Array.from(new Set(arr.map(function(w) {return w.normal})));
    
  } catch (e) { return e.message }
"""

OPTIONS (
  library="gs://fh-bigquery/compromise.11.2.1.min.js"
);

SELECT COUNT(*) c, name, ANY_VALUE(body)
FROM (

  SELECT myFunc(body) names, body
  FROM `fh-bigquery.reddit_comments.2017_05`  
  WHERE LENGTH(body)>50
  AND subreddit='politics'
  AND score>40
), UNNEST(names) name
GROUP BY name
ORDER BY c DESC

@fhoffa
Copy link
Author

fhoffa commented Jul 2, 2019

Now we are able to run this as a permanent public UDF:

SELECT fhoffa.x.parse_number('5 million 3 hundred 25 point zero 1')

To create this function, I just did:

CREATE OR REPLACE FUNCTION x.nlp_compromise_number(str STRING)
RETURNS NUMERIC LANGUAGE js AS '''
   return nlp(str).values(0).toNumber().out()
'''
OPTIONS (
  library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

@spencermountain
Copy link
Owner

captain marvel! haha!
thanks @fhoffa . This is so cool, you've made my day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants