# BDCC project - Search Cloud Function development

**[Big Data and Cloud Computing](https://www.dcc.fc.up.pt/~edrdo/aulas/bdcc), Project 1**




## GCP authentication function

In [0]:
PROJECT_ID = 'bdcc20-p1'  # TODO change to your project id

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
# The authentication method 
def google_colab_authenticate(projectId, keyFile=None, debug=True):  
    import os
    from google.colab import auth
    if keyFile == None:
      keyFile='/content/bdcc-colab.json'
    if os.access(keyFile,os.R_OK):
      if debug:
        print('Using key file "%s"' % keyFile)
      os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '%s' % keyFile
      os.environ['GCP_PROJECT'] = projectId 
      os.environ['GCP_ACCOUNT'] = 'bdcc-colab@' + projectId + '.iam.gserviceaccount.com'
      !gcloud auth activate-service-account --key-file="$GOOGLE_APPLICATION_CREDENTIALS" --project="$GCP_PROJECT"
    else:
      if debug:
        print('No key file given. You may be redirected to the verification code procedure.')
      auth.authenticate_user()
      !gcloud config set project $projectId
    !gcloud info | grep -e Account -e Project

# Copy key file from Google Drive if available 
# to a path without spaces (it usually creates problems)
!test -f "/content/drive/My Drive/bdcc-colab.json" && cp "/content/drive/My Drive/bdcc-colab.json" /content/bdcc-colab.json
google_colab_authenticate(PROJECT_ID)


Using key file "/content/bdcc-colab.json"
Activated service account credentials for: [bdcc-cloud@bdcc20-p1.iam.gserviceaccount.com]
Account: [bdcc-cloud@bdcc20-p1.iam.gserviceaccount.com]
Project: [bdcc20-p1]


In [4]:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/content/drive/My Drive/bdcc-colab.json' 
!echo $GOOGLE_APPLICATION_CREDENTIALS

/content/drive/My Drive/bdcc-colab.json


In [5]:
from google.cloud import storage

storage_client = storage.Client()
buckets = storage_client.list_buckets()
print('-- List of buckets in project \"' + storage_client.project + '\"')

for b in buckets:
  print(b.name)


-- List of buckets in project "bdcc20-p1"
bdcc20-movie_data


In [6]:
# To enable the GPU access Edit > Notebook settings and set the Hardware accelerator to GPU.

%tensorflow_version 2.x 
import tensorflow as tf

print("GPU device: " + tf.test.gpu_device_name())

from tensorflow.python.client import device_lib

tf_devices = device_lib.list_local_devices()

for x in tf_devices:
  print('------')
  print(x)

GPU device: 
------
name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 2529762464067234966

------
name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 7360113003788487155
physical_device_desc: "device: XLA_CPU device"



## TF-IDF results 

For the `tfidf_search` functionality (see the cloud function code cell):

- Start by testing the use of just __one word $w$__ in the search. In this case you simply need to yield the movies $m$ with the __highest TF-IDF values__ ${\rm TFIDF}(m,w)$.
- Generalize to __a set of multiple words $W$__ by taking the movies $m$ with the __highest average of TF-IDF values__ as follows:
 
 $$\overline{\rm TFIDF}(m,W) = \frac{1}{|W|} \sum_{w \:\in\: W} {\rm TFIDF}(m,w)$$

   Note that by definition ${\rm TFIDF}(m,w) = 0$ if the word $w$ is not associated to movie $m$, implying that a $(m,w,v)$ entry will not exist in the `tfidf` BigQuery table.

## Weighted search results (extra work)

For the `weighted_search` functionality (see the cloud function code cell):

- The idea is to use the TF-IDF values as a weighting factor for movie search __together__ with rating information in the `movies_agg` table.
- You should return movies $m$ with the highest ${\rm WS}(m,w)$ values, defined as follows:

  $$
  {\rm WS}(m,W) = W_1 \times \frac{\overline{\rm TFIDF}(m,W)}{{\rm log_2(|M|)}} + W_2 \times \frac{{\rm avgRating}(m) \times {\rm log}_2({\rm numRatings}(m))}{5 \times {\rm log}_2({\rm MAXR})} 
  $$
   
  where:
    -  $W_1 > 0 \wedge W_2 > 0 \wedge W_1 + W_2 = 1$ are the weighting factors, for example $W_1 = W_2 = 0.5$ ;
    - $|M|$ is the size (count) of movies in the set $M$ of movies in the `movies_agg` table;
    - ${\rm MAXR}$ is the number of ratings for the movie with most ratings, i.e., 
    
     $$ 
     {\rm MAXR} = {\rm max}_{m' \in M}  {\rm numRatings}(m')
     $$

- Observe that under these conditions ${\rm WS}(m,W) \in [0,1]$ since:
  - average movie ratings values ${\rm avgRating}(m)$  are in the interval $[0,5]$;
  - and by definition 
    $$\overline{\rm TFIDF}(m,W) \le {\rm log_2({\rm NUMMOVIES})}$$ since for every word movie $m$ and $w$ we have 
    
    $$
      {\rm TF(m,w)} \in [0,1]
    $$ 
    
    and 
    
    $${\rm IDF}(w,M) \le {\rm log}_2(|M|)$$.


  

## Cloud function code

This should be placed in a single cell to facilitate cloud function.

__Important notes__:

- __Ideally__, data queries __should only be performed using SQL over BigQuery__ rather than handled through Pandas. You should not use Pandas __except__  for the purpose of __getting BigQuery results__ through the [`to_dataframe()`](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe) __or__ to hold temporary scalar values (like the count of records in a table, or the maximum number of ratings) required for a sequence of BigQuery queries. 
- You __SHOULD NOT__ use "magic" notebook extensions in this cell such as `! shell command` or `%%bigquery`, as these are notebook extensions rather than pure Python code, hence  __your cloud function deployment will fail with parse errors if you use them__.



In [7]:
# Imports
import os
import pandas as pd
import google.cloud.bigquery as bq

# Parameters
PROJECT_ID = 'bdcc20-p1'  # TODO change to your project id
DEBUG = True 
RUNNING_IN_COLAB = os.environ.get('COLAB_GPU') != None

# Debug method
def debug(message):
  if DEBUG:
     print(message)

# Authenticate to GCP if running in Colab
if RUNNING_IN_COLAB:
  google_colab_authenticate(PROJECT_ID)

# Initialize interface to BigQuery and GCS
BQ_CLIENT = bq.Client(PROJECT_ID)

def get_movies(ds_id, max_results):
  query = BQ_CLIENT.query(
      '''
      SELECT * FROM `%s.movies_agg` 
      ORDER BY movieId
      LIMIT %s
      ''' % (ds_id, max_results))
  return query.to_dataframe()

def list_movies(request):
  ds_id = '%s.%s' % (PROJECT_ID, request.args.get('dataset'))
  max_results = request.args.get('max_results')
  movies = get_movies(ds_id, max_results)
  debug('Returning result with %d rows' % len(movies))
  return movies.to_html()
# -----------------------------------------------------------------------------------
# Calculate absolute frequency
def freq(doc_df,debug=False):
  import re
  frequency = {}
  # Plain Python here as pandas do not provide a "flatMap" operator as in Spark
  dic = doc_df.to_dict()
  for (doc,content) in dic['title'].items():
    doc = dic['movieId'][doc] # since id's may not start at index 0
    for word in content.split():
      count = frequency.get((doc,word), 0)
      frequency[(doc,word)] = count + 1

  f_df = pd.DataFrame(data = {
      'doc': [x[0] for x in frequency.keys()],
      'word': [x[1] for x in frequency.keys()],
      'f': [frequency[x] for x in frequency.keys()]
  })

  if debug:
    print("== frequency (f) ==")
    print(f_df.sort_values(['doc','f'],ascending=[True,False]))

  return f_df

def tf_idf(ds_id, doc_df, debug=False):
  import math
  f = freq(doc_df, debug)

  if debug:
    print('== f ==')
    print(f)

  max_f = f.groupby('doc').agg({'f': 'max'})\
          .rename(columns={'f': 'max_f'}).reset_index()

  if debug:
    print('== max_f ==')
    print(max_f)
  
  # Calculate TF
  TF = max_f.merge(f,on=['doc'])
  TF['tf'] = TF['f'] / TF['max_f']        
  TF = TF[['doc','word','f','max_f','tf']]
  
  if debug:
    print('== TF ==')
    print(TF.sort_values(['doc','tf','word'],ascending=[True,False,True]))
  
  # Calculate IDF
  N_DOCS = len(doc_df)
  IDF = TF.groupby('word').agg({ 'doc': 'count'}).\
     rename(columns={ 'doc': 'n'}).reset_index()
  IDF['idf'] =  (N_DOCS / IDF['n']).apply(lambda x: math.log(x,2))
  
  if debug:
    print('== IDF ==')
    print(IDF.sort_values(['idf','word','n'], ascending=[False,True,False]))

  # Finally TF-IDF
  TF_IDF = TF.merge(IDF,on='word')
  TF_IDF['tf_idf'] = TF_IDF['tf'] * TF_IDF['idf']

  # Reorganize column order
  TF_IDF = TF_IDF[['doc','word','f','max_f','tf','n','idf','tf_idf']]
  
  if debug:
    print('== TF_IDF ==')
    print(TF_IDF.sort_values(by=['tf_idf','word','doc'],ascending=[False,True,True]))
  return TF_IDF

def list_tfidf(request):
  ds_id = '%s.%s' % (PROJECT_ID, request.args.get('dataset'))
  word = 'title'
  doc = 'movieId'
  max_results = request.args.get('max_results')
  movies = get_movies(ds_id, max_results)
  tfidf = tf_idf(ds_id, movies,False)
  # implement limit max_results
  return tfidf.to_html()
# -----------------------------------------------------------------------------------
def mostSignificantDocs(TF_IDF, wordList, maxDocuments=2,debug=False):
    words_df = pd.DataFrame(data={'word': wordList})
    if debug:
        print('== words ==')
        print(words_df)
    
    m_df = TF_IDF.merge(words_df, on='word')
   
    if debug:
      print('== after merging ==')
      print(m_df)
    
    # TODO WHAT NEXT ? 
    dic = {}
    for _, tf_idf_df in m_df.iterrows():
        val = dic.get(tf_idf_df['doc'], 0)
        val += tf_idf_df['tf_idf']
        dic[tf_idf_df['doc']] = val
    
    if debug:
      print('== after calculation ==')
      print(dic)

    # This should be a data frame with (doc, sum_tf_idf) columns
    return pd.DataFrame(dic.items(), columns=['doc', 'sum_tf_idf']).\
              sort_values(by=['sum_tf_idf']).\
              head(int(maxDocuments))

def tfidf_search(request):
  ds_id = '%s.%s' % (PROJECT_ID, request.args.get('dataset'))
  wordList = request.args.get('words').split(' ')
  max_results = request.args.get('max_results')
  movies = get_movies(ds_id, max_results)

  tfidf = tf_idf(ds_id, movies)
  sum_tf_idf = mostSignificantDocs(tfidf,wordList,max_results)
  return sum_tf_idf.to_html()

def weighted_search(request):
  # TODO
  return 'NOT IMPLEMENTED'

# bonus
def get_jackard_index(ds_id, max_results):
  query = BQ_CLIENT.query(
      '''
      SELECT * FROM `%s.jaccardIndex` 
      ORDER BY movie1, movie2
      LIMIT %s
      ''' % (ds_id, max_results))
  return query.to_dataframe()

def list_jaccard_index(request):
  ds_id = '%s.%s' % (PROJECT_ID, request.args.get('dataset'))
  max_results = request.args.get('max_results')
  ji = get_jackard_index(ds_id, max_results)
  return ji.to_html()

def handle_request(request):
  print(request.args)
  if not request.args:
    debug('No arguments given!')
    return 'ERROR: No arguments'
  
  if 'dataset' not in request.args:
    debug('No dataset specified!')
    return 'ERROR: No dataset has been specified'
  
  if 'op' not in request.args:
    debug('No operation specified!')
    return 'ERROR: No operation has been specified'

  if 'max_results' not in request.args:
    debug('No result limit specified!')
    return 'ERROR: No result limit has been specified'

  operations = {
     'list_movies': list_movies,
     'list_tfidf': list_tfidf,
     'tfidf_search': tfidf_search,
     'weighted_search': weighted_search,
     'list_jaccard_index': list_jaccard_index
  }
  op = request.args.get('op')
  dataset = request.args.get('dataset')
  debug('dataset: %s, op: %s' % (dataset,op))
  func = operations.get(op, lambda req: 'Invalid operation: %s' % op)
  return func(request)


Using key file "/content/bdcc-colab.json"
Activated service account credentials for: [bdcc-cloud@bdcc20-p1.iam.gserviceaccount.com]
Account: [bdcc-cloud@bdcc20-p1.iam.gserviceaccount.com]
Project: [bdcc20-p1]


## Test cloud function locally

In [8]:
dataset = 'medium1' #@param ["tiny1", "tiny2", "tiny3", "tiny4", "medium1", "medium2", "medium3", "medium4", "large1", "large2", "large3", "large4", "large5"] {allow-input: true}
max_results = 10 #@param {type:"slider", min:10, max:1000, step:10}

from IPython.core.display import HTML

class ListMoviesReq:
   args = { 'op': 'list_movies',\
            'dataset': dataset,\
            'max_results': max_results\
           }

HTML(handle_request(ListMoviesReq()))


{'op': 'list_movies', 'dataset': 'medium1', 'max_results': 10}
dataset: medium1, op: list_movies
Returning result with 10 rows


Unnamed: 0,movieId,title,year,imdbId,numRatings,avgRating
0,24,Powder,1995,114168,9191,3.179306
1,888,The Land Before Time III: The Time of the Great Giving,1995,113596,799,2.319775
2,944,Lost Horizon,1937,29162,1147,3.819965
3,1102,American Strays,1996,115531,86,2.610465
4,1176,La double vie de Véronique,1991,101765,1972,3.889452
5,1483,Crash,1996,115964,3313,3.126019
6,1687,The Jackal,1997,119395,5644,3.202339
7,1920,Small Soldiers,1998,122718,4698,2.826415
8,1939,The Best Years of Our Lives,1946,36868,2045,4.082396
9,1993,Child's Play 3,1991,103956,1114,2.041293


In [9]:
# Calculate absolute frequency
def freqDocs(words, doc_df,debug=False):
  import re
  docFrequency = {}
  # Plain Python here as pandas do not provide a "flatMap" operator as in Spark
  dic = doc_df.to_dict()
  for (i,movieId) in dic['movieId'].items():
    docFrequency[movieId] = {}
    for (doc,content) in dic['title'].items():
      for word in content.split():
        docFrequency[movieId][word] = 1 if i == doc else 0
  return docFrequency
  
ds_id = '%s.%s' % (PROJECT_ID, 'medium1')
f = freqDocs(['Toy', 'Men', 'Waiting'], get_movies(ds_id, max_results), debug)
print(f)

{24: {'Powder': 1, 'The': 0, 'Land': 0, 'Before': 0, 'Time': 0, 'III:': 0, 'of': 0, 'the': 0, 'Great': 0, 'Giving': 0, 'Lost': 0, 'Horizon': 0, 'American': 0, 'Strays': 0, 'La': 0, 'double': 0, 'vie': 0, 'de': 0, 'Véronique': 0, 'Crash': 0, 'Jackal': 0, 'Small': 0, 'Soldiers': 0, 'Best': 0, 'Years': 0, 'Our': 0, 'Lives': 0, "Child's": 0, 'Play': 0, '3': 0}, 888: {'Powder': 0, 'The': 0, 'Land': 1, 'Before': 1, 'Time': 1, 'III:': 1, 'of': 0, 'the': 1, 'Great': 1, 'Giving': 1, 'Lost': 0, 'Horizon': 0, 'American': 0, 'Strays': 0, 'La': 0, 'double': 0, 'vie': 0, 'de': 0, 'Véronique': 0, 'Crash': 0, 'Jackal': 0, 'Small': 0, 'Soldiers': 0, 'Best': 0, 'Years': 0, 'Our': 0, 'Lives': 0, "Child's": 0, 'Play': 0, '3': 0}, 944: {'Powder': 0, 'The': 0, 'Land': 0, 'Before': 0, 'Time': 0, 'III:': 0, 'of': 0, 'the': 0, 'Great': 0, 'Giving': 0, 'Lost': 1, 'Horizon': 1, 'American': 0, 'Strays': 0, 'La': 0, 'double': 0, 'vie': 0, 'de': 0, 'Véronique': 0, 'Crash': 0, 'Jackal': 0, 'Small': 0, 'Soldiers': 0,

In [10]:
dataset = 'medium1' #@param ["tiny1", "tiny2", "tiny3", "tiny4", "medium1", "medium2", "medium3", "medium4", "large1", "large2", "large3", "large4", "large5"] {allow-input: true}
max_results = 100 #@param {type:"slider", min:100, max:1000, step:100}


class ListTFIDFReq:
   args = { 'op': 'list_tfidf',\
            'dataset': dataset,\
            'max_results': max_results\
          }

HTML(handle_request(ListTFIDFReq()))

{'op': 'list_tfidf', 'dataset': 'medium1', 'max_results': 100}
dataset: medium1, op: list_tfidf


Unnamed: 0,doc,word,f,max_f,tf,n,idf,tf_idf
0,24,Powder,1,1,1.0,1,6.643856,6.643856
1,888,The,2,2,1.0,21,2.251539,2.251539
2,1687,The,1,1,1.0,21,2.251539,2.251539
3,1939,The,1,1,1.0,21,2.251539,2.251539
4,2058,The,1,1,1.0,21,2.251539,2.251539
5,4090,The,1,1,1.0,21,2.251539,2.251539
6,4124,The,1,1,1.0,21,2.251539,2.251539
7,4720,The,1,1,1.0,21,2.251539,2.251539
8,4955,The,1,1,1.0,21,2.251539,2.251539
9,5942,The,1,1,1.0,21,2.251539,2.251539


In [11]:
dataset = 'medium1' #@param ["tiny1", "tiny2", "tiny3", "tiny4", "medium1", "medium2", "medium3", "medium4", "large1", "large2", "large3", "large4", "large5"] {allow-input: true}
words = 'GoldenEye toy story'  #@param {type: "string"}
max_results = 15 #@param {type:"slider", min:5, max:100, step:5}

class TFIDFSearch:
   args = { 
            'op': 'tfidf_search',      \
            'dataset': dataset,        \
            'words': words,            \
            'max_results': max_results \
          }
  
HTML(handle_request(TFIDFSearch()))

{'op': 'tfidf_search', 'dataset': 'medium1', 'words': 'GoldenEye toy story', 'max_results': 15}
dataset: medium1, op: tfidf_search


Unnamed: 0,doc,sum_tf_idf


In [12]:
dataset = 'medium1' #@param ["tiny1", "tiny2", "tiny3", "tiny4", "medium1", "medium2", "medium3", "medium4", "large1", "large2", "large3", "large4", "large5"] {allow-input: true}
max_results = 10 #@param {type:"slider", min:10, max:1000, step:10}

from IPython.core.display import HTML

class ListJaccardIndex:
   args = { 'op': 'list_jaccard_index',\
            'dataset': dataset,\
            'max_results': max_results\
           }

HTML(handle_request(ListJaccardIndex()))


{'op': 'list_jaccard_index', 'dataset': 'medium1', 'max_results': 10}
dataset: medium1, op: list_jaccard_index


Unnamed: 0,movie1,movie2,user,index,jaccard_index
0,24,888,3372,15,0.004448
1,24,944,3960,45,0.011364
2,24,1102,3285,1,0.000304
3,24,1176,4598,15,0.003262
4,24,1483,4509,43,0.009536
5,24,1687,5091,121,0.023767
6,24,1920,4167,100,0.023998
7,24,1939,4754,37,0.007783
8,24,1993,3346,9,0.00269
9,24,2058,8258,308,0.037297


## Trigger cloud function once it is deployed

Now create and deploy the Google Cloud function.

1.   List item
2.   List item


When creating the cloud function, remember to:

- use an __HTTP__ trigger;
- choose a Python runtime;
- setting __MAIN.PY__ by copying the code from the notebook cell containig the cloud function code;
- and finally, setting __REQUIREMENTS.txt__ with:
```
pandas
google.cloud.bigquery
```

For testing the invocation, see previous examples. I will update this notebook with an HTML form generated from Colab.

In [13]:
url='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function'

op = 'list_movies' 
dataset = 'tiny1' 
max_results = 10 
request='%s?op=%s&dataset=%s&max_results=%s' % (url, op, dataset, max_results)
# request='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function?op=list_movies&dataset=tiny1&max_results=10'
# print(request)

# Invoke the function
from IPython.core.display import HTML
import requests
response = requests.get(request)
HTML(response.text)

# import subprocess
# response = subprocess.check_output('curl '+request,shell=True)
# HTML(response.decode('utf-8'))

Unnamed: 0,movieId,title,year,imdbId,numRatings,avgRating
0,1,Toy Story,1995,114709,215,3.92093
1,2,Jumanji,1995,113497,110,3.431818
2,3,Grumpier Old Men,1995,113228,52,3.259615
3,4,Waiting to Exhale,1995,114885,7,2.357143
4,5,Father of the Bride Part II,1995,113041,49,3.071429
5,6,Heat,1995,113277,102,3.946078
6,7,Sabrina,1995,114319,54,3.185185
7,8,Tom and Huck,1995,112302,8,2.875
8,9,Sudden Death,1995,114576,16,3.125
9,10,GoldenEye,1995,113189,132,3.496212


In [14]:
url='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function'

op = 'list_movies' 
dataset = 'medium1' 
max_results = 10 
request='%s?op=%s&dataset=%s&max_results=%s' % (url, op, dataset, max_results)
# request='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function?op=list_movies&dataset=medium1&max_results=10'
# print(request)

# Invoke the function
from IPython.core.display import HTML
import requests
response = requests.get(request)
HTML(response.text)

# import subprocess
# response = subprocess.check_output('curl '+request,shell=True)
# HTML(response.decode('utf-8'))

Unnamed: 0,movieId,title,year,imdbId,numRatings,avgRating
0,24,Powder,1995,114168,9191,3.179306
1,888,The Land Before Time III: The Time of the Great Giving,1995,113596,799,2.319775
2,944,Lost Horizon,1937,29162,1147,3.819965
3,1102,American Strays,1996,115531,86,2.610465
4,1176,La double vie de Véronique,1991,101765,1972,3.889452
5,1483,Crash,1996,115964,3313,3.126019
6,1687,The Jackal,1997,119395,5644,3.202339
7,1920,Small Soldiers,1998,122718,4698,2.826415
8,1939,The Best Years of Our Lives,1946,36868,2045,4.082396
9,1993,Child's Play 3,1991,103956,1114,2.041293


In [15]:
url='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function'

op = 'list_tfidf' 
dataset = 'tiny1' 
max_results = 100 
request='%s?op=%s&dataset=%s&max_results=%s' % (url, op, dataset, max_results)
# request='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function?op=list_tfidf&dataset=tiny1&max_results=100'
# print(request)

# Invoke the function
from IPython.core.display import HTML
import requests
response = requests.get(request)
HTML(response.text)

# import subprocess
# response = subprocess.check_output('curl '+request,shell=True)
# HTML(response.decode('utf-8'))

Unnamed: 0,doc,word,f,max_f,tf,n,idf,tf_idf
0,1,Toy,1,1,1.0,1,3.321928,3.321928
1,1,Story,1,1,1.0,1,3.321928,3.321928
2,2,Jumanji,1,1,1.0,1,3.321928,3.321928
3,3,Grumpier,1,1,1.0,1,3.321928,3.321928
4,3,Old,1,1,1.0,1,3.321928,3.321928
5,3,Men,1,1,1.0,1,3.321928,3.321928
6,4,Waiting,1,1,1.0,1,3.321928,3.321928
7,4,to,1,1,1.0,1,3.321928,3.321928
8,4,Exhale,1,1,1.0,1,3.321928,3.321928
9,5,Father,1,1,1.0,1,3.321928,3.321928


In [16]:
url='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function'

op = 'list_tfidf' 
dataset = 'medium1' 
max_results = 100 
request='%s?op=%s&dataset=%s&max_results=%s' % (url, op, dataset, max_results)
# request='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function?op=list_tfidf&dataset=medium1&max_results=100'
# print(request)

# Invoke the function
from IPython.core.display import HTML
import requests
response = requests.get(request)
HTML(response.text)

# import subprocess
# response = subprocess.check_output('curl '+request,shell=True)
# HTML(response.decode('utf-8'))

Unnamed: 0,doc,word,f,max_f,tf,n,idf,tf_idf
0,24,Powder,1,1,1.0,1,6.643856,6.643856
1,888,The,2,2,1.0,21,2.251539,2.251539
2,1687,The,1,1,1.0,21,2.251539,2.251539
3,1939,The,1,1,1.0,21,2.251539,2.251539
4,2058,The,1,1,1.0,21,2.251539,2.251539
5,4090,The,1,1,1.0,21,2.251539,2.251539
6,4124,The,1,1,1.0,21,2.251539,2.251539
7,4720,The,1,1,1.0,21,2.251539,2.251539
8,4955,The,1,1,1.0,21,2.251539,2.251539
9,5942,The,1,1,1.0,21,2.251539,2.251539


In [17]:
url='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function'

op = 'tfidf_search' 
dataset = 'tiny1' 
words = 'GoldenEye toy story'.replace(' ', '%20')
max_results = 100 
request='%s?op=%s&dataset=%s&words=%s&max_results=%s' % (url, op, dataset, words, max_results)
#request='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function?op=tfidf_search&dataset=tiny1&words=GoldenEye%20toy%20story&max_results=100'
# print(request)

# Invoke the function
from IPython.core.display import HTML
import requests
response = requests.get(request)
HTML(response.text)

# import subprocess
# response = subprocess.check_output('curl '+request,shell=True)
# HTML(response.decode('utf-8'))

Unnamed: 0,doc,sum_tf_idf
0,10,3.321928


In [18]:
url='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function'

op = 'tfidf_search' 
dataset = 'medium1' 
words = 'GoldenEye toy story'.replace(' ', '%20')
max_results = 100 
request='%s?op=%s&dataset=%s&words=%s&max_results=%s' % (url, op, dataset, words, max_results)
#request='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function?op=tfidf_search&dataset=medium1&words=GoldenEye%20toy%20story&max_results=100'
# print(request)

# Invoke the function
from IPython.core.display import HTML
import requests
response = requests.get(request)
HTML(response.text)

# import subprocess
# response = subprocess.check_output('curl '+request,shell=True)
# HTML(response.decode('utf-8'))

Unnamed: 0,doc,sum_tf_idf


In [19]:
url='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function'

op = 'list_jaccard_index' 
dataset = 'tiny1' 
max_results = 10 
request='%s?op=%s&dataset=%s&max_results=%s' % (url, op, dataset, max_results)
# request='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function?op=list_jaccard_index&dataset=tiny1&max_results=10'
# print(request)

# Invoke the function
from IPython.core.display import HTML
import requests
response = requests.get(request)
HTML(response.text)

# import subprocess
# response = subprocess.check_output('curl '+request,shell=True)
# HTML(response.decode('utf-8'))

Unnamed: 0,movie1,movie2,user,index,jaccard_index
0,1,2,176,21,0.119318
1,1,3,154,11,0.071429
2,1,5,152,7,0.046053
3,1,6,189,27,0.142857
4,1,7,160,7,0.04375
5,1,10,187,19,0.101604
6,2,3,63,5,0.079365
7,2,5,58,4,0.068966
8,2,6,111,8,0.072072
9,2,7,64,6,0.09375


In [20]:
url='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function'

op = 'list_jaccard_index' 
dataset = 'medium1' 
max_results = 10 
request='%s?op=%s&dataset=%s&max_results=%s' % (url, op, dataset, max_results)
# request='https://europe-west1-bdcc20-p1.cloudfunctions.net/search_cloud_function?op=list_jaccard_index&dataset=medium1&max_results=10'
# print(request)

# Invoke the function
from IPython.core.display import HTML
import requests
response = requests.get(request)
HTML(response.text)

# import subprocess
# response = subprocess.check_output('curl '+request,shell=True)
# HTML(response.decode('utf-8'))

Unnamed: 0,movie1,movie2,user,index,jaccard_index
0,24,888,3372,15,0.004448
1,24,944,3960,45,0.011364
2,24,1102,3285,1,0.000304
3,24,1176,4598,15,0.003262
4,24,1483,4509,43,0.009536
5,24,1687,5091,121,0.023767
6,24,1920,4167,100,0.023998
7,24,1939,4754,37,0.007783
8,24,1993,3346,9,0.00269
9,24,2058,8258,308,0.037297


### Inspect cloud function logs using `gcloud`

You can see what happened in more detail by inspecting the function invocation logs in the Web UI but also using the `gcloud` program, as follows (__change `cloudFunctionName` to the name of your cloud function__).


In [22]:
cloudFunctionName = 'search_cloud_function'
!gcloud functions logs read $cloudFunctionName --limit=1000

Listed 0 items.
