# Purpose

This notebook demonstrates how to run a full scan of all tables and columns in a spark warehouse to detect PI entities in the columns.
The results are then saved into a dataframe for reference

## Intended Use

This notebook walks through how to scan a spark dataframe column by column to detect PI. Obvisously there are many different ways to accomplish this (some likely faster than doing a full collect on the column as well), so use this example as a quick explainer but you'd want to re-work this for production scale.


In [None]:
from privateai_client import PAIClient
from privateai_client import request_objects
api_key = 'YOUR KEY GOES HERE' #NOTE: if you have a container, you'd authenticate via the mechanism you set up
client = PAIClient(url="https://api.private-ai.com/community/", api_key=api_key)

In [None]:
client.ping()

In [None]:
list_of_tables = []
final_output = []
for table in spark.catalog.listTables('privateai'):
  print(f"*********** Analyzing table:{table.name} ****************")
  df = spark.sql("select * from {0}.{1}".format(table.database,table.name))
  for col in df.columns:
    col_list=df.rdd.map(lambda x: x[col]).collect()
    text_req = request_objects.process_text_obj(text=[])
    print(f"************** PROCESSING {col} ************")
    text_req.text.append(f"{col}: {' | '.join(str(x) for x in col_list)}")
    resp = client.process_text(text_req)
    final_output.append(
      {"database":'privatea0',
       "table":table.name,
       "column":col,
       "list":col_list,
       "full_resp": resp
       }
      )

In [None]:
for item in final_output:
  item['processed_text'] = item['full_resp'].processed_text

In [None]:
def get_best_label_list(entities_list):
  best_label_list = []
  for item in entities_list:
    best_label_list.append(item['best_label'])
  return best_label_list

In [None]:
for item in final_output:
  item['entities_list'] = get_best_label_list(item['full_resp'].entities[0])

In [None]:
df_create_list = []
for item in final_output:
  df_create_list.append(
    {
      "database":item['database'],
      "table":item['table'],
      "column":item['column'],
      "detected_entities":item['entities_list']
    }
  )

In [None]:
output_df = spark.createDataFrame(df_create_list)

In [None]:
display(output_df.select("database","table","column","detected_entities"))

In [None]:
output_df.write.saveAsTable("detection_output_results")