# AutoML Text Classifier
This uses automl to develop a text classifier model based on a kaggle dataset.

Install the required packages

In [3]:
!pip3 install --upgrade kaggle google-cloud-automl google-api-core google-api-python-client

Requirement already up-to-date: kaggle in /usr/local/envs/py3env/lib/python3.5/site-packages (1.4.7.1)
Requirement already up-to-date: google-cloud-automl in /usr/local/envs/py3env/lib/python3.5/site-packages (0.1.1)
Requirement already up-to-date: google-api-core in /usr/local/envs/py3env/lib/python3.5/site-packages (1.4.1)
Requirement already up-to-date: google-api-python-client in /usr/local/envs/py3env/lib/python3.5/site-packages (1.7.4)


Get kaggle secrets from storage, and store them in home directory
See: https://cloud.google.com/kms/docs/store-secrets & https://cloud.google.com/kms/docs/quickstart#encrypt_data

In [1]:

from io import BytesIO
import googleapiclient.discovery
import base64
import json
import os
import zipfile
import pandas as pd
from google.cloud import automl_v1beta1 as automl
from google.api_core.exceptions import AlreadyExists
import numpy as np


In [2]:
%%storage read --object gs://mdh-secrets/kaggle.json.encrypted --variable kaggle_id

In [3]:
kms_client = kms_client = googleapiclient.discovery.build('cloudkms', 'v1')
project="mdh-test-restricted-datalab"
location="global"
keyring="storage"
cryptokey="mykey"
key_name = 'projects/{}/locations/{}/keyRings/{}/cryptoKeys/{}'.format(project,location,keyring,cryptokey)
crypto_keys = kms_client.projects().locations().keyRings().cryptoKeys()
request = crypto_keys.decrypt(
  name=key_name,
  body={'ciphertext':base64.b64encode(kaggle_id).decode('ascii')}
)
plaintext = base64.b64decode(request.execute()['plaintext'].encode('ascii'))
os.makedirs(os.path.dirname('/content/.kaggle/kaggle.json'),exist_ok=True)
f = open('/content/.kaggle/kaggle.json','w+b')
f.write(plaintext)
f.close()

Get file from kaggle competition

In [4]:
!kaggle datasets download -p /content/kaggle rmisra/news-category-dataset

news-category-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


Unzip and pull the file in as a dataframe

In [5]:
with zipfile.ZipFile("/content/kaggle/news-category-dataset.zip") as z:
  with z.open('News_Category_Dataset.json') as f:
    data = pd.read_json(f,lines=True)

In [6]:
data.head()

Unnamed: 0,authors,category,date,headline,link,short_description
0,Melissa Jeltsen,CRIME,2018-05-26,There Were 2 Mass Shootings In Texas Last Week...,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...
1,Andy McDonald,ENTERTAINMENT,2018-05-26,Will Smith Joins Diplo And Nicky Jam For The 2...,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.
2,Ron Dicker,ENTERTAINMENT,2018-05-26,Hugh Grant Marries For The First Time At Age 57,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...
3,Ron Dicker,ENTERTAINMENT,2018-05-26,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...
4,Ron Dicker,ENTERTAINMENT,2018-05-26,Julianna Margulies Uses Donald Trump Poop Bags...,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ..."


Remove any null descriptions, and reduce to just a category and a short description

In [7]:
data_limited = data[["short_description","category"]]
data_limited.replace('',np.nan,inplace=True)
data_limited.dropna(inplace=True)
data_limited.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,short_description,category
0,She left her husband. He killed their children...,CRIME
1,Of course it has a song.,ENTERTAINMENT
2,The actor and his longtime girlfriend Anna Ebe...,ENTERTAINMENT
3,The actor gives Dems an ass-kicking for not fi...,ENTERTAINMENT
4,"The ""Dietland"" actress said using the bags is ...",ENTERTAINMENT


Create GCS Bucket to store CSV FIle - must be in the format PROJECT_ID-lcm

In [8]:
%%storage create -p $project -b gs://$project-lcm

Create the dataset, if it already exists don't throw an error

In [11]:
try:
  dataset_name = 'news_dataset_v2'
  compute_region = 'us-central1'
  client = automl.AutoMlClient()
  project_location = client.location_path(project,compute_region)
  classification_type = "MULTICLASS"
  dataset_metadata = {"classification_type": classification_type}
  my_dataset = {
    "display_name":dataset_name,
    "text_classification_dataset_metadata": dataset_metadata
  }
  dataset = client.create_dataset(project_location,my_dataset)
except AlreadyExists as ex:
  print('dataset_exists')
dataset_path = ''
dataset_id = ''
for element in client.list_datasets(client.location_path(project,compute_region)):
    if element.display_name == dataset_name:
        dataset_path = element.name
        dataset_id = element.name.split('/')[-1]
        break
    

Create the input data csv from the pandas dataframe

In [10]:
import csv
import os
from tensorflow.python.lib.io import file_io
data_csv = "gs://{}-lcm/news-categories.csv".format(project)
with file_io.FileIO(data_csv,'w') as f:
  writer = csv.writer(f,delimiter=',')
  for index,row in data_limited.iterrows():
    writer.writerow([row['short_description'],row['category']])

  from ._conv import register_converters as _register_converters


Import the training data into the dataset

In [12]:
input_config = {'gcs_source':{'input_uris':[data_csv]}}
response = client.import_data(dataset_path,input_config)
print("Processing import...")
print('Data imported. {}'.format(response.result()))

Processing import...
Data imported. 


Train the model on the dataset

In [14]:
model_name = 'newsmodel_v3'
my_model  = {
  'display_name': model_name,
  'dataset_id': dataset_id,
  'text_classification_model_metadata': {}
}
response = client.create_model(project_location,my_model)
print('Training operation: {}'.format(response.operation.name))
print('Training started')

Training operation: projects/785778569209/locations/us-central1/operations/TCN3469769602761480693
Training started


Review your dataset and model at: https://cloud.google.com/automl/ui/text. You can also review the documentation on how to analyze the model within python. https://cloud.google.com/natural-language/automl/docs/