[View in Colaboratory](https://colab.research.google.com/github/jagatfx/turicreate-colab/blob/master/turicreate_text_classification.ipynb)

# Text Classification
https://apple.github.io/turicreate/docs/userguide/text_classifier/

Text classification - commonly used in tasks such as sentiment analysis - refers to the use of natural language processing (NLP) techniques to extract subjective information such as the polarity of the text, e.g., whether or not the author is speaking positively or negatively about some topic.

In many cases, it can help keep a pulse on users' needs and adapt products and services accordingly. Many applications exist for this type of analysis:

*   Forum data: Find out how people feel about various products and features.
*   Restaurant and movie reviews: What are people raving about? What do people hate?
*   Social media: What is the sentiment about a hashtag, e.g. for a company, politician, etc?
*   Call center transcripts: Are callers praising or complaining about particular topics?

In addition, text classification can also be used to identify features (or aspects) of entities that are mentioned, and then estimate the sentiment for each aspect. For example, when studying reviews about mobile phones you may be interested in how people feel about aspects such as battery life, screen resolution, size, etc.


Turi Create applications for automated text analytics include:

*   detecting user sentiment regarding product reviews
*   creating features for use in other machine learning models
*   understanding large collections of documents

## Turi Create and GPU Setup

In [0]:
!apt install libnvrtc8.0
!pip uninstall -y mxnet-cu80 && pip install mxnet-cu80==1.1.0
!pip install turicreate

## Google Drive Access

You will be asked to click a link to generate a secret key to access your Google Drive. 

Copy and paste secret key it into the space provided with the notebook.

In [0]:
import os.path
from google.colab import drive

# mount Google Drive to /content/drive/My Drive/
if os.path.isdir("/content/drive/My Drive"):
  print("Google Drive already mounted")
else:
  drive.mount('/content/drive')

## Fetch Example Data

*   Yelp review data: https://www.yelp.com/dataset

In [0]:
import os.path
import urllib.request
import tarfile
import zipfile
import gzip
from shutil import copy

def fetch_remote_datafile(filename, remote_url):
  if os.path.isfile("./" + filename):
    print("already have " + filename + " in workspace")
    return
  print("fetching " + filename + " from " + remote_url + "...")
  urllib.request.urlretrieve(remote_url, "./" + filename)

def cache_datafile_in_drive(filename):
  if os.path.isfile("./" + filename) == False:
    print("cannot cache " + filename + ", it is not in workspace")
    return
  
  data_drive_path = "/content/drive/My Drive/Colab Notebooks/data/"
  if os.path.isfile(data_drive_path + filename):
    print("" + filename + " has already been stored in Google Drive")
  else:
    print("copying " + filename + " to " + data_drive_path)
    copy("./" + filename, data_drive_path)
  

def load_datafile_from_drive(filename, remote_url=None):
  data_drive_path = "/content/drive/My Drive/Colab Notebooks/data/"
  if os.path.isfile("./" + filename):
    print("already have " + filename + " in workspace")
  elif os.path.isfile(data_drive_path + filename):
    print("have " + filename + " in Google Drive, copying to workspace...")
    copy(data_drive_path + filename, ".")
  elif remote_url != None:
    fetch_remote_datafile(filename, remote_url)
  else:
    print("error: you need to manually download " + filename + " and put in drive")
    
def extract_datafile(filename, expected_extract_artifact=None):
  if expected_extract_artifact != None and (os.path.isfile(expected_extract_artifact) or os.path.isdir(expected_extract_artifact)):
    print("files in " + filename + " have already been extracted")
  elif os.path.isfile("./" + filename) == False:
    print("error: cannot extract " + filename + ", it is not in the workspace")
  else:
    extension = filename.split('.')[-1]
    if extension == "zip":
      print("extracting " + filename + "...")
      data_file = open(filename, "rb")
      z = zipfile.ZipFile(data_file)
      for name in z.namelist():
          print("    extracting file", name)
          z.extract(name, "./")
      data_file.close()
    elif extension == "gz":
      print("extracting " + filename + "...")
      if filename.split('.')[-2] == "tar":
        tar = tarfile.open(filename)
        tar.extractall()
        tar.close()
      else:
        data_zip_file = gzip.GzipFile(filename, 'rb')
        data = data_zip_file.read()
        data_zip_file.close()
        extracted_file = open('.'.join(filename.split('.')[0:-1]), 'wb')
        extracted_file.write(data)
        extracted_file.close()
    elif extension == "tar":
      print("extracting " + filename + "...")
      tar = tarfile.open(filename)
      tar.extractall()
      tar.close()
    elif extension == "csv":
      print("do not need to extract csv")
    else:
      print("cannot extract " + filename)
      
def load_cache_extract_datafile(filename, expected_extract_artifact=None, remote_url=None):
  load_datafile_from_drive(filename, remote_url)
  extract_datafile(filename, expected_extract_artifact)
  cache_datafile_in_drive(filename)
  

In [2]:
load_cache_extract_datafile("yelp-data.csv.zip", "yelp-data.csv", "https://static.turi.com/datasets/regression/yelp-data.csv")

already have yelp-data.csv.zip in workspace
files in yelp-data.csv.zip have already been extracted
yelp-data.csv.zip has already been stored in Google Drive


In [4]:
load_cache_extract_datafile("w16.csv.zip", "w16.csv", "https://static.turi.com/datasets/wikipedia/raw/w16.csv")

already have w16.csv.zip in workspace
files in w16.csv.zip have already been extracted
w16.csv.zip has already been stored in Google Drive


## Setup Turi Create

In [0]:
import mxnet as mx
import turicreate as tc

In [0]:
# Use all GPUs (default)
tc.config.set_num_gpus(-1)

# Use only 1 GPU
#tc.config.set_num_gpus(1)

# Use CPU
#tc.config.set_num_gpus(0)

## Text Classifier Example - Yelp Sentiment

In [45]:
# Load the data
ydata =  tc.SFrame('yelp-data.csv')
print(ydata)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,int,str,str,str,dict,int,int,int,list,str,str,float,float,str,int,int,float,str,str,float,str,int,str,int,int,int,dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


+------------------------+------------+------------------------+-------+
|      business_id       |    date    |       review_id        | stars |
+------------------------+------------+------------------------+-------+
| 9yKzy9PApeiPPOUJEtnvkg | 2011-01-26 | fWKvX83p0-ka4JS3dc6E5A |   5   |
| ZRJwVLyzEJq1VAihDhYiow | 2011-07-27 | IjZ33sJrzXqU-0X6U8NwyA |   5   |
| 6oRAC4uyJCsJl1X0WZpVSA | 2012-06-14 | IESLBzqUCLdSzSqm0eCSxQ |   4   |
| _1QQZuf4zZOyFCvXc0o6Vg | 2010-05-27 | G-WvGaISbqqaMHlNnByodA |   5   |
| 6ozycU1RpktNG2-1BroVtw | 2012-01-05 | 1uJFq2r5QfJG_6ExMRCaGw |   5   |
| -yxfBYGB6SEqszmxJxd97A | 2007-12-13 | m2CKSsepBCoRYWxiRUsxAg |   4   |
| zp713qNhx8d9KCJJnrw1xA | 2010-02-12 | riFQ3vxNpP4rWLk_CSri2A |   5   |
| hW0Ne_HTHEAgGF1rAdmR-g | 2012-07-12 | JL7GXJ9u4YMx7Rzs05NfiQ |   4   |
| wNUea3IXZWD63bbOQaOH-g | 2012-08-17 | XtnfnYmnJYi71yIuGsXIUA |   4   |
| nMHhuYan8e3cONo3PornJA | 2010-08-11 | jJAIXA46pU1swYyRCdfXtQ |   5   |
+------------------------+------------+------------

The text classifier in Turi Create is currently a simple combination of two components:

*   feature engineering: a bag-of-words transformation
*   statistical model: a LogisticClassifier is used to classify text based on the above features

The bag-of-words and a logistic regression classifier is a very strong baseline for this particular task and works on a wide variety of datasets.

In [11]:
# Create a model
model = tc.text_classifier.create(ydata, 'stars', features=['text'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [46]:
# Make predictions
predictions = model.predict(ydata)
print(predictions)

[5, 3, 4, 5, 5, 4, 5, 2, 5, 5, 5, 4, 5, 5, 4, 2, 3, 5, 3, 1, 4, 4, 5, 1, 5, 4, 5, 4, 4, 4, 5, 2, 5, 5, 3, 3, 4, 4, 4, 4, 5, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 5, 4, 2, 5, 4, 5, 4, 1, 3, 1, 3, 5, 5, 4, 5, 4, 2, 4, 4, 5, 4, 4, 5, 4, 1, 4, 5, 5, 5, 1, 4, 5, 2, 1, 4, 5, 5, 5, 3, 5, 5, 5, 4, 5, 5, 5, 5, ... ]


In [76]:
test_sentence = tc.SFrame({'text': ['this place is okay, we could go somewhere else if you want. this was good. the other thing was bad'], 'stars': [3]})
test_result = model.evaluate(test_sentence)
test_result['confusion_matrix']

target_label,predicted_label,count
3,4,1


In [75]:
test_sentence = tc.SFrame({'text': ['this place is terrible, I will never go back here'], 'stars': [1]})
test_result = model.evaluate(test_sentence)
test_result['confusion_matrix']

target_label,predicted_label,count
1,1,1


In [13]:
classifier = model.classifier
print(classifier)

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 251104
Number of examples             : 204936
Number of classes              : 5
Number of feature columns      : 1
Number of unpacked features    : 62775

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.2

Training Summary
----------------
Solver                         : lbfgs
Solver iterations              : 10
Solver status                  : Completed (Iteration limit reached).
Training time (sec)            : 1256.6885

Settings
--------
Log-likelihood                 : 146535.9458

Highest Positive Coefficients
-----------------------------
text[endorses]                 : 7.9638
text[lovelier]                 : 7.1328
text[urinalysis]               : 7.1111
text[vegen]                    : 6.9885
text[limy]                     : 6.8805

Lowest Negative Coefficients
----------------------------
text[mussamon]                

In [19]:
# Evaluate the model
results = model.evaluate(ydata[0:10])
print(results)

{'accuracy': 0.7, 'auc': nan, 'confusion_matrix': Columns:
	target_label	int
	predicted_label	int
	count	int

Rows: 5

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      4       |        4        |   2   |
|      5       |        5        |   5   |
|      5       |        3        |   1   |
|      4       |        2        |   1   |
|      4       |        5        |   1   |
+--------------+-----------------+-------+
[5 rows x 3 columns]
, 'f1_score': 0.375, 'log_loss': 0.9071786727194706, 'precision': 0.45833333333333337, 'recall': 0.6666666666666667, 'roc_curve': Columns:
	threshold	float
	fpr	float
	tpr	float
	p	int
	n	int
	class	int

Rows: 500005

Data:
+-----------+-----+-----+---+----+-------+
| threshold | fpr | tpr | p | n  | class |
+-----------+-----+-----+---+----+-------+
|    0.0    | 1.0 | nan | 0 | 10 |   0   |
|   1e-05   | 0.9 | nan | 0 | 10 |   0   |
|   2e-05   | 0.9 | nan | 0 

## Save and Export Model

In [0]:
# Save the model for later use in Turi Create
model.save('TextClassifier.model')

In [0]:
model = tc.load_model('TextClassifier.model')

In [0]:
# Export for use in Core ML
model.export_coreml('TextClassifier.mlmodel')

In [0]:
# download mlmodel locally
from google.colab import files
files.download("TextClassifier.mlmodel")

In [0]:
# copy model to Google Drive
from shutil import copy
copy("/content/TextClassifier.mlmodel", "/content/drive/My Drive/Colab Notebooks/data/models/TextClassifier.mlmodel")

In [0]:
from shutil import copytree
copytree("/content/TextClassifier.model", "/content/drive/My Drive/Colab Notebooks/data/models/TextClassifier.model")

## Text Analysis Example - Wikipedia

Each line of w16.csv contains all of the text in a single document

In [12]:
wdata = tc.SFrame.read_csv('w16.csv', header=False)
print(wdata[0])

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


{'X1': 'alainconnes alain connes is one of the leading specialists on operator algebras  in his early work on von neumann algebras in the 1970s he succeeded in obtaining the almost complete classification of injective factors  following this he made contributions in operator ktheory and index theory which culminated in the baumconnes conjecture he also introduced cyclic cohomology in the early 1980s as a first step in the study of noncommutative differential geometry connes has applied his work in areas of mathematics and theoretical physics including number theory differential geometry and particle physics connes was awarded the fields medal in 1982 the crafoord prize in 2001 and the gold medal of the cnrs in 2004   he is a member of the french academy of sciences and several foreign academies and societies including the danish academy of sciences norwegian academy of sciences russian academy of sciences and us national academy of sciences'}


### Bag-of-words

Both SFrames and SArrays expose functionality that can be very useful for manipulating text data. For example, one common preprocessing task for text data is to transform it into "bag-of-words" format: each document is represented by a map where the words are keys and the values are the number of occurrences. So a document containing the text "hello goodbye hello" would be represented by a dict type element containing the value {"hello": 2, "goodbye":1}. This transformation can be accomplished with the following code.

In [15]:
bow = tc.text_analytics.count_words(wdata['X1'])
bow

dtype: dict
Rows: 72269

In [16]:
# print five words in first document
list(bow[0].keys())[:5]

['national', 'norwegian', 'societies', 'academies', 'which']

In [21]:
# find the documents that contain the word "gold"
wdata[bow.dict_has_any_keys(['gold'])][0]

{'X1': 'alainconnes alain connes is one of the leading specialists on operator algebras  in his early work on von neumann algebras in the 1970s he succeeded in obtaining the almost complete classification of injective factors  following this he made contributions in operator ktheory and index theory which culminated in the baumconnes conjecture he also introduced cyclic cohomology in the early 1980s as a first step in the study of noncommutative differential geometry connes has applied his work in areas of mathematics and theoretical physics including number theory differential geometry and particle physics connes was awarded the fields medal in 1982 the crafoord prize in 2001 and the gold medal of the cnrs in 2004   he is a member of the french academy of sciences and several foreign academies and societies including the danish academy of sciences norwegian academy of sciences russian academy of sciences and us national academy of sciences'}

In [24]:
# save this representation of the documents as another column of the original SFrame
wdata['bow'] = bow
wdata.head()

X1,bow
alainconnes alain connes is one of the leading ...,"{'national': 1, 'norwegian': 1, ..."
americannationalstandards institute the american ...,"{'industry': 1, 'current': 1, 'nescc' ..."
alberteinstein near the beginning of his career ...,"{'50000': 1, 'winners': 1, 'peace': 2, ..."
austriangerman as german is a pluricentric ...,"{'spite': 1, 'rhythm': 1, 'markedly': 1, 'border': ..."
arsenic arsenic is a metalloid it can exis ...,"{'coke': 1, 'metal': 1, 'nonferrous': 1, 'gla ..."
alps the alps alpen alpi alp alpes aupsalps alps ...,"{'ranunculus': 1, 'convert4000': 1, ..."
alexiscarrel born in saintefoylslyon rhne ...,"{'accepted': 1, 'brief': 1, 'intention': 1, ..."
adelaide adelaide is a coastal city situated on ...,"{'desalination': 1, 'pumping': 1, 'demand': ..."
artist an artist is a person engaged in one or ...,"{'complete': 1, 'purposes': 1, 'menti ..."
abdominalsurgery the three most common ...,"{'leakage': 1, 'thus': 1, 'as': 1, 'bowel': 2, ..."


### TF-IDF

Another useful representation for text data is called TF-IDF (term frequency - inverse document frequency). This is a modification of the bag-of-words format where the counts are transformed into scores: words that are common across the document corpus are given low scores, and rare words occurring often in a document are given high scores.

In [0]:
wdata['tfidf'] = tc.text_analytics.tf_idf(wdata['bow'])

In [26]:
wdata.head()

X1,bow,tfidf
alainconnes alain connes is one of the leading ...,"{'national': 1, 'norwegian': 1, ...","{'national': 1.9620352560896113, ..."
americannationalstandards institute the american ...,"{'industry': 1, 'current': 1, 'nescc' ...","{'industry': 3.314172167576655, ..."
alberteinstein near the beginning of his career ...,"{'50000': 1, 'winners': 1, 'peace': 2, ...","{'50000': 5.22457120356271, ..."
austriangerman as german is a pluricentric ...,"{'spite': 1, 'rhythm': 1, 'markedly': 1, 'border': ...","{'spite': 5.103651134105985, ..."
arsenic arsenic is a metalloid it can exis ...,"{'coke': 1, 'metal': 1, 'nonferrous': 1, 'gla ...","{'coke': 6.98345792779019, ..."
alps the alps alpen alpi alp alpes aupsalps alps ...,"{'ranunculus': 1, 'convert4000': 1, ...","{'ranunculus': 9.578712634747056, ..."
alexiscarrel born in saintefoylslyon rhne ...,"{'accepted': 1, 'brief': 1, 'intention': 1, ...","{'accepted': 3.734588675537783, ..."
adelaide adelaide is a coastal city situated on ...,"{'desalination': 1, 'pumping': 1, 'demand': ...","{'desalination': 8.88556545418711, ..."
artist an artist is a person engaged in one or ...,"{'complete': 1, 'purposes': 1, 'menti ...","{'complete': 3.27024996085324, ..."
abdominalsurgery the three most common ...,"{'leakage': 1, 'thus': 1, 'as': 1, 'bowel': 2, ...","{'leakage': 7.604631608725046, ..."


### BM25

The BM25 score is yet another useful representation for text data. It scores each document in a corpus according to the document's relevance to a particular query.

https://apple.github.io/turicreate/docs/api/generated/turicreate.text_analytics.bm25.html

In [32]:
query = ['beatles', 'john', 'paul']
bm25_scores = tc.text_analytics.bm25(wdata["X1"], query)

+--------+--------------------+
| doc_id |        bm25        |
+--------+--------------------+
| 14579  | 20.858418102812646 |
| 38137  | 17.356521355914538 |
|  9384  | 16.039850349384903 |
|  2355  | 15.798258743981112 |
| 57034  | 15.789714879105889 |
| 14556  | 15.78412222842356  |
| 14555  | 15.55229034937392  |
| 59546  | 15.414278860004568 |
| 68926  | 15.305128039009258 |
|  2768  | 15.175833663227923 |
+--------+--------------------+
[7751 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [33]:
print(bm25_scores.sort("bm25", ascending = False))

+--------+--------------------+
| doc_id |        bm25        |
+--------+--------------------+
| 14579  | 20.858418102812646 |
| 38137  | 17.356521355914538 |
|  9384  | 16.039850349384903 |
|  2355  | 15.798258743981112 |
| 57034  | 15.789714879105889 |
| 14556  | 15.78412222842356  |
| 14555  | 15.55229034937392  |
| 59546  | 15.414278860004568 |
| 68926  | 15.305128039009258 |
|  2768  | 15.175833663227923 |
+--------+--------------------+
[7751 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [34]:
wdata[14579]

{'X1': 'dizzymisslizzie the song has been covered many times including  most famously  by the beatles on the 1965 help album though the recording was initially intended for the 1965 american compilation beatles vi along with the larry williams cover bad boy recorded by the beatles on the same day  paul mccartney has stated that he believes this song to be one of the beatles best recordings it features loud rhythmic instrumentation along with john lennons particularly rousing vocals the song also appeared in a live solo version by lennon on the plastic ono bands live peace in toronto 1969 in the united kingdom the beatles version first appeared on the album help misspelled dizzy miss lizzy in north america it was included on beatles vi the song was originally thought about by band manager brian epstein and was later introduced to ringo starr the bands drummer he made sure that the band recorded it after loving its upbeat rhythm and interesting lyrics',
 'bow': {'1965': 2,
  '1969': 1,
 

Turi Create also contains a helper function called stop_words that returns a list of common words. We can use SArray.docs.dict_trim_by_keys to remove these words from the documents as a preprocessing step. NB: Currently only English words are available.

In [30]:
docs = wdata['bow'].dict_trim_by_values(2)
docs = docs.dict_trim_by_keys(tc.text_analytics.stop_words(), exclude=True)
docs

dtype: dict
Rows: 72269
[{'theory': 2, 'operator': 2, 'physics': 2, 'connes': 3, 'medal': 2, 'early': 2, 'algebras': 2, 'work': 2, 'geometry': 2, 'differential': 2, 'academy': 5, 'including': 2, 'sciences': 5}, {'nescc': 2, 'usnc': 2, 'iso': 3, 'administers': 2, 'process': 2, 'procedures': 2, 'iec': 4, 'commission': 2, 'electrotechnical': 2, 'panels': 2, 'identify': 2, 'annual': 2, 'organizations': 8, 'accredits': 2, 'consensus': 3, 'technical': 2, 'develop': 2, 'people': 2, 'developed': 2, '1918': 2, 'membership': 2, 'national': 9, 'oversees': 2, 'institute': 8, 'personnel': 2, 'agencies': 3, 'american': 7, 'ansis': 2, 'ansi': 13, 'government': 4, 'societies': 2, 'accreditation': 2, 'united': 2, 'companies': 2, 'organization': 5, 'staff': 2, 'standards': 38, 'voluntary': 3, 'asa': 3, 'states': 2, 'carry': 2, 'international': 11, 'products': 6, 'bodies': 2, 'developing': 3, 'requirements': 3, 'engineering': 3, 'nuclear': 2, 'programs': 2, 'services': 2, 'budget': 2, 'formed': 3, 'aesc'

### Topic Models

"Topic models" are a class of statistical models for text data. These models typically assume documents can be described by a small set of topics, and there is a probability of any word occurring for a given "topic".

For example, suppose we are given documents where the first document begins with the text "The burrito was terrible. I..." and continues with a long description of the eater's woes. A topic model attempts to do two things:

*   Learn "topics": collections of words that co-occur in a meaningful way
*   Learn how much each document pertains to each topic