This code example illustrates the use of PROC TEXTMINE for identifying important terms and topics in a document collection.     
                                                                      
PROC TEXTMINE parses the news data set to                            
1. generate a dictionary of important terms                        
2. generate a collection of important topics                       
                                                                      
The OUTTERMS= option specifies the terms dictionary to be created.

The OUTTOPICS= option specifies the SAS data set to contain the number of topics specified by the K= option. The user can peruse the TERMS and TOPICS data sets to gain insight about the document collection.                                                          
                                                                      
PROC TMSCORE allows the user to score new document collections based on training performed by a previous PROC TEXTMINE analysis.    

### Import packages

In [1]:
from swat import *
from pprint import pprint

### CAS Server connection details

In [2]:
cashost='localhost'
casport=5570
casauth='~/.authinfo'

### Start CAS session

In [3]:
sess = CAS(cashost, casport, authinfo=casauth, caslib="casuser")

### Import action sets

In [4]:
sess.loadactionset(actionset="textMining")

NOTE: Added action set 'textMining'.


### Load data into CAS 

In [5]:
indata_dir = "/home/viyauser/casuser/data"
if not sess.table.tableExists(table="news").exists:
    sess.upload_file(indata_dir + "/news.sas7bdat", casout={"name":"news"})

if not sess.table.tableExists(table="engstop").exists:
    sess.upload_file(indata_dir+"/engstop.sas7bdat", casout={"name":"engstop"})

NOTE: Cloud Analytic Services made the uploaded file available as table NEWS in caslib CASUSER(viyauser).
NOTE: The table NEWS has been created in caslib CASUSER(viyauser) from binary data uploaded to Cloud Analytic Services.
NOTE: Cloud Analytic Services made the uploaded file available as table ENGSTOP in caslib CASUSER(viyauser).
NOTE: The table ENGSTOP has been created in caslib CASUSER(viyauser) from binary data uploaded to Cloud Analytic Services.


1. Parse the documents in table news and generate the term-by-term matrix                                   
2. Perform dimensionality reduction via SVD, and                    
3. Perform topic discovery based on SVD 

In [6]:
sess.textMining.tmMine(
  documents={"name":"news"},
  stopList={"name":"engstop"},
  docId="key",
  text="text",
  reduce=2,
  entities="STD",
  k=10,
  norm="DOC",
  u={"name":"svdu", "replace":True},
  terms={"name":"terms", "replace":True},
  parent={"name":"parent", "replace":True},
  parseConfig={"name":"config", "replace":True},
  docPro={"name":"docpro", "replace":True},
  topics={"name":"topics", "replace":True}  
)

Unnamed: 0,casLib,Name,Label,Rows,Columns,casTable
0,CASUSER(viyauser),config,,1,11,"CASTable('config', caslib='CASUSER(viyauser)')"
1,CASUSER(viyauser),terms,,9657,11,"CASTable('terms', caslib='CASUSER(viyauser)')"
2,CASUSER(viyauser),parent,,31467,3,"CASTable('parent', caslib='CASUSER(viyauser)')"
3,CASUSER(viyauser),svdu,,6024,11,"CASTable('svdu', caslib='CASUSER(viyauser)')"
4,CASUSER(viyauser),docpro,,598,11,"CASTable('docpro', caslib='CASUSER(viyauser)')"
5,CASUSER(viyauser),topics,,10,3,"CASTable('topics', caslib='CASUSER(viyauser)')"


### Print results

In [7]:
allRows=20000  # Assuming max rows in terms table is <= 20,000
terms_sorted=sess.CASTable("terms").fetch(to=allRows)['Fetch'].sort_values(by="_NumDocs_", ascending=False)

print("10 Topics found by PROC TEXTMINE".center(80, '-'))
pprint(sess.CASTable("topics").fetch(to=10))

print("Top 10 entities that appear in the news".center(80, '-'))
terms_sorted.where="attribute='Entity'";
pprint(terms_sorted.head(n=10))

print("Top 10 noun terms that appear in the news".center(80, '-'))
terms_sorted.where="role='Noun'";
pprint(terms_sorted.head(n=10))

print("Stuctured representation of first 5 documents".center(80, '-'))
pprint(sess.CASTable("docpro").fetch(to=5))

------------------------10 Topics found by PROC TEXTMINE------------------------
CASResults([('Fetch',
             Selected Rows from Table TOPICS

   _TopicId_                                           _Name_  _TermCutOff_
0        1.0  league, +defenseman, hockey, tampa, +draft pick         0.021
1        2.0            +keyboard, pc, +price, +mouse, +thumb         0.021
2        3.0             +flyer, amour, +goal, tommy, lindros         0.020
3        4.0              period, scorer g, scorer, power, pp         0.021
4        5.0     gif, +injury, +muscle, +keyboard, +condition         0.020
5        6.0    +tool, +break, +exercise, +type, +description         0.022
6        7.0                +cancer, +day, +bath, water, +eat         0.022
7        8.0                   +versus, tor, mon, van, series         0.023
8        9.0        business, political, college, +event, dr.         0.020
9       10.0        +system, sgi, virtual, graphics, +reality         0.024)])
------------

### Score new text data  

In [8]:
sess.textMining.tmScore(
  documents={"name":"news"},
  u={"name":"svdu"},
  parseConfig={"name":"config"},
  terms={"name":"terms"},
  docPro={"name":"score_docpro", "replace":True},
  parent={"name":"score_parent", "replace":True},
  text="text",
  docId="key"
)

Unnamed: 0,casLib,Name,Label,Rows,Columns,casTable
0,CASUSER(viyauser),score_parent,,31467,3,"CASTable('score_parent', caslib='CASUSER(viyau..."
1,CASUSER(viyauser),score_docpro,,598,11,"CASTable('score_docpro', caslib='CASUSER(viyau..."


### End CAS session

In [9]:
sess.close()