### Text Mining in CAS

This code example illustrates the use of PROC TEXTMINE for identifying important terms and topics in a document collection.     
                                                                      
PROC TEXTMINE parses the news data set to                            
1. generate a dictionary of important terms                        
2. generate a collection of important topics                       
                                                                      
The OUTTERMS= option specifies the terms dictionary to be created.

The OUTTOPICS= option specifies the SAS data set to contain the number of topics specified by the K= option. The user can peruse the TERMS and TOPICS data sets to gain insight about the document collection.                                                          
                                                                      
PROC TMSCORE allows the user to score new document collections based on training performed by a previous PROC TEXTMINE analysis.    

### Import packages and Connect to CAS

In [39]:
from swat import *
import swat
#swat.options.cas.print_messages = False
swat.options.trace_actions = True

In [40]:
#gtpviyaea22.unx.sas.com
# Connect to the session
cashost='racesx11093.demo.sas.com'
casport=5570
casauth='C:/.authinfo_w12_race2'

s = CAS(cashost, casport, authinfo=casauth, caslib="casuser")

# Define directory anddata file name
indata_dir="/opt/sasinside/DemoData"
dataset='news'
stopw='engstop'
textvar="TEXT"

# Create a CAS library called DMLib pointing to the defined directory
#s.table.addCaslib(datasource={'srctype':'path'}, name='DMlib', path=indata_dir)

# Load table into CAS
s.loadTable(caslib='DemoData', path=dataset+'.sas7bdat', casout={'name':dataset})
s.loadTable(caslib='DemoData', path=stopw+'.sas7bdat', casout={'name':stopw})

#Actionsets
actionsets = ['textMining', 'FedSQL', 'sampling', 'decisionTree']
[s.builtins.loadactionset(i) for i in actionsets];

[table.loadtable]
   casout.name = "news" (string)
   path   = "news.sas7bdat" (string)
   caslib = "DemoData" (string)

NOTE: Cloud Analytic Services made the file news.sas7bdat available as table NEWS in caslib CASUSER(sasdemo).
[table.loadtable]
   casout.name = "engstop" (string)
   path   = "engstop.sas7bdat" (string)
   caslib = "DemoData" (string)

NOTE: Cloud Analytic Services made the file engstop.sas7bdat available as table ENGSTOP in caslib CASUSER(sasdemo).
[builtins.loadactionset]
   actionset = "textMining" (string)

NOTE: Added action set 'textMining'.
[builtins.loadactionset]
   actionset = "FedSQL" (string)

NOTE: Added action set 'FedSQL'.
[builtins.loadactionset]
   actionset = "sampling" (string)

NOTE: Added action set 'sampling'.
[builtins.loadactionset]
   actionset = "decisionTree" (string)

NOTE: Added action set 'decisionTree'.


## Investigate our text data

In [41]:
reviews=s.CASTable(dataset)
reviews.head()

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup,key
0,I have a few reprints left of chapters from my...,1.0,0.0,0.0,graphics,1.0
1,"gnuplot, etc. make it easy to plot real valued...",1.0,0.0,0.0,graphics,2.0
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1.0,0.0,0.0,graphics,3.0
3,"Hello, I am looking to add voice input capabil...",1.0,0.0,0.0,graphics,4.0
4,I recently got a file describing a library of ...,1.0,0.0,0.0,graphics,5.0


### What are text reviews about

In [42]:
reviews['newsgroup'].value_counts()

hockey      200
medical     200
graphics    198
dtype: int64

### Number of Rows

In [43]:
len(reviews)

598

### See complete reviews
Appears to be an email  chain from university researchers

In [44]:
reviews_df=reviews.to_frame() #Pandas Data Frame

for i in range(5):
    print(reviews_df['TEXT'][i], "\n")

I have a few reprints left of chapters from my book " Visions of the Future" . These include reprints of 3 chapters probably of interest to readers of this forum, including: 1. Current Techniques and Development of Computer Art, by Franz Szabo 2. Forging a Career as a Sculptor from a Career as Computer Programmer, by Stewart Dickson 3. Fractals and Genetics in the Future by H. Joel Jeffrey I'd be happy to send out free reprints to researchers for scholarly purposes, until the reprints run out. Just send me your name and address. Thanks, Cliff cliff@watson.ibm.com 

gnuplot, etc. make it easy to plot real valued functions of 2 variables but I want to plot functions whose values are 2-vectors. I have been doing this by plotting arrays of arrows (complete with arrowheads) but before going further, I thought I would ask whether someone has already done the work. Any pointers?? thanx in advance Tom Weston | USENET: weston@ucssun1.sdsu.edu Department of Philosophy | (619) 594-6218 (office) S

### Build Model

In [45]:
def c_dict(name):
    training_options = dict(name      = name,
                            replace   = True)                           
    return training_options

In [47]:
s.textMining.tmMine(
  documents=dataset,
  stopList=stopw,
  docId="key",
  copyVars=['text', 'newsgroup'],
  text=textvar,
  reduce=1,
  entities="STD",
  k=3,
  norm="DOC",
  topicDecision=True,
  fakeParam="hello",
  u=c_dict("svdu"),
  terms=c_dict("terms"),
  parent=c_dict("parent"),
  child=c_dict("child"),
  parseConfig=c_dict("config"),
  docPro=c_dict("docpro"),
  topics=c_dict("topics")
)

[textmining.tmmine]
   k             = 3 (int64)
   stopList.name = "engstop" (string)
   terms.name    = "terms" (string)
   terms.replace = true (boolean)
   norm          = "DOC" (string)
   copyVars[0] = "text" (string)
   copyVars[1] = "newsgroup" (string)
   u.name    = "svdu" (string)
   u.replace = true (boolean)
   reduce        = 1 (int64)
   topicDecision = true (boolean)
   child.name    = "child" (string)
   child.replace = true (boolean)
   docPro.name    = "docpro" (string)
   docPro.replace = true (boolean)
   topics.name    = "topics" (string)
   topics.replace = true (boolean)
   text          = "TEXT" (string)
   fakeParam     = "hello" (string)
   parent.name    = "parent" (string)
   parent.replace = true (boolean)
   documents.name = "news" (string)
   entities      = "STD" (string)
   parseConfig.name    = "config" (string)
   parseConfig.replace = true (boolean)
   docId         = "key" (string)



ERROR: Parameter 'fakeParam' is not recognized.
E
ERROR: The action stopped due to errors.


### 15 Most Frequent Terms
Stemming automatically identifies parent-child relationships

In [75]:
terms = s.CASTable("terms").sort_values(by="_NumDocs_", ascending=False)
terms.head(15)

Unnamed: 0,_Term_,_Role_,_Attribute_,_Frequency_,_NumDocs_,_Keep_,_Termnum_,_Parent_,_ParentId_,_IsPar_,_Weight_
0,write,Verb,Alpha,394.0,341.0,Y,2576.0,,2576.0,+,0.098328
1,writes,Verb,Alpha,325.0,308.0,Y,8154.0,2576.0,2576.0,.,0.106965
2,article,Noun,Alpha,277.0,257.0,Y,285.0,,285.0,+,0.137203
3,article,Noun,Alpha,274.0,255.0,Y,285.0,285.0,285.0,.,0.137203
4,know,Verb,Alpha,191.0,142.0,Y,5926.0,,5926.0,+,0.239833
5,know,Verb,Alpha,159.0,125.0,Y,5926.0,5926.0,5926.0,.,0.239833
6,good,Adj,Alpha,175.0,112.0,Y,459.0,,459.0,+,0.298098
7,time,Noun,Alpha,166.0,102.0,Y,2965.0,,2965.0,+,0.316837
8,ca,Abbr,Alpha,185.0,101.0,Y,408.0,,408.0,,0.295679
9,year,Noun,Alpha,193.0,96.0,Y,4172.0,,4172.0,+,0.331244


### Different Attributes

In [76]:
terms['_Attribute_'].value_counts()

Alpha     8192
Entity    1204
Mixed      240
Abbr        21
dtype: int64

In [77]:
terms[terms['_Attribute_']=='Entity'].sort_values(by="_NumDocs_", ascending=False).head(10)

Unnamed: 0,_Term_,_Role_,_Attribute_,_Frequency_,_NumDocs_,_Keep_,_Termnum_,_Parent_,_ParentId_,_IsPar_,_Weight_
0,article-i.d.,PROP_MISC,Entity,85.0,85.0,Y,882.0,,882.0,,0.30514
1,gordon banks,PERSON,Entity,68.0,57.0,Y,2562.0,,2562.0,,0.375116
2,sender,PROP_MISC,Entity,46.0,46.0,Y,5634.0,,5634.0,,0.401175
3,n3jxp,PROP_MISC,Entity,45.0,45.0,Y,3151.0,,3151.0,,0.404613
4,geb@cadre.dsl.pitt.edu,INTERNET,Entity,45.0,45.0,Y,2228.0,,2228.0,,0.404613
5,lines,PROP_MISC,Entity,44.0,43.0,Y,4909.0,,4909.0,,0.413055
6,pittsburgh,LOCATION,Entity,78.0,38.0,Y,4377.0,,4377.0,,0.483375
7,well,LOCATION,Entity,43.0,38.0,Y,3999.0,,3999.0,,0.441978
8,nntp-posting-host,PROP_MISC,Entity,29.0,29.0,Y,3518.0,,3518.0,,0.473333
9,nhl,ORGANIZATION,Entity,54.0,29.0,Y,4463.0,,4463.0,,0.550269


### Raw document-term-matrix
The matrix is compressed, so for each document we only see words that appear more than once in the document
<br>
done at the child level, not combining stemmed words as defined by parent/child above
<br>
Can be done in SQL

In [78]:
s.FedSQL.execDirect(''' 
                    SELECT * 
                    FROM child
                    ORDER by _Document_
                    LIMIT 10
                    ''')

Unnamed: 0,_Termnum_,_Document_,_Count_
0,6557.0,1.0,2.0
1,7555.0,1.0,1.0
2,7938.0,1.0,1.0
3,2349.0,1.0,1.0
4,4401.0,1.0,1.0
5,2807.0,1.0,1.0
6,7176.0,1.0,1.0
7,1102.0,1.0,1.0
8,204.0,1.0,1.0
9,5842.0,1.0,1.0


### Scaled document-term-matrix
The matrix is compressed, so for each document we only see words that appear more than once in the document
<br>
done at the parent level, merging counts for parent and child into parent level
<br>
Results Scaled

In [79]:
s.CASTable("parent").sort_values(by="_Document_").head(10)

Unnamed: 0,_Termnum_,_Document_,_Count_
0,3439.0,1.0,1.116642
1,1239.0,1.0,0.837381
2,849.0,1.0,0.396056
3,4105.0,1.0,0.615504
4,3418.0,1.0,0.624953
5,753.0,1.0,0.488286
6,3268.0,1.0,0.891587
7,211.0,1.0,0.891587
8,204.0,1.0,0.837381
9,4192.0,1.0,0.900445


### Descriptive terms for each topic

In [80]:
s.CASTable("topics").fetch(to=10)

Unnamed: 0,_TopicId_,_Name_,_TermCutOff_
0,1.0,"league, +defenseman, hockey, tampa, +draft pick",0.021
1,2.0,"+keyboard, pc, +price, +mouse, +thumb",0.021
2,3.0,"+flyer, amour, +goal, tommy, lindros",0.02
3,4.0,"period, scorer g, scorer, power, pp",0.021
4,5.0,"gif, +injury, +muscle, +keyboard, +condition",0.02
5,6.0,"+tool, +break, +exercise, +type, +description",0.022
6,7.0,"+cancer, +day, +bath, water, +eat",0.022
7,8.0,"+versus, tor, mon, van, series",0.023
8,9.0,"business, political, college, +event, dr.",0.02
9,10.0,"+system, sgi, virtual, graphics, +reality",0.024


### See structured representation of first 5 documents
Similar to a PCA analysis, this structured representation redyces the document-term-matrix into 10 new variables that can be used in a predictive model
<br>
Our goal - to predict if a review is about hockey, medical or graphics

In [81]:
s.CASTable("docpro").fetch(to=5)

Unnamed: 0,key,_Col1_,_Col2_,_Col3_,_Col4_,_Col5_,_Col6_,_Col7_,_Col8_,_Col9_,_Col10_,TEXT,newsgroup
0,94.0,0.347776,0.153176,0.0,0.0,0.595036,0.0,0.211529,0.0,0.0,0.675853,Has anyone successfully converted Interleaf gr...,graphics
1,95.0,0.05993,0.179273,0.025497,0.027616,0.065927,0.10891,0.083279,0.021661,0.058935,0.967353,"Sorry I missed you Raymond, I was just out in ...",graphics
2,96.0,0.036863,0.282517,0.34079,0.0,0.384383,0.370562,0.256907,0.102496,0.20655,0.63123,In article < jonas-y.734802983@gouraud> jonas-...,graphics
3,97.0,0.096503,0.437356,0.066794,0.008742,0.284029,0.347536,0.23438,0.065123,0.190443,0.705671,"Hi everyone, I thought that some people may be...",graphics
4,98.0,0.0,0.317446,0.0,0.0,0.0,0.441664,0.628794,0.0,0.555679,0.0,Update on location!! Directory should be: publ...,graphics


### Split into training and validation

In [82]:
# Create a 70/30 stratified split
s.sampling.stratified(
    table   = dict(name = "docpro", groupBy = 'newsgroup'),
    samppct = 70,
    partind = True,
    seed    = 12345,
    output  = dict(casOut = dict(name = 'docpro' + '_sampled', replace = True), copyVars = 'ALL')
)
s.fetch('docpro_sampled', to=5)

NOTE: Using SEED=12345 for sampling.


Unnamed: 0,key,_Col1_,_Col2_,_Col3_,_Col4_,_Col5_,_Col6_,_Col7_,_Col8_,_Col9_,_Col10_,TEXT,newsgroup,_PartInd_
0,404.0,0.373458,0.296889,0.225395,0.0,0.150264,0.701464,0.285108,0.0,0.0,0.354494,"In article < ng4.733990422@husc.harvard.edu> ,...",medical,1.0
1,405.0,0.094883,0.14155,0.251029,0.0,0.320791,0.068929,0.883073,0.122674,0.0,0.073627,How long does it take a smoker's lungs to clea...,medical,0.0
2,406.0,0.17412,0.235961,0.049507,0.142531,0.080435,0.468806,0.608487,0.073573,0.480988,0.240773,"In article < 1pka0uINNnqa@mojo.eng.umd.edu> , ...",medical,0.0
3,407.0,0.073311,0.403314,0.127127,0.0,0.65853,0.191441,0.20785,0.0373,0.308831,0.453344,ls8139@albnyvms.bitnet (larry silverberg) writ...,medical,1.0
4,408.0,0.050722,0.144778,0.201284,0.0,0.376264,0.510399,0.680269,0.080077,0.161381,0.19659,I second what Spenser Aden said in reply. Addi...,medical,1.0


### Modeling Shortcuts

In [83]:
#Input variables
input_vars =[]
[input_vars.append('_Col' + str(i+1) + '_') for i in range(10)]

#model
params = dict(
    table    = dict(name = 'docpro_sampled', where = '_partind_ = 1'), 
    target   = 'newsgroup', 
    inputs   = input_vars, 
    nominals = 'newsgroup',
)

### Build Decision Tree

In [84]:
#Model
s.decisionTree.dtreeTrain(**params, varImp = True, casOut = dict(name = 'dt_model', replace = True))

Unnamed: 0,Descr,Value
0,Number of Tree Nodes,23.0
1,Max Number of Branches,2.0
2,Number of Levels,6.0
3,Number of Leaves,12.0
4,Number of Bins,20.0
5,Minimum Size of Leaves,5.0
6,Maximum Size of Leaves,269.0
7,Number of Variables,10.0
8,Confidence Level for Pruning,0.25
9,Number of Observations Used,419.0

Unnamed: 0,Variable,Importance,Std,Count
0,_Col8_,63.860242,0.0,1.0
1,_Col1_,22.347081,4.006096,4.0
2,_Col3_,10.281426,0.0,1.0
3,_Col5_,7.029275,2.581304,2.0
4,_Col4_,5.986783,0.0,1.0
5,_Col7_,1.891692,0.0,1.0
6,_Col2_,0.934066,0.0,1.0

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(sasdemo),dt_model,23,24,"CASTable('dt_model', caslib='CASUSER(sasdemo)')"


### Score Model on validation

In [85]:
def score_model(model, partition):
    
    #If partition=true score on validation. False score on whole dataset
    if partition==True:
        table_dct = dict(name = 'docpro_sampled', where = '_partind_ = 0')
    else:
        table_dct = dict(name = 'docpro_sampled') 
        
    score = dict(
        table      = table_dct,
        modelTable = model + '_model',
        copyVars   = ['newsgroup', '_partind_', 'TEXT'],
        casOut     = dict(name = '_scored_' + model, replace = True)
    )
    return score

s.decisionTree.dtreeScore(**score_model('dt', True))

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(sasdemo),_scored_dt,179,15,"CASTable('_scored_dt', caslib='CASUSER(sasdemo)')"

Unnamed: 0,Descr,Value
0,Number of Observations Read,179.0
1,Number of Observations Used,179.0
2,Misclassification Error (%),38.547486034


### Score whole dataset and compare predicted (medical, hockey, graphics) vs actual
Use Python Syntax to modify the data and load it back into CAS

In [86]:
s.decisionTree.dtreeScore(**score_model('dt', False))

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(sasdemo),_scored_dt,598,15,"CASTable('_scored_dt', caslib='CASUSER(sasdemo)')"

Unnamed: 0,Descr,Value
0,Number of Observations Read,598.0
1,Number of Observations Used,598.0
2,Misclassification Error (%),37.62541806


In [87]:
text_pred = s.CASTable('_scored_dt')[['newsgroup','_DT_PredName_','TEXT']]
text_pred['Correct']= text_pred['newsgroup']==text_pred['_DT_PredName_']
text_pred.head()

Unnamed: 0,newsgroup,_DT_PredName_,TEXT,Correct
0,graphics,medical,merkelbd@sage.cc.purdue.edu (Brian Merkel) wri...,0.0
1,graphics,medical,There is a new product for the (IBM'ers) out t...,0.0
2,graphics,medical,"In a previous article, trb3@Ra.MsState.Edu (To...",0.0
3,graphics,medical,In article < 734553308snx@rjck.UUCP> rob@rjck....,0.0
4,graphics,medical,Hi! I am working on a project that needs to cr...,0.0


### Accuracy by Category

In [88]:
text_pred.groupby(['newsgroup'])['Correct'].mean()

newsgroup
graphics    0.010101
hockey      0.910000
medical     0.945000
Name: Correct, dtype: float64

### Load data back into CAS

In [89]:
text_pred.table.partition(casOut='final_text_score1')
s.CASTable('final_text_score1').head(10)

Unnamed: 0,newsgroup,_PartInd_,TEXT,_DT_PredName_,_DT_PredP_,_DT_PredLevel_,_LeafID_,_MissIt_,_NumNodes_,_NodeList0_,_NodeList1_,_NodeList2_,_NodeList3_,_NodeList4_,_NodeList5_,Correct
0,graphics,1.0,merkelbd@sage.cc.purdue.edu (Brian Merkel) wri...,medical,0.483271,1.0,22.0,1.0,6.0,0.0,2.0,6.0,12.0,16.0,22.0,0.0
1,graphics,1.0,There is a new product for the (IBM'ers) out t...,medical,0.483271,1.0,22.0,1.0,6.0,0.0,2.0,6.0,12.0,16.0,22.0,0.0
2,graphics,0.0,"In a previous article, trb3@Ra.MsState.Edu (To...",medical,0.483271,1.0,22.0,1.0,6.0,0.0,2.0,6.0,12.0,16.0,22.0,0.0
3,graphics,1.0,In article < 734553308snx@rjck.UUCP> rob@rjck....,medical,0.483271,1.0,22.0,1.0,6.0,0.0,2.0,6.0,12.0,16.0,22.0,0.0
4,graphics,1.0,Hi! I am working on a project that needs to cr...,medical,0.483271,1.0,22.0,1.0,6.0,0.0,2.0,6.0,12.0,16.0,22.0,0.0
5,graphics,0.0,Toronto Siggraph ================ What: ``Chan...,medical,0.483271,1.0,22.0,1.0,6.0,0.0,2.0,6.0,12.0,16.0,22.0,0.0
6,graphics,1.0,zyeh@caspian.usc.edu (zhenghao yeh) writes: Th...,medical,0.483271,1.0,22.0,1.0,6.0,0.0,2.0,6.0,12.0,16.0,22.0,0.0
7,graphics,0.0,"wing the suggestion of Stu Lynne, I have poste...",medical,0.483271,1.0,22.0,1.0,6.0,0.0,2.0,6.0,12.0,16.0,22.0,0.0
8,graphics,1.0,In article < 1993Apr11.132604.13400@ornl.gov> ...,medical,0.483271,1.0,22.0,1.0,6.0,0.0,2.0,6.0,12.0,16.0,22.0,0.0
9,graphics,1.0,In article < 1993Apr13.025240.8884@nwnexus.WA....,medical,0.483271,1.0,22.0,1.0,6.0,0.0,2.0,6.0,12.0,16.0,22.0,0.0


### Score new text data  
We can then Score our text mining solution on new data that has come in

In [90]:
s.textMining.tmScore(
  documents=dataset,
  u='svdu',
  parseConfig='config',
  terms='terms',
  docPro=c_dict('score_docpro'),
  parent=c_dict('score_parent'),
  text=textvar,
  docId="key"
)

Unnamed: 0,casLib,Name,Label,Rows,Columns,casTable
0,CASUSER(sasdemo),score_parent,,31467,3,"CASTable('score_parent', caslib='CASUSER(sasde..."
1,CASUSER(sasdemo),score_docpro,,598,11,"CASTable('score_docpro', caslib='CASUSER(sasde..."


In [91]:
s.CASTable('score_docpro').sort_values(by='key').head()

Unnamed: 0,key,_Col1_,_Col2_,_Col3_,_Col4_,_Col5_,_Col6_,_Col7_,_Col8_,_Col9_,_Col10_
0,1.0,0.050279,0.265522,0.12504,-0.004726,0.100388,0.370671,0.353446,0.066813,0.629267,0.488337
1,2.0,0.522268,0.3615,-0.027332,0.182182,0.21132,0.190645,0.217793,-0.059781,0.304305,0.581382
2,3.0,0.054938,0.119717,0.242587,0.04513,0.508865,0.233717,0.619396,0.170805,-0.042527,0.439958
3,4.0,0.030875,0.688525,-0.025504,-0.002427,0.160521,0.371034,0.105233,-0.017701,0.100524,0.582578
4,5.0,0.157317,0.16747,-0.004709,-0.014886,0.233048,0.449574,0.242501,0.007526,-0.011406,0.794694


### End CAS session

In [38]:
s.close()