### Text Mining in CAS

This code example illustrates the use of PROC TEXTMINE for identifying important terms and topics in a document collection.     
                                                                      
PROC TEXTMINE parses the news data set to                            
1. generate a dictionary of important terms                        
2. generate a collection of important topics                       
                                                                      
The OUTTERMS= option specifies the terms dictionary to be created.

The OUTTOPICS= option specifies the SAS data set to contain the number of topics specified by the K= option. The user can peruse the TERMS and TOPICS data sets to gain insight about the document collection.                                                          
                                                                      
PROC TMSCORE allows the user to score new document collections based on training performed by a previous PROC TEXTMINE analysis.    

In [1]:
from swat import *

### Import packages

In [2]:
# Connect to the session
cashost='sasserver.demo.sas.com'
casport=5570
casauth='~/.authinfo'

s = CAS(cashost, casport, authinfo=casauth, caslib="casuser")

# Define directory anddata file name
indata_dir="/opt/sasinside/DemoData"
dataset='news'
stopw='engstop'
textvar="TEXT"

# Create a CAS library called DMLib pointing to the defined directory
s.table.addCaslib(datasource={'srctype':'path'}, name='DMlib', path=indata_dir)

# Load table into CAS
s.loadTable(caslib='DMlib', path=dataset+'.sas7bdat', casout={'name':dataset})
s.loadTable(caslib='DMlib', path=stopw+'.sas7bdat', casout={'name':stopw})

#Actionsets
actionsets = ['textMining', 'FedSQL', 'sampling', 'decisionTree']
[s.builtins.loadactionset(i) for i in actionsets];
#s.loadactionset("textMining")
#s.loadactionset("FedSQL")

NOTE: 'DMlib' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'DMlib'.
NOTE: Cloud Analytic Services made the file news.sas7bdat available as table NEWS in caslib DMlib.
NOTE: Cloud Analytic Services made the file engstop.sas7bdat available as table ENGSTOP in caslib DMlib.
NOTE: Added action set 'textMining'.
NOTE: Added action set 'FedSQL'.
NOTE: Added action set 'sampling'.
NOTE: Added action set 'decisionTree'.


## Investigate our text data

In [3]:
reviews=s.CASTable(dataset)
reviews.head()

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup,key
0,I have a few reprints left of chapters from my...,1.0,0.0,0.0,graphics,1.0
1,"gnuplot, etc. make it easy to plot real valued...",1.0,0.0,0.0,graphics,2.0
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1.0,0.0,0.0,graphics,3.0
3,"Hello, I am looking to add voice input capabil...",1.0,0.0,0.0,graphics,4.0
4,I recently got a file describing a library of ...,1.0,0.0,0.0,graphics,5.0


### What are text reviews about

In [4]:
reviews['newsgroup'].value_counts()

hockey      200
medical     200
graphics    198
dtype: int64

### Number of Rows

In [5]:
len(reviews)

598

### See complete reviews
Appears to be an email  chain from university researchers

In [6]:
reviews_df=reviews.to_frame() #Pandas Data Frame

for i in range(5):
    print(reviews_df['TEXT'][i], "\n")

I have a few reprints left of chapters from my book " Visions of the Future" . These include reprints of 3 chapters probably of interest to readers of this forum, including: 1. Current Techniques and Development of Computer Art, by Franz Szabo 2. Forging a Career as a Sculptor from a Career as Computer Programmer, by Stewart Dickson 3. Fractals and Genetics in the Future by H. Joel Jeffrey I'd be happy to send out free reprints to researchers for scholarly purposes, until the reprints run out. Just send me your name and address. Thanks, Cliff cliff@watson.ibm.com 

gnuplot, etc. make it easy to plot real valued functions of 2 variables but I want to plot functions whose values are 2-vectors. I have been doing this by plotting arrays of arrows (complete with arrowheads) but before going further, I thought I would ask whether someone has already done the work. Any pointers?? thanx in advance Tom Weston | USENET: weston@ucssun1.sdsu.edu Department of Philosophy | (619) 594-6218 (office) S

### Build Model

In [7]:
def c_dict(name):
    training_options = dict(name      = name,
                            replace   = True)                           
    return training_options

In [8]:
s.textMining.tmMine(
  documents=dataset,
  stopList=stopw,
  docId="key",
  copyVars=['text', 'newsgroup'],
  text=textvar,
  reduce=2,
  entities="STD",
  k=10,
  norm="DOC",
  u=c_dict("svdu"),
  terms=c_dict("terms"),
  parent=c_dict("parent"),
  child=c_dict("child"),
  parseConfig=c_dict("config"),
  docPro=c_dict("docpro"),
  topics=c_dict("topics"),
)

Unnamed: 0,casLib,Name,Label,Rows,Columns,casTable
0,DMlib,config,,1,7,"CASTable('config', caslib='DMlib')"
1,DMlib,terms,,9674,11,"CASTable('terms', caslib='DMlib')"
2,DMlib,parent,,31519,3,"CASTable('parent', caslib='DMlib')"
3,DMlib,child,,32648,3,"CASTable('child', caslib='DMlib')"
4,DMlib,topics,,10,3,"CASTable('topics', caslib='DMlib')"
5,DMlib,svdu,,6039,11,"CASTable('svdu', caslib='DMlib')"
6,DMlib,docpro,,598,13,"CASTable('docpro', caslib='DMlib')"


### 15 Most Frequent Terms
Stemming automatically identifies parent-child relationships

In [9]:
terms = s.CASTable("terms").sort_values(by="_NumDocs_", ascending=False)
terms.head(15)

Unnamed: 0,_Term_,_Role_,_Attribute_,_Frequency_,_NumDocs_,_Keep_,_Termnum_,_Parent_,_ParentId_,_IsPar_,_Weight_
0,write,Verb,Alpha,394.0,341.0,Y,2585.0,,2585.0,+,0.098328
1,writes,Verb,Alpha,325.0,308.0,Y,8170.0,2585.0,2585.0,.,0.106965
2,article,Noun,Alpha,277.0,257.0,Y,286.0,,286.0,+,0.137203
3,article,Noun,Alpha,274.0,255.0,Y,286.0,286.0,286.0,.,0.137203
4,know,Verb,Alpha,191.0,142.0,Y,5941.0,,5941.0,+,0.239833
5,know,Verb,Alpha,159.0,125.0,Y,5941.0,5941.0,5941.0,.,0.239833
6,good,Adj,Alpha,175.0,112.0,Y,463.0,,463.0,+,0.298098
7,time,Noun,Alpha,166.0,102.0,Y,2974.0,,2974.0,+,0.316837
8,ca,Abbr,Alpha,185.0,101.0,Y,411.0,,411.0,,0.295679
9,year,Noun,Alpha,193.0,96.0,Y,4182.0,,4182.0,+,0.331244


### Different Attributes

In [10]:
terms['_Attribute_'].value_counts()

Alpha     8210
Entity    1191
Mixed      252
Abbr        21
dtype: int64

In [11]:
terms[terms['_Attribute_']=='Entity'].sort_values(by="_NumDocs_", ascending=False).head(10)

Unnamed: 0,_Term_,_Role_,_Attribute_,_Frequency_,_NumDocs_,_Keep_,_Termnum_,_Parent_,_ParentId_,_IsPar_,_Weight_
0,article-i.d.,PROP_MISC,Entity,85.0,85.0,Y,886.0,,886.0,,0.30514
1,gordon banks,PERSON,Entity,68.0,57.0,Y,2571.0,,2571.0,,0.375116
2,sender,PROP_MISC,Entity,46.0,46.0,Y,5648.0,,5648.0,,0.401175
3,n3jxp,PROP_MISC,Entity,45.0,45.0,Y,3160.0,,3160.0,,0.404613
4,geb@cadre.dsl.pitt.edu,INTERNET,Entity,45.0,45.0,Y,2233.0,,2233.0,,0.404613
5,lines,PROP_MISC,Entity,44.0,43.0,Y,4920.0,,4920.0,,0.413055
6,pittsburgh,LOCATION,Entity,78.0,38.0,Y,4388.0,,4388.0,,0.483375
7,well,LOCATION,Entity,43.0,38.0,Y,4009.0,,4009.0,,0.441978
8,nntp-posting-host,PROP_MISC,Entity,29.0,29.0,Y,3528.0,,3528.0,,0.473333
9,nhl,ORGANIZATION,Entity,54.0,29.0,Y,4474.0,,4474.0,,0.550269


### Raw document-term-matrix
The matrix is compressed, so for each document we only see words that appear more than once in the document
<br>
done at the child level, not combining stemmed words as defined by parent/child above
<br>
Can be done in SQL

In [12]:
s.FedSQL.execDirect(''' 
                    SELECT * 
                    FROM child
                    ORDER by _Document_
                    LIMIT 10
                    ''')

Unnamed: 0,_Termnum_,_Document_,_Count_
0,4115.0,1.0,1.0
1,5219.0,1.0,1.0
2,212.0,1.0,1.0
3,5594.0,1.0,2.0
4,6323.0,1.0,1.0
5,205.0,1.0,1.0
6,2356.0,1.0,1.0
7,5353.0,1.0,1.0
8,853.0,1.0,1.0
9,4204.0,1.0,1.0


### Scaled document-term-matrix
The matrix is compressed, so for each document we only see words that appear more than once in the document
<br>
done at the parent level, merging counts for parent and child into parent level
<br>
Results Scaled

In [13]:
s.CASTable("parent").sort_values(by="_Document_").head(10)

Unnamed: 0,_Termnum_,_Document_,_Count_
0,2356.0,1.0,0.891587
1,212.0,1.0,0.891587
2,3448.0,1.0,1.116642
3,3375.0,1.0,0.712095
4,4115.0,1.0,0.615504
5,4204.0,1.0,0.900445
6,4285.0,1.0,0.719757
7,847.0,1.0,1.974851
8,5219.0,1.0,0.612682
9,2594.0,1.0,0.593452


### Descriptive terms for each topic

In [14]:
s.CASTable("topics").fetch(to=10)

Unnamed: 0,_TopicId_,_TermCutOff_,_Name_
0,1.0,0.021,"league, +defenseman, hockey, tampa, +draft pick"
1,2.0,0.021,"+keyboard, pc, +price, +mouse, +thumb"
2,3.0,0.02,"+flyer, amour, +goal, tommy, lindros"
3,4.0,0.021,"period, scorer g, scorer, power, pp"
4,5.0,0.02,"gif, +injury, +muscle, +keyboard, +condition"
5,6.0,0.022,"+tool, +break, +exercise, +type, +description"
6,7.0,0.022,"+cancer, +day, +bath, water, +eat"
7,8.0,0.023,"+versus, tor, mon, van, series"
8,9.0,0.02,"business, political, college, +event, dr."
9,10.0,0.024,"+system, sgi, virtual, graphics, +reality"


### See structured representation of first 5 documents
Similar to a PCA analysis, this structured representation redyces the document-term-matrix into 10 new variables that can be used in a predictive model
<br>
Our goal - to predict if a review is about hockey, medical or graphics

In [15]:
s.CASTable("docpro").fetch(to=5)

Unnamed: 0,key,_Col1_,_Col2_,_Col3_,_Col4_,_Col5_,_Col6_,_Col7_,_Col8_,_Col9_,_Col10_,TEXT,newsgroup
0,1.0,0.09685,0.357367,0.086975,0.0,0.114574,0.39771,0.272656,0.078418,0.693112,0.350925,I have a few reprints left of chapters from my...,graphics
1,2.0,0.539289,0.389104,0.0,0.233961,0.225581,0.237213,0.143625,0.0,0.390849,0.471678,"gnuplot, etc. make it easy to plot real valued...",graphics
2,3.0,0.049127,0.211392,0.340236,0.056811,0.546535,0.286506,0.583523,0.095591,0.0,0.321697,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,graphics
3,4.0,0.010792,0.712185,0.006393,0.0,0.229538,0.406102,0.113645,0.0,0.143324,0.491499,"Hello, I am looking to add voice input capabil...",graphics
4,5.0,0.114912,0.22702,0.04733,0.0,0.334205,0.519938,0.233447,0.049982,0.053838,0.700781,I recently got a file describing a library of ...,graphics


### Split into training and validation

In [16]:
# Create a 70/30 stratified split
s.sampling.stratified(
    table   = dict(name = "docpro", groupBy = 'newsgroup'),
    samppct = 70,
    partind = True,
    seed    = 12345,
    output  = dict(casOut = dict(name = 'docpro' + '_sampled', replace = True), copyVars = 'ALL')
)
s.fetch('docpro_sampled', to=5)

NOTE: Using SEED=12345 for sampling.


Unnamed: 0,key,_Col1_,_Col2_,_Col3_,_Col4_,_Col5_,_Col6_,_Col7_,_Col8_,_Col9_,_Col10_,TEXT,newsgroup,_PartInd_
0,1.0,0.09685,0.357367,0.086975,0.0,0.114574,0.39771,0.272656,0.078418,0.693112,0.350925,I have a few reprints left of chapters from my...,graphics,1.0
1,2.0,0.539289,0.389104,0.0,0.233961,0.225581,0.237213,0.143625,0.0,0.390849,0.471678,"gnuplot, etc. make it easy to plot real valued...",graphics,0.0
2,3.0,0.049127,0.211392,0.340236,0.056811,0.546535,0.286506,0.583523,0.095591,0.0,0.321697,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,graphics,1.0
3,4.0,0.010792,0.712185,0.006393,0.0,0.229538,0.406102,0.113645,0.0,0.143324,0.491499,"Hello, I am looking to add voice input capabil...",graphics,0.0
4,5.0,0.114912,0.22702,0.04733,0.0,0.334205,0.519938,0.233447,0.049982,0.053838,0.700781,I recently got a file describing a library of ...,graphics,0.0


### Modeling Shortcuts

In [17]:
#Input variables
input_vars =[]
[input_vars.append('_Col' + str(i+1) + '_') for i in range(10)]

#model
params = dict(
    table    = dict(name = 'docpro_sampled', where = '_partind_ = 1'), 
    target   = 'newsgroup', 
    inputs   = input_vars, 
    nominals = 'newsgroup',
)

### Build Decision Tree

In [18]:
#Model
s.decisionTree.dtreeTrain(**params, varImp = True, casOut = dict(name = 'dt_model', replace = True))

Unnamed: 0,Descr,Value
0,Number of Tree Nodes,25.0
1,Max Number of Branches,2.0
2,Number of Levels,6.0
3,Number of Leaves,13.0
4,Number of Bins,20.0
5,Minimum Size of Leaves,5.0
6,Maximum Size of Leaves,207.0
7,Number of Variables,10.0
8,Confidence Level for Pruning,0.25
9,Number of Observations Used,419.0

Unnamed: 0,Variable,Importance,Std,Count
0,_Col8_,71.690048,0.0,1.0
1,_Col7_,32.609998,13.848893,3.0
2,_Col1_,26.772125,8.452729,2.0
3,_Col4_,4.899271,1.806778,2.0
4,_Col3_,4.588084,1.494042,2.0
5,_Col6_,2.986667,0.0,1.0
6,_Col5_,0.376923,0.0,1.0

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,DMlib,dt_model,25,24,"CASTable('dt_model', caslib='DMlib')"


### Score Model on validation

In [19]:
def score_model(model, partition):
    
    #If partition=true score on validation. False score on whole dataset
    if partition==True:
        table_dct = dict(name = 'docpro_sampled', where = '_partind_ = 0')
    else:
        table_dct = dict(name = 'docpro_sampled') 
        
    score = dict(
        table      = table_dct,
        modelTable = model + '_model',
        copyVars   = ['newsgroup', '_partind_', 'TEXT'],
        casOut     = dict(name = '_scored_' + model, replace = True)
    )
    return score

s.decisionTree.dtreeScore(**score_model('dt', True))

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,DMlib,_scored_dt,179,15,"CASTable('_scored_dt', caslib='DMlib')"

Unnamed: 0,Descr,Value
0,Number of Observations Read,179.0
1,Number of Observations Used,179.0
2,Misclassification Error (%),24.581005587


### Score whole dataset and compare predicted (medical, hockey, graphics) vs actual
Use Python Syntax to modify the data and load it back into CAS

In [20]:
s.decisionTree.dtreeScore(**score_model('dt', False))

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,DMlib,_scored_dt,598,15,"CASTable('_scored_dt', caslib='DMlib')"

Unnamed: 0,Descr,Value
0,Number of Observations Read,598.0
1,Number of Observations Used,598.0
2,Misclassification Error (%),24.581939799


In [21]:
text_pred = s.CASTable('_scored_dt')[['newsgroup','_DT_PredName_','TEXT']]
text_pred['Correct']= text_pred['newsgroup']==text_pred['_DT_PredName_']
text_pred.head()

Unnamed: 0,newsgroup,_DT_PredName_,TEXT,Correct
0,graphics,graphics,I have a few reprints left of chapters from my...,1.0
1,graphics,hockey,"gnuplot, etc. make it easy to plot real valued...",0.0
2,graphics,graphics,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1.0
3,graphics,graphics,"Hello, I am looking to add voice input capabil...",1.0
4,graphics,graphics,I recently got a file describing a library of ...,1.0


### Accuracy by Category

In [22]:
text_pred.groupby(['newsgroup'])['Correct'].mean()

newsgroup
graphics    0.909091
hockey      0.900000
medical     0.455000
Name: Correct, dtype: float64

### Load data back into CAS

In [23]:
text_pred.table.partition(casOut='final_text_score1')
s.CASTable('final_text_score1').head(10)

Unnamed: 0,newsgroup,_PartInd_,TEXT,_DT_PredName_,_DT_PredP_,_DT_PredLevel_,_LeafID_,_MissIt_,_NumNodes_,_NodeList0_,_NodeList1_,_NodeList2_,_NodeList3_,_NodeList4_,_NodeList5_,Correct
0,graphics,1.0,I have a few reprints left of chapters from my...,graphics,0.599034,1.0,24.0,0.0,6.0,0.0,2.0,6.0,12.0,20.0,24.0,1.0
1,graphics,0.0,"gnuplot, etc. make it easy to plot real valued...",hockey,1.0,2.0,17.0,1.0,5.0,0.0,2.0,5.0,10.0,17.0,,0.0
2,graphics,1.0,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,graphics,0.599034,1.0,24.0,0.0,6.0,0.0,2.0,6.0,12.0,20.0,24.0,1.0
3,graphics,0.0,"Hello, I am looking to add voice input capabil...",graphics,0.599034,1.0,24.0,0.0,6.0,0.0,2.0,6.0,12.0,20.0,24.0,1.0
4,graphics,0.0,I recently got a file describing a library of ...,graphics,0.599034,1.0,24.0,0.0,6.0,0.0,2.0,6.0,12.0,20.0,24.0,1.0
5,graphics,1.0,d9hh@dtek.chalmers.se (Henrik Harmsen) writes:...,graphics,0.599034,1.0,24.0,0.0,6.0,0.0,2.0,6.0,12.0,20.0,24.0,1.0
6,graphics,1.0,Article-I.D.: cs.1993Apr6.020751.13389 Sender:...,graphics,0.599034,1.0,24.0,0.0,6.0,0.0,2.0,6.0,12.0,20.0,24.0,1.0
7,graphics,1.0,I am looking for publically accessible sources...,graphics,0.599034,1.0,24.0,0.0,6.0,0.0,2.0,6.0,12.0,20.0,24.0,1.0
8,graphics,1.0,The HumBio Project: Call for Data and Visualiz...,graphics,0.599034,1.0,24.0,0.0,6.0,0.0,2.0,6.0,12.0,20.0,24.0,1.0
9,graphics,0.0,Hello everybody ! If you are using PIXAR'S Ren...,graphics,0.599034,1.0,24.0,0.0,6.0,0.0,2.0,6.0,12.0,20.0,24.0,1.0


### Score new text data  
We can then Score our text mining solution on new data that has come in

In [24]:
s.textMining.tmScore(
  documents=dataset,
  u='svdu',
  parseConfig='config',
  terms='terms',
  docPro=c_dict('score_docpro'),
  parent=c_dict('score_parent'),
  text=textvar,
  docId="key"
)

Unnamed: 0,casLib,Name,Label,Rows,Columns,casTable
0,DMlib,score_parent,,31519,3,"CASTable('score_parent', caslib='DMlib')"
1,DMlib,score_docpro,,598,11,"CASTable('score_docpro', caslib='DMlib')"


In [25]:
s.CASTable('score_docpro').sort_values(by='key').head()

Unnamed: 0,key,_Col1_,_Col2_,_Col3_,_Col4_,_Col5_,_Col6_,_Col7_,_Col8_,_Col9_,_Col10_
0,1.0,0.045274,0.262227,0.124965,-0.006553,0.098686,0.370008,0.353609,0.065083,0.631025,0.489301
1,2.0,0.52124,0.358958,-0.031114,0.182669,0.215457,0.189721,0.216669,-0.06497,0.305267,0.58168
2,3.0,0.051122,0.11652,0.310052,0.040493,0.502558,0.230435,0.606129,0.162678,-0.046764,0.429337
3,4.0,0.030576,0.688631,-0.026268,-0.002048,0.161236,0.372678,0.105735,-0.01708,0.099294,0.581328
4,5.0,0.160315,0.167704,-0.010341,-0.014588,0.227844,0.450432,0.240869,0.007719,-0.018137,0.795391


### End CAS session

In [26]:
s.close()