# Fine Foods - Data Overview & Text Analysis
We will be using food reviews from Amazon to build a recommendation engine using Factorization Machine in SAS VIYA

Factorization Machine (FM) is one of the newest algorithms in the Machine Learning space, and has been developed in SAS. FM is a general prediction algorithm, similar to Support Vector Machines, that can model very sparce data, an area where traditional Machine Learning techniques fail. 

Since FM is a general prediction algorithm, it can accept any sized real vector as inputs. Because of this, we will use SAS Viya text analytics capabilities to represent text as numeric vectors, that we can use as inputs to our FM model.


this notebook has **three** parts:
1. Load Data
2. Data overview & Prepare for text analytics
3. Perform Text Analytics
4. Promote dataset to public memory 

We will use the dataset promoted to public memory to train our FM model in SAS Studio

## 1. Load Data
In this step, we will make a connection to our CAS server, and will load the revelant table that we prepared in Python into memory


In [3]:
from swat import *
#swat.options.cas.print_messages = False

# Connect to the session
cashost='racesx12013.demo.sas.com'
casport=5570
casauth='U:\.authinfo_w12_race'

s = CAS(cashost, casport, authinfo=casauth, caslib="casuser")

#Load Data
f='foods_prepped'
s.loadTable(caslib='DemoData', path=f+'.csv', casout=f);

#Load actionsets
actionsets=['fedSQL', 'autoTune', 'factMac', 'textMining']
[s.builtins.loadactionset(i) for i in actionsets];


#Create shortcuts
food = s.CASTable(f)
target = 'score'
class_inputs = ['helpfulness','productid','time','userid']

NOTE: Cloud Analytic Services made the file foods_prepped.csv available as table FOODS_PREPPED in caslib CASUSER(sasdemo).
NOTE: Added action set 'fedSQL'.
NOTE: Added action set 'autoTune'.
NOTE: Added action set 'factMac'.
NOTE: Added action set 'textMining'.


## 2. Overview Data & Prepare for text analytics
In this step, we will adda column representing the row count. This will be necessary during our text analytics.  We will also look at the data to make sure everything looks right

In [2]:
#Add a column Identifier in-memory
s.dataStep.runCode('''data ''' + f + '''; 
                      set '''  + f + ''';
                      key = _n_; run;''')

#Print Number of reviews
print(len(food), "Reviews")

#Validate first few rows
food.head()

568454 Reviews


Unnamed: 0,helpfulness,productId,score,summary,text,time,userId,key
0,2/2,B000HEA964,5.0,Dog's Favorite Snack,These chicken chips are devored daily by my 2 ...,1212883000.0,A2E61OQYIVB55P,67425.0
1,2/2,B000HEA964,5.0,"Better Than ""Cookies""",These crunchy treats are irresistable to my Co...,1208304000.0,A2UCGE4EQZ0P4A,67426.0
2,2/2,B000HEA964,4.0,Good for small dogs.,"I have two American Eskimo dogs, and so these ...",1204157000.0,A304WL23L6EDML,67427.0
3,2/2,B000HEA964,5.0,great,My little dog loved these. Were first sent to ...,1176163000.0,A287Z78FJTTT27,67428.0
4,1/1,B000HEA964,5.0,"Cost more than steak, but my dogs love them!",My two Havanese really love these! They are v...,1285114000.0,A18UVHCREY2RE2,67429.0


## 3. Perform text analysis
The code below performs many different types of text analytics, including:
1. Creating Document term matrix
2. Creating parent-child relationshipos
3. Text Topics and important terms per topic
4. Creating structured representation of the text data

For this FM model, we will look to reduce the document term matrix into three structured representations, that describe latent differences amoung the text data. We will use these numeric vectors, in additional to our original inputs, in our FM model

In [4]:
#Load stop list into memory
s.loadTable(caslib='DemoData', path='engstop'+'.sas7bdat', casout='engstop');

#Helper function to aid with text mining
def c_dict(name):
    training_options = dict(name      = name,
                            replace   = True)                           
    return training_options

#Perform Text mining
    #* = optional
s.textMining.tmMine(
  documents=f,
  stopList="engstop",
  docId="key",
  copyVars=class_inputs + [target],
  text='text',
  reduce=10,
  entities="STD",
  k=3,
  norm="DOC",
  u=c_dict("svdu"),
  terms=c_dict("terms"), #*
  parent=c_dict("parent"), #*
  child=c_dict("child"), #*
  parseConfig=c_dict("config"), #*
  docPro=c_dict("docpro"), 
  topics=c_dict("topics"), #*
)

Unnamed: 0,casLib,Name,Label,Rows,Columns,casTable
0,CASUSER(sasdemo),config,,1,11,"CASTable('config', caslib='CASUSER(sasdemo)')"
1,CASUSER(sasdemo),terms,,141740,11,"CASTable('terms', caslib='CASUSER(sasdemo)')"
2,CASUSER(sasdemo),parent,,15670231,3,"CASTable('parent', caslib='CASUSER(sasdemo)')"
3,CASUSER(sasdemo),child,,16503075,3,"CASTable('child', caslib='CASUSER(sasdemo)')"
4,CASUSER(sasdemo),svdu,,72708,4,"CASTable('svdu', caslib='CASUSER(sasdemo)')"
5,CASUSER(sasdemo),docpro,,568454,9,"CASTable('docpro', caslib='CASUSER(sasdemo)')"
6,CASUSER(sasdemo),topics,,3,3,"CASTable('topics', caslib='CASUSER(sasdemo)')"


## 4. Promote Text Analytics Dataset into public memory
We will first take a look at the dataset ouput from our text analytics. We can see 3 columns have been added, _Col1_, _Col2_, _Col3_. These columns are the numerical summaries of how each text review is related to each of the 3 latent text topics
<br>

We will then promote this dataset into public memory, where we will use it to build a FM model in SAS Studio. Alternatively, you could save the file to the server as a sashdat file, and load it into memory in SAS Studio

In [5]:
s.CASTable("docpro").fetch(to=5)

Unnamed: 0,key,_Col1_,_Col2_,_Col3_,helpfulness,productId,time,userId,score
0,8408.0,0.793594,0.302304,0.528035,0/0,B00146K7MU,1288829000.0,AYYACIDP5I4V6,5.0
1,8409.0,0.774742,0.256919,0.577726,4/4,B001ESKSPY,1294618000.0,A3SQJCRXHOQ8GF,5.0
2,8410.0,0.835524,0.256906,0.485694,2/2,B001ESKSPY,1308269000.0,A1XUX4HFY8F7YW,5.0
3,8411.0,0.836214,0.289241,0.465924,6/6,B004749DY4,1327018000.0,A216NSW58Q3SCJ,4.0
4,8412.0,0.795012,0.366726,0.483184,6/7,B004749DY4,1324426000.0,ACJT8MUC0LRF0,4.0


In [7]:
#Load data into Public memory
s.loadTable(caslib='DemoData',  path='Foods_prep_text'+'.sashdat', casout='docpro')

NOTE: Cloud Analytic Services made the file Foods_prep_text.sashdat available as table DOCPRO in caslib CASUSER(sasdemo).


In [15]:
#Load data into Public memory
s.table.promote(table='docpro')

#Save file to Server
s.table.save(caslib='DemoData', name='Foods_prep_text.sashdat', table="docpro")

NOTE: Cloud Analytic Services saved the file Foods_prep_text.sashdat in caslib DemoData.
