# Text - Decision Tree

In [8]:
emails=read.csv("emails.csv",stringsAsFactors=FALSE)

In [9]:
str(emails)

'data.frame':	5728 obs. of  2 variables:
 $ text: chr  "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqg"| __truncated__ "Subject: the stock trading gunslinger  fanny is merrill but muzo not colza attainder and penultimate like esmark perspicuous ra"| __truncated__ "Subject: unbelievable new homes made easy  im wanting to show you this  homeowner  you have been pre - approved for a $ 454 , 1"| __truncated__ "Subject: 4 color printing special  request additional information now ! click here  click here for a printable version of our o"| __truncated__ ...
 $ spam: int  1 1 1 1 1 1 1 1 1 1 ...


In [10]:
table(emails$spam)


   0    1 
4360 1368 

In [4]:
library(tm)

Loading required package: NLP


In [5]:
library(SnowballC)

corpus - create collection of document and makes preprocessing easier

In [11]:
corpus = Corpus(VectorSource(emails$text))

In [12]:
corpus

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 5728

Convert upper case to lower case

In [None]:
corpus = tm_map(corpus, content_transformer(tolower))

Remove punctuations

In [10]:
corpus = tm_map(corpus, removePunctuation)

Remove stopwords such as is, the 

In [11]:
corpus = tm_map(corpus, removeWords, stopwords("english"))

Stem the document. Replace removed, remove to remov

In [12]:
corpus = tm_map(corpus, stemDocument)

create document frequency matrix

In [13]:
dfm = DocumentTermMatrix(corpus)

In [14]:
dfm

<<DocumentTermMatrix (documents: 5728, terms: 28687)>>
Non-/sparse entries: 481719/163837417
Sparsity           : 100%
Maximal term length: 24
Weighting          : term frequency (tf)

Extract only top frequenct words

In [16]:
sdfm = removeSparseTerms(dfm, 0.995)

Convert the data to required dataframe format

In [17]:
emailSparse = as.data.frame(as.matrix(sdtm))

In [18]:
colnames(emailSparse) = make.names(colnames(emailSparse))

Find the word with maximun count

In [19]:
which.max(colSums(emailSparse))

In [20]:
emailSparse$spam = emails$spam

Find the word with large count in spam 

In [21]:
colnames(emailSparse[colSums(subset(emailSparse,spam==1))>1000])

Find the word with large count in non spam 

In [22]:
colnames(emailSparse[colSums(subset(emailSparse,spam==0))>=5000])

In [23]:
colnames(emailSparse[colSums(subset(emailSparse,spam==1))>=1000])

In [24]:
library(caTools)

In [25]:
emailSparse$spam = as.factor(emailSparse$spam)

In [26]:
table(emailSparse$spam)


   0    1 
4360 1368 

In [27]:
set.seed(123)

In [28]:
library(caTools)

In [29]:
spl = sample.split(emailSparse$spam, SplitRatio = 0.7)

In [30]:
table(spl)

spl
FALSE  TRUE 
 1718  4010 

In [31]:
trainSparse = subset(emailSparse, spl==TRUE)

In [32]:
table(trainSparse$spam)


   0    1 
3052  958 

In [33]:
testSparse = subset(emailSparse, spl==FALSE)

In [34]:
library(rpart)

In [35]:
library(rpart.plot)

Build the decision tree model

In [57]:
spamCART = rpart(spam ~ ., data=trainSparse, method="class",cp=0.1)

Predict for test case

In [58]:
predictCART = predict(spamCART, newdata=testSparse, type="class")

In [40]:
table(predictCART)

predictCART
   0    1 
1258  460 

In [59]:
table(testSparse$spam, predictCART)

   predictCART
       0    1
  0 1105  203
  1    8  402

In [60]:
library(caret)

Set cross-validation as 10

In [43]:
numFolds = trainControl( method = "cv", number = 10 )

In [44]:
cpGrid = expand.grid( .cp = seq(0.01,0.5,0.01)) 

In [45]:
train(spam ~.,data = trainSparse, method = "rpart", trControl = numFolds, tuneGrid = cpGrid )

CART 

4010 samples
2330 predictors
   2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 3608, 3609, 3609, 3609, 3609, 3609, ... 
Resampling results across tuning parameters:

  cp    Accuracy   Kappa      Accuracy SD   Kappa SD  
  0.01  0.9334076  0.8231324  0.0121730798  0.03124879
  0.02  0.9224331  0.7933381  0.0130369334  0.03527999
  0.03  0.9224337  0.7931073  0.0132967404  0.03577457
  0.04  0.9094680  0.7637460  0.0136241503  0.03138312
  0.05  0.9062298  0.7592390  0.0142500583  0.03126693
  0.06  0.9009904  0.7578630  0.0134705600  0.02987518
  0.07  0.9009904  0.7578630  0.0134705600  0.02987518
  0.08  0.9009904  0.7578630  0.0134705600  0.02987518
  0.09  0.8920023  0.7402830  0.0282912217  0.05779572
  0.10  0.8733033  0.7026357  0.0205268631  0.04116617
  0.11  0.8733033  0.7026357  0.0205268631  0.04116617
  0.12  0.8733033  0.7026357  0.0205268631  0.04116617
  0.13  0.8688146  0.6942909  0.0235171995  0.04596020


In [55]:
spamCART2 = rpart(spam ~ ., data=trainSparse, method="class",cp=0.01)

In [56]:
predictCART = predict(spamCART2, newdata=testSparse, type="class")

In [54]:
table(testSparse$spam, predictCART)

   predictCART
       0    1
  0 1105  203
  1    8  402