### Crude dataset

This data set holds 20 news articles with additional meta information from the Reuters-21578 data set. All documents belong to the topic crude dealing with crude oil.

In [3]:
# load the required library
# NOTE: Please ignore the warning message
library(tm)

# Load data
data(crude)

# Look at the help file of the dataset
?crude

In [4]:
# Display the raw text of the first document in the corpus and display
writeLines(as.character(crude[[1]]))

Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter


### Pre-processing

In this process, we'll convert documents to lower case and remove punctuations, stopwords and numbers. We'll then perfrom stemming.

In [5]:
# Transform document words to lower case
crude <- tm_map(crude, content_transformer(tolower))
writeLines(as.character(crude[[1]]))

diamond shamrock corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    the reduction brings its posted price for west texas
intermediate to 16.00 dlrs a barrel, the copany said.
    "the price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    diamond is the latest in a line of u.s. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 reuter


In [6]:
# Remove punctuation from documents
crude <- tm_map(crude, removePunctuation)
writeLines(as.character(crude[[1]]))

diamond shamrock corp said that
effective today it had cut its contract prices for crude oil by
150 dlrs a barrel
    the reduction brings its posted price for west texas
intermediate to 1600 dlrs a barrel the copany said
    the price reduction today was made in the light of falling
oil product prices and a weak crude oil market a company
spokeswoman said
    diamond is the latest in a line of us oil companies that
have cut its contract or posted prices over the last two days
citing weak oil markets
 reuter


In [7]:
# Remove stopwords from the corpus
crude <- tm_map(crude, removeWords, stopwords("english"))
writeLines(as.character(crude[[1]]))

diamond shamrock corp said 
effective today   cut  contract prices  crude oil 
150 dlrs  barrel
     reduction brings  posted price  west texas
intermediate  1600 dlrs  barrel  copany said
     price reduction today  made   light  falling
oil product prices   weak crude oil market  company
spokeswoman said
    diamond   latest   line  us oil companies 
 cut  contract  posted prices   last two days
citing weak oil markets
 reuter


In [8]:
# Remove numbers from the corpus
crude <- tm_map(crude, removeNumbers)
writeLines(as.character(crude[[1]]))

diamond shamrock corp said 
effective today   cut  contract prices  crude oil 
 dlrs  barrel
     reduction brings  posted price  west texas
intermediate   dlrs  barrel  copany said
     price reduction today  made   light  falling
oil product prices   weak crude oil market  company
spokeswoman said
    diamond   latest   line  us oil companies 
 cut  contract  posted prices   last two days
citing weak oil markets
 reuter


In [9]:
# load the required library
# NOTE: Please ignore the warning message
library(lsa)
# Stem the corpus
crude <- tm_map(crude, stemDocument, language = "english")
writeLines(as.character(crude[[1]]))

Loading required package: SnowballC


diamond shamrock corp said effect today cut contract price crude oil dlrs barrel reduct bring post price west texa intermedi dlrs barrel copani said price reduct today made light fall oil product price weak crude oil market compani spokeswoman said diamond latest line us oil compani cut contract post price last two day cite weak oil market reuter


### TF-IDF

TF-IDF calculates term importance based on its occurrence in a given document.

In [10]:
# Build a document-term matrix using TF-IDF and inspect
crude.dt <- DocumentTermMatrix(crude, control=list(weighting=weightTfIdf))
crude.dt
inspect(crude.dt[1:8, 1:8])

<<DocumentTermMatrix (documents: 20, terms: 782)>>
Non-/sparse entries: 1502/14138
Sparsity           : 90%
Maximal term length: 16
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

<<DocumentTermMatrix (documents: 8, terms: 8)>>
Non-/sparse entries: 5/59
Sparsity           : 92%
Maximal term length: 9
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
Sample             :
     Terms
Docs  abdulaziz       abil abl     abroad accept accord across      activ
  127         0 0.00000000   0 0.00000000      0      0      0 0.00000000
  144         0 0.02081343   0 0.00000000      0      0      0 0.00000000
  191         0 0.00000000   0 0.00000000      0      0      0 0.00000000
  194         0 0.00000000   0 0.00000000      0      0      0 0.00000000
  211         0 0.00000000   0 0.00000000      0      0      0 0.00000000
  236         0 0.03170230   0 0.01668698      0      0      0 0.00000000
  237         0 0.00000000   0 0.00000000      0      0      0 0.01277665
  242         0 0.00000000   0 0.00000000      0      0      0 0.03818308


### Cosine similarity

Cosine Similarity is used to calculate a score for the similarity between a query and a document in the collection. 

In [11]:
# Compute a matrix of cosine similarity scores between each document pair and inspect
crude.cos <- cosine(as.matrix(t(crude.dt)))
crude.cos[1:8, 1:8]

Unnamed: 0,127,144,191,194,211,236,237,242
127,1.0,0.044101264,0.171502012,0.234109138,0.017396426,0.04490304,0.021734896,0.008863279
144,0.044101264,1.0,0.00549563,0.006553802,0.014369444,0.16843578,0.05666018,0.064470093
191,0.171502012,0.00549563,1.0,0.293581671,0.015709368,0.02321779,0.006743755,0.001942854
194,0.234109138,0.006553802,0.293581671,1.0,0.019544632,0.02257741,0.03487625,0.002659559
211,0.017396426,0.014369444,0.015709368,0.019544632,1.0,0.03027528,0.021429912,0.005676107
236,0.04490304,0.16843578,0.023217787,0.022577411,0.03027528,1.0,0.073052963,0.124905313
237,0.021734896,0.05666018,0.006743755,0.03487625,0.021429912,0.07305296,1.0,0.046109463
242,0.008863279,0.064470093,0.001942854,0.002659559,0.005676107,0.12490531,0.046109463,1.0


### Additional pre-processing steps

Improve the analysis using the suggested steps below! Re-run the analysis after further finessing your data

In [12]:
# Add to stopword list
crude <- tm_map(crude, removeWords, c(stopwords("english"), "said"))
writeLines(as.character(crude[[1]]))

diamond shamrock corp  effect today cut contract price crude oil dlrs barrel reduct bring post price west texa intermedi dlrs barrel copani  price reduct today made light fall oil product price weak crude oil market compani spokeswoman  diamond latest line us oil compani cut contract post price last two day cite weak oil market reuter


In [13]:
# load the required library
# NOTE: Please ignore the warning message
library(spelling)
# Check for mispellings/non-standard English words
data(crude)
for(i in 1:length(crude)){
  mispell <- spell_check_text(as.character(crude[[i]]), ignore = character(), lang = "en_US")
  print(i)
  print(mispell);
}

[1] 1
    word found
1 copany     1
2   dlrs     1
3 Reuter     1
[1] 2
        word found
1      Bijan     1
2        bpd     1
3       CERA     1
4    Mizrahi     1
5        mln     1
6     Mlotok     1
7  Moussavar     1
8    Rahmani     1
9     Reuter     1
10   Salomon     1
11   Spriggs     1
12    Yergin     1
[1] 3
    word found
1    cts     1
2   dlrs     1
3 Reuter     1
4  Swann     1
[1] 4
    word found
1    dlr     1
2   dlrs     1
3 Reuter     1
[1] 5
         word found
1        dlrs     1
2         mln     1
3      Reuter     1
4 unitholders     1
[1] 6
           word found
1            al     1
2        Alvite     1
3           bpd     1
4          dlrs     1
5       Khalifa     1
6           mln     1
7          opec     1
8  organisation     1
9         Qabas     1
10        qatar     1
11       REUTER     1
12       rumour     1
13        Sabah     1
[1] 7
         word found
1        dlrs     1
2     favours     1
3 liberalised     1
4         mln     1
5      R

In [14]:
# Replace mispelled/non-standard words from above with the correct words
pattern <- c("copany", "organiaation")
replacement <- c("company","organization")
for(i in 1:2){
  crude <- tm_map(crude, content_transformer(gsub), pattern = pattern[i], replacement = replacement[i])
}
writeLines(as.character(crude[[1]]))

Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the company said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter


In [15]:
# Save your text files in a directory, then read them into `tm` function `Corpus`**
crude <-Corpus(DirSource("CorpusDataSource"), readerControl=list(language="en"))
writeLines(as.character(crude[[1]]))

Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
