# Detecting Vandalism on Wikipedia

Wikipedia is a free online encyclopedia that anyone can edit and contribute to. It is available in many languages and is growing all the time. On the English language version of Wikipedia:

- There are currently 4.7 million pages.
- There have been a total over 760 million edits (also called revisions) over its lifetime.
- There are approximately 130,000 edits per day.

One of the consequences of being editable by anyone is that some people vandalize pages. This can take the form of removing content, adding promotional or inappropriate content, or more subtle shifts that change the meaning of the article. With this many articles and edits per day it is difficult for humans to detect all instances of vandalism and revert (undo) them. As a result, Wikipedia uses bots - computer programs that automatically revert edits that look like vandalism. In this assignment we will attempt to develop a vandalism detector that uses machine learning to distinguish between a valid edit and vandalism.

The data for this problem is based on the revision history of the page Language. Wikipedia provides a history for each page that consists of the state of the page at each revision. Rather than manually considering each revision, a script was run that checked whether edits stayed or were reverted. If a change was eventually reverted then that revision is marked as vandalism. This may result in some misclassifications, but the script performs well enough for our needs.

As a result of this preprocessing, some common processing tasks have already been done, including lower-casing and punctuation removal. The columns in the dataset are:

- Vandal = 1 if this edit was vandalism, 0 if not.
- Minor = 1 if the user marked this edit as a "minor edit", 0 if not.
- Loggedin = 1 if the user made this edit while using a Wikipedia account, 0 if they did not.
- Added = The unique words added.
- Removed = The unique words removed.

Notice the repeated use of unique. The data we have available is not the traditional bag of words - rather it is the set of words that were removed or added. For example, if a word was removed multiple times in a revision it will only appear one time in the "Removed" column.

## Bags of Words

In [4]:
wiki = read.csv('./dataset/wiki.csv', stringsAsFactors = FALSE)
wiki$Vandal = as.factor(wiki$Vandal)

In [5]:
summary(wiki)

      X.1             X        Vandal       Minor           Loggedin     
 Min.   :   1   Min.   :   1   0:2061   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1001   1st Qu.:1184   1:1815   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :2016   Median :2318            Median :0.0000   Median :1.0000  
 Mean   :2040   Mean   :2322            Mean   :0.2853   Mean   :0.6659  
 3rd Qu.:3069   3rd Qu.:3467            3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :4135   Max.   :4639            Max.   :1.0000   Max.   :1.0000  
    Added             Removed         
 Length:3876        Length:3876       
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      

We will now use the bag of words approach to build a model. We have two columns of textual data, with different meanings. For example, adding rude words has a different meaning to removing rude words. We'll start like we did in class by building a document term matrix from the Added column. The text already is lowercase and stripped of punctuation. So to pre-process the data, just complete the following four steps:

1) Create the corpus for the Added column, and call it "corpusAdded".

2) Remove the English-language stopwords.

3) Stem the words.

4) Build the DocumentTermMatrix, and call it dtmAdded.

In [6]:
library("tm")
library("SnowballC")

Loading required package: NLP



In [19]:
# Create corpus
corpus = VCorpus(VectorSource(wiki$Added))

# Look at corpus
corpus
corpus[[1]]$content

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3876

In [8]:
# Remove the English-language Stopwords
corpus = tm_map(corpus, removeWords, stopwords("english"))

In [9]:
# Stem the words
corpus = tm_map(corpus, stemDocument)

In [10]:
dtmAdded = DocumentTermMatrix(corpus)

In [13]:
dtmAdded

<<DocumentTermMatrix (documents: 3876, terms: 6675)>>
Non-/sparse entries: 15368/25856932
Sparsity           : 100%
Maximal term length: 784
Weighting          : term frequency (tf)

In [14]:
sparseAdded = removeSparseTerms(dtmAdded, 0.997)
sparseAdded

<<DocumentTermMatrix (documents: 3876, terms: 166)>>
Non-/sparse entries: 2681/640735
Sparsity           : 100%
Maximal term length: 28
Weighting          : term frequency (tf)

Convert sparseAdded to a data frame called wordsAdded, and then prepend all the words with the letter A, by using the command:
```R
colnames(wordsAdded) = paste("A", colnames(wordsAdded))
```

In [16]:
wordsAdded = as.data.frame(as.matrix(sparseAdded))
colnames(wordsAdded) = paste("A", colnames(wordsAdded))

In [17]:
wordsAdded

Unnamed: 0_level_0,A accord,A actual,A ago,A agre,A analog,A appar,A arbitrari,A believ,A biolog,A biologyanalog,...,A utter,A verb,A want,A wide,A will,A work,A write,A writer,A xmlspacepreserveotheruses4th,A year
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now repeat all of the steps we've done so far (create a corpus, remove stop words, stem the document, create a sparse document term matrix, and convert it to a data frame) to create a Removed bag-of-words dataframe, called wordsRemoved, except this time, prepend all of the words with the letter R:
```R
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))
```

In [20]:
# Create corpus
corpus = VCorpus(VectorSource(wiki$Removed))

# Look at corpus
corpus
corpus[[1]]$content

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3876

In [21]:
# Remove the English-language Stopwords
corpus = tm_map(corpus, removeWords, stopwords("english"))

# Stem the words
corpus = tm_map(corpus, stemDocument)

In [22]:
dtmRemoved = DocumentTermMatrix(corpus)
sparseRemoved = removeSparseTerms(dtmRemoved, 0.997)
wordsRemoved = as.data.frame(as.matrix(sparseRemoved))
colnames(wordsRemoved) = paste("A", colnames(wordsRemoved))

In [23]:
wordsRemoved

Unnamed: 0_level_0,A 2000000,A 40000,A accord,A actual,A ago,A agre,A analog,A appar,A arbitrari,A believ,...,A unit,A use,A verb,A want,A wide,A will,A work,A writer,A xmlspacepreserveotheruses4th,A year
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Combine the two data frames into a data frame called wikiWords with the following line of code:
```R
wikiWords = cbind(wordsAdded, wordsRemoved)
```
The cbind function combines two sets of variables for the same observations into one data frame. Then add the Vandal column (HINT: remember how we added the dependent variable back into our data frame in the Twitter lecture). Set the random seed to 123 and then split the data set using sample.split from the "caTools" package to put 70% in the training set.