# Censorship Analysis using Doc2Vec

## Setup

### Modules

Used `gensim`, since `gensim` has a much more readable implementation of Word2Vec (and Doc2Vec). Also used `numpy` for general array manipulation, `stopwords` for removing insignificant words from sentences

In [48]:
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec

# numpy
import numpy

# random
import random
from random import shuffle

# stop words from nltk
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))

### Input Format

TODO : clean them up by converting everything to lower case and removing punctuation

The result is to have two documents:

- `train-blocked.txt`
- `train-nonblocked.txt`

Each of the sentence should be formatted as such:

```
Fang Lizhi was born on 12 February 1936 in Peking
In 1948, one year before the PLA took over the city, as a student of the Beijing No.4 High School, Fang Lizhi joined an underground youth organization that was associated to CCP
One of Fang Lizhi's extracurricular activities was assembling radio receivers from used parts

```

The sample up there contains three information sentences, each one taking up one entire line. Yes, **each document should be on one line, separated by new lines**. This is extremely important, because our parser depends on this to identify sentences.

### Feeding Data to Doc2Vec

Doc2Vec (the portion of `gensim` that implements the Doc2Vec algorithm) does a great job at word embedding. It only takes in `LabeledIndividualSentence` classes which basically yields `LabeledSentence`, a class from `gensim.models.doc2vec` representing a single sentence. Why the "Labeled" word? Well, here's how Doc2Vec differs from Word2Vec.

Word2Vec simply converts a word into a vector.

Doc2Vec not only does that, but also aggregates all the words in a sentence into a vector. To do that, it simply treats a sentence label as a special word, and does create a vector for that special word. Hence, that special word is a label for a sentence. 

So we have to format sentences into

```python
[['word1', 'word2', 'word3', 'lastword'], ['label1']]
```

`LabeledSentence` is simply a tidier way to do that. It contains a list of words, and a label for the sentence. We don't really need to care about how `LabeledSentence` works exactly, we just have to know that it stores those two things -- a list of words and a label.

However, we need a way to convert our new line separated corpus into a collection of `LabeledSentence`s. The default constructor for the default `LabeledIndividualSentence` class in Doc2Vec can do that for a single text file, but can't do that for multiple files.

So we write our own `LabeledIndividualSentence` class. The constructor takes in a dictionary that defines the files to read and the label prefixes sentences from that document should take on. Then, Doc2Vec can either read the collection directly via the iterator, or we can access the array directly. We also need a function to return a permutated version of the array of `LabeledSentence`s.

In [49]:

class LabeledIndividualSentence(object):
    def __init__(self, sources):
        self.sources = sources
        
        flipped = {}
        
        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')
    
    def __iter__(self):
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    line = line.lower()
                    tokens = utils.to_unicode(line).split()
                    tokens = [w for w in tokens if not w in stopset]
                    yield LabeledSentence(tokens, [prefix + '_%s' % item_no])
    
    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    line = line.lower() # change case for better results
                    tokens = utils.to_unicode(line).split() # LabeledSentence accepts only unicode tokens
                    tokens = [w for w in tokens if not w in stopset] # remove stopwords
                    self.sentences.append(LabeledSentence(tokens, [prefix + '_%s' % item_no]))
        return self.sentences
    
    def getTag_words(self,words):
        return [s for s in self.to_array() if ' '.join(s.words)==words][0].tags[0]
    
    def getWords_tag(self,tag):
        return ' '.join([s for s in self.to_array() if s.tags[0]==tag][0].words)

Now we can feed the data files to `LabeledIndividualSentence`. As we mentioned earlier, `LabeledIndividualSentence` simply takes a dictionary with keys as the file names and values the special prefixes for sentences from that document. The prefixes need to be unique, so that there is no ambiguitiy for sentences from different documents.

The prefixes will have a line number appended to them to label individual sentences in the documetns.

In [50]:
sources = {'train-blocked.txt':'TRAIN_BL', 'train-nonblocked.txt':'TRAIN_NBL'}
sentences = LabeledIndividualSentence(sources)
alldocs = sentences.to_array()
doc_list = alldocs[:]  # for reshuffling per pass

## Model

### Building the Vocabulary Table

Doc2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them). So we feed it the array of sentences. `model.build_vocab` takes an array of `LabeledIndividualSentence`, hence our `to_array` function in the `LabeledIndividualSentence` class. 

More on Word2Vec documentation. Otherwise, here's a quick rundown:

- `min_count`: ignore all words with total frequency lower than this. You have to set this to 1, since the sentence labels only appear once. Setting it any higher than 1 will miss out on the sentences.
- `window`: the maximum distance between the current and predicted word within a sentence. Word2Vec uses a skip-gram model, and this is simply the window size of the skip-gram model.
- `size`: dimensionality of the feature vectors in output. 100 is a good number. If you're extreme, you can go up to around 400.
- `sample`: threshold for configuring which higher-frequency words are randomly downsampled
- `workers`: use this many worker threads to train the model 

### Training Doc2Vec

Now we train the model. The model is better trained if **in each training interval, the sequence of sentences fed to the model is randomized**.This is the reason for the `shuffle` method before each training.

We train it for 20 interval.

In [53]:
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=7)
model.build_vocab(alldocs)
for interval in range(20):
    shuffle(doc_list) # reshuffling for better results
    model.train(doc_list)

### Inspecting the Model

Should produce sentence alternatives. One more interesting fact, blocked sentences move down the vector space

In [63]:
wrd = "fang lizhi"
print("Query word : ",wrd, "\n")
sims = model.docvecs.most_similar(sentences.getTag_words(wrd), topn=4)
#print(sims)
tags_sims = [s[0] for s in sims]
for t in tags_sims:
    print(sentences.getWords_tag(t),'\n')


Query word :  fang lizhi 

[('TRAIN_NBL_5', 0.027962500229477882), ('TRAIN_BL_1', -0.03432030230760574), ('TRAIN_NBL_7', -0.1078491061925888), ('TRAIN_NBL_3', -0.13212865591049194)]
fang lizhi born 12 february 1936 peking 

tiananmen square massacre 

one fang lizhi's extracurricular activities assembling radio receivers used parts 

chinese government condemned tiananmen square protests counter-revolutionary riot, largely prohibited discussion remembrance events 



In [62]:
model.most_similar('lizhi', topn=10)

[('one', 0.31396618485450745),
 ('deposed', 0.27445733547210693),
 ('demonstrators', 0.27362653613090515),
 ('initiated', 0.25295835733413696),
 ('june', 0.2520868480205536),
 ('fang', 0.23459796607494354),
 ('protests', 0.2315969616174698),
 ('residents,', 0.22675630450248718),
 ('secretary', 0.2199401557445526),
 ('place', 0.21217812597751617)]