# Word clouds

#### July 2016

Work in a Conda environment.

```
$ conda create -n py27 python=2.7 anaconda
$ source activate py27

$ pip install wordcloud
```

Download both an English and a French translations of Dostoyevsky's "The Possessed" from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page) in plain text format (UTF-8):

```
$ wget http://www.gutenberg.org/ebooks/8117.txt.utf-8 -O ThePossessed.txt
$ wget http://www.gutenberg.org/ebooks/16824.txt.utf-8 -O LesPossedes.txt
```

Generate both word clouds using the example provided in the [GitHub repo](https://github.com/amueller/word_cloud) (with slight modifications).

In [1]:
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

In [2]:
def make_word_cloud(input_text, mask, output_file_name, stopwords=None, extra_stopwords=None,
                    background_color = "white", max_words=2000):
    """
    Uses 'word_cloud' to generate word cloud and write it to .png file.
    """
    
    # Read the whole text
    text = open(input_text).read()
    # read the mask image
    mask = np.array(Image.open(mask))
    
    stopwords = set(stopwords)

    if extra_stopwords is not None:
        for word in extra_stopwords:
            stopwords.add(word)

    wc = WordCloud(background_color=background_color, max_words=max_words,
                   mask=mask, stopwords=stopwords)
    
    # generate word cloud
    wc.generate(text)
    # store to file
    wc.to_file(output_file_name)


## Stop words

Before running the function to generate the word clouds, it is worth looking a bit more in depth at the _stop word_ collections. In natural language processing, a basic step is to remove common, uninformative words that will otherwise overwhelm the counts with little or no information power.

The English _stop word_ collection included with `word_cloud`, as well as those provided by the package `NLTK`, are rather small. If the generated cloud worked well for English, the French cloud is still dominated by common words. It is worth getting a more comprehensive collection from a specialized website, in order to bring out more informative words.

In [21]:
# For French we need to get stopwords elsewhere
# Let's get NLTK's collection
from nltk.corpus import stopwords
english_sw = stopwords.words('english')
french_sw = stopwords.words('french')
print len(STOPWORDS)
print len(english_sw)
print len(french_sw)

183
153
155


In [4]:
make_word_cloud(input_text="ThePossessed.txt", mask="devil_stencil.jpg",
                output_file_name="ThePossessed.png", stopwords=STOPWORDS)

In [20]:
make_word_cloud(input_text="LesPossedes.txt", mask="devil_stencil.jpg",
                output_file_name="LesPossedes.png", stopwords=french_sw, extra_stopwords=["les"])