# Word Count
You've likely seen word-clouds before, if not, please check [here](https://www.google.com/search?site=&tbm=isch&source=hp&biw=1536&bih=799&q=word+cloud&oq=word+cloud&gs_l=img.3..0l10.981.2160.0.2280.11.11.0.0.0.0.95.704.9.9.0....0...1.1.64.img..2.9.695.0.NtfMDYloQTw) for examples. In order to create word clouds, the software finds the most frequently occurring words in a text file. Our mini-programming assignment will ask you to do just that. We'll use the text of the famous novel by Charles Dickens, A Tale of Two Cities, in our example, but you can use any text you'd like.
## Part 1: Writing and running python code
1. Make sure you have the environment for the course already setup.  If not, please see the instructions at the end of Week 1.

2. Feel free to create a python program using a text editor (nano) or you can do all the work in python shell.  Or, if you already have some experience in Jupyter, feel free to do your work there instead. We don't want to be prescriptive here, however you'd like to get started programming is fine. We'll be working in Jupyter from this week on, so this assignment is just to get a little practice in python programming.

## Part 2: Grab source files
[Download the source files here.](https://prod-edxapp.edx-cdn.org/assets/courseware/v1/b373c297be36d3ac1cc9452cfa3807e8/asset-v1:UCSanDiegoX+DSE200x+1T2018+type@asset+block/word_cloud.zip)
Included in the source files are:

1. word_cloud.py <- Starter file if you wish to use it

2. 98-0.txt <- Tale of Two Cities, by Charles Dickens. Credit to [Project Gutenberg](https://www.gutenberg.org/).

3. stopwords <- common words to exclude. Credit to [Andreas Mueller](https://github.com/amueller/word_cloud/). 

Note that we could use the nltk stopwords instead of those provided. You should feel free to do so if you wish.

## Part 3: Word Count
To complete this assignment, you will want to read and clean the input, then count the frequencies of each word. Remember that the data science process involves some pre-processing, then consists of some analysis itself. <br>

Optionally, you can also filter out common words (“the”, “this”, “and”, etc.) by excluding words which appear in the stopwords file.

Overall, your approach will be:

- Create a data structure to store the words and the number of occurrences of the word.

- Read in each word from the file, making it lower case and removing punctuation. (Optionally, skip common words).

          For each remaining word, add the word to the data structure or update your count for the word

- Extract the top ten most frequently occurring words from your data structure and print them, along with their frequencies.

### Checking your solution:
You will get different counts on words depending on what punctuation you remove, what stop words you use, etc.  So don't worry too much about getting the exact count we have.  But if you want to see what we found, here are two examples:<br>

Without using stop words and removing the punctuation (. , " “ ), the top 10 most common words should be:<br>

the : 8177 <br>
and : 4984 <br>
of : 4122 <br>
to : 3536 <br>
a : 2976 <br>
in : 2612 <br>
his : 1998 <br>
it : 1879 <br>
i : 1872 <br>
that : 1861 <br>

Using the stop words and removing the punctuation (. , " “ ), the top 10 most common words should be:

said : 642 <br>
mr : 616 <br>
one : 420 <br>
lorry : 313 <br>
will : 290 <br>
upon : 289 <br>
little : 264 <br>
man : 259 <br>
defarge : 259 <br>
time : 236 <br>

*Note, at least "said" and "mr" seem to be common words. Feel free to add more to your stopwords file if you wish to get to less common words.

## Hints
1. **Which Data Structure?** If you aren't sure which data structure to use, remember that we discussed a data structure this week that gives us a key and a value at that key (dictionaries). This could be really useful here.

2.  **Stripping off Punctuation.**  The command "replace" on a string will replace one letter with another.  For example:

          word = word.replace(".","")

          Will remove any periods from the word.

3.  **Sorting the data structure.**  If you used an unordered data structure like a dictionary, you might need get the values out of it (into a list) to sort it.  You could also use "collections.Counter" to help with this step.

In [11]:
# importing the libraries
import collections

In [27]:
# importing the file
file = open('Assignment Files/Word Cloud/98-0.txt',
            encoding='utf-8')
wordCount = {}
for word in file.read().lower().split():
    word = word.replace(".","")
    word = word.replace(",","")
    word = word.replace("\"","")
    word = word.replace('"',"")
    word = word.replace("“","")
    word = word.replace("”","")
    if word not in wordCount:
        wordCount[word] = 1
    else:
        wordCount[word] += 1
d = collections.Counter(wordCount)
for word, count in d.most_common(10):
    print(word,":",count)    


the : 8177
and : 4984
of : 4125
to : 3536
a : 2976
in : 2617
his : 1999
it : 1928
i : 1876
that : 1876


In [31]:
# importing the file
file = open('Assignment Files/Word Cloud/98-0.txt',
            encoding='utf-8')
stopwords = open('Assignment Files/Word Cloud/stopwords'\
                 ,encoding='utf-8')
stopwords = set(stopwords.read().strip())

wordCount = {}
for word in file.read().lower().split():
    word = word.replace(".","")
    word = word.replace(",","")
    word = word.replace("\"","")
    word = word.replace('"',"")
    word = word.replace("“","")
    word = word.replace("”","")
    if word not in stopwords:
        if word not in wordCount:
            wordCount[word] = 1
        else:
            wordCount[word] += 1
d = collections.Counter(wordCount)
for word, count in d.most_common(10):
    print(word,":",count)    



the : 8177
and : 4984
of : 4125
to : 3536
in : 2617
his : 1999
it : 1928
that : 1876
he : 1809
was : 1754


In [33]:
stopwords = open('Assignment Files/Word Cloud/stopwords'\
                 ,encoding='utf-8')

In [36]:
stopwords.read()

''