# Moby Dick Challenge

#### Count the number of words in the full text of Moby Dick! 

- How many times does the word `whale` appear? 
- What are the top ten most used words? 
- Save the dictionary of word frequencies as a `json` file
- Bonus: Create a line plot of the ordered word frequencies.

Clean the text, store each unique word as a key in a dictionary and the corresponding word frequency as its value. The resulting data structure should look like this:

```python
{
    'the': 827,
    'python': 34,
    ...
}    
```
---
## Hints

These functions might be useful to look at:

```python
open('myfile')
txt = "hallo"
txt.replace()
txt.split()
file.read()
json.dump()
```

---

## More Hints

1. Read in the `./data/mobydick.txt` textfile as a single string (use the `f.read()` function of the file connection object `f`). Store the string in a variable with the name `txt`.
2. Convert everything to lowercase.
3. Remove the line breaks `\n` (Hint: Use the `txt.replace(old, new)` function to replace substrings in `txt` with `' '`). Are there any other characters that you could clean from the text?
4. Split the cleaned text using whitespace as separator (look at the `txt.split()` function). You get a list of single words.
5. Create an empty dictionary. Loop over the list of words:
    - Check if the word is already in the dictionary.
        - If `yes`, increase the counter for this word by `1`.
        - If `no`, add it as key to the dictionary and assign it the value `0`.
6. To save a `json` file look here: https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/

In [156]:
filename = './data/mobydick.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
words = text.replace('\n', '')
# split into words by white space
words = text.split()
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick;', 'or', 'the', 'whale,', 'by', 'herman', 'melville', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever.', 'you', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.org', 'title:', 'moby', 'dick;', 'or', 'the', 'whale', 'author:', 'herman', 'melville', 'release', 'date:', 'december', '25,', '2008', '[ebook', '#2701]', 'last', 'updated:', 'december', '3,', '2017', 'language:', 'english', 'character', 'set', 'encoding:', 'utf-8', '***', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'moby', 'dick;', 'or', 'the', 'whale', '***', 'produced', 'by', 'daniel']


In [157]:
import string
table = str.maketrans('', '', string.punctuation)
words = [w.translate(table) for w in words]
print(stripped[:100])

['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale', 'by', 'herman', 'melville', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'you', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 'reuse', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'wwwgutenbergorg', 'title', 'moby', 'dick', 'or', 'the', 'whale', 'author', 'herman', 'melville', 'release', 'date', 'december', '25', '2008', 'ebook', '2701', 'last', 'updated', 'december', '3', '2017', 'language', 'english', 'character', 'set', 'encoding', 'utf8', '', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'moby', 'dick', 'or', 'the', 'whale', '', 'produced', 'by', 'daniel']


In [158]:
import collections
def top10_words(text):
    counts = collections.Counter(text)
    return counts.most_common(10)
                                 
print('The most 10 top words: ', top10_words(words))

The most 10 top words:  [('the', 14524), ('of', 6708), ('and', 6410), ('a', 4668), ('to', 4652), ('in', 4187), ('that', 2918), ('his', 2516), ('it', 2323), ('i', 1845)]


In [159]:
import re

unwanted_chars = ".,-_ (and so on)"
wordfreq = {}
for word in words:
    word = word.strip(unwanted_chars)
    if word not in wordfreq:
        wordfreq[word] = 0 
    wordfreq[word] += 1
print(wordfreq)



In [172]:
import pandas as pd

df = pd.DataFrame(list(wordfreq.items()), columns = ['key','value'])
df

Unnamed: 0,key,value
0,the,15127
1,project,88
2,gutenberg,24
3,ebook,17
4,f,6733
...,...,...
17971,riginator,1
17972,confirme,1
17973,pg,1
17974,httpwwwgutenbergorg,1


In [196]:
# df = df.set_index('key')
df.to_dict(orient = 'dict')['value']

{'the': 15127,
 'project': 88,
 'gutenberg': 24,
 'ebook': 17,
 'f': 6733,
 'moby': 82,
 'ick': 81,
 'r': 1114,
 'whale': 1194,
 'by': 1216,
 'herm': 4,
 'melville': 4,
 'thi': 1416,
 'i': 8329,
 'for': 1614,
 'use': 106,
 'yone': 6,
 'ywhere': 16,
 't': 7621,
 '': 16671,
 'cost': 6,
 'with': 1764,
 'lmost': 195,
 'restricti': 2,
 'whatsoever': 7,
 'you': 914,
 'may': 247,
 'copy': 19,
 'it': 2722,
 'give': 133,
 'way': 444,
 'reuse': 2,
 'under': 121,
 'term': 42,
 'license': 19,
 'include': 19,
 'line': 156,
 'wwwgutenbergorg': 3,
 'title': 6,
 'uthor': 13,
 'release': 1,
 'te': 86,
 'ecember': 6,
 '25': 3,
 '2008': 1,
 '2701': 1,
 'last': 274,
 'update': 2,
 '3': 11,
 '2017': 1,
 'language': 7,
 'english': 48,
 'character': 16,
 'et': 99,
 'encoding': 1,
 'utf8': 1,
 'tart': 38,
 'produce': 16,
 'iel': 4,
 'lazaru': 7,
 'jonesey': 2,
 'vi': 7,
 'widger': 2,
 'mobydick': 2,
 'content': 26,
 'etymology': 2,
 'extract': 8,
 'upplie': 16,
 'ubsublibrari': 2,
 'chapter': 314,
 '1': 7,
 '

In [168]:
#list into dict!
# dict(zip(test_keys, test_values))

In [169]:
print('whale apeared: ', wordfreq.get('whale'))

whale apeared:  1194


In [194]:

try:
    word = input()
    word = word.lower()
    wordddd = wordfreq[f'{word}']
    print(wordddd)
except KeyError:
    print ('thank you')

you
914


In [163]:
wordfreq['whale']

1194

In [133]:
import json
with open('./data/mobydick.json', 'w') as json_file:
    json.dump(wordfreq, json_file)


In [207]:
a= 'this is an example of a strings and just for fun! there is a list of a numner a good number there is not a a good number of a list'
a= a.split()
count = dict()
for name in a:
    count[name] = count.get(name, 0)+1
print(a)

print(count)

['this', 'is', 'an', 'example', 'of', 'a', 'strings', 'and', 'just', 'for', 'fun!', 'there', 'is', 'a', 'list', 'of', 'a', 'numner', 'a', 'good', 'number', 'there', 'is', 'not', 'a', 'a', 'good', 'number', 'of', 'a', 'list']
{'this': 1, 'is': 3, 'an': 1, 'example': 1, 'of': 3, 'a': 7, 'strings': 1, 'and': 1, 'just': 1, 'for': 1, 'fun!': 1, 'there': 2, 'list': 2, 'numner': 1, 'good': 2, 'number': 2, 'not': 1}


In [208]:
count['is']

3

In [209]:
w = input()
try:
    j = count[f'{w}']
    print(j)
except:
    print('try another word')

fun!
1
