## Assignment: Onegram Counter

You probably know about Google Book's __[Ngram Viewer](https://books.google.com/ngrams)__: when you enter phrases into it, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years. 

Your assignment for this course is something similar: build a Python function that can take the file `data/corpus.txt` (UTF-8 encoded) from this repo as an argument and print a count of the 100 most frequent 1-grams (i.e. *single words*).

In essence the job is to do this:

In [26]:
from collections import Counter
import os

def onegrams(file):
    with open(file, 'r') as corpus:
        text = corpus.read()
        # .casefold() is better than .lower() here
        # https://www.programiz.com/python-programming/methods/string/casefold
        normalize = text.casefold()
        words = normalize.split(' ')
        count = Counter(words) 
        return count

ngram_viewer = onegrams(os.path.join('data', '/Users/lorenverreyen/Documents/GitHub/InformationScience/course/data/corpus.txt'))
print(ngram_viewer.most_common(100))

[('the', 11852), ('', 5952), ('of', 5768), ('and', 5264), ('to', 4027), ('a', 3980), ('in', 3548), ('that', 2336), ('his', 2061), ('it', 1517), ('as', 1490), ('i', 1488), ('with', 1460), ('he', 1448), ('is', 1400), ('was', 1393), ('for', 1337), ('but', 1319), ('all', 1148), ('at', 1116), ('this', 1063), ('by', 1042), ('from', 944), ('not', 933), ('be', 863), ('on', 850), ('so', 763), ('you', 718), ('one', 694), ('have', 658), ('had', 647), ('or', 638), ('were', 551), ('they', 547), ('are', 504), ('some', 498), ('my', 484), ('him', 480), ('which', 478), ('their', 478), ('upon', 475), ('an', 473), ('like', 470), ('when', 458), ('whale', 456), ('into', 452), ('now', 437), ('there', 415), ('no', 414), ('what', 413), ('if', 404), ('out', 397), ('up', 380), ('we', 379), ('old', 365), ('would', 350), ('more', 348), ('been', 338), ('over', 324), ('only', 322), ('then', 312), ('its', 307), ('such', 307), ('me', 307), ('other', 301), ('will', 300), ('these', 299), ('down', 270), ('any', 269), ('

However, there is a twist: you can't use the `collections` library...

Moreover, try to think about what else may be suboptimal in this example. For instance, in this code all of the text is loaded into memory in one time (with the `read()` method). What would happen if we tried this on a really big text file? 

**Most importantly, the count is also wrong**. Check by counting in an editor, for instance, and try to find out why.

If this is an easy task for you, you can also think about the graphical representation of the 1-gram count.

In [44]:

counts={}

with open("/Users/lorenverreyen/Documents/GitHub/InformationScience/course/data/corpus.txt","r") as corpus:
    textlines=corpus.readlines()
    for line in textlines:
        normalized=line.casefold()
        words=normalized.split(" ")
        for word in words:
            try:
                counts[word]+=1
            except KeyError:
                counts[word]=1

#print(counts)
#countssorted=sorted(counts)
#print(countssorted)
dict(sorted(counts.items(), key=lambda item: item[1], reverse=True))



{'the': 12825,
 'of': 6077,
 '': 5983,
 'and': 5663,
 'to': 4248,
 'a': 4128,
 'in': 3774,
 '\n': 3361,
 'that': 2546,
 'his': 2214,
 'it': 1668,
 'i': 1603,
 'as': 1597,
 'with': 1581,
 'but': 1570,
 'he': 1535,
 'the\n': 1478,
 'is': 1470,
 'was': 1459,
 'for': 1427,
 'all': 1236,
 'at': 1197,
 'this': 1167,
 'by': 1105,
 'from': 1022,
 'not': 1012,
 'be': 908,
 'on': 896,
 'so': 816,
 'you': 764,
 'one': 739,
 'have': 702,
 'had': 689,
 'or': 662,
 'and\n': 614,
 'they': 592,
 'were': 591,
 'some': 550,
 'their': 537,
 'are': 534,
 'of\n': 534,
 'which': 524,
 'when': 520,
 'upon': 520,
 'like': 517,
 'him': 512,
 'my': 512,
 'a\n': 507,
 'an': 504,
 'whale': 498,
 'into': 485,
 'now': 477,
 'there': 474,
 'what': 451,
 'no': 448,
 'if': 438,
 'out': 424,
 'we': 407,
 'up': 395,
 'old': 392,
 'more': 390,
 'would': 390,
 'been': 361,
 'then': 347,
 'over': 342,
 'only': 339,
 'these': 337,
 'such': 334,
 'other': 334,
 'its': 330,
 'to\n': 328,
 'will': 327,
 'me': 323,
 'in\n': 302