In [144]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

In [177]:
raw_words = """ Christina Pagel is professor of operational research and director of the Clinical Operational Research Unit at University College London. She is also member of the Independent SAGE group, providing analysis and advice on the covid-19 pandemic.

In mid-July, British Prime Minister Boris Johnson promised the country a “significant return to normality” by Christmas. But two weeks after Christmas, that is a far cry from reality, as England enters its third national lockdown in an attempt to contain the rapid spread of covid-19.

This week, England admitted the highest number of people with covid-19 to hospitals since the start of the pandemic. Surgeries are being canceled, ambulances are waiting for hours outside hospitals because there are no beds, and health-care staff are exhausted and traumatized. Oxygen infrastructure is already being stressed, as it is in California. Cases keep rising relentlessly, with an overwhelmed National Health Service possible within days.
The new variant found in the U.K. has now been identified in the United States, Denmark, France, Spain, Canada and many other countries. What we are going through might be your future. How did we end up in this desperate situation? And how do we get out?

In September, cases started rising rapidly in England. On Sept. 21, the government’s scientific advisory committee recommended an immediate short lockdown. Instead, authorities spent six weeks tinkering with regional restrictions of differing severity. We entered our second national lockdown in November, with schools and universities remaining open. Cases, hospital admissions and deaths all fell in the second half of November. Yet by early December, cases were rapidly rising in the southeast corner of England and neighboring London, among the least affected English regions in October.
By Dec. 18, the Covid-19 Genomics UK Consortium had identified a new variant of the coronavirus that has since proved to be 40 percent to 70 percent more infectious. On Dec. 22, scientific advisers once again urged immediate restrictions to control its spread, including keeping schools and universities shut in January. But much of England was allowed to mix with up to three households on Christmas Day, and schools for students under 11 were urged to open this week. After schools had been open for just one day, Johnson announced the third lockdown. He has also announced a plan to give nearly 14 million people — people over 70 years old and front-line clinical workers — their first dose of vaccine by mid-February, an exercise that requires us to increase current vaccination rates tenfold.
The hope is that these measures will slow the spread of cases, buying us time to vaccinate the most vulnerable. During the first lockdown in March, cases reduced by a factor of 10. Will it work this time?

The new lockdown should be enough to slow growth, but given the higher infectiousness of the new variant, it is very possible this will be not be enough to actually drive new cases down. Even if cases level off, we cannot cope with such sustained high numbers of people requiring hospital care. For this lockdown to be effective, England must combine severe restrictions and existing behavioral changes — such as social distancing, avoiding indoor spaces, mask-wearing and frequent hand-washing — with the following measures."""

In [178]:
raw_words2 = """ 
Christina Pagel is professor of operational research and director of the Clinical Operational Research Unit at University College London. She is also member of the Independent SAGE group, providing analysis and advice on the covid-19 pandemic.

In mid-July, British Prime Minister Boris Johnson promised the country a “significant return to normality” by Christmas. But two weeks after Christmas, that is a far cry from reality, as England enters its third national lockdown in an attempt to contain the rapid spread of covid-19.

This week, England admitted the highest number of people with covid-19 to hospitals since the start of the pandemic. Surgeries are being canceled, ambulances are waiting for hours outside hospitals because there are no beds, and health-care staff are exhausted and traumatized. Oxygen infrastructure is already being stressed, as it is in California. Cases keep rising relentlessly, with an overwhelmed National Health Service possible within days.
The new variant found in the U.K. has now been identified in the United States, Denmark, France, Spain, Canada and many other countries. What we are going through might be your future. How did we end up in this desperate situation? And how do we get out?

In September, cases started rising rapidly in England. On Sept. 21, the government’s scientific advisory committee recommended an immediate short lockdown. Instead, authorities spent six weeks tinkering with regional restrictions of differing severity. We entered our second national lockdown in November, with schools and universities remaining open. Cases, hospital admissions and deaths all fell in the second half of November. Yet by early December, cases were rapidly rising in the southeast corner of England and neighboring London, among the least affected English regions in October.
By Dec. 18, the Covid-19 Genomics UK Consortium had identified a new variant of the coronavirus that has since proved to be 40 percent to 70 percent more infectious. On Dec. 22, scientific advisers once again urged immediate restrictions to control its spread, including keeping schools and universities shut in January. But much of England was allowed to mix with up to three 


"""

In [72]:
def get_freq(words):
    #remove punctuation
    tokenizer = RegexpTokenizer(r'\w+')
    words = tokenizer.tokenize(words)
    
    # Remove single-character tokens
    words = [word for word in words if len(word) > 1]
    
    # Remove numbers
    words = [word for word in words if not word.isnumeric()]

    # Lowercase all words (default_stopwords are lowercase too)
    words = [word.lower() for word in words]

    # Remove stopwords
    words = [word for word in words if word not in stopwords.words()]
    
    # Stemming words - should this be optional?
    stemmer = nltk.stem.snowball.SnowballStemmer('english')
    words = [stemmer.stem(word) for word in words]

    # Calculate frequency distribution
    fdist = nltk.FreqDist(words)

    # Output top 50 words
    output_list = []
    for word, frequency in fdist.most_common(50):
        print(u'{}:{}'.format(word, frequency))
        output_list.append((word, frequency))
        
    return output_list

In [74]:
#res = get_freq(raw_words)

In [175]:
def get_freq_chunks(words, chunk_len, min_cnt, stem = False):
    #remove punctuation
    tokenizer = RegexpTokenizer(r'\w+')
    words = tokenizer.tokenize(words)
    
    # Remove single-character tokens
    words = [word for word in words if len(word) > 1]
    
    # Remove numbers
    words = [word for word in words if not word.isnumeric()]

    # Lowercase all words (default_stopwords are lowercase too)
    words = [word.lower() for word in words]
    
    # Remove stopwords
    words = [word for word in words if word not in stopwords.words()]

    # Stemming words - optional
    if stem:
        print("\nStemming words...")
        stemmer = nltk.stem.snowball.SnowballStemmer('english')
        words = [stemmer.stem(word) for word in words]
    else:
        print("\nNOT stemming words...")
        
    print("\nLength of words: ", len(words))
        
    #Chunk the text
    chunks = [words[i:i + chunk_len] for i in range(0, len(words), chunk_len)]  
    
    ck_output_list = []
    for ck in chunks:
        # Calculate frequency distribution
        #print(ck)
        print("\nLength of this chunk: ", len(ck))
        print("Displaying words with a count greater than: ", min_cnt)
        #print("Displaying top "+ str(display_cnt) +" words")
        fdist = nltk.FreqDist(ck)
        print("\n")
        
        # Output top n words in 250 word chunks
        output_list = []
        counter = 0
        for word, frequency in fdist.most_common(chunk_len):
            if frequency > min_cnt:
                counter += 1
                print(u'{}:{}'.format(word, frequency))
                output_list.append((word, frequency))
        ck_output_list.append(output_list)
        print("\n")
        
    return ck_output_list

In [182]:
res = get_freq_chunks(raw_words2, 250, 2, stem=False)


NOT stemming words...

Length of words:  201

Length of this chunk:  201
Displaying words with a count greater than:  2


england:5
cases:4
covid:4
rising:3
lockdown:3
national:3




In [188]:
testw = """hand handed hands walked run running"""
res = get_freq_chunks(testw, 100, 0, stem=True)


Stemming words...

Length of words:  6

Length of this chunk:  6
Displaying words with a count greater than:  0


hand:3
run:2
walk:1




In [191]:
stopwords.words()

['og',
 'i',
 'jeg',
 'det',
 'at',
 'en',
 'den',
 'til',
 'er',
 'som',
 'på',
 'de',
 'med',
 'han',
 'af',
 'for',
 'ikke',
 'der',
 'var',
 'mig',
 'sig',
 'men',
 'et',
 'har',
 'om',
 'vi',
 'min',
 'havde',
 'ham',
 'hun',
 'nu',
 'over',
 'da',
 'fra',
 'du',
 'ud',
 'sin',
 'dem',
 'os',
 'op',
 'man',
 'hans',
 'hvor',
 'eller',
 'hvad',
 'skal',
 'selv',
 'her',
 'alle',
 'vil',
 'blev',
 'kunne',
 'ind',
 'når',
 'være',
 'dog',
 'noget',
 'ville',
 'jo',
 'deres',
 'efter',
 'ned',
 'skulle',
 'denne',
 'end',
 'dette',
 'mit',
 'også',
 'under',
 'have',
 'dig',
 'anden',
 'hende',
 'mine',
 'alt',
 'meget',
 'sit',
 'sine',
 'vor',
 'mod',
 'disse',
 'hvis',
 'din',
 'nogle',
 'hos',
 'blive',
 'mange',
 'ad',
 'bliver',
 'hendes',
 'været',
 'thi',
 'jer',
 'sådan',
 'de',
 'en',
 'van',
 'ik',
 'te',
 'dat',
 'die',
 'in',
 'een',
 'hij',
 'het',
 'niet',
 'zijn',
 'is',
 'was',
 'op',
 'aan',
 'met',
 'als',
 'voor',
 'had',
 'er',
 'maar',
 'om',
 'hem',
 'dan',
 'z