In [1]:
from string import punctuation
from collections import Counter
from heapq import nlargest
import pathlib
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

---

This notebook helps one to analyze a text document and have an idea about the topic.

----

### Read and tokenize document

In [2]:
"""
This cell:
    initiates an nlp model
    reads doc from a file
    inputs the document to the model    
"""
nlp = spacy.load('en')
file_name = "doc.txt"
doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))

### Preprocessing

In [3]:
"""
This cell:
    removes stopwords
    filters certain words based on their POS
    appends their lemmatized versions in a list
"""

keyword = []
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']

for token in doc:  
    if (token.text in STOP_WORDS or token.text in punctuation or token.text == "|"):
        continue
    if (token.pos_ in pos_tag):
        keyword.append(token.lemma_)

### Most Common 20 keywords

In [4]:
"""
This cell:
    counts keywords' frequencies
    prints most common 20 keywords and their normalized frequencies
"""    
freq_word = Counter(keyword)
max_freq = Counter(keyword).most_common(1)[0][1]

for word in freq_word.keys():
    freq_word[word] = (freq_word[word]/max_freq)
freq_word.most_common(20)

[('water', 1.0),
 ('cool', 0.28),
 ('site', 0.28),
 ('reuse', 0.24),
 ('wastewater', 0.24),
 ('treatment', 0.24),
 ('clean', 0.2),
 ('facility', 0.2),
 ('plant', 0.16),
 ('process', 0.16),
 ('treat', 0.16),
 ('instal', 0.12),
 ('kao', 0.12),
 ('product', 0.12),
 ('new', 0.12),
 ('production', 0.12),
 ('implement', 0.08),
 ('technology', 0.08),
 ('example', 0.08),
 ('chiller', 0.08)]

### 5 Most Strengthful Sentences

In [5]:
"""
This cell:
    calculates the strength of senteces by adding on keywords' normalized frequencies
    prints 5 top strength senteces
"""
sent_strength = {}
for sent in doc.sents:
    for word in sent:
        if word.text in freq_word.keys():
            if sent in sent_strength.keys():
                sent_strength[sent]+= freq_word[word.text]
            else:
                sent_strength[sent]=freq_word[word.text]
for sent in nlargest(5, sent_strength,key=sent_strength.get):
    print(sent,"\n")

| This concept requires all the water necessary for production processes (cleaning equipment   producing steam etc) to be entirely derived from water recycled in a loop on site  with no water sourced from public water supplies. 

Kao Vietnam Introduced a spray technique for washing and sanitizing tanks  resulting in reducing its use of water and steam Kao Industrial (Thailand) Returns cooling water overflow to a cooling water pool to help eliminate unnecessary water consumption Quimi -Kao  S.A. de C .V. 

The plant has also implemented other water saving technologies that allow it to reuse condensate water in the cooling towers  ultrasonic cleaning in the canteen  and reusing wastewater instead of withdrawing more city water. 

| Several wastewater treatment approaches are being implemented at our site in Maribor   including using treated rainwater to clean our machines and containers  as well as using river water that has been softened to cool down water samples taken from our boiler.