In [2]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [3]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
#Here we enter the raw or the original article
DOCUMENT = """
China net cafe culture crackdown

Chinese authorities closed 12,575 net cafes in the closing months of 2004, the country's government said.

According to the official news agency most of the net cafes were closed down because they were operating illegally. Chinese net cafes operate under a set of strict guidelines and
many of those most recently closed broke rules that limit how close they can be to schools. The move is the latest in a series of steps the Chinese government has taken to crack
down on what it considers to be immoral net use.

The official Xinhua News Agency said the crackdown was carried out to create a "safer environment for young people in China". Rules introduced in 2002 demand that net cafes be at
least 200 metres away from middle and elementary schools. The hours that children can use net cafes are also tightly regulated. China has long been worried that net cafes are an
unhealthy influence on young people. The 12,575 cafes were shut in the three months from October to December. China also tries to dictate the types of computer games people can
play to limit the amount of violence people are exposed to.

Net cafes are hugely popular in China because the relatively high cost of computer hardware means that few people have PCs in their homes. This is not the first time that the
Chinese government has moved against net cafes that are not operating within its strict guidelines. All the 100,000 or so net cafes in the country are required to use software
that controls what websites users can see. Logs of sites people visit are also kept. Laws on net cafe opening hours and who can use them were introduced in 2002 following a fire
at one cafe that killed 25 people. During the crackdown following the blaze authorities moved to clean up net cafes and demanded that all of them get permits to operate. In August
2004 Chinese authorities shut down 700 websites and arrested 224 people in a crackdown on net porn. At the same time it introduced new controls to block overseas sex sites. The
Reporters Without Borders group said in a report that Chinese government technologies for e-mail interception and net censorship are among the most highly developed in the world.

""";

print("The Original Document:",DOCUMENT)
print(' ')
L1=len(DOCUMENT)
c0=0
c1=0
for i in range (0,L1):
    if(DOCUMENT[i]==' '):
        c1=c1+1
    elif(DOCUMENT[i]=='.'):
        c0=c0+1
print("Length of original document=",c1," words")
print("The number of sentences in the original article=",c0,' ')

The Original Document: 
China net cafe culture crackdown

Chinese authorities closed 12,575 net cafes in the closing months of 2004, the country's government said.

According to the official news agency most of the net cafes were closed down because they were operating illegally. Chinese net cafes operate under a set of strict guidelines and
many of those most recently closed broke rules that limit how close they can be to schools. The move is the latest in a series of steps the Chinese government has taken to crack
down on what it considers to be immoral net use.

The official Xinhua News Agency said the crackdown was carried out to create a "safer environment for young people in China". Rules introduced in 2002 demand that net cafes be at
least 200 metres away from middle and elementary schools. The hours that children can use net cafes are also tightly regulated. China has long been worried that net cafes are an
unhealthy influence on young people. The 12,575 cafes were shut in the 

In [8]:
import re #re is abuilt-in package which can be used to work with Regular Expressions. This module provides regular expression matching operations.
DOCUMENT = re.sub(r'\n|\r', ' ', DOCUMENT)#removal of the line breaks or paragraph seperators
DOCUMENT = re.sub(r' +', ' ', DOCUMENT)
DOCUMENT = DOCUMENT.strip()#removes spaces at the begining and end of the input article
sentences = nltk.sent_tokenize(DOCUMENT)#forms an array of sentences. These sentences arre those which are present in the input article seperated by a '.'


In [10]:
import numpy as np
stop_words = nltk.corpus.stopwords.words('english')#returns the array of designated stopwords defined by nltk

def normalize_document(doc):
    
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)#removes special characters (mainly the puncuation marks)
    doc = doc.lower()#entire document is converted into lower case
    doc = doc.strip()#removes whitespace from the beginning/end of the document
    tokens = nltk.word_tokenize(doc)#tokenize the document. Returns the array of all the words present in the input article 
    filtered_tokens = [token for token in tokens if token not in stop_words]#filtering of the stopwords from the array of tokenized document
    doc = ' '.join(filtered_tokens)# re-create document from filtered tokens
    return doc
normalize_corpus = np.vectorize(normalize_document)#normalization of the document

norm_sentences = normalize_corpus(sentences)
print("The processed text:",' ')
print(norm_sentences)

The processed text:  
['china net cafe culture crackdown chinese authorities closed net cafes closing months countrys government said'
 'according official news agency net cafes closed operating illegally'
 'chinese net cafes operate set strict guidelines many recently closed broke rules limit close schools'
 'move latest series steps chinese government taken crack considers immoral net use'
 'official xinhua news agency said crackdown carried create safer environment young people china'
 'rules introduced demand net cafes least metres away middle elementary schools'
 'hours children use net cafes also tightly regulated'
 'china long worried net cafes unhealthy influence young people'
 'cafes shut three months october december'
 'china also tries dictate types computer games people play limit amount violence people exposed'
 'net cafes hugely popular china relatively high cost computer hardware means people pcs homes'
 'first time chinese government moved net cafes operating within str

In [13]:
#TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer#Used for geerating TF-IDF scores
import pandas as pd
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)#Convert a collection of sentences in the normalized document to a matrix of TF-IDF features
dt_matrix = tv.fit_transform(norm_sentences)
dt_matrix = dt_matrix.toarray()
vocab = tv.get_feature_names()#Array of words (in the normalized document) sorted on the basis of their respective TF-IDF features
td_matrix = dt_matrix.T#Transpose of matrix dt_matrix
td_matrix.shape
m=int(input("Enter the value of 'm' ( any number between 10-20 ):"))
print("The selection matrix on vectorization for top 'm' words:,':",' ')
print(pd.DataFrame(np.round(td_matrix, 2), index=vocab).head(m))

Enter the value of 'm':10
The selection matrix on vectorization for top 'm' words:,':  
               0     1    2    3     4     5     6    7    8     9    10   11  \
according    0.00  0.41  0.0  0.0  0.00  0.00  0.00  0.0  0.0  0.00  0.0  0.0   
agency       0.00  0.36  0.0  0.0  0.28  0.00  0.00  0.0  0.0  0.00  0.0  0.0   
also         0.00  0.00  0.0  0.0  0.00  0.00  0.34  0.0  0.0  0.23  0.0  0.0   
among        0.00  0.00  0.0  0.0  0.00  0.00  0.00  0.0  0.0  0.00  0.0  0.0   
amount       0.00  0.00  0.0  0.0  0.00  0.00  0.00  0.0  0.0  0.29  0.0  0.0   
arrested     0.00  0.00  0.0  0.0  0.00  0.00  0.00  0.0  0.0  0.00  0.0  0.0   
august       0.00  0.00  0.0  0.0  0.00  0.00  0.00  0.0  0.0  0.00  0.0  0.0   
authorities  0.26  0.00  0.0  0.0  0.00  0.00  0.00  0.0  0.0  0.00  0.0  0.0   
away         0.00  0.00  0.0  0.0  0.00  0.34  0.00  0.0  0.0  0.00  0.0  0.0   
blaze        0.00  0.00  0.0  0.0  0.00  0.00  0.00  0.0  0.0  0.00  0.0  0.0   

              12    

In [16]:
#Latent Semantic Analysis
from scipy.sparse.linalg import svds   
def low_rank_svd(matrix, singular_count=2):
    u, s, vt = svds(matrix, k=singular_count)
    return u, s, vt
n=int(input("Enter the number of segments for summary (Enter 3 or 5 as an input):"))
num_sentences = int(input("Enter the number of sentences you require in the summary:"))
num_topics = n
u, s, vt = low_rank_svd(td_matrix, singular_count=num_topics)  
u.shape, s.shape, vt.shape
term_topic_mat, singular_values, topic_document_mat = u, s, vt
sv_threshold = 0.5
min_sigma_value = max(singular_values) * sv_threshold
singular_values[singular_values < min_sigma_value] = 0
salience_scores = np.sqrt(np.dot(np.square(singular_values), np.square(topic_document_mat)))
print("Saline scores:",' ')
print(salience_scores)

Enter the number of segments for summary (Enter 3 or 5 as an input):3
Enter the number of sentences you require in the summary:8
Saline scores:  
[0.65271468 0.56394579 0.53252843 0.38211541 0.72249958 0.36200433
 0.56475884 0.55658241 0.23711151 0.60158722 0.456053   0.60475891
 0.44482743 0.58003323 0.5063153  0.42602088 0.44140093 0.42300557
 0.30244031]


In [18]:
top_sentence_indices = (-salience_scores).argsort()[:num_sentences]
print(top_sentence_indices.sort())
similarity_matrix = np.matmul(dt_matrix, dt_matrix.T)
similarity_matrix.shape
print("The similarity matrix:",' ')
print(np.round(similarity_matrix, 3))

None
The similarity matrix:  
[[1.    0.153 0.152 0.13  0.167 0.058 0.074 0.131 0.149 0.043 0.096 0.176
  0.062 0.    0.182 0.185 0.238 0.    0.159]
 [0.153 1.    0.121 0.02  0.296 0.05  0.064 0.06  0.039 0.    0.044 0.172
  0.054 0.    0.018 0.049 0.024 0.    0.016]
 [0.152 0.121 1.    0.052 0.    0.196 0.047 0.045 0.029 0.067 0.033 0.258
  0.04  0.    0.014 0.115 0.063 0.    0.042]
 [0.13  0.02  0.052 1.    0.    0.017 0.095 0.02  0.    0.    0.015 0.128
  0.08  0.    0.065 0.017 0.068 0.    0.089]
 [0.167 0.296 0.    0.    1.    0.    0.    0.202 0.    0.102 0.075 0.
  0.    0.049 0.032 0.056 0.106 0.    0.052]
 [0.058 0.05  0.196 0.017 0.    1.    0.054 0.051 0.033 0.    0.037 0.046
  0.045 0.    0.079 0.042 0.02  0.081 0.014]
 [0.074 0.064 0.047 0.095 0.    0.054 1.    0.065 0.042 0.079 0.048 0.059
  0.141 0.125 0.187 0.053 0.026 0.    0.017]
 [0.131 0.06  0.045 0.02  0.202 0.051 0.065 1.    0.039 0.133 0.142 0.056
  0.055 0.064 0.06  0.05  0.078 0.    0.016]
 [0.149 0.039 0.029 0

In [19]:
#text rank sorting an the final summary 
import networkx
similarity_graph = networkx.from_numpy_array(similarity_matrix)
similarity_graph
scores = networkx.pagerank(similarity_graph)
ranked_sentences = sorted(((score, index) for index, score in scores.items()), reverse=True)
ranked_sentences[:10]
top_sentence_indices = [ranked_sentences[index][1] for index in range(num_sentences)]
top_sentence_indices.sort()

print(' ')
print(' ')
print("The summary:",' ')
print(' ')
final_summary='\n'.join(np.array(sentences)[top_sentence_indices])
print(final_summary)
print(' ')
L2=len(final_summary)
c2=0
for i in range (0,L2):
    if(DOCUMENT[i]==' '):
        c2=c2+1
print("Length of summary=",c2," words")

 
 
The summary:  
 
China net cafe culture crackdown Chinese authorities closed 12,575 net cafes in the closing months of 2004, the country's government said.
According to the official news agency most of the net cafes were closed down because they were operating illegally.
Chinese net cafes operate under a set of strict guidelines and many of those most recently closed broke rules that limit how close they can be to schools.
The hours that children can use net cafes are also tightly regulated.
China has long been worried that net cafes are an unhealthy influence on young people.
This is not the first time that the Chinese government has moved against net cafes that are not operating within its strict guidelines.
Laws on net cafe opening hours and who can use them were introduced in 2002 following a fire at one cafe that killed 25 people.
In August 2004 Chinese authorities shut down 700 websites and arrested 224 people in a crackdown on net porn.
 
Length of summary= 162  words


In [22]:
#Rouge score calculation. For Rouge score calculation, the human generated summary is needed for comparison with the proposed architecture generated one.

!git clone https://github.com/pltrdy/rouge
%cd rouge
!python setup.py install 

Cloning into 'rouge'...
remote: Enumerating objects: 269, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 269 (delta 7), reused 15 (delta 5), pack-reused 247[K
Receiving objects: 100% (269/269), 71.66 KiB | 4.48 MiB/s, done.
Resolving deltas: 100% (124/124), done.
/home/jovyan/binder/rouge/rouge/rouge
running install
running bdist_egg
running egg_info
creating rouge.egg-info
writing rouge.egg-info/PKG-INFO
writing dependency_links to rouge.egg-info/dependency_links.txt
writing entry points to rouge.egg-info/entry_points.txt
writing requirements to rouge.egg-info/requires.txt
writing top-level names to rouge.egg-info/top_level.txt
writing manifest file 'rouge.egg-info/SOURCES.txt'
reading manifest file 'rouge.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'rouge.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py

In [23]:
from rouge import Rouge
#in order to measure the Rouge scores, it is neccessary to obtain the corresponding human summary
human_summary=""" 
Chinese authorities closed 12,575 net cafes in the closing months of 2004, the country's government said.
Chinese net cafes operate under a set of strict guidelines and many of those most recently closed broke rules that limit how close they can be to schools.
This is not the first time that the Chinese government has moved against net cafes that are not operating within its strict guidelines.
""";
hypothesis =final_summary ;
reference =human_summary
rouge = Rouge()
scores = rouge.get_scores(hypothesis, reference)
print(scores)


[{'rouge-1': {'r': 0.98, 'p': 0.5, 'f': 0.6621621576880935}, 'rouge-2': {'r': 0.953125, 'p': 0.4066666666666667, 'f': 0.5700934537514194}, 'rouge-l': {'r': 0.98, 'p': 0.5, 'f': 0.6621621576880935}}]


In [34]:
#To measure the cosine similarity
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

X = DOCUMENT.lower() 
Y = final_summary.lower() 

X_list = word_tokenize(X)  
Y_list = word_tokenize(Y) 

sw = stopwords.words('english')  
l1 =[];l2 =[] 

X_set = {w for w in X_list if not w in sw}  
Y_set = {w for w in Y_list if not w in sw} 

rvector = X_set.union(Y_set)  
for w in rvector: 
    if w in X_set: l1.append(1)
    else: l1.append(0) 
    if w in Y_set: l2.append(1) 
    else: l2.append(0) 
c = 0

for i in range(len(rvector)): 
        c+= l1[i]*l2[i] 
cosine = c / float((sum(l1)*sum(l2))*0.5) 
print("Cosine Similarity: ", cosine)

Cosine Similarity:  0.013986013986013986


In [2]:
! pip install --upgrade language-check

Collecting language-check
  Using cached language-check-1.1.tar.gz (33 kB)
Building wheels for collected packages: language-check
  Building wheel for language-check (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /srv/conda/envs/notebook/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ao8ntm0f/language-check_1a933e9d83c54f5fa5862fcd4c0c54b1/setup.py'"'"'; __file__='"'"'/tmp/pip-install-ao8ntm0f/language-check_1a933e9d83c54f5fa5862fcd4c0c54b1/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-dxhpn_v1
       cwd: /tmp/pip-install-ao8ntm0f/language-check_1a933e9d83c54f5fa5862fcd4c0c54b1/
  Complete output (35 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-ao8ntm0f/language-check_1a933

In [None]:
#To check the number of gramatical errors in the generated summary
import language_check
# Mention the language keyword
tool = language_check.LanguageTool('en-US')
i = 0
# Path of file which needs to be checked
with open(r'C:\Users\win10\Desktop\Summaries\Tech-No.2.txt', 'r') as fin:            
    for line in fin:
        matches = tool.check(line)
        i = i + len(matches)     
        pass
# prints total mistakes which are found 
# from the document
print("No. of mistakes found in document is ", i)
print() 
# prints mistake one by one 
for mistake in matches:
    print(mistake)
    print()    