# Attempting to automate analysis creation process 

### Text Similarity Approaches

In [None]:
#@title Upload plain text file of excerpts from local machine { display-mode: "form" }
import re
import csv
import pandas as pd
from google.colab import files
uploaded = files.upload()

In [None]:
#@title Using regex to parse TXT file { display-mode: "form" }
# create csv out of txt file
rx = r"\[(.*?)\]\s*(?:\[(.*?)\])?\s*(.*?)\((.*?)\)"
with open('input.txt', 'r') as f:
    with open('output.csv', 'w') as f1:
        writer = csv.writer(f1)
        writer.writerow(('ID', 'TAG', 'TEXT', 'SOURCE'))
        for line in f:
            line = line.strip()
            if line:
                id, tag, text, source = re.findall(rx, line)[0]
                writer.writerow([id, tag, text, source])

In [None]:
#@title TXT to CSV file { vertical-output: true, display-mode: "form" }
display_output = pd.read_csv('output.csv')
print(display_output)

              ID                                                       TAG  \
0   85930-610858                                                       NaN   
1   85930-610870                                                       NaN   
2   85930-610873                                                 Kiev City   
3   85930-610883                       Migrants, refugees, stateless, IDPS   
4   85930-610891                                                       NaN   
5   85930-610954  Migrants, refugees, stateless, IDPS, Luhanska; Kiev City   
6   85930-610957                                  Female head of household   
7   85930-610976                                  Female head of household   
8   85930-610986                                                 Kiev City   
9   85930-612000                  Kharkivska; Luhanska; Khersonska; Sumska   
10  85930-632237                 Luhans'k; Kherson; Donets'k; Zaporizhzhya   
11  85923-607141                                                

# Sentence Clustering

In [None]:
#@title Setting the environment for sentence clustering { display-mode: "form" }
from configparser import ConfigParser, ExtendedInterpolation
from collections import defaultdict
from itertools import combinations, product
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import TfidfVectorizer


# increase display of columns in pandas
pd.set_option('display.max_colwidth', 200)

In [None]:
#@title Read text and remove duplicates { display-mode: "form" }
# read text and source columns of excerpts
EXCERPTS_CSV = r'output.csv'
cols = ['TEXT', 'SOURCE']
excerpts = pd.read_csv(EXCERPTS_CSV, encoding='latin-1', usecols=cols)

# remove duplicates
excerpts = excerpts[~excerpts.TEXT.duplicated()]

excerpts.shape

(15, 2)

In [None]:
#@title Vectorizing the text { display-mode: "form" }
# vectorize the text
vect = TfidfVectorizer(min_df=.01, max_df=.95, norm='l2', stop_words='english', max_features=1000, ngram_range=(1,2))
titles_vect = vect.fit_transform(excerpts['TEXT'])

In [None]:
#@title Clustering text using KMeans { vertical-output: true, display-mode: "form" }
%%time

# cluster the document using KMeans

# step 1 - import the model
from sklearn.cluster import KMeans

# step 2 - instantiate the model
number_of_clusters = 5

km = KMeans(n_clusters=number_of_clusters, random_state=42)

# step 3 - fit the model with data
# clustering is unsupervised so we do not have labels to add during .fit()
km.fit(titles_vect)

# step 4 - predict the cluster of each section_title
excerpts['clusters'] = km.predict(titles_vect)

CPU times: user 111 ms, sys: 954 µs, total: 112 ms
Wall time: 65.7 ms


In [None]:
#@title Clustered text: { vertical-output: true, display-mode: "form" }
def review_clusters(df, n_clusters):

  with open('clustered.txt', 'a') as f:
    for cl_num in range(n_clusters):
      print(cl_num)
      print(excerpts[df.clusters == cl_num][['TEXT', 'SOURCE']].values[0:10])
      print()
      f.write(str(excerpts[df.clusters == cl_num]['TEXT'].values[0:10]))

review_clusters(excerpts, n_clusters=number_of_clusters)

0
[['The survey findings also reflect increasing prices. Three quarters of all respondents reported having experienced significant increases in prices in the two weeks before the survey. Respondents in the eastern, southern, and northern parts of the country were more likely to report price increases, with the figure going up to 90% in the Sumska oblast and over 80% in Donetska, Luhanska, Odeska, Khersonska, Kharkivska, and Chernihivska oblasts. According to data from the Ukrainian Statistical Service, bread prices went up by 3.5-4.3% in March nationwide, with increases up to around 30 percent in the Kherson oblast. '
  'World Food Programme, UN Country Team in Ukraine, Ukraine Food Security Report, 13/05/2022']
 ['Oblasts in the eastern and southern parts of the country were found to have the highest estimate levels of food insecurity, with one in every two households being food insecure. '
  'World Food Programme, UN Country Team in Ukraine, Ukraine Food Security Report, 13/05/2022']

### Evaluation and summary (Using Extractive methods)

In [None]:
#@title Creating extractive summary that covers all clusters. { vertical-output: true, display-mode: "form" }
import nltk
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
 
def read_article(file_name):
    file = open(file_name, "r")
    article = file.readlines()
    sentences = []

    for sentence in article:
#        print(sentence)
        sentences.append(sentence.replace("'", "").split("\n"))
#        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    sentences.pop() 
    
    return sentences

def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)
 
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix


def generate_summary(file_name, top_n=number_of_clusters + 5):
    nltk.download("stopwords")
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text anc split it
    sentences =  read_article(file_name)

    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    print("Indexes of top ranked_sentence order are ", ranked_sentence)    

    for i in range(top_n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

    # Step 5 - Offcourse, output the summarize texr
    summary =  " ".join(summarize_text)
    print("Summarize Text: \n", summary)

    # Write output to file
    with open('extractive.txt', 'w') as f:
      f.write(summary)

# let's begin
generate_summary( "clustered.txt", number_of_clusters)


Indexes of top ranked_sentence order are  [(0.03001374981180622, [' Since the onset of the war on 24 February 2022, the humanitarian situation in Ukraine has continued to deteriorate. The war has pushed millions of people from their homes, creating the fastest-growing displacement crisis since the second world war. Eastern and southern oblasts in Ukraine continue to see active fighting. The destruction of civilian infrastructure is affecting essential services such as electricity, heating, and clean water and disrupting access to food and healthcare. ', '']), (0.03001374981180622, [' Since the onset of the war on 24 February 2022, the humanitarian situation in Ukraine has continued to deteriorate. The war has pushed millions of people from their homes, creating the fastest-growing displacement crisis since the second world war. Eastern and southern oblasts in Ukraine continue to see active fighting. The destruction of civilian infrastructure is affecting essential services such as elec

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**After these checkpoint, summary looks like that:**

Since the onset of the war on 24 February 2022, the humanitarian situation in Ukraine has continued to deteriorate. The war has pushed millions of people from their homes, creating the fastest-growing displacement crisis since the second world war. Eastern and southern oblasts in Ukraine continue to see active fighting. The destruction of civilian infrastructure is affecting essential services such as electricity, heating, and clean water and disrupting access to food and healthcare.    Since the onset of the war on 24 February 2022, the humanitarian situation in Ukraine has continued to deteriorate. The war has pushed millions of people from their homes, creating the fastest-growing displacement crisis since the second world war. Eastern and southern oblasts in Ukraine continue to see active fighting. The destruction of civilian infrastructure is affecting essential services such as electricity, heating, and clean water and disrupting access to food and healthcare.    Since the onset of the war on 24 February 2022, the humanitarian situation in Ukraine has continued to deteriorate. The war has pushed millions of people from their homes, creating the fastest-growing displacement crisis since the second world war. Eastern and southern oblasts in Ukraine continue to see active fighting. The destruction of civilian infrastructure is affecting essential services such as electricity, heating, and clean water and disrupting access to food and healthcare.    Oblasts in the eastern and southern parts of the country were found to have the highest estimate levels of food insecurity, with one in every two households being food insecure.    Households with a female decision-maker were more likely to have a poor or borderline consumption  


### Evaluation and summary (Using Abstractive methods)

In [None]:
#@title Setting up the environment for abstractive summary { display-mode: "form" }
!pip install torch
!pip install transformers
!pip install bert-extractive-summarizer
!pip install sentencepiece
from summarizer import TransformerSummarizer


In [None]:
#@title Creating extractive summary with XLNet model. { vertical-output: true, display-mode: "form" }
with open('extractive.txt', 'r') as f:
  sentence_bag = f.read()

  model = TransformerSummarizer(transformer_type="XLNet",transformer_model_key="xlnet-base-cased")
  key_findings = ''.join(model(sentence_bag, min_length=60))
  
  print(key_findings)
  # Write output to file
  with open('abstractive.txt', 'w') as f1:
    f1.write(key_findings)

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetModel: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Since the onset of the war on 24 February 2022, the humanitarian situation in Ukraine has continued to deteriorate. The war has pushed millions of people from their homes, creating the fastest-growing displacement crisis since the second world war. The destruction of civilian infrastructure is affecting essential services such as electricity, heating, and clean water and disrupting access to food and healthcare.


**Final summary:**

Since the onset of the war on 24 February 2022, the humanitarian situation in Ukraine has continued to deteriorate. The war has pushed millions of people from their homes, creating the fastest-growing displacement crisis since the second world war. The destruction of civilian infrastructure is affecting essential services such as electricity, heating, and clean water and disrupting access to food and healthcare.