In [33]:
# !pip install -r requirements.txt

## Semantic search
- in model 1, took the n most frequent bigrams, and compared cosine similarity of their embeddings in SBERT
- now we could try embedding every sentence of the corpus and then find k nearest neighbours of a embedded query.
- question. Do you want to keep stop words, punctuation etc for embedding?
- can we combine with named entity recognition?

In [1]:
import faculty.datasets as datasets
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import spacy
import re
from utils import clean2, is_pua
from sentence_transformers import SentenceTransformer
import pickle
%matplotlib inline

2023-03-06 21:39:21.571770: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/anaconda/envs/Python3/lib:
2023-03-06 21:39:21.571834: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-06 21:39:21.571860: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (cube-28dd6833-92cf-4219-8e40-4c56594541f0-54876db7d5-78kjt): /proc/driver/nvidia/version does not exist


## Create DataFrame
Create dataframe of just pdfs which have text. Leaves 502/686 plans.

In [9]:
df = pd.read_excel('../data/raw/plans.xlsx')
drop_cols = ['search_link','type','charset','unfound','credit','date_retrieved','time_period','scope','status',
            'well_presented', 'baseline_analysis', 'notes', 'plan_path','file_type','website_url',
            'authority_code', 'authority_type', 'wdtk_id',
           'mapit_area_code', 'country', 'gss_code', 'county', 'region',
           'population', 'unfound','credit','homepage_mention', 'dedicated_page',
            'plan_due','title','title_checked','twitter_url','twitter_name','url','council']
df = df.drop(drop_cols,axis=1)
corpus = df[df['text'].notna()]

# Better cleaning

In [10]:
corpus['text'] = corpus['text'].apply(clean2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  corpus['text'] = corpus['text'].apply(clean2)


In [4]:
corpus.to_csv('../data/processed/text-and-sentences.csv')

## Sentence tokenization

Worth removing numbers before doing so

In [11]:
def clean3(text):
    text = re.sub("[^a-z. ]",'',text)
    return text
corpus['text'] = corpus['text'].apply(clean3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  corpus['text'] = corpus['text'].apply(clean3)


In [12]:
corpus

Unnamed: 0,text
0,aberdeen city council energy and climate plan ...
1,aberdeen adapts aberdeens climate adaptation f...
2,council climate change plan towards a net ze...
3,a climate positive city at the heart of the gl...
5,carbon neutral plan working towards the targe...
...,...
668,wychavon intelligently green plan contents w...
670,climate change action plan council operations ...
671,climate change action plan wider borough our g...
680,restore revive thrive our environment climate ...


In [13]:
# nltk.download('punkt')
# nltk.download("stopwords")
from nltk.tokenize import sent_tokenize, word_tokenize

corpus['sentences'] = corpus['text'].apply(sent_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  corpus['sentences'] = corpus['text'].apply(sent_tokenize)


In [17]:
sentence_corpus = corpus['sentences'].sum()
len(sentence_corpus) #there are ~55000 sentences in the corpus

In [47]:
corpus.sentences.sample(n=10).iloc[0]

['one city climate strategystrategy forcarbon neutral climate resilient bristol by 2030 bristol one city climate strategystrategy forcarbon neutral climate resilient bristol by 2030 foreword from the one city environmental sustainability board we are facingclimate in the one city plan bristol this strategy sets the vision for where this strategy iscall to action.',
 'committed to becoming carbon we need to be in 2030 based on we call on you as people who live emergency.',
 'ascity neutral and climate resilient by 2030. sound science.',
 'we would like to thank work visit and invest in bristol to join we need to act now to to achieve this over the next decade our colleagues on bristols advisory with us on this exciting decade of we need to radically rethink how we committee on climate change for transformation.',
 'reduce direct and indirect live work and invest in the city.',
 'their review and challenge of the we will engage widely to understand evidence for bristol.',
 'carbon emissi

## Storing and Loading embeddings in sBERT
Start with the most general pretrained model

In [56]:
from sentence_transformers import SentenceTransformer
import pickle

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = sentence_corpus


embeddings = model.encode(sentences) #perhaps need to add convert_to_tensor = True

#Store sentences & embeddings on disc
with open('../data/processed/embeddings.pkl', "wb") as fOut:
    pickle.dump({'sentences': sentences, 'embeddings': embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)


In [6]:
#Load sentences & embeddings from disc
with open('../data/processed/embeddings.pkl', "rb") as fIn:
    stored_data = pickle.load(fIn)
    stored_sentences = stored_data['sentences']
    stored_embeddings = stored_data['embeddings']

In [7]:
from sentence_transformers import util

cosine_scores = util.cos_sim(stored_embeddings, stored_embeddings)

In [8]:
cosine_scores

tensor([[1.0000, 0.5324, 0.4346,  ..., 0.2506, 0.4048, 0.2367],
        [0.5324, 1.0000, 0.5131,  ..., 0.1765, 0.2188, 0.2319],
        [0.4346, 0.5131, 1.0000,  ..., 0.2407, 0.2976, 0.3638],
        ...,
        [0.2506, 0.1765, 0.2407,  ..., 1.0000, 0.3145, 0.1501],
        [0.4048, 0.2188, 0.2976,  ..., 0.3145, 1.0000, 0.1995],
        [0.2367, 0.2319, 0.3638,  ..., 0.1501, 0.1995, 1.0000]])

In [None]:
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

## Paraphrase mining 
- We have ~ 50 000 sentences so current approach too slow

In [21]:
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = sentence_corpus


paraphrases = util.paraphrase_mining(model, sentences, corpus_chunk_size=10000,query_chunk_size=1000, top_k=20)

In [33]:
for paraphrase in paraphrases[20:30]:
    score, i, j = paraphrase
    print("{} \t {} \t Score: {:.4f}".format(sentences[i], sentences[j], score))

note. 	 note. 	 Score: 1.0000
note. 	 note. 	 Score: 1.0000
the for high quality sustainable force dec suffolk design project design. 	 the for high quality sustainable force dec suffolk design project design. 	 Score: 1.0000
challengechallengechallengeulev infrastructure. 	 challengechallengechallengeulev infrastructure. 	 Score: 1.0000
report to. 	 report to. 	 Score: 1.0000
the principal conurbation is bracknell itself with secondary population centres built up around the historic towns and villages of sandhurst crowthorne to the south binfield warfield and winkfield to the north and north ascot to the east. 	 the principal conurbation is bracknell itself with secondary population centres built up around the historic towns and villages of sandhurst crowthorne to the south binfield warfield and winkfield to the north and north ascot to the east. 	 Score: 1.0000
such partners will not only help us to deliver but will take the responsibility for achieving targets to help close the gree