This is to show the workflow of how the ethical sentences are extracted, which mainly contains 4 phases:
1) Load ethical text files as a whole
2) Tokenize 1) and Extract most common words
3) Manually review the common words to filter out ethics-irrelevant ones
4) Get synonyms of the word list to extend it

In [None]:
import os
import collections
import nltk
import re
import itertools
import json

Load ethical text files as a whole
The ethical articles are stored in the [path], which are all .txt files

In [None]:
path = '/Ethics and DH/files/'
file_string = ''
for filename in os.listdir(path):
    with open(path+filename, encoding="utf-8") as file:
        file_string += file.read()
        
file_string = file_string.lower()

In [None]:
Funtion to get the most common words (based on Python Collections)

In [None]:
def most_common(counter, quantity=None, minimum=None):
    g = None
    if (quantity is not None) and (minimum is not None):
        g=(e for e in counter.most_common(quantity) if e[1]>=minimum)
    elif (quantity is None) and (minimum is not None):
        g=(e for e in counter.most_common() if e[1]>=minimum)
    elif (quantity is not None) and (minimum is None):
        g=counter.most_common(quantity)
    else:
        g=counter.most_common()
    return list(g)

Tokenize words using Regular Expression
Use NLDK Stopword List to filter out irrelevant words 

In [None]:
word_tokens = re.findall(r'(?!\d)\w+', file_string)
stop_words = set(nltk.corpus.stopwords.words('english'))
filtered_words = [w for w in word_tokens if (not w in stop_words) and len(w) > 1] 
counter = collections.Counter(filtered_words)
common_words = [w[0] for w in most_common(counter, 300)]

Function to check if a word is in a text

In [None]:
def in_sentence(word, text):
    tokens = nltk.word_tokenize(text.lower())
    word = word.lower()
    if word in tokens:
        return True
    else:
        return False

Get synonyms of each ethical word in the wordlist
The Web API is provided by www.thesaurus.com

In [None]:
import urllib.request
from bs4 import BeautifulSoup
from urllib.error import URLError, HTTPError

common_words_extended = []

for w in common_words:
    common_words_extended.append(w)
    url = 'https://www.thesaurus.com/browse/' + w
    req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    try:
        resource = urllib.request.urlopen(req)
    except HTTPError as e:
        continue
    uf = resource.read().decode(resource.headers.get_content_charset())
    soup = BeautifulSoup(uf,"html.parser")
    synonyms = [a.find(text=True) for a in soup.select("a.css-3kshty.etbu2a31")]
    common_words_extended.extend(synonyms)

common_words_extended = list(set(common_words_extended))

with open("/Ethics and DH/filtered_wordlist_extended.txt", "w", encoding="utf-8") as text_file:
    text_file.write(str(common_words_extended))

We tend to save the wordlist as a JSON text file in each phase as a backup

In [None]:
with open("/Ethics and DH/filtered_wordlist_extended.txt",encoding="utf-8") as file:
    common_words = json.loads(file.read())

Function to extract the sentence which contains at least [weight] number of the ethical words

In [None]:
weight = 2
paragraphs = [p for p in file_string.split('\n') if p]
sentence_tokens = []
for paragraph in paragraphs:
    sentences = nltk.sent_tokenize(paragraph)
    sentence_tokens += sentences
output_sentences=[]
for sentence in sentence_tokens:
    sentence = sentence.replace('\n','')
    count=0
    for word in common_words:
        if in_sentence(word, sentence):
            if count < weight:
                count+=1
            else:
                output_sentences.append(sentence)
                break

Save the final ethical sentences as a JSON file as the trainning resource

In [None]:
output_sentences = list(set(output_sentences))
output = ""
for s in output_sentences:
    output+=s.strip() + "\n"

with open("/Ethics and DH/sentences_new_extended.txt", "w", encoding="utf-8") as text_file:
    text_file.write(output)