<a href="https://colab.research.google.com/github/jingyiw114/DemoKin/blob/main/Copy_of_Shared_Keyphrase_network_analysis_scottish_accounts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Keyphrase Network Analysis

This notebook exemplifies the workflow of keyphrase network analysis by applying it to the digitalized text of the *The Statistical Accounts of Scotland 1791 – 1845*.

<center><img src="https://stataccscot.edina.ac.uk/static/statacc/dist/img/carousel/carousel2.jpg" alt="drawing" width="50%"/></center>


> There are accounts for all of Scotland’s 938 parishes and considerable additional material form the original publications, such as illustrations, maps, county observations and appendixes on topics of special interest.
>
> Source: https://stataccscot.edina.ac.uk/static/statacc/dist/support/about#content

# Importing and preparing the data

In [1]:
!git clone https://github.com/DCS-training/ScottishAccounts/

Cloning into 'ScottishAccounts'...
remote: Enumerating objects: 29125, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 29125 (delta 4), reused 15 (delta 3), pack-reused 29102[K
Receiving objects: 100% (29125/29125), 101.75 MiB | 3.73 MiB/s, done.
Resolving deltas: 100% (121/121), done.
Updating files: 100% (29115/29115), done.


In [2]:
import os
import pandas as pd
import chardet

# Function to detect the encoding of a file
def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
    return result['encoding']

# Function to read files from a directory into a Pandas DataFrame
def read_files_into_dataframe(directory_path):
    # Get a list of file names in the specified directory
    files = [f for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]

    # Initialize an empty list to store DataFrames for each file
    df_list = []

    # Iterate through each file in the directory
    for file in files:
        file_path = os.path.join(directory_path, file)
        encoding = detect_encoding(file_path)

        # Open the file, read its content, and create a DataFrame
        with open(file_path, 'r', encoding=encoding, errors='replace') as f:
            content = f.read()
            df_list.append(pd.DataFrame({'filename': [file], 'content': [content]}))

    # Concatenate all DataFrames in the list into a single DataFrame
    df = pd.concat(df_list, ignore_index=True)
    return df


directory_path = '/content/ScottishAccounts/Accounts'
df = read_files_into_dataframe(directory_path)

## Sorting pages

In [3]:
import re
def find_three_numbers_substring(input_string):
    # Define the regular expression pattern
    pattern = re.compile(r'\b\d+\.\d+\b')
    # Use the pattern to find matches in the input string
    matches = re.findall(pattern, input_string)

    try:
        return matches[0]
    except:
        print("error",input_string)


df["n_index"] = df['filename'].apply(find_three_numbers_substring)

def extract_substring_after_second_dot(input_string):
    # Define the regular expression pattern
    pattern = re.compile(r'\b\d+\.\d+\.(\w+)\b')

    # Use the pattern to find the match in the input string
    match = re.search(pattern, input_string)

    # Extract the substring after the second dot
    if match:
        substring_after_second_dot = match.group(1)
        return substring_after_second_dot
    else:
        return None

def get_first(t):
    return t.split(".")[0]

def get_second(t):
    return t.split(".")[1]

df["n_index3"] = df['filename'].apply(extract_substring_after_second_dot)

df["n_index1"] = df['n_index'].apply(get_first)
df["n_index2"] = df['n_index'].apply(get_second)

def extract_substring_after_fourth_dot(input_string):
    return ".".join(input_string.split(".")[4:])

df["name"] = df['filename'].apply(extract_substring_after_fourth_dot)

def extract_components(df, column_name):
    # Define regular expression patterns for extracting prefix, number, and suffix
    pattern = re.compile(r'([a-zA-Z]+)?(\d+)([a-zA-Z]+)?')

    # Apply the regular expression pattern to the specified column
    df[['prefix', 'number', 'suffix']] = df[column_name].str.extract(pattern)

    # Fill NaN values with an empty string
    df[['prefix', 'number', 'suffix']] = df[['prefix', 'number', 'suffix']].fillna('')

    return df


df = extract_components(df, "n_index3")

def convert_to_numeric(df, column_name):
    try:
        # Attempt to convert all values in the specified column to numeric
        df[column_name] = pd.to_numeric(df[column_name])
        return df
    except (ValueError, TypeError):
        # If any value cannot be converted to numeric, return False and the original DataFrame
        return df

df = convert_to_numeric(df, "number")

def split_column_inplace(df, column_name):
    # Use the str.split method to split the column based on dots
    df[['type', 'Parish', 'subregion']] = df[column_name].str.split('.', n=2, expand=True)
    return df

df = split_column_inplace(df, "name")

df = df.sort_values(['Parish', 'number', "prefix", "suffix"])
df

Unnamed: 0,filename,content,n_index,n_index3,n_index1,n_index2,name,prefix,number,suffix,type,Parish,subregion
7378,StAS.2.12.1.P.Aberdeen.Aberdeen.txt,\n\t\t CITY OF ABERDEEN.*\n\n\t\t PR...,2.12,1,2,12,P.Aberdeen.Aberdeen.txt,,1,,P,Aberdeen,Aberdeen.txt
8913,StAS.2.12.1.C.Aberdeen.Contents.txt,"\n \t\t\t C O N T E N T S.\n\nABERDEEN,...",2.12,1,2,12,C.Aberdeen.Contents.txt,,1,,C,Aberdeen,Contents.txt
12826,StAS.2.12.1.F.Aberdeen.Contents.txt,\n\t\t\t\tA B E R D E E N.\n\n\n,2.12,1,2,12,F.Aberdeen.Contents.txt,,1,,F,Aberdeen,Contents.txt
14822,StAS.1.6.1.P.Aberdeen.Fraserburgh.txt,\n STATISTICAL ACCOUNT\n\n OF\n\n ...,1.6,1,1,6,P.Aberdeen.Fraserburgh.txt,,1,,P,Aberdeen,Fraserburgh.txt
21131,StAS.2.12.1.M.Aberdeen.Contents.txt,\n<MAP=ABERDEEN SHIRE>\n\n\n,2.12,1,2,12,M.Aberdeen.Contents.txt,,1,,M,Aberdeen,Contents.txt
...,...,...,...,...,...,...,...,...,...,...,...,...,...
17564,StAS.1.17.591.P.Wigton.Glasserton.txt,\n of Glasserton.\n\nit a stately mansion-...,1.17,591,1,17,P.Wigton.Glasserton.txt,,591,,P,Wigton,Glasserton.txt
8735,StAS.1.17.592.P.Wigton.Glasserton.txt,\n Statistical Account\n\nand fattened for...,1.17,592,1,17,P.Wigton.Glasserton.txt,,592,,P,Wigton,Glasserton.txt
17739,StAS.1.17.593.P.Wigton.Glasserton.txt,\n of Glasserton.\n\n Mr Stewart of Cast...,1.17,593,1,17,P.Wigton.Glasserton.txt,,593,,P,Wigton,Glasserton.txt
19088,StAS.1.17.594.P.Wigton.Glasserton.txt,"\n Statistical Account\n\nthe church yard,...",1.17,594,1,17,P.Wigton.Glasserton.txt,,594,,P,Wigton,Glasserton.txt


In [4]:
def aggregate_numbers_and_unique_values(df):
    # Group by "Parish" and aggregate "number" column
    result_df = df.groupby('Parish')['number'].agg(['min', 'max']).reset_index()

    # Group by "Parish" and collect unique values for "type", "prefix", and "suffix"
    unique_values_df = df.groupby('Parish')[['type', 'prefix', 'suffix']].agg(lambda x: list(set(x))).reset_index()

    # Merge the two DataFrames on "Parish"
    result_df = pd.merge(result_df, unique_values_df, on='Parish')

    # Rename the columns for clarity
    result_df.columns = ['parish', 'min_number', 'max_number', 'unique_types', 'unique_prefixes', 'unique_suffixes']

    return result_df

# Apply the function to aggregate numbers
summary_df = aggregate_numbers_and_unique_values(df)
summary_df

Unnamed: 0,parish,min_number,max_number,unique_types,unique_prefixes,unique_suffixes
0,Aberdeen,1,1214,"[A, I, C, F, P, M, G]",[],[]
1,Argyle,1,728,"[A, I, C, F, P, M, G]",[],[]
2,Ayrshire,1,871,"[I, C, F, P, M, G]",[],"[, b]"
3,Banff,1,647,"[I, C, F, P, M, G]",[],[]
4,Berwick,1,593,"[I, C, F, P, M, G]",[],"[, b]"
5,Bute,1,582,"[I, C, F, P, M, G]",[],[]
6,Caithness,1,579,"[I, C, F, P, M, G]",[],"[, b, a]"
7,Clackmannan,1,650,"[I, C, F, P, G]",[],"[, b, a]"
8,Dumbarton,1,464,"[I, C, F, P, M, G]",[],[]
9,Dumfries,1,616,"[G, P, I, M]",[],[]


# Clean content

In [5]:
df['content2'] = df['content']

import re

def dehyfenate(text):
    return re.sub(r"-\s+", '', text)


df['content2'] = df['content2'].apply(dehyfenate)

def delete_newlines_except_specific_cases(input_string):
    # Define a regular expression pattern to match newline characters not followed by specified conditions
    pattern = re.compile(r'(?<=[^\n])\n(?!\n|\s{3,})')

    # Use re.sub to replace matches with a space
    result_string = re.sub(pattern, ' ', input_string)

    return result_string

df['content2'] = df['content2'].apply(delete_newlines_except_specific_cases)

def remove_trailing_newlines(input_string):
    # Define a regular expression pattern to match three or more consecutive newlines at the end of the string
    pattern = re.compile(r'\n{3,}$')

    # Use re.sub to replace matches with a space
    result_string = re.sub(pattern, ' ', input_string)

    return result_string

df['content2'] = df['content2'].apply(remove_trailing_newlines)

## Divide into paragraphs

In [6]:
def replace_pattern_with_newline(text):
    # Define the regular expression pattern with capturing groups
    pattern = re.compile(r'(\.)\s*(\t+)\s*([A-Z])')

    # Replace the pattern with the desired format
    modified_text = re.sub(pattern, r'\1\n\3', text)

    return modified_text


def divide_into_paragraphs(text):
    text = replace_pattern_with_newline(text)
    # Split the text into paragraphs using double line breaks
    paragraphs = text.split('\n')
    return paragraphs

def is_footnote(text):
    return "<FOOTNOTE>" in text

def contains_only_special_characters(input_string):
    # Check if all characters in the string are non-alphanumeric
    return all(not char.isalnum() for char in input_string)



def is_indices_text(input_text):
    # Define a regular expression pattern to match the provided structure
    pattern = re.compile(r'^\s*[\w\s,]+(?:\d+-\w+,\s*)*\d+\s*.*$')

    # Check if the input text matches the pattern
    return bool(re.match(pattern, input_text))

paragraphs = []

for page in df["content2"]:
    for par in divide_into_paragraphs(page)[2:]:
        if contains_only_special_characters(par) or is_footnote(par) or is_indices_text(par):
            continue
        subdivisions = divide_into_paragraphs(par)
        if subdivisions != []:
            paragraphs += subdivisions
        else: paragraphs.append(par)

# Search paragraphs for keywords

In [7]:
relevant_paragraphs = []

words = ["manufactur", "machin", "factory", "factories", "industrial", "industries", "engine"]
#words = ["moral", "manner", "habit", "ethic", 'honest'] #, 'values', 'decency', 'decent']


for pi in range(len(paragraphs)):
    par = paragraphs[pi]
    if any([w in par.lower() for w in words]):
        if par[0].isupper() and par.endswith('.'):
            relevant_paragraphs.append(par)
        elif not par.endswith('.'):
            relevant_paragraphs.append(par + paragraphs[pi + 1])
        else:
            relevant_paragraphs.append(paragraphs[pi - 1] + par)

relevant_sentences = []

for par in relevant_paragraphs:
  for s in par.split("."):
    if any([w in s.lower() for w in words]):
      relevant_sentences.append(s)


len(relevant_sentences)


7770

# Generate Keyphrase networks

In [8]:
!pip install keybert keyphrase_vectorizers

Collecting keybert
  Downloading keybert-0.8.3.tar.gz (29 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting keyphrase_vectorizers
  Downloading keyphrase_vectorizers-0.0.11-py3-none-any.whl (29 kB)
Collecting sentence-transformers>=0.3.8 (from keybert)
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting spacy-transformers>=1.1.6 (from keyphrase_vectorizers)
  Downloading spacy_transformers-1.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (197 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers>=0.3.8->keybert)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━

In [9]:
%%time

from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer, KeyphraseTfidfVectorizer

# Init KeyBERT
kw_model = KeyBERT(model = "all-MiniLM-L6-v2")

keyphrases = []

docs = relevant_paragraphs

ndocs = len(docs)
max_df_prop = 0.5
min_df_prop = 0.0025
min_df_docs = int(min_df_prop*ndocs)
top_n = 15

#kp_vectorizer = KeyphraseTfidfVectorizer
kp_vectorizer = KeyphraseCountVectorizer

for d, kp in zip(docs, kw_model.extract_keywords(docs=docs,
                                                 vectorizer=kp_vectorizer(
                                                        max_df = int(max_df_prop*ndocs),
                                                        min_df = min_df_docs),
                                                 top_n=top_n, highlight=True)):
    keyphrases.append([d] + [kp])

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

2023-11-30 15:40:06,440 - KeyphraseVectorizer - INFO - It looks like you do not have downloaded a list of stopwords yet. It is attempted to download the stopwords now.
INFO:KeyphraseVectorizer:It looks like you do not have downloaded a list of stopwords yet. It is attempted to download the stopwords now.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


CPU times: user 7min 30s, sys: 9.69 s, total: 7min 40s
Wall time: 7min 51s


In [10]:
for k in keyphrases:
  if "manners" in [kp[0] for kp in k[1]]:
    print(k[0])
    print(k[1])

In [11]:
kp_edges = []
for d in keyphrases:
    for kp in d[1]:
        kp_edges.append([d[0], kp[0]])

print(len(kp_edges))

63514


In [12]:
from itertools import combinations
from collections import defaultdict

edge_count_kph = defaultdict(int)

for doc in keyphrases:
    for c in combinations([kph[0] for kph in doc[1]], 2):
            c = list(c)
            c.sort()
            edge_count_kph[tuple(c)] +=1

In [13]:
min_count_edge = 3

edge_count_kph_popular = {k: v for k, v in zip(edge_count_kph.keys(), edge_count_kph.values()) if v >= min_count_edge}
len(edge_count_kph_popular)

30686

In [14]:
import networkx as nx

Gkwp = nx.Graph()

for i in edge_count_kph_popular:
    Gkwp.add_edge(list(i)[0], list(i)[1], weight = edge_count_kph_popular[i])

Gcc = sorted(nx.connected_components(Gkwp), key=len, reverse=True)
G0 = Gkwp.subgraph(Gcc[0])

## Export to gephi

In [15]:
filename = f"SA_{str(kp_vectorizer).split('.')[-1][:-2]}_w-{words[0]}_ndos-{ndocs}_maxdf-{max_df_prop}_mindf-{min_df_docs}_topn-{top_n}_mincountedge-{min_count_edge}.gexf"
nx.write_gexf(G0, filename)
from google.colab import files
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [16]:
[k for k in keyphrases if any([p[0] == "game" for p in k[1]])]

[["state of crime at the present time with that in any other portion of our parish history during the course of the seventeenth century.  The superrtitions which, from the same authority, we find to have then infected both clergy and people, are now generally ridiculed.  If any trace of superstition still remain, it is rather practical than speculative, as in observing festival days, or concealing a child's name until the baptism, and seems rather the result of habit than of any religious prepossession.   Poaching in game prevails to a considerable extent, but much more among quarriers and manufacturers than the permanent inha. bitants of the district.  There is no poaching on the salmon-fisheries, which in this parish are of very little value.",
  [('salmon', 0.3985),
   ('superstition', 0.3613),
   ('clergy', 0.2521),
   ('seventeenth century', 0.2472),
   ('considerable extent', 0.1545),
   ('authority', 0.1515),
   ('game', 0.118),
   ('habit', 0.1174),
   ('century', 0.1148),
   (