<a href="https://colab.research.google.com/github/jazoza/cultural-data-analysis/blob/main/04_CDA_HH_midterm_synthesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cultural Data Analysis

Introduction to working with datasets

In [None]:
# import necessary libraries
import os, re, csv
import numpy as np
import pandas as pd

## Loading the datasets: heritage homes webistes

The dataset is stored in a shared google drive:
https://drive.google.com/drive/folders/11Shm0edDOiWrOe56fzJQRZi-v_BPSW8E?usp=drive_link

Add it to your drive.

To access it, load your gdrive in 'Files' (see left pane of the notebook in google colab) and navigate to the shared folder. You may need to click on 'refresh' to make it appear on the list.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Load the github repository files where this notebook is stored to the available files. This will make it easier to import stopwords, url lists and other additional data we need.

In [None]:
!git clone https://github.com/jazoza/cultural-data-analysis

## Import all datasets (4 countries)

You will have all datasets available for analysis and comparison, mapped in the following way:

> df0 - Dutch dataset

> df1 - UK dataset

> df2 - German dataset

> df3 - French dataset

In [None]:
# Country code: change here between 'NL' and 'UK'
cc_list = ['NL', 'UK', 'DE', 'FR']

In [None]:
gdrive_path = '/content/gdrive/MyDrive/CDA/'

In [None]:
# Import scraped json data into 4 separate dataframes
df0=pd.read_json(gdrive_path+cc_list[0]+'_dataset_website-content-crawler.json')
# select columns for analysis: url, text, metadata
df0=df0[['url','text','metadata']]

df1=pd.read_json(gdrive_path+cc_list[1]+'_dataset_website-content-crawler.json')
# select columns for analysis: url, text, metadata
df1=df1[['url','text','metadata']]

df2=pd.read_json(gdrive_path+cc_list[2]+'_dataset_website-content-crawler.json')
# select columns for analysis: url, text, metadata
df2=df2[['url','text','metadata']]

df3=pd.read_json(gdrive_path+cc_list[3]+'_dataset_website-content-crawler.json')
# select columns for analysis: url, text, metadata
df3=df3[['url','text','metadata']]

df0.head()


Join all pages from a domain to an entry in the analysis. To do this, add a new column which will contain only the main domain name.

In [None]:
# function to extract the main domain from the url in the dataset
def extract_main_domain(url):
    if not isinstance(str(url), str):
        print('NOT VALID',url)
        return None
    match = re.findall('(?:\\w+\\.)*\\w+\\.\\w*', str(url)) #'www\.?([^/]+)'
    return match[0].lstrip('www.') if match else None

In [None]:
# Add a new column 'domain' and fill it by applying the extract_main_domain function to the 'url' column

# first, create a mapping of dataframes which could be addressed in a loop
df_dict = {'0':df0, '1':df1, '2':df2, '3':df3}

# then, loop through the df_dict to update each dataframe
for k, v in df_dict.items():
  cc_column = cc_list[int(k[-1])]+' domains'
  cc = cc_list[int(k[-1])]
  # print(cc_column, cc)
  urls = pd.read_csv(gdrive_path+'url_lists/'+cc_list[int(k[-1])]+'_urls.csv')[cc_column].values.tolist()
  domains = {extract_main_domain(url) for url in urls if extract_main_domain(url) is not None}
  matching_links = [link for link in v.url if extract_main_domain(link) in domains]
  # update the dataframe
  v['domain'] = v['url'].apply(extract_main_domain)

# check one of the dataframes
df1.head()

## Prepare the analysis

Import stopwords dictionaries for the 4 langauges we work with.
It is good to import all of them in our case, because many websites have sections is English, German or French even when this is not the main language of the website.

In [None]:
# load a list of 'stopwords' function
def get_stopwords_list(stop_file_path):
    """load stop words """
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return list(frozenset(stop_set))

In [None]:
# Get the stopwords list for all languages (using cc_list previously defined)
# cc_list = ['NL', 'UK', 'DE', 'FR'] # remove the hashtag from this line to uncomment this code and make it run

stopwords = [] # empty list to which a list of stopwords will be appended in loop

for i in range(len(cc_list)):
  stopwords_cc_path = "/content/cultural-data-analysis/stopwords_archive/"+cc_list[i]+".txt"
  stopwords_cc = get_stopwords_list(stopwords_cc_path)
  #print(len(stopwords_cc)) # print how many words are in the list
  stopwords.extend(stopwords_cc)

#print(len(stopwords)) # print how many words are in all stopwords lists

In [None]:
# you may need to include additional words which you notice as too frequent
special_stop_words = ['nbsp', 'nl', 'fr', 'de', 'uk', 'com', 'www', 'lit'] # these might appear frequently as 'terms' in the corpus, so it's good to filter them
stopwords_ext = stopwords+special_stop_words

## 1. Term frequency

The cells below will compute a term-matrix and calculate the frequency of each unique word (token) in the corpus

This can be done for ALL words in the corpus, or ALL MEANINGFUL words (without so-called stop-words like 'the' or 'het')

In [None]:
# CALCULATE TERM FREQUENCY OF ALL TERMS
from sklearn.feature_extraction.text import CountVectorizer

# convert the text documents into a matrix of token (word) counts
cvec_all = CountVectorizer().fit(df0.text) #### CHANGE df0 TO THE DATAFRAME YOU ANALYSE
df_matrix_all = cvec_all.transform(df0.text) #### CHANGE df0 TO THE DATAFRAME YOU ANALYSE
df_all = np.sum(df_matrix_all,axis=0)
terms = np.squeeze(np.asarray(df_all))
# print the 'shape' of the matrix - it should indicate the number of unique terms
print(terms.shape)
term_freq_df_all = pd.DataFrame([terms],columns=cvec_all.get_feature_names_out()).transpose() #term_freq_df is with stopwords
term_freq_df_all.columns = ['terms']
# show the first ten words [:10];
# change the values in the brackets to show 30th-40th words [30:40]
# or show the last ten words [:-10]
term_freq_df_all.sort_values(by='terms', ascending=False).iloc[:-10]

In [None]:
# CALCULATE TERM FREQUENCY WITHOUT STOP-WORDS

#cvec_stopped = CountVectorizer(max_df=0.5, token_pattern=r'(?u)\b[A-Za-z]{2,}\b') # max_df could in theory automatically filter stopwords
cvec_stopped = CountVectorizer(stop_words=stopwords_ext, token_pattern=r'(?u)\b[A-Za-z]{2,}\b') # token pattern recognizes only words which are made of letters, and longer than 1 character
cvec_stopped.fit(df0.text) #### CHANGE df0 TO THE DATAFRAME YOU ANALYSE
document_matrix = cvec_stopped.transform(df0.text) #### CHANGE df0 TO THE DATAFRAME YOU ANALYSE
term_batches = np.linspace(0,document_matrix.shape[0],10).astype(int)
i=0
df_stopped = []
while i < len(term_batches)-1:
    batch_result = np.sum(document_matrix[term_batches[i]:term_batches[i+1]].toarray(),axis=0)
    df_stopped.append(batch_result)
    print(term_batches[i+1],"entries' term frequency calculated")
    i += 1

terms_stopped = np.sum(df_stopped,axis=0)
#print(terms_stopped.shape)
term_freq_df_stopped = pd.DataFrame([terms_stopped],columns=cvec_stopped.get_feature_names_out()).transpose()
term_freq_df_stopped.columns = ['terms']
term_freq_df_stopped.sort_values(by='terms', ascending=False).iloc[:10]


### 1.1 Term frequency for specific terms

In [None]:
search_word = 'kasteel' # Change this to the word you want to search for

if search_word in term_freq_df_stopped.index: # check if the words exists;
    frequency = term_freq_df_stopped.loc[search_word, 'terms']
    print(f"The word '{search_word}' appears {frequency} times in the current corpus.")
elif search_word in stopwords_ext:
    # If not found in stopped, maybe it was a stop word, so check all terms
    frequency = term_freq_df_all.loc[search_word, 'terms']
    print(f"The word '{search_word}' was filtered out as a stopword. Its total frequency is {frequency} times.")
else:
    print(f"The word '{search_word}' was not found in the corpus.")

### 1.2 TF-IDF vectorization

- What is TF/IDF (term frequency / inverse document frequency)? https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words=stopwords_ext, token_pattern=r'(?u)\b[A-Za-z]{2,}\b')
# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(df0['text']) #### CHANGE df0 TO THE DATAFRAME YOU ANALYSE
# Convert the TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
# Add filenames as index
tfidf_df.index = df0['domain'] #### CHANGE df0 TO THE DATAFRAME YOU ANALYSE
# Print the TF-IDF DataFrame
tfidf_df.head()

In [None]:
# Function to transform the wide TF-IDF DataFrame to a long format
def create_long_tfidf_df_efficiently(tfidf_wide_df):
    data = []
    for domain, row in tfidf_wide_df.iterrows():
        # Get non-zero TF-IDF scores and their corresponding terms to reduce data processing
        active_terms = row[row > 0]
        for term, tfidf_score in active_terms.items():
            data.append({'document': domain, 'term': term, 'tfidf': tfidf_score})
    return pd.DataFrame(data)

# Reorganize the DataFrame from wide to long format using the efficient function
tfidf_df = create_long_tfidf_df_efficiently(tfidf_df)
tfidf_df.head()

In [None]:
import altair as alt

# Terms in this list will get a red dot in the visualization
term_list = ['kasteel', 'huis', 'children'] # write key terms here

In [None]:
import altair as alt

# Calculate top 10 TF-IDF terms per domain
top_tfidf_plus = tfidf_df.groupby('document').apply(lambda x: x.nlargest(10, 'tfidf')).reset_index(drop=True)

# Add 'rank' based on tfidf score within each document
top_tfidf_plus['rank'] = top_tfidf_plus.groupby('document')['tfidf'].rank(method='first', ascending=False).astype(int)

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf_plus.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf_plus.shape[0])*0.0001

# Define the base Altair chart
base = alt.Chart(top_tfidf_plusRand).encode(
    x=alt.X('rank:O', axis=alt.Axis(title='Rank (Top 10 Terms)')),
    y=alt.Y('document:N', axis=alt.Axis(title='Domain'))
).transform_window(
    rank="rank()",
    sort=[alt.SortField("tfidf", order="descending")],
    groupby=["document"]
)

# Create the heatmap layer
heatmap = base.mark_rect().encode(
    color=alt.Color('tfidf:Q', scale=alt.Scale(scheme='yellowgreenblue'), title='TF-IDF Score')
)

# Create the text layer with white text
text = base.mark_text(baseline='middle').encode(
    text='term:N',
    color=alt.value('white') # Explicitly set text color to white
)

# Combine the heatmap and text layers and set properties
chart = (heatmap + text).properties(
    title='Top 10 TF-IDF Terms per Domain',
    width=600,
    height=alt.Step(25) # Adjust height based on number of documents
)

chart

In [None]:
#inspect problems by printing the entire website (document)

document = 'artland.top' # replace with one of the website domains on the left

print("TF-IDF entries for ", document, ": ")
display(tfidf_df[tfidf_df['document'] == document].sort_values(by='tfidf', ascending=False))

print("\nOriginal text entries for", document, ": ")
# Filter the original DataFrame 'df' for the domain 'kasteeltuinen.nl'
domain_text = df0[df0['domain'] == document]['text'].str.cat(sep=' ')
print(domain_text)

### 1.3 Word2Vec model

Vectorizing the corpus with word2vec model
https://en.wikipedia.org/wiki/Word2vec

In [None]:
!pip install gensim

In [None]:
import nltk
nltk.download('punkt_tab')

In [None]:
import gensim
from nltk.tokenize import word_tokenize

# X is a list of tokenized texts (i.e. list of lists of tokens)
X = [word_tokenize(item) for item in df0.text.tolist()] # replace df0 with a dataframe you are analysing
#print(X[0:3])
model = gensim.models.Word2Vec(X, min_count=6, vector_size=200) # min_count: how many times a word appears in the corpus; size: number of dimensions

Observe keywords that may be characteristic in the corpus on heritage homes, such as 'castle', 'garden', 'party', 'princess'; try also words related to less obvious themes, like 'servant'

You can ask for 'negative' or 'positive' similarity, and explore how these bring up terms that are opposite to the meaning in a variety of ways.


In [None]:
model.wv.most_similar(positive=["kasteel"], topn=12)

In [None]:
model.wv.most_similar(positive=["tuin"], topn=12)

In [None]:
model.wv.most_similar(negative=["baron"], topn=12)

## 2. Collocations

### 2.1 Analyze specific collocations

In [None]:
# define vectorization functions

from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
from itertools import islice

# SCI-KIT method, produces lists of co-occurencies for specific terms
def vectorize_text(df):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(df['text'])
    return X, vectorizer

def find_collocations(text, target_words):
    words = text.split()
    collocations = []
    for i in range(len(words) - 1):
        if words[i] in target_words:
            collocations.append((words[i], words[i + 1]))
        if words[i + 1] in target_words:
            collocations.append((words[i + 1], words[i]))
    return collocations

def get_frequent_collocations(df, most_frequent_words):
    collocations = []
    for text in df['text']:
        collocations.extend(find_collocations(text, most_frequent_words))
    collocation_counts = Counter(collocations)
    frequent_collocations = {}
    for word in most_frequent_words:
        word_collocations = {collocation: count for collocation, count in collocation_counts.items() if word in collocation}
        frequent_collocations[word] = dict(islice(Counter(word_collocations).most_common(20), 20)) # change these two values to get more or less terms
    return frequent_collocations

def analyze_word_collocations(df):
    X, vectorizer = vectorize_text(df)
    most_frequent_words = search_words
    frequent_collocations = get_frequent_collocations(df, most_frequent_words)
    return frequent_collocations

In [None]:
collocations = analyze_word_collocations(df0) # CHANGE df0 TO DATAFRAME YOU ARE ANALYSING

Define the search term here, analyse whether it appears in the corpus and next to which words (excluding stopwords)

In [None]:
# search for words from this list or use another list
search_words = ['kasteel']

In [None]:
data = []
for word, colloc_dict in collocations.items():
   for collocation, count in colloc_dict.items():
       #collocation_str = ' '.join(collocation)  # Join collocation words into a single string
       data.append([word, collocation[1], count])
collocations_df = pd.DataFrame(data, columns=['Word', 'Collocation', 'Count'])
print(collocations_df.to_markdown(index=True))

#### 2.1.1. Analyze collocation in page titles

In [None]:
# add a column 'page_title' to th dataframe (df0 or df1-3) extracting the value of 'title' key in metadata dictionary in each entry (lambda function)

df0['page_title'] = df0['metadata'].apply(lambda x: x.get('title'))
df0['page_title'].head()

In [None]:
search_term = 'baron'
total_occurrences = 0

print("Searching for", search_term, ":")
for index, title in df0['page_title'].items():
    if isinstance(title, str):
        # Convert to lowercase for case-insensitive search
        title_lower = title.lower()
        # Count occurrences in the current title
        occurrences_in_title = title_lower.count(search_term)
        if occurrences_in_title > 0:
            print(title)
            total_occurrences += occurrences_in_title

print("\nTotal occurrences of", search_term, "across all titles: ", total_occurrences)

### 2.2 Analyse collocations in sentences

In [None]:
#function to remove non-ascii characters
def _removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

In [None]:
# import the advanced Natural Language Processing (NLP) library
# which we will use to analze the grammatical structure
import spacy

In [None]:
# download the suitable language pipeline
# Dutch: nl_core_news_sm
# French: fr_core_news_sm
# German: de_core_news_sm
# English: en_core_web_sm (available by default)
!python -m spacy download nl_core_news_sm

In [None]:
nlp = spacy.load('nl_core_news_sm') # change to FR/DE/EN code module, see names above

In [None]:
import string

#function to clean and lemmatize comments
def clean_documents(text):
    #remove punctuations
    regex = re.compile('[' + re.escape(string.punctuation) + '\\r\\t\\n]')
    nopunct = regex.sub(" ", str(text))
    #use spacy to lemmatize comments
    doc = nlp(nopunct, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc]
    return lemma

In [None]:
#apply function to clean and lemmatize comments
lemmatized = df.text.map(clean_documents)
#make sure to lowercase everything
lemmatized = lemmatized.map(lambda x: [word.lower() for word in x])
lemmatized.head()

In [None]:
unlist_documents = [item for items in lemmatized for item in items]

In [None]:
# You would use these commands to save lemmatized text into a 'pickle' for later use
# The current setup does not enable you to overwrite existing files in the CDA/jar folder,
# so you would have to save the 'pickle' files elsewhere (for example in sample_data folder)
# If you want to reactivate this code, remove the tripple quotes ''' from the beginning and end
'''
# save these outputs for later
with open(gdrive_path+'jar/lemmatized.pickle', 'wb') as handle_l:
    pickle.dump(lemmatized, handle_l, protocol=pickle.HIGHEST_PROTOCOL)

with open(gdrive_path+'jar/unlist_documents.pickle', 'wb') as handle_u:
    pickle.dump(unlist_documents, handle_u, protocol=pickle.HIGHEST_PROTOCOL)
  '''

In [None]:
'''
# load saved pickles
with open(gdrive_path+'jar/'+cc+'_lemmatized.pickle', 'rb') as handle_l:
    lemmatized = pickle.load(handle_l)

with open(gdrive_path+'jar/'+cc+'_unlist_documents.pickle', 'rb') as handle_u:
    unlist_documents = pickle.load(handle_u)
'''

In [None]:
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
# initiate bigrams and trigrams
bigrams = nltk.collocations.BigramAssocMeasures()
trigrams = nltk.collocations.TrigramAssocMeasures()

In [None]:
# identify all collocations in the flat list of words from all documents
bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(unlist_documents)
trigramFinder = nltk.collocations.TrigramCollocationFinder.from_words(unlist_documents)

Calculate basic frequency

In [None]:
bigram_freq = bigramFinder.ngram_fd.items()

In [None]:
bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)

In [None]:
bigramFreqTable.head().reset_index(drop=True)

In [None]:
# compute basic trigrams frequency
trigram_freq = trigramFinder.ngram_fd.items()
trigramFreqTable = pd.DataFrame(list(trigram_freq), columns=['trigram','freq']).sort_values(by='freq', ascending=False)
trigramFreqTable[:10]

Find meaningful bi- and tri-grams by filtering adjectives and nouns based on an nltk functionality

In [None]:
#function to filter for ADJ/NN bigrams
def rightTypes(ngram):
    for word in ngram:
        if word in stopwords_ext:
            return False
    acceptable_types = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    second_type = ('NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in acceptable_types and tags[1][1] in second_type:
        return True
    else:
        return False

In [None]:
#filter bigrams
filtered_bi = bigramFreqTable[bigramFreqTable.bigram.map(lambda x: rightTypes(x))]

In [None]:
filtered_bi[:10]

Use advanced statistical methods like the Chi-Square to identify meaninful collocations
https://en.wikipedia.org/wiki/Chi-squared_test

In [None]:
# filter bigrams using chi-square
bigramChiTable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.chi_sq)), columns=['bigram','chi-sq']).sort_values(by='chi-sq', ascending=False)
bigramChiTable.head()

In [None]:
# find meaningful trigrams by filtering basic frequency table
# function to filter trigrams
def rightTypesTri(ngram):
    if '-pron-' in ngram or '' in ngram or ' 'in ngram or '  ' in ngram or 't' in ngram:
        return False
    for word in ngram:
        if word in stopwords_ext:
            return False
    first_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    third_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in first_type and tags[2][1] in third_type:
        return True
    else:
        return False

In [None]:
filtered_tri = trigramFreqTable[trigramFreqTable.trigram.map(lambda x: rightTypesTri(x))]
filtered_tri[:10]

In [None]:
# Chi-sqare frequency calculation for trigrams
trigramChiTable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.chi_sq)), columns=['trigram','chi-sq']).sort_values(by='chi-sq', ascending=False)
trigramChiTable.head(20)