In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

In [31]:
letters_raw = pd.read_excel('cleaned_van_gogh_letters.xlsx')
letters_raw.head()

Unnamed: 0,Letter,From,To,Location,Date,Original Text,Translation,Note 1,Note 2,Note 3,...,Note 119,Note 120,Note 121,Note 122,Note 123,Note 124,Note 125,Note 126,Note 127,Note 128
0,191,Vincent van Gogh,Theo van Gogh,The Hague,"between Thursday, 1 and Saturday, 3 December 1881","den Haag Dec 1881. Waarde Theo, Zoo als ge zie...","The Hague, Dec. 1881. My dear Theo, As you see...",Mauve the Strickers Regarding ’s plan to visit...,"F 63 / JH 920 (). See cat. Amsterdam 1999, p. ...",Mauve The boarding-house must therefore have b...,...,,,,,,,,,,
1,554,Vincent van Gogh,Theo van Gogh,Antwerp,"on or about Friday, 22 January 1886","Waarde Theo, Een paar dagen heb ik nu ginder g...","My dear Theo, I’ve been painting there for a f...",It emerges from that Van Gogh painted a large ...,Vinck In the ‘Classical Statues’ class with :,See for this painting class: cat. Amsterdam 20...,...,,,,,,,,,,
2,238,Vincent van Gogh,Theo van Gogh,The Hague,"Friday, 9 June 1882","Waarde Theo, Weinig dingen hebben in den laats...","My dear Theo, Few things have given me so much...",Sien father When exactly left is not known. On...,Means: ‘het ziekenhuis’ (the hospital).,Armand Cassagne Van Gogh might be referring he...,...,,,,,,,,,,
3,361,Vincent van Gogh,Theo van Gogh,The Hague,"on or about Wednesday, 11 July 1883","Waarde Theo, Naar Uw brief had ik reeds min of...","My dear Theo, I had already been looking out f...",The exhibition at Galerie Georges Petit. At th...,Jules Dupré Rousseau Troyon There were four wo...,Constant Troyon Which particular painting of a...,...,,,,,,,,,,
4,445,Vincent van Gogh,Theo van Gogh,Nuenen,"Wednesday, 30 April 1884","Waarde Theo, Hartelijk gefeliciteerd met Uw ve...","My dear Theo, Many happy returns of the day. I...",Theo was 27 on 1 May 1884.,"Boussod, Valadon & Cie Mr van Gogh This remark...","F 30 / JH 479 Most probably (), which is indee...",...,,,,,,,,,,


In [42]:
letters = letters_raw[letters_raw['From'] == 'Vincent van Gogh']

In [43]:
letters = letters[['Letter', 'Date', 'Translation']]
letters.head()

Unnamed: 0,Letter,Date,Translation
0,191,"between Thursday, 1 and Saturday, 3 December 1881","The Hague, Dec. 1881. My dear Theo, As you see..."
1,554,"on or about Friday, 22 January 1886","My dear Theo, I’ve been painting there for a f..."
2,238,"Friday, 9 June 1882","My dear Theo, Few things have given me so much..."
3,361,"on or about Wednesday, 11 July 1883","My dear Theo, I had already been looking out f..."
4,445,"Wednesday, 30 April 1884","My dear Theo, Many happy returns of the day. I..."


In [44]:
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
  """
  preprocess text by converting to lowercase, tokenizing, removing stop words, and lemmatizing.
  """
  if isinstance(text, str):
    # Lowercase
    text = text.lower()
    # Tokenize and remove stop words
    tokens = [word for word in text.split() if word.isalnum() and word not in stop_words]
    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Join tokens back into a string
    return ' '.join(tokens)

letters['cleaned_text'] = letters['Translation'].apply(preprocess_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Latent Dirichlet Allocation (LDA)

To analyze patterns in word usage in order to identify themes, we need to use a topic modeling algorithm, such as LDA.
For that, we first need to vectorize text:

In [45]:
print(letters['cleaned_text'].isnull().sum())

# dropping nulls
letters = letters.dropna(subset=['cleaned_text'])

2


In [46]:
from sklearn.feature_extraction.text import CountVectorizer

custom_stop_words = list(set(stopwords.words('english')).union({'like', 'know', 'thing', 'shall', 'make'}))
# Initialize CountVectorizer with custom stop words
vectorizer = CountVectorizer(max_df=0.95, min_df=5, stop_words=custom_stop_words)
# Transform the text data
doc_term_matrix = vectorizer.fit_transform(letters['cleaned_text'])

print(doc_term_matrix.shape) #number of letters, number of unique terms (words)
print(doc_term_matrix.toarray()[:5])

(818, 4726)
[[0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [47]:
# applying LDA to the document-term matrix
from sklearn.decomposition import LatentDirichletAllocation

n_topics = 3  # I start with 3
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda.fit(doc_term_matrix)

In [48]:
def display_topics(model, feature_names, no_top_words):
    topics = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_words = [feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]
        topics[f'Topic {topic_idx+1}'] = topic_words
    return topics

no_top_words = 10
feature_names = vectorizer.get_feature_names_out()
topics = display_topics(lda, feature_names, no_top_words)
print(topics)

{'Topic 1': ['one', 'would', 'think', 'see', 'say', 'much', 'good', 'something', 'even', 'go'], 'Topic 2': ['one', 'something', 'drawing', 'see', 'would', 'work', 'also', 'think', 'painting', 'figure'], 'Topic 3': ['one', 'also', 'see', 'good', 'day', 'go', 'old', 'god', 'give', 'life']}


In [49]:
#assigning topics to letters
letter_topics = lda.transform(doc_term_matrix)
letters['Topic'] = letter_topics.argmax(axis=1) + 1
letters.head()

Unnamed: 0,Letter,Date,Translation,cleaned_text,Topic
0,191,"between Thursday, 1 and Saturday, 3 December 1881","The Hague, Dec. 1881. My dear Theo, As you see...",dear writing since last planned come stay ette...,1
1,554,"on or about Friday, 22 January 1886","My dear Theo, I’ve been painting there for a f...",dear painting day suit sort painter see workin...,2
2,238,"Friday, 9 June 1882","My dear Theo, Few things have given me so much...",dear thing given much pleasure recently hearin...,2
3,361,"on or about Wednesday, 11 July 1883","My dear Theo, I had already been looking out f...",dear already looking glad thank find write exh...,2
4,445,"Wednesday, 30 April 1884","My dear Theo, Many happy returns of the day. I...",dear many happy return really important news l...,2


In [50]:
# counting letters per topic
letters['Topic'].value_counts()

Unnamed: 0_level_0,count
Topic,Unnamed: 1_level_1
1,423
2,238
3,157


In [52]:
topic_1_letters = letters[letters['Topic'] == 2]
print(topic_1_letters.sample(1)['Translation'].values[0])

Cuesmes, 20 August 1880 Dear Theo, If I’m not mistaken, you should still have ’s ‘The labours of the fields’. Would you be so kind as to lend them to me for a short while, and to send them to me by post? You should know that I’m sketching large drawings after , and that I’ve done The four times of the day, as well as The sower. Well, if you saw them perhaps you wouldn’t be too unhappy with them. Now, if you’d like to send me The labours of the fields, perhaps you could also add some other sheets by or after , , , &c. Don’t buy any specially, but lend me what you may have. Send me what you can, and don’t have any fears on my account. If only I can go on working, I’ll recover somehow. But you’d be a great help to me by doing this. If you pay a visit to Holland sooner or later, I hope you won’t pass by without coming to see the scratches. I’m writing to you while drawing and I’m in a hurry to get back to it, so good-night, and send the sheets as soon as possible, and believe me Ever yours

So far, it seems that Topic 2 is about art.

In [56]:
topic_1_letters = letters[letters['Topic'] == 3]
print(topic_1_letters.sample(1)['Translation'].values[0])

Welwyn, 17 June 1876 My dear Theo, Last Monday I left Ramsgate for London. That’s a long walk indeed, and when I left it was awfully hot and it remained so until the evening, when I arrived at Canterbury. That same evening I walked a bit further until I came to a couple of large beeches and elms next to a small pond, where I rested for a while. In the morning at half past 3 the birds began to sing upon seeing the morning twilight, and I continued on my way. It was good to walk then. In the afternoon I arrived at Chatham, where, in the distance, past partly flooded, low-lying meadows, with elms here and there, one sees the Thames full of ships. It’s always grey weather there, I think. There I met a cart that brought me a couple of miles further, but then the driver went into an inn and I thought he might stay there a long time, so I walked on and arrived towards evening in the well-known suburbs of London and walked on towards the city down the long, long ‘Roads’. I stayed in London for

Topic 3 is a mix of Religion (mentioning sermons, Bible, God) and describing nature (meadows, parks, etc).

In [58]:
topic_1_letters = letters[letters['Topic'] == 1]
print(topic_1_letters.sample(1)['Translation'].values[0])

My dear Theo, I’ve just sent off 3 large drawings, as well as some other, smaller ones and the two lithographs by . The vertical small farmhouse garden is, it seems to me, the best of the three large ones. The one with the sunflowers is the little garden of a bathhouse. The third, horizontal, garden is the one of which I’ve also done some painted studies. Under the blue sky, the orange, yellow, red patches of flowers take on an amazing brilliance, and in the limpid air there’s something happier and more suggestive of love than in the north. It vibrates — like the bouquet by that you have. I’m annoyed with myself for not painting flowers here. Anyway, even having already produced about fifty drawings or painted studies here, I feel as though I’ve done absolutely nothing at all. I’d gladly content myself with being nothing but a pioneer for other, future painters who’ll come to work in the south. Now the harvest, the garden, the sower and the two seascapes are croquis after painted studi

Topic 1 mentiones travels but also art a lot same as Topic 3. LDA does not support direct control over which words should be included in specific topics, so I will try Seeded or Anchored Topic Modeling.