mailchimp is very interesting because from december 2014 through spring 2017, the newsletter is different from all the ones followed it. In this period there is an effort to make the newsletter interesting and useful. Later, Fall 2018 onwards, the newsletter is just a repetition of announcements and events that we have on the website. That is why we decided to extract information form the mailchimp differently.
1. We kept the first 10 newsletters (Dec 2014-Spring 2017) and removed the rest from our analysis
2. We divide the newsletters up into text chunks following their internal subheadings
    Dec 2014-Spring 2016: [TITLE] 
    Summer 2016, Fall-Winter 2016, Spring 2017: -----
3. We identify the people who appear in these text chunks using NER for counting the people part of the dataviz project
4. We read these newsletters to find definitions of CESTA and curious artefacts to be shared along the dataviz


In [1]:
import os
import re
import pandas as pd

In [26]:
def divide_newsletter(text):
    # Define the pattern to split the text at the headings
    pattern = re.compile(r'(\[.*?\])')
    sections = pattern.split(text)
    sections = [section.strip() for section in sections if section.strip()]

    # Pair headings with their respective content
    paired_sections = []
    for i in range(len(sections)):
        if pattern.match(sections[i]):
            if i + 1 < len(sections) and not pattern.match(sections[i + 1]):
                paired_sections.append((sections[i], sections[i + 1]))
            else:
                paired_sections.append((sections[i], ""))
    return paired_sections

file_path = '/Users/mervetekgurler/Desktop/PhD/CESTA/cesta-events/mailchimp/relevant_messages/632073_cesta-monthly-newsletter.txt'
with open(file_path, 'r') as file:
    newsletter_text = file.read()

divided_newsletter = divide_newsletter(newsletter_text)



In [34]:
len(divided_newsletter)

13

In [32]:
import pandas as pd

# Sample filename
filename = "632073_cesta-monthly-newsletter.txt"

# Function to divide the text into sections based on headings
def divide_newsletter_to_csv(text, filename):
    # Define the pattern to split the text at the headings
    pattern = re.compile(r'(\[.*?\])')
    sections = pattern.split(text)
    sections = [section.strip() for section in sections if section.strip()]
    
    # Prepare the data for the CSV
    data = []
    section_count = 0
    for i in range(len(sections)):
        if pattern.match(sections[i]):
            section_count += 1
            section_name = f"{filename}_{section_count}"
            section_heading = sections[i].strip('[]')
            section_text = sections[i + 1] if i + 1 < len(sections) and not pattern.match(sections[i + 1]) else ""
            data.append([section_name, section_heading, section_text])
    
    # Create a DataFrame
    df = pd.DataFrame(data, columns=["File Name", "Section Heading", "Section Text"])
    return df

# Divide the newsletter and create the CSV data
df = divide_newsletter_to_csv(newsletter_text, filename)

# # Save the DataFrame to a CSV file
# csv_filename = "/mnt/data/newsletter_sections.csv"
# df.to_csv(csv_filename, index=False)

df.head()

Unnamed: 0,File Name,Section Heading,Section Text
0,632073_cesta-monthly-newsletter.txt_1,NEWS YOU CAN USE,Welcome to the new monthly newsletter from the...
1,632073_cesta-monthly-newsletter.txt_2,WILL IT PLAY IN PEORIA?,CESTA is proud to announce that one of our lon...
2,632073_cesta-monthly-newsletter.txt_3,OTHER PEOPLE'S MONEY,Spatial History Project (https://web.stanford....
3,632073_cesta-monthly-newsletter.txt_4,GOING POSTAL,We are proud to announce the recent publicatio...
4,632073_cesta-monthly-newsletter.txt_5,PALLADIO 8.0,CESTA's Humanities + Design (http://hdlab.stan...


In [33]:
len(df)

13

In [60]:
def divide_newsletter_with_custom_delimiters(text, filename):
    # Define the pattern to identify the section headings and text
    pattern = re.compile(r'\*\*\s*(.*)\s-*')
    matches = pattern.split(text)
    matches = [match.strip() for match in matches if match.strip()]
    
    # Prepare the data for the CSV
    data = []
    for i in range(1, len(matches), 2):
        section_name = f"{filename}_{(i//2) + 1}"
        section_heading = matches[i-1]
        section_text = matches[i]
        data.append([section_name, section_heading, section_text])
    
    # Create a DataFrame
    df = pd.DataFrame(data, columns=["File Name", "Section Heading", "Section Text"])
    return df


In [61]:
text_path = "/Users/mervetekgurler/Desktop/PhD/CESTA/cesta-events/mailchimp/relevant_messages/1815413_cesta-newsletter-summer-2016.txt"
with open(text_path, 'r') as file:
    text = file.read()

df = divide_newsletter_with_custom_delimiters(text, "1815413_cesta-newsletter-summer-2016.txt") 

In [None]:
mailchimp_folder = '/Users/mervetekgurler/Desktop/PhD/CESTA/cesta-events/mailchimp/relevant_messages'

In [62]:
df

Unnamed: 0,File Name,Section Heading,Section Text
0,1815413_cesta-newsletter-summer-2016.txt_1,CESTA Stanford Monthly Newsletter\nView this e...,A Message from the Director
1,1815413_cesta-newsletter-summer-2016.txt_2,Welcome to the first CESTA Newsletter of the n...,About that logo ...
2,1815413_cesta-newsletter-summer-2016.txt_3,You may have noticed that the header image for...,Introducing CESTA's New Website
3,1815413_cesta-newsletter-summer-2016.txt_4,CESTA is pleased to announce the launch of its...,Literary Lab Launches Techne Blog
4,1815413_cesta-newsletter-summer-2016.txt_5,Building on the success of their longstanding ...,Text Technologies News
5,1815413_cesta-newsletter-summer-2016.txt_6,Text Technologies was delighted to welcome int...,Poetic Media Project News
6,1815413_cesta-newsletter-summer-2016.txt_7,The Poetic Media Lab's Lacuna annotation syste...,CESTA in Leeds
7,1815413_cesta-newsletter-summer-2016.txt_8,The International Medieval Congress is no ordi...,CESTA at DH 2016
8,1815413_cesta-newsletter-summer-2016.txt_9,"For five hectic days, digital humanists from a...",Mapping Rome ... Not in a Day
9,1815413_cesta-newsletter-summer-2016.txt_10,"After more than ten years of field work, digit...",Summer RA Activities
