mailchimp is very interesting because from december 2014 through spring 2017, the newsletter is different from all the ones followed it. In this period there is an effort to make the newsletter interesting and useful. Later, Fall 2018 onwards, the newsletter is just a repetition of announcements and events that we have on the website. That is why we decided to extract information form the mailchimp differently.
1. We kept the first 10 newsletters (Dec 2014-Spring 2017) and removed the rest from our analysis
2. We divide the newsletters up into text chunks following their internal subheadings
    Dec 2014-Spring 2016: [TITLE] 
    Summer 2016, Fall-Winter 2016, Spring 2017: -----
3. We identify the people who appear in these text chunks using NER for counting the people part of the dataviz project
4. We read these newsletters to find definitions of CESTA and curious artefacts to be shared along the dataviz


In [1]:
import os
import re
import pandas as pd

In [16]:
# This is the one that works for all but last 3 newsletters. testing on the first one

file_path = '/Users/mervetekgurler/Desktop/PhD/CESTA/cesta-events/mailchimp/relevant_messages/632073_cesta-monthly-newsletter.txt'
with open(file_path, 'r') as file:
    newsletter_text = file.read()

filename = "632073_cesta-monthly-newsletter"

def divide_newsletter_to_csv(text, filename):
    pattern = re.compile(r'(\[.*?\])') #[TITLE]
    sections = pattern.split(text)
    sections = [section.strip() for section in sections if section.strip()]
    
    data = []
    section_count = 0
    for i in range(len(sections)):
        if pattern.match(sections[i]):
            section_count += 1
            section_name = f"{filename}_{section_count}"
            section_heading = sections[i].strip('[]')
            section_text = sections[i + 1] if i + 1 < len(sections) and not pattern.match(sections[i + 1]) else ""
            data.append([section_name, section_heading, section_text])
    
    df = pd.DataFrame(data, columns=["File Name", "Section Heading", "Section Text"])
    return df

# Divide the newsletter and create the CSV data
df = divide_newsletter_to_csv(newsletter_text, filename)

# # Save the DataFrame to a CSV file
# csv_filename = "/mnt/data/newsletter_sections.csv"
# df.to_csv(csv_filename, index=False)

df.head()

Unnamed: 0,File Name,Section Heading,Section Text
0,632073_cesta-monthly-newsletter_1,NEWS YOU CAN USE,Welcome to the new monthly newsletter from the...
1,632073_cesta-monthly-newsletter_2,WILL IT PLAY IN PEORIA?,CESTA is proud to announce that one of our lon...
2,632073_cesta-monthly-newsletter_3,OTHER PEOPLE'S MONEY,Spatial History Project (https://web.stanford....
3,632073_cesta-monthly-newsletter_4,GOING POSTAL,We are proud to announce the recent publicatio...
4,632073_cesta-monthly-newsletter_5,PALLADIO 8.0,CESTA's Humanities + Design (http://hdlab.stan...


In [14]:
# This is the one that works for the 3 newsletters. testing on the first one

def divide_newsletter_with_custom_delimiters(text, filename):
    # Define the pattern to identify the section headings and text
    pattern = re.compile(r'\*\*\s*(.*?)\s*-{2,}')
    matches = pattern.split(text)
    
    # Ignore the text before the first heading
    matches = matches[1:]
    
    # Prepare the data for the CSV
    data = []
    for i in range(0, len(matches) - 1, 2):
        section_name = f"{filename}_{(i//2) + 1}"
        section_heading = matches[i].strip()
        section_text = matches[i + 1].strip()
        data.append([section_name, section_heading, section_text])
    
    # Create a DataFrame
    df = pd.DataFrame(data, columns=["File Name", "Section Heading", "Section Text"])
    return df

text_path = "/Users/mervetekgurler/Desktop/PhD/CESTA/cesta-events/mailchimp/relevant_messages/1815413_cesta-newsletter-summer-2016.txt"
with open(text_path, 'r') as file:
    text = file.read()

df = divide_newsletter_with_custom_delimiters(text, "1815413_cesta-newsletter-summer-2016") 

df.head()

Unnamed: 0,File Name,Section Heading,Section Text
0,1815413_cesta-newsletter-summer-2016_1,A Message from the Director,Welcome to the first CESTA Newsletter of the n...
1,1815413_cesta-newsletter-summer-2016_2,About that logo ...,You may have noticed that the header image for...
2,1815413_cesta-newsletter-summer-2016_3,Introducing CESTA's New Website,CESTA is pleased to announce the launch of its...
3,1815413_cesta-newsletter-summer-2016_4,Literary Lab Launches Techne Blog,Building on the success of their longstanding ...
4,1815413_cesta-newsletter-summer-2016_5,Text Technologies News,Text Technologies was delighted to welcome int...


In [18]:
# Joint function for all that also saves them into one CSV 

def divide_newsletter_unified(text, filename):
    # Define a unified pattern to identify the section headings and text
    pattern = re.compile(r'(\*\*\s*(.*?)\s*-{2,}|\[.*?\])')
    matches = pattern.split(text)
    
    # Ignore the text before the first heading
    matches = matches[1:]
    
    data = []
    section_count = 0
    for i in range(0, len(matches), 3):
        if pattern.match(matches[i]):
            section_count += 1
            section_name = f"{filename}_{section_count}"
            section_heading = matches[i+1] if matches[i+1] else matches[i].strip('[]').strip()
            section_text = matches[i+2].strip() if i+2 < len(matches) else ""
            data.append([section_name, section_heading, section_text])
    
    df = pd.DataFrame(data, columns=["File Name", "Section Heading", "Section Text"])
    return df

mailchimp_folder = '/Users/mervetekgurler/Desktop/PhD/CESTA/cesta-events/mailchimp/relevant_messages'
output_csv = '/Users/mervetekgurler/Desktop/PhD/CESTA/cesta-events/mailchimp/all_newsletter_sections.csv'

all_data = []

for filename in os.listdir(mailchimp_folder):
    if filename.endswith('.txt'):
        file_path = os.path.join(mailchimp_folder, filename)
        with open(file_path, 'r') as file:
            text = file.read()
        
        # Use the unified function for all files
        df = divide_newsletter_unified(text, filename)
        all_data.append(df)

# Combine all data into a single DataFrame
combined_df = pd.concat(all_data, ignore_index=True)

# Save the combined DataFrame to a single CSV file
combined_df.to_csv(output_csv, index=False)
print(f"Processed all files and saved to {output_csv}")

print("Processing complete.")



Processed all files and saved to /Users/mervetekgurler/Desktop/PhD/CESTA/cesta-events/mailchimp/all_newsletter_sections.csv
Processing complete.
