# Exploratory Analysis of parsed data

In [1]:
import os
import pandas as pd

df_articles = pd.read_pickle("../data/processed/parsed_articles.pkl")

<a id="sections"></a>
## 3. Sections Analysis


[⬆ Return to Top](#table-of-contents)

In [2]:
set(df_articles["section"])

{'METRO; Pg. N-1',
 'Section B; Column 0; Business/Financial Desk; Pg. 5; SHORTCUTS',
 'THE WASHINGTON POST NATIONAL WEEKLY; Pg. K20',
 'Section SR; Column 0; Editorial Desk; Pg. 8; LETTERS',
 'FEATURES; Pg. 6',
 'OPINION; Pg. A20',
 'Section 2; Column 1; Arts and Leisure Desk; Pg. 28; CD REVIEWS',
 'Section F;\xa0; Section F;\xa0Page 7;\xa0Column 1;\xa0House & Home/Style Desk\xa0; Column 1;\xa0',
 'FEATURES MAGAZINE: ENTERTAINMENT; Pg. D04',
 'Section A; Column 0; Editorial Desk; Pg. 23; GUEST ESSAY',
 'SPORTS; Inq Sports; Pg. E06',
 'Pg. B01; news',
 'Section 6;\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0; Section 6;\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Page 100;\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Column 1;\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Magazine Desk\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0; Column 1;\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0',
 'Section 6; Column 1; Magazine Desk; Pg. 42; STYLE',
 'Se

In [3]:
#Check for unique section names
from itertools import islice

sections_raw = df_articles["section"].dropna().astype(str)

unique_raw = set(sections_raw)
print("Unique count (raw):", len(unique_raw))
list(islice(unique_raw, 20))  # show 20 samples


Unique count (raw): 12695


['METRO; Pg. N-1',
 'Section B; Column 0; Business/Financial Desk; Pg. 5; SHORTCUTS',
 'THE WASHINGTON POST NATIONAL WEEKLY; Pg. K20',
 'Section SR; Column 0; Editorial Desk; Pg. 8; LETTERS',
 'FEATURES; Pg. 6',
 'OPINION; Pg. A20',
 'Section 2; Column 1; Arts and Leisure Desk; Pg. 28; CD REVIEWS',
 'Section F;\xa0; Section F;\xa0Page 7;\xa0Column 1;\xa0House & Home/Style Desk\xa0; Column 1;\xa0',
 'FEATURES MAGAZINE: ENTERTAINMENT; Pg. D04',
 'Section A; Column 0; Editorial Desk; Pg. 23; GUEST ESSAY',
 'SPORTS; Inq Sports; Pg. E06',
 'Pg. B01; news',
 'Section 6;\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0; Section 6;\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Page 100;\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Column 1;\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Magazine Desk\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0; Column 1;\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0',
 'Section 6; Column 1; Magazine Desk; Pg. 42; STYLE',
 'Se

In [4]:
# Clean NBSP and extra spaces
sections_clean = (
    sections_raw
      .str.replace("\xa0", " ", regex=False)
      .str.replace(r"\s+", " ", regex=True)
      .str.strip()
)

unique_clean = set(sections_clean)
print("Unique count (cleaned):", len(unique_clean))
list(islice(unique_clean, 20))  # show 20 samples after cleaning


Unique count (cleaned): 12389


['METRO; Pg. N-1',
 'Section B; Column 0; Business/Financial Desk; Pg. 5; SHORTCUTS',
 'THE WASHINGTON POST NATIONAL WEEKLY; Pg. K20',
 'Section SR; Column 0; Editorial Desk; Pg. 8; LETTERS',
 'FEATURES; Pg. 6',
 'OPINION; Pg. A20',
 'Section 2; Column 1; Arts and Leisure Desk; Pg. 28; CD REVIEWS',
 'FEATURES MAGAZINE: ENTERTAINMENT; Pg. D04',
 'Section A; Column 0; Editorial Desk; Pg. 23; GUEST ESSAY',
 'SPORTS; Inq Sports; Pg. E06',
 'Pg. B01; news',
 'Section 1; ; Section 1; Page 27; Column 5; Metropolitan Desk; Second Front; Column 5; ; Second Front',
 'Section 6; Column 1; Magazine Desk; Pg. 42; STYLE',
 'Section A; ; Section A; Page 27; Column 1; Editorial Desk ; Column 1; ; Op-Ed',
 'Section 1;; Section 1; Part 1; Page 10; Column 5; National Desk; Part 1;; Column 5;',
 'ARTS AND ENTERTAINMENT; FLASH!; Pg. C-2',
 'NATIONAL; Pg. A-2',
 'Section 2;; Section 2; Page 25; Column 1; Arts and Leisure Desk; Column 1;',
 'NEWS; Pg. 17',
 'Section 6; Page 70, Column 1; Magazine Desk; Lette

In [5]:
import re

# Split on semicolon or slash
parts_series = sections_clean.apply(lambda x: re.split(r"[;/]", x))

# Flatten tokens
tokens = [
    t.strip()
    for parts in parts_series
    for t in parts
    if t and t.strip()
]

print("Total tokens:", len(tokens))
print("Unique tokens:", len(set(tokens)))
list(islice(set(tokens), 20))  # show 20 samples


Total tokens: 141116
Unique tokens: 4227


['Pg. 4OP',
 'DUBLIN JOURNAL',
 'Section BR',
 'Page 30',
 'Pg. C10',
 'OTHER NEWS',
 'PHENOMENON',
 'None',
 'WEEKEND FEEDBACK',
 'Pg. 100',
 'Inq Col David Patrick Stearns',
 'Pg. 25A',
 'VIDEO REVIEW',
 'Pg. SPD10',
 "Critic's Choice: New DVD's",
 'NEW HOMES',
 'SPRING BREAK MEXICO CITY',
 'tennis',
 'Pg. D03',
 'Pg. A22']

In [6]:
from collections import Counter

counts = Counter(tokens)
df_section_token_freq = (
    pd.DataFrame(counts.items(), columns=["Section_Token", "Frequency"])
      .sort_values("Frequency", ascending=False)
      .reset_index(drop=True)
)

# Top 30 tokens
df_section_token_freq.head(50)


Unnamed: 0,Section_Token,Frequency
0,Column 0,10590
1,Section A,8037
2,Column 1,6033
3,National Desk,3440
4,Pg. 1,3223
5,NEWS,2768
6,Metropolitan Desk,2517
7,Foreign Desk,2398
8,Section B,2384
9,Section C,2312


#### Findings from Top 50 Section Tokens

The analysis of the `section` field across tens of thousands of newspaper articles reveals a mix of **layout references**, **section codes**, **editorial desks**, and **topic/genre labels**. This heterogeneity suggests that the `section` field serves multiple purposes in the metadata and should be cleaned or separated for targeted analysis.

##### 1. Layout References
Tokens describing **physical placement** within the newspaper are frequent but not semantically meaningful for content analysis:
- Examples: `Column 0`, `Column 1`, `Column 2`, `Column 3`, `Column 4`, `Column 5`, `Pg. 1`, `Page 1`, `Pg. 4`, `Pg. 6`, `Pg. 3`, `Pg. 8`, `Pg.`, `Pg. 2`, `Pg. 10`.

##### 2. Section Codes
These indicate **broad newspaper sections** and may be useful for high-level categorization:
- Examples: `Section A`, `Section B`, `Section C`, `Section 1`, `Section E`, `Section 7`, `Section 2`, `Section BR`, `Section 6`, `Section SR`.

##### 3. Editorial Desks
Tokens representing **department names** provide insight into the editorial source:
- Examples: `National Desk`, `Metropolitan Desk`, `Foreign Desk`, `Editorial Desk`, `Cultural Desk`, `Book Review Desk`, `Weekend Desk`, `Arts and Leisure Desk`, `Financial Desk`, `Magazine Desk`.
- Potential action: Analyze separately to understand departmental output.

##### 4. Topic/Genre Labels
These tokens reflect **content themes or genres**:
- Examples: `NEWS`, `OPINION`, `US`, `The Arts`, `politics`, `Review`, `Movies, Performing Arts`, `BUSINESS`, `NATIONAL`, `LIFE`, `WORLD`, `EDITORIAL`, `LOCAL`, `Business`.
- Potential action: Use in thematic analysis or topic modeling as topic labels.
