# 🧹 Coding Tips

## 1. Use Relative Paths Instead of Absolute Paths
Relative paths point to a location relative to the working directory. This makes the code portable and ensures that collaborators can run it without errors caused by machine-specific paths.

### ✅ Example: Relative Path

In [10]:
folder_path = "../data/raw/NYT/"  # relative path based on project structure

### ❌ Example: Absolute Path (machine-specific)

In [11]:
# This will only work on Guo's computer
# folder_path = '/Users/jingguo/Desktop/OPT/NLP/NYT/Archive'

## 2. Word Document Paragraphs Include Metadata
In .docx files, paragraphs contain metadata (like style names) that help identify content types, such as titles. You can use this to extract structured data.

### Example: Extracting Titles

In [12]:
from docx import Document

doc_path = "../data/raw/NYT/1.DOCX"
doc = Document(doc_path)

titles = [para.text for para in doc.paragraphs if "Heading 1" in para.style.name]

print("Number of titles found:", len(titles))
print(titles[:5])


Number of titles found: 500
['White Working Class Shunning Democrats', 'A Real Working-Class Hero', 'Working Class Proves Elusive For Democrats', 'Strong Showing Spurs Midwest Mechanic to Empower Working Class', 'Is This the End of the White Working-Class Democrat?']


### 3. Avoid Repeating Code: Use Functions
If a block of code needs to be reused (e.g., to process different datasets), wrap it in a function. This improves maintainability and reduces errors.

### Example:

Instead of repeating code for NYT data and other publishers:

In [13]:
def process_docx_titles(doc_path):
    from docx import Document
    doc = Document(doc_path)
    titles = [para.text for para in doc.paragraphs if "Heading 1" in para.style.name]
    return titles

In [14]:
nyt_titles = process_docx_titles("../data/raw/NYT/1.DOCX")
print("Number of titles found:", len(nyt_titles))
print(nyt_titles[:5])

Number of titles found: 500
['White Working Class Shunning Democrats', 'A Real Working-Class Hero', 'Working Class Proves Elusive For Democrats', 'Strong Showing Spurs Midwest Mechanic to Empower Working Class', 'Is This the End of the White Working-Class Democrat?']


### 4. Use pickle Instead of .csv for Text-Heavy or Complex Data

While .csv files are a common format for tabular data, they can become unreliable when columns contain large amounts of text with punctuation, line breaks, or special characters. Such elements often cause CSV parsers to misinterpret the data, resulting in broken rows or misaligned columns—even when quoting is properly enabled. In these cases, Python’s pickle module offers a more robust solution. It preserves the exact Python object structure, avoids formatting issues, and typically loads data faster.

In [15]:
import pandas as pd
import pickle

# Sample DataFrame with a text-heavy field
df = pd.DataFrame({
    "id": [1, 2],
    "title": ["Headline A", "Headline B"],
    "content": [
        "This is a long paragraph. It has punctuation, commas, and even new\nlines.",
        "Another example! Includes: colons, quotes ('like this'), and more..."
    ]
})

# Save DataFrame to pickle file
df.to_pickle("articles_df.pkl")

# Later, load the DataFrame from the pickle file
df = pd.read_pickle("articles_df.pkl")

## Clean-Up

In [17]:
# Delete the file if it exists
file_path = "articles_df.pkl"
if os.path.exists(file_path):
    os.remove(file_path)
    print(f"{file_path} deleted.")

articles_df.pkl deleted.
