In [1]:
import json
import re

### What is good data? 

* **Domain-Specific Jargon** – "benchmark federal funds rate," "monetary policy," "yield on 10-year Treasury bonds"
* **Contextually Rich** – Explains the relationship between interest rates, bonds, and equity markets
* **Realistic Scenario** – Discusses Federal Reserve actions and investor sentiment, etc

In [2]:
example1 = '''
The Federal Reserve's decision to raise the benchmark federal funds rate by 25 basis points to 5.25% 
signals a tightening monetary policy aimed at curbing inflationary pressures.The yield on 10-year 
Treasury bonds climbed to 4.1%, reflecting investor expectations of prolonged restrictive policy.
Equity markets experienced a downturn, with the S&P 500 declining 1.2% as growth stocks faced valuation
adjustments.

'''

In [3]:
example2 = '''
"Sartre’s notion of radical freedom, as articulated in Being and Nothingness, posits that existence
precedes essence, rejecting any predetermined human nature. This existentialist framework implies that
individuals bear absolute responsibility for their choices, a premise that leads to what Sartre describes
as 'anguish'—the weight of freedom in an indifferent universe. Similarly, Heidegger’s concept of Dasein
emphasizes being-toward-death, underscoring how authentic existence emerges only through an awareness of
finitude. In contrast, Camus’ Myth of Sisyphus rejects metaphysical meaning, framing human life as an absurd
confrontation between the search for meaning and an indifferent cosmos, where rebellion against nihilism 
becomes the ultimate assertion of freedom."

'''

### Preprocessing
* Would be a good idea to break your free form text into chunck (i.e. sentences)
* I used regex here, but consider using an NLP library like NLTK or Spacy (chunk, clean up, etc)
* Put it in a format that will work with your fine tuning pipeline (I put in list of dicts here, which works with hugging face training pepeline)

In [4]:
# Regex to split text into sentences
sentences = re.split(r'(?<=[.!?])\s+', example1.strip())

# Convert to requested format
financial_data = [{"text": sentence} for sentence in sentences]
financial_data[1]

{'text': 'Equity markets experienced a downturn, with the S&P 500 declining 1.2% as growth stocks faced valuation\nadjustments.'}

### **Some Domain-Specific Datasets (not tested)**

### **1. Finance**
- **Dataset**: [SEC EDGAR Filings](https://www.sec.gov/edgar.shtml)
- **Description**: Public company reports (10-K, 10-Q), regulatory filings, and financial statements.
- **Use Case**: Training financial NLP models for understanding earnings reports and financial jargon.

### **2. Law**
- **Dataset**: [Harvard Case Law Project](https://case.law/)
- **Description**: U.S. Supreme Court rulings, legal contracts, and case law.
- **Use Case**: Legal document processing, contract analysis, and legal text summarization.

### **3. Healthcare & Medical**
- **Dataset**: [PubMed Open Access](https://pubmed.ncbi.nlm.nih.gov/)
- **Description**: Biomedical research papers, clinical trial results, and medical abstracts.
- **Use Case**: Training models for medical literature comprehension, biomedical NLP, and clinical text processing.

### **4. Cybersecurity**
- **Dataset**: [MITRE CVE Database](https://cve.mitre.org/)
- **Description**: Collection of known cybersecurity vulnerabilities and exploits.
- **Use Case**: Threat intelligence NLP, cybersecurity risk analysis, and automated vulnerability tracking.

### **5. Science & Research**
- **Dataset**: [ArXiv Open Access](https://arxiv.org/)
- **Description**: AI, physics, mathematics, and computational research papers.
- **Use Case**: Scientific paper summarization, technical NLP applications, and AI research modeling.

### **6. News & Journalism**
- **Dataset**: [CNN/DailyMail Dataset](https://huggingface.co/datasets/cnn_dailymail)
- **Description**: News articles with human-written summaries.
- **Use Case**: Training NLP models for news summarization and information retrieval.

### **7. Business & Economy**
- **Dataset**: [S&P 500 Earnings Call Transcripts](https://seekingalpha.com/)
- **Description**: CEO, CFO, and analyst discussions on financial markets and economic trends.
- **Use Case**: Sentiment analysis, financial trend prediction, and business NLP applications.

### **8. Technology & Code**
- **Dataset**: [GitHub Code Corpus](https://github.com/EGI-Federation/ai-code-analysis)
- **Description**: Open-source repositories, code comments, and discussions.
- **Use Case**: Training code-aware BERT models, AI-powered coding assistants, and bug detection.

### **9. Social Media & Sentiment Analysis**
- **Dataset**: [PushShift Reddit Dataset](https://files.pushshift.io/reddit/)
- **Description**: Large-scale Reddit comments and discussions across various topics.
- **Use Case**: Conversational AI, chatbot training, and sentiment analysis on social media.

### **10. Legal & Government Regulations**
- **Dataset**: [EUR-Lex](https://eur-lex.europa.eu/)
- **Description**: European Union legal documents, regulations, and legislative texts.
- **Use Case**: Law-focused BERT training, policy document analysis, and regulatory compliance automation.