# Earnings Call Transcript Statistics

This notebook computes sample statistics for:

- **Structured calls** (number of firms, region, section lengths, number of speakers)
- **Paragraph-level data** from the JSONL paragraphs file

It assumes your project layout and data pipeline as described in your README.

In [1]:
from pathlib import Path
import json
import pandas as pd

from semantic_measurement.config import DATA_ROOT  # uses your existing config

# ---- User parameters ----
# Set the index you want to analyze: 'SP500' or 'STOXX600'
index = 'SP500'

DATA_ROOT = Path(DATA_ROOT)
structured_root = DATA_ROOT / 'dataset' / 'intermediaries' / 'structured_calls' / index
paragraphs_path = DATA_ROOT / 'dataset' / 'intermediaries' / 'paragraphs' / f'{index}_paragraphs.jsonl'

structured_root, paragraphs_path

(PosixPath('/Users/marleendejonge/Desktop/semantic_measurement_project/data/dataset/intermediaries/structured_calls/SP500'),
 PosixPath('/Users/marleendejonge/Desktop/semantic_measurement_project/data/dataset/intermediaries/paragraphs/SP500_paragraphs.jsonl'))

## 1. Load structured calls

Each `structured_calls_*.json` file contains a list of call dicts. We load and concatenate them into a single DataFrame.

In [2]:
records = []
for path in sorted(structured_root.glob('structured_calls_*.json')):
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)  # list of call dicts
        records.extend(data)

calls_df = pd.DataFrame(records)

# Add region variable (same as index here)
calls_df['region'] = index

calls_df.head()

Unnamed: 0,file,filename,company_name,ticker,quarter,year,date,management_discussion_full,qa_section_full,speaker_segments_management,speaker_segments_qa,management_paragraphs,qa_paragraphs,region
0,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,3M Co.,MMM,Q1,2010,2010-04-27,"Operator: Ladies and gentlemen, thank you for...",Operator: [Operator Instructions] Our first q...,"[{'speaker': 'Operator', 'profession': 'Operat...","[{'speaker': 'Operator', 'profession': 'Operat...","[Operator: Ladies and gentlemen, thank you fo...",[Operator: [Operator Instructions] Our first ...,SP500
1,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2011 Earnin...,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2011 Earnin...,3M Co.,MMM,Q1,2011,2011-04-26,"Operator: Ladies and gentlemen, thank you for...",Operator: [Operator Instructions] our first q...,"[{'speaker': 'Operator', 'profession': 'Operat...","[{'speaker': 'Operator', 'profession': 'Operat...","[Operator: Ladies and gentlemen, thank you fo...",[Operator: [Operator Instructions] our first ...,SP500
2,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2012 Earnin...,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2012 Earnin...,3M Co. (MMM),MMM,Q1,2012,2012-04-24,MANAGEMENT DISCUSSION SECTION\nMANAGEMENT DISC...,R SECTION\nR SECTION\nR SECTION \n \nOpera...,"[{'speaker': 'MANAGEMENT DISCUSSION SECTION', ...","[{'speaker': 'R SECTION', 'profession': 'R SEC...",[MANAGEMENT DISCUSSION SECTION\nMANAGEMENT DIS...,[R SECTION\nR SECTION\nR SECTION \n \nOper...,SP500
3,CORRECTED TRANSCRIPT The Goldman Sachs Group I...,CORRECTED TRANSCRIPT The Goldman Sachs Group I...,"The Goldman Sachs Group, Inc. (GS)",GS,Q3,2021,2021-10-15,"Operator: Good morning. My name is Erica, and ...",Operator: [Operator Instructions] Your first q...,"[{'speaker': 'Operator', 'profession': 'Operat...","[{'speaker': 'Operator', 'profession': 'Operat...","[Operator: Good morning. My name is Erica, and...",[Operator: [Operator Instructions] Your first ...,SP500
4,CORRECTED TRANSCRIPT The Goldman Sachs Group I...,CORRECTED TRANSCRIPT The Goldman Sachs Group I...,"The Goldman Sachs Group, Inc. (GS)",GS,Q3,2022,2022-10-18,"Operator: Good morning. My name is Katie, and ...",Operator: Thank you. [Operator Instructions] W...,"[{'speaker': 'Operator', 'profession': 'Operat...","[{'speaker': 'Operator', 'profession': 'Operat...","[Operator: Good morning. My name is Katie, and...",[Operator: Thank you. [Operator Instructions] ...,SP500


## 2. Structured call statistics

### 2.1 Number of calls and firms

In [3]:
n_calls = len(calls_df)
n_firms_ticker = calls_df['ticker'].nunique()
n_firms_company = calls_df['company_name'].nunique()

print('Index:', index)
print('Number of calls:', n_calls)
print('Unique firms (ticker):', n_firms_ticker)
print('Unique firms (company_name):', n_firms_company)

print('\nCalls per firm (ticker) – summary:')
calls_per_firm = calls_df.groupby('ticker')['file'].nunique().describe()
calls_per_firm

Index: SP500
Number of calls: 10
Unique firms (ticker): 2
Unique firms (company_name): 5

Calls per firm (ticker) – summary:


count    2.000000
mean     5.000000
std      2.828427
min      3.000000
25%      4.000000
50%      5.000000
75%      6.000000
max      7.000000
Name: file, dtype: float64

### 2.2 Section lengths (management and Q&A)

We compute:
- Length in characters of the full management and Q&A sections.
- Number of paragraphs in each section (using the precomputed `*_paragraphs` lists).

In [4]:
def list_len_safe(x):
    try:
        return len(x)
    except TypeError:
        return 0

# Character length of full sections
calls_df['mgmt_len_chars'] = calls_df['management_discussion_full'].astype(str).str.len()
calls_df['qa_len_chars'] = calls_df['qa_section_full'].astype(str).str.len()

# Paragraph counts
calls_df['mgmt_n_paragraphs'] = calls_df['management_paragraphs'].apply(list_len_safe)
calls_df['qa_n_paragraphs'] = calls_df['qa_paragraphs'].apply(list_len_safe)

calls_df[['mgmt_len_chars', 'qa_len_chars', 'mgmt_n_paragraphs', 'qa_n_paragraphs']].describe()

Unnamed: 0,mgmt_len_chars,qa_len_chars,mgmt_n_paragraphs,qa_n_paragraphs
count,10.0,10.0,10.0,10.0
mean,21602.5,61907.3,3.7,6.0
std,8336.777092,18967.601717,3.945462,6.63325
min,13861.0,35099.0,1.0,1.0
25%,16106.0,46399.25,1.0,1.0
50%,19472.0,65073.0,1.0,1.0
75%,21327.75,68686.5,5.0,11.75
max,40303.0,101121.0,11.0,17.0


You can also inspect how these evolve by year:

In [5]:
calls_df.groupby('year')[['mgmt_len_chars', 'qa_len_chars', 'mgmt_n_paragraphs', 'qa_n_paragraphs']].mean()

Unnamed: 0_level_0,mgmt_len_chars,qa_len_chars,mgmt_n_paragraphs,qa_n_paragraphs
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010,29113.5,45479.0,8.0,14.0
2011,23860.0,43231.5,7.5,13.0
2012,17641.5,84694.5,1.0,1.0
2021,20484.0,67243.0,1.0,1.0
2022,21609.0,73291.0,1.0,1.0
2023,18841.0,68826.0,1.0,1.0
2024,13861.0,62903.0,1.0,1.0


### 2.3 Number of speakers per call

We count unique speakers in management and Q&A segments, and the union across both sections.

In [6]:
def count_unique_speakers(segments):
    if not isinstance(segments, list):
        return 0
    return len({seg.get('speaker') for seg in segments if seg.get('speaker')})

calls_df['mgmt_n_speakers'] = calls_df['speaker_segments_management'].apply(count_unique_speakers)
calls_df['qa_n_speakers'] = calls_df['speaker_segments_qa'].apply(count_unique_speakers)

def count_total_speakers(row):
    speakers = set()
    for seg_list in [row['speaker_segments_management'], row['speaker_segments_qa']]:
        if isinstance(seg_list, list):
            speakers |= {seg.get('speaker') for seg in seg_list if seg.get('speaker')}
    return len(speakers)

calls_df['total_n_speakers'] = calls_df.apply(count_total_speakers, axis=1)

calls_df[['mgmt_n_speakers', 'qa_n_speakers', 'total_n_speakers']].describe()

Unnamed: 0,mgmt_n_speakers,qa_n_speakers,total_n_speakers
count,10.0,10.0,10.0
mean,2.6,9.6,10.0
std,1.505545,7.441625,7.81736
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,3.0,14.0,14.5
75%,3.75,16.0,16.0
max,5.0,16.0,18.0


## 3. Paragraph-level statistics

Now we load the paragraph JSONL file and compute paragraph-level statistics by section, year, etc.

In [7]:
# If file is large, you can switch to chunked reading by uncommenting the chunk code.

try:
    para_df = pd.read_json(paragraphs_path, lines=True)
except ValueError:
    # Fallback: chunked loading for very large files
    chunks = pd.read_json(paragraphs_path, lines=True, chunksize=200_000)
    para_df = pd.concat(chunks, ignore_index=True)

para_df.head()

Unnamed: 0,file,filename,company,ticker,date,quarter,year,section,speaker,profession,paragraph
0,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,3M Co.,MMM,2010-04-27,Q1,2010,management,Operator,Operator,"Ladies and gentlemen, thank you for standing b..."
1,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,3M Co.,MMM,2010-04-27,Q1,2010,management,Operator,Operator,I would now like to turn the call over to Matt...
2,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,3M Co.,MMM,2010-04-27,Q1,2010,management,Operator,Operator,"Matthew J. Ginter, Vice President, Investor Re..."
3,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,3M Co.,MMM,2010-04-27,Q1,2010,management,Operator,Operator,"Before I turn things over to George, please ta..."
4,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,CORRECTED TRANSCRIPT 3M CoMMMUS Q1 2010 Earnin...,3M Co.,MMM,2010-04-27,Q1,2010,management,Operator,Operator,So let’s begin today’s review. Turn to slide n...


### 3.1 Basic counts

In [8]:
n_paragraphs = len(para_df)
print('Total paragraphs:', n_paragraphs)

print('\nParagraphs by section:')
print(para_df['section'].value_counts())

print('\nParagraphs by year and section:')
para_df.groupby(['year', 'section'])['paragraph'].count().unstack(fill_value=0)

Total paragraphs: 2157

Paragraphs by section:
section
qa            1607
management     550
Name: count, dtype: int64

Paragraphs by year and section:


section,management,qa
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2010,149,467
2011,162,397
2012,79,312
2021,43,113
2022,45,111
2023,42,118
2024,30,89


### 3.2 Paragraph length distributions

We compute approximate length measures for each paragraph.

In [9]:
para_df['paragraph_len_chars'] = para_df['paragraph'].astype(str).str.len()
para_df['paragraph_len_tokens_approx'] = para_df['paragraph'].astype(str).str.split().str.len()

para_df[['paragraph_len_chars', 'paragraph_len_tokens_approx']].describe()

Unnamed: 0,paragraph_len_chars,paragraph_len_tokens_approx
count,2157.0,2157.0
mean,290.896616,50.029207
std,245.614201,42.472643
min,5.0,1.0
25%,76.0,13.0
50%,240.0,41.0
75%,458.0,78.0
max,1706.0,304.0


By section:


In [10]:
para_df.groupby('section')[['paragraph_len_chars', 'paragraph_len_tokens_approx']].describe()

Unnamed: 0_level_0,paragraph_len_chars,paragraph_len_chars,paragraph_len_chars,paragraph_len_chars,paragraph_len_chars,paragraph_len_chars,paragraph_len_chars,paragraph_len_chars,paragraph_len_tokens_approx,paragraph_len_tokens_approx,paragraph_len_tokens_approx,paragraph_len_tokens_approx,paragraph_len_tokens_approx,paragraph_len_tokens_approx,paragraph_len_tokens_approx,paragraph_len_tokens_approx
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
section,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
management,550.0,373.398182,223.686825,16.0,200.25,379.0,521.0,1296.0,550.0,59.698182,35.622333,2.0,32.0,61.0,83.0,195.0
qa,1607.0,262.660236,246.480959,5.0,68.0,175.0,424.5,1706.0,1607.0,46.719975,44.101603,1.0,11.0,32.0,76.0,304.0


### 3.3 Linking back to calls (optional)

If you want to join paragraph-level and call-level information, you can merge on `file`/`filename` and then aggregate.
For example: average paragraph length per call, or per firm-year.