In [39]:
import json
import glob

import pandas as pd
import numpy as np

from pydantic import BaseModel
from openai import OpenAI

from tqdm import tqdm
tqdm.pandas()

In [2]:
client = OpenAI()

In [48]:
context_prompt = """
You are a QUD parsing expert.
Your task is to identify the implicit question that the speaker is answering in the following statement.
The question should reflect the main purpose of the statement.

Examples for this question generation are:

Statement: Well, as a matter of fact there is. If we assume the ah – uh – a rate of growth of our economy, equivalent to what it was during President Johnson, President Kennedy, even before the – the – the – uh wa uh – Vietnese- namese War, and if we assume that at the end of the four- year period we can cut our unemployment rate down to 4 to 4 and a half percent – under those circumstances, even assuming no elimination of unnecessary programs and assuming an increase in the ad- in the allotment of money to finance programs, increasing as the inflation rate does – my economic projections, I think confirmed by the House uh – and the Senate committees, have been with the $60 billion extra amount of money that can be spent in fiscal year 81 which will be the last year of this next term. Within that sixty- billion dollars increase there would be fit the programs that I promised the American people. I might say too, that – that if we see that these goals cannot be reached – and I believe theyre reasonable goals – then I would cut back on the rate of implement- implementation of new programs in order to accommodate a balanced budget by fiscal year 81 which is the last year of the next term. I believe that we ought to have a balanced budget during normal economic circumstances. And uh – these projections have been very carefully made. I stand behind them. And if they should be in error slightly on the down side, then Ill phase in the programs that weve uh – advocated, more slowly.
Question: How can the proposed programs be funded while achieving a balanced budget by fiscal year 1981?

Statement: Well, the first thing we have to do is get spending under control in Washington. Its completely out of control. Its gone — we have now presided over the largest increase in the size of government since the Great Society. We Republicans came to power to change government, and government changed us. And the — the worst symptom on this disease is what my friend, Tom Coburn, calls earmarking as a gateway drug, because its a gateway. Its a gateway to out- of- control spending and corruption. And we have former members of Congress now residing in federal prison because of the evils of this earmarking and pork- barrel spending. You know, we spent $3 million to study the DNA of bears in Montana. I dont know if that was a criminal issue or a paternal issue, but the fact is that it was $3 million of our taxpayers money. And it has got to be brought under control. As president of the United States, I want to assure you, Ive got a pen. This ones kind of old. Ive got a pen, and Im going to veto every single spending bill that comes across my desk. I will make them famous. You will know their names. Now, Senator Obama, you wanted to know one of the differences. He has asked for $932 million of earmark pork- barrel spending, nearly a million dollars for every day that hes been in the United States Senate. I suggest that people go up on the Web site of Citizens Against Government Waste, and theyll look at those projects. That kind of thing is not the way to rein in runaway spending in Washington, D. C. Thats one of the fundamental differences that Senator Obama and I have.
Question: What steps need to be taken to address out-of-control government spending in Washington?

Statement: Well, Sander, thats a good question, and the answer is; for 40- some years we kept the peace. If you look at the cost of not keeping the peace in Europe, it would be exorbitant. We have reduced the number of troops that are deployed and going to be deployed. I have cut defense spending. And the reason we could do that is because of our fantastic success in winning the Cold War. We never would have got there if we had gone for the nuclear freeze crowd; we never would have got there if we had listened to those that wanted to cut defense spending. I think it is important that the US stay in Europe and continue to guarantee the peace. We simply cannot pull back. Now, when anybody has a spending program they want to spend money on at home, they say, well, lets cut money out of the Defense Dept. I will accept and have accepted the recommendations of 2 proven leaders, General Colin Powell and Secretary Dick Cheney. They feel that the levels were operating at and the reductions that I have proposed are proper. And so I simply do not think we should go back to the isolation days and starting blaming foreigners. We are the sole remaining superpower, and we should be that. And we have a certain disproportionate responsibility. But I would ask the American people to understand that if we make imprudent cuts, if we go too far, we risk the peace. And I dont want to do that. Ive seen what it is like to see a war, to see the burdens of a war, and I dont want to see us make reckless cuts. Because of our programs we have been able to significantly cut defense spending. But lets not cut into the muscle, and lets not cut down our insurance policy, which is participation of American forces in NATO, the greatest peace- keeping organization ever made. Today youve got problems in Europe, still bubbling along even though Europes gone democracys route. But we are there, and I think this insurance policy is necessary. I think it goes with world leadership, and I think the levels weve come up with are just about right.
Question: Why is it important for the United States to maintain its current level of defense spending and military presence in Europe?

Statement: Well, I think theyre serious. I think its a matter that we should continue to uh – give uh – great care and attention to. We should support uh – the laws which the United States has passed in order to protect us from uh – those who would destroy us from within. We should sustain uh – the Department of Justice in its efforts and the F. B. I. , and we should be continually alert. I think if the United States is maintaining a strong society here in the United States, I think that we can meet any internal threat. The major threat is external and will continue.
Question: What measures should the United States take to address internal and external threats?

Please generate one question and one question only without any prefaces.
"""

In [49]:
def get_qud(prompt):
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": context_prompt},
            {"role": "user", "content": prompt},
        ]
    )
    return completion.choices[0].message.content

In [50]:
get_qud("The Justice Department is in the process of trying to gain control over a law that federal Judge David Sentelle recently called a 'monster.' Needless to say, he was talking about RICO.With its recently revised guidelines for RICO, Justice makes it clear that the law currently holds too many incentives for abuse by prosecutors.The text of the 'new policy' guidelines from the Criminal Division are reprinted nearby.They strongly suggest that Justice's prosecutions of Drexel Burnham Lambert, Michael Milken and Princeton/Newport violated notions of fundamental fairness. Justice is attempting to avoid a replay of these tactics.This amounts to an extraordinary repudiation of the tenure of New York mayoral candidate and former U.S. Attorney Rudolph Giuliani, who was more inclined to gathering scalps than understanding markets.The new guidelines limit the pretrial forfeitures of assets of RICOed defendants and their investors, clients, bankers and others.This follows earlier new guidelines from the Tax Division prohibiting Princeton/Newport-like tax cases from masquerading as RICO cases.")

'What changes is the Justice Department implementing regarding the use of RICO to prevent prosecutorial abuse?'

In [51]:
data_base_dir = "data/by_date"

dates = []
dfs = {}
for fpath in glob.glob(data_base_dir + "/*"):
    df = pd.read_csv(fpath)
    date = df["date"].iloc[0]
    dfs[date] = df
    dates.append(date)

dates.sort()

In [52]:
dfs[dates[0]]

Unnamed: 0,speaker,text,type,election_year,date,candidate,qud,question
0,Howard Smith,Good evening. The television and radio station...,Pres,1960,1960-09-26,0,What are the rules and structure for the upcom...,
1,John Kennedy,"Mr. Smith, Mr. Nixon. In the election of 1860,...",Pres,1960,1960-09-26,1,What is the responsibility of the United State...,
2,Howard Smith,And now the opening statement by Vice Presiden...,Pres,1960,1960-09-26,0,What key points will Vice President Nixon addr...,
3,Richard Nixon,"Mr. Smith, Senator Kennedy. The things that Se...",Pres,1960,1960-09-26,1,Question: What is the basis for the disagreeme...,
4,Howard Smith,"Thank you, Mr. Nixon. That completes the openi...",Pres,1960,1960-09-26,0,What will be the format for the candidates' re...,
...,...,...,...,...,...,...,...,...
63,Howard Smith,Three minutes and twenty seconds for each cand...,Pres,1960,1960-09-26,0,What is Vice President Nixon's concluding stat...,
64,Richard Nixon,"Thank you, Mr. Smith. Senator Kennedy. First o...",Pres,1960,1960-09-26,1,What are the differences between your economic...,
65,Howard Smith,"Senator Kennedy, your conclusion.",Pres,1960,1960-09-26,0,What is your final assessment on the topic dis...,
66,John Kennedy,The point was made by Mr. Nixon that the Sovie...,Pres,1960,1960-09-26,1,Question: What is the key question facing the ...,


In [53]:
sum([len(df) for df in dfs.values()])

2716

In [54]:
for i, d in enumerate(dates):
    print(f"{i}\t{d}\t{len(dfs[d])}\t{len(dfs[d].columns)}")

0	1960-09-26	68	8
1	1976-09-23	94	8
2	1980-09-21	76	8
3	1984-10-07	134	8
4	1988-09-25	161	8
5	1992-10-11	92	8
6	1996-10-06	144	8
7	2000-10-03	166	8
8	2004-09-30	142	8
9	2008-09-26	189	8
10	2012-10-03	210	8
11	2016-09-26	308	8
12	2020-09-29	932	8


In [68]:
i = 1

In [69]:
pd.read_csv(f"{data_base_dir}/{dates[i]}.csv")

Unnamed: 0,speaker,text,type,election_year,date,candidate,qud,question
0,Edwin Newman,"Good evening. Im Edwin Newman, moderator of th...",Pres,1976,1976-09-23,0,,
1,Frank Reynolds,"Mr. President, Governor Carter. Governor, in a...",Pres,1976,1976-09-23,0,,
2,Jimmy Carter,Yes. First of all is to recognize a tremendous...,Pres,1976,1976-09-23,1,What strategies can be implemented to reduce u...,"Mr. President, Governor Carter. Governor, in a..."
3,Frank Reynolds,"Governor, uh – in the event you are successful...",Pres,1976,1976-09-23,0,,
4,Jimmy Carter,"Yes, in unemployment that is likely to create ...",Pres,1976,1976-09-23,1,,
...,...,...,...,...,...,...,...,...
89,Edwin Newman,"It is now time for the closing statements, whi...",Pres,1976,1976-09-23,0,,
90,Jimmy Carter,"Well, tonight weve had a chance to talk a lot ...",Pres,1976,1976-09-23,1,,
91,Edwin Newman,President Ford.,Pres,1976,1976-09-23,0,,
92,Gerald Ford,On November second all of you will make a very...,Pres,1976,1976-09-23,1,,


In [64]:
dfs[dates[i]]['qud'] = dfs[dates[i]].progress_apply(lambda row: get_qud(row['text']) if not pd.isna(row['question']) else None, axis=1)

100%|██████████| 94/94 [00:15<00:00,  6.00it/s]


In [70]:
dfs[dates[i]]

Unnamed: 0,speaker,text,type,election_year,date,candidate,qud,question
0,Edwin Newman,"Good evening. Im Edwin Newman, moderator of th...",Pres,1976,1976-09-23,0,,
1,Frank Reynolds,"Mr. President, Governor Carter. Governor, in a...",Pres,1976,1976-09-23,0,,
2,Jimmy Carter,Yes. First of all is to recognize a tremendous...,Pres,1976,1976-09-23,1,What strategies can be implemented to reduce u...,"Mr. President, Governor Carter. Governor, in a..."
3,Frank Reynolds,"Governor, uh – in the event you are successful...",Pres,1976,1976-09-23,0,,
4,Jimmy Carter,"Yes, in unemployment that is likely to create ...",Pres,1976,1976-09-23,1,,
...,...,...,...,...,...,...,...,...
89,Edwin Newman,"It is now time for the closing statements, whi...",Pres,1976,1976-09-23,0,,
90,Jimmy Carter,"Well, tonight weve had a chance to talk a lot ...",Pres,1976,1976-09-23,1,,
91,Edwin Newman,President Ford.,Pres,1976,1976-09-23,0,,
92,Gerald Ford,On November second all of you will make a very...,Pres,1976,1976-09-23,1,,


In [66]:
dfs[dates[i]].to_csv(f"{data_base_dir}/{dates[i]}.csv", index=False)

In [71]:
pd.read_csv(f"{data_base_dir}/{dates[i]}.csv")

Unnamed: 0,speaker,text,type,election_year,date,candidate,qud,question
0,Edwin Newman,"Good evening. Im Edwin Newman, moderator of th...",Pres,1976,1976-09-23,0,,
1,Frank Reynolds,"Mr. President, Governor Carter. Governor, in a...",Pres,1976,1976-09-23,0,,
2,Jimmy Carter,Yes. First of all is to recognize a tremendous...,Pres,1976,1976-09-23,1,What strategies can be implemented to reduce u...,"Mr. President, Governor Carter. Governor, in a..."
3,Frank Reynolds,"Governor, uh – in the event you are successful...",Pres,1976,1976-09-23,0,,
4,Jimmy Carter,"Yes, in unemployment that is likely to create ...",Pres,1976,1976-09-23,1,,
...,...,...,...,...,...,...,...,...
89,Edwin Newman,"It is now time for the closing statements, whi...",Pres,1976,1976-09-23,0,,
90,Jimmy Carter,"Well, tonight weve had a chance to talk a lot ...",Pres,1976,1976-09-23,1,,
91,Edwin Newman,President Ford.,Pres,1976,1976-09-23,0,,
92,Gerald Ford,On November second all of you will make a very...,Pres,1976,1976-09-23,1,,


In [72]:
for i, d in enumerate(dates):
    dfs[dates[i]]['qud'] = dfs[dates[i]].progress_apply(lambda row: get_qud(row['text']) if not pd.isna(row['question']) else None, axis=1)
    dfs[dates[i]].to_csv(f"{data_base_dir}/{dates[i]}.csv", index=False)

100%|██████████| 68/68 [00:08<00:00,  8.12it/s]
100%|██████████| 94/94 [00:20<00:00,  4.68it/s]
100%|██████████| 76/76 [00:10<00:00,  7.08it/s]
100%|██████████| 134/134 [00:33<00:00,  4.05it/s]
100%|██████████| 161/161 [00:17<00:00,  9.33it/s]
100%|██████████| 92/92 [00:08<00:00, 10.90it/s]
100%|██████████| 144/144 [00:18<00:00,  7.64it/s]
100%|██████████| 166/166 [00:23<00:00,  7.21it/s]
100%|██████████| 142/142 [00:14<00:00,  9.55it/s]
100%|██████████| 189/189 [00:14<00:00, 13.21it/s]
100%|██████████| 210/210 [00:20<00:00, 10.44it/s]
100%|██████████| 308/308 [00:21<00:00, 14.19it/s]
100%|██████████| 932/932 [00:38<00:00, 24.47it/s]
