AI assisted data collection for ecosystem mapping and sensemaking #1044
Replies: 2 comments 1 reply
-
Thanks for this @mattosborn 🙏 I think this is super interesting, particularly the names and descriptions of values and anti-values generated from transcripts. Interesting also to see the AI experiment in categorising the values generated. That kind of sensemaking and clustering I would be interested to work on more, and interested to see how different thinkers or organisations relate to those clusters of values/anti-values. I also wonder whether it would be interesting to look at where different language is used to express essentially the same value, or even where the same language is used to describe different values. Can you tell from the AI output to what extent it is quoting directly from the data input vs paraphrasing? |
Beta Was this translation helpful? Give feedback.
-
This is a question, not an answer! I've noticed that automatic transcripts are often full of inaccuracies affecting the meaning or getting key terms wrong. For the McGilchrist/Schmachtenberger/Vervaeke conversation they did a pretty good edited transcript, but I still found plenty of mistakes and inaccuracies. My current version with corrections to date is here: https://www.simongrant.org/philosophy/2023_metacrisis-conversation.html (which you're welcome to use of course). Given the common awareness of the inaccuracy of automatic transcript tech, would you perhaps include some comparison between results obtained based on a typical raw transcripts, and results from carefully cleaned up ones? |
Beta Was this translation helpful? Give feedback.
-
Overview
The motivation here is that there is a vast amount of knowledge related to this movement in existing and freely available media such as videos and podcasts. The issue is that this is unstructured data, and manual processing across hundreds of hours of content is not feasible for a small research project.
The proposed solution is to make use of the current generation of large language models that have (1) a context window large enough to accept a full, longform dialog transcript (2) the capability to generate useful insights from this on arbitrary topics and (3) the ability to format their response as structured data, e.g. JSON.
In essence, we automate the process of harvesting transcripts and feeding them to a capable AI model for processing into useful structured data. The potential of this is the ability to conduct in-depth analysis of many thousands of hours of conversations and presentations from people related to this movement, discussing topics related to this movement.
Process in more detail
This is the current state of the prototype. All of these steps are automated except for video selection, which done manually through an interface (so we don't process irrelevant videos)
Channel level
YouTube API
- Fetch metadata for all videos uploaded to a given channelLocal
- Select a subset of videos for further processing (manual step)Per video
Local
- Extract guest/speaker name(s) from video titles. Ideally only videos with a single guest for simplicityYouTube / OSS library
- Fetch video transcriptOpen source AI model (local)
- Punctuation/formatting restoration if the transcript was auto-generated (to improve the quality of data extraction in next step)OpenAI API (GPT 4)
- Pass a prompt with the transcript to extract metadata from the video transcriptWhat kind of data are we talking about here?
This process is capable of generating the kind of data a human could extract from a video transcript. In some of the tests so far it has been surprisingly successful at extracting data around the following.
Expressed by the speaker
Conversation metadata
Can I see some data?
Here's an example of the values and anti-values a speaker expressed, extracted from a single video transcript. This data comes directly from the AI model in response to a single prompt, and underwent no further processing:
Recursive data analysis
Given this method yields thousands of data points, further analysis of this would be extremely useful to gain deeper insights. Thankfully we can just feed the aggregated output back in as input and ask more questions. The experiment ran here was to (1) generate a small number of higher level categories for the values that were expressed and (2) to assign each value one or more of these categories. The following image shows the result of this process, with the name of each value coloured by the assigned category.
There are some questionable category assignments here, but on the whole it performed pretty well. This could likely be improved with further prompt tuning, and manual intervention is always an option here where the dataset is small enough for it to be manageable.
Potential issues
Errors
These models aren't perfect, and much like humans they are prone to making mistakes (such as so-called hallucinations). Two points on this:
Pricing
This experiment was conducted using GPT4's API. The cost of a single prompt with a 1-2 hour long video transcript is probably in the region of $0.10-0.20 currently. If we analyzed 1000 hours of content, that puts the cost at roughly $100 per attribute. Some ways this could be brought down:
Further ideas
Podcasts
The prototype above works well for YouTube videos. Podcasts would require some extra steps:
Projects
It's possible we could apply this same method to generate metadata about projects by using text extracted through crawling project websites as prompt input. This would just require a dataset of projects that included their website URL.
Can I try this out?
If you have access to GPT Plus you can experiment with this manually. Here is a zip file with processed video transcripts pulled from thestoa. Here are the prompts I used in the experiment. Set the model to GPT4 and paste the prompt, along with the video transcript, into a new/empty chat window:
Values
Actions
Beta Was this translation helpful? Give feedback.
All reactions