AI assisted data collection for ecosystem mapping and sensemaking #1044

mattosborn · 2024-02-16T14:15:38Z

mattosborn
Feb 16, 2024
Collaborator

Overview

The motivation here is that there is a vast amount of knowledge related to this movement in existing and freely available media such as videos and podcasts. The issue is that this is unstructured data, and manual processing across hundreds of hours of content is not feasible for a small research project.

The proposed solution is to make use of the current generation of large language models that have (1) a context window large enough to accept a full, longform dialog transcript (2) the capability to generate useful insights from this on arbitrary topics and (3) the ability to format their response as structured data, e.g. JSON.

In essence, we automate the process of harvesting transcripts and feeding them to a capable AI model for processing into useful structured data. The potential of this is the ability to conduct in-depth analysis of many thousands of hours of conversations and presentations from people related to this movement, discussing topics related to this movement.

Process in more detail

This is the current state of the prototype. All of these steps are automated except for video selection, which done manually through an interface (so we don't process irrelevant videos)

Channel level

YouTube API - Fetch metadata for all videos uploaded to a given channel
Local - Select a subset of videos for further processing (manual step)

Per video

Local - Extract guest/speaker name(s) from video titles. Ideally only videos with a single guest for simplicity
YouTube / OSS library - Fetch video transcript
Open source AI model (local) - Punctuation/formatting restoration if the transcript was auto-generated (to improve the quality of data extraction in next step)
OpenAI API (GPT 4) - Pass a prompt with the transcript to extract metadata from the video transcript

What kind of data are we talking about here?

This process is capable of generating the kind of data a human could extract from a video transcript. In some of the tests so far it has been surprisingly successful at extracting data around the following.

Expressed by the speaker

Values
Anti-values
Goals
Actions they support (individual, collective)
Practices
Utopias and dystopias

Conversation metadata

Topics and themes under discussion
Archetypes the speaker appears to fit into (e.g. educator, sage, connector, visionary, philosopher, etc)

Can I see some data?

Here's an example of the values and anti-values a speaker expressed, extracted from a single video transcript. This data comes directly from the AI model in response to a single prompt, and underwent no further processing:

{
  "support": [
    {
      "name": "Perennial wisdom",
      "description": "Support for the concept of universal wisdom that transcends specific religions and cultures, embodying a core of truth found throughout various traditions."
    },
    {
      "name": "Integration of spirituality and science",
      "description": "Advocacy for a worldview that reconciles scientific understanding with spiritual wisdom, suggesting that modern science can reveal and support ancient spiritual insights."
    },
    {
      "name": "Systems change",
      "description": "Belief in the importance of comprehensive change in societal, environmental, and political systems, underpinned by a transformation in collective consciousness and values."
    },
    {
      "name": "Cyclical nature of the universe",
      "description": "The idea that the universe operates in cycles, including the rise and fall of civilizations, as part of a larger cosmic pattern."
    },
    {
      "name": "Empowerment and transcendence",
      "description": "Encouragement of personal growth and transcendence through overcoming fear and desire, potentially leading to a deeper connection with the divine or universal consciousness."
    }
  ],
  "critical": [
    {
      "name": "Modern disconnect from the sacred",
      "description": "Criticism of the contemporary era's alienation from spiritual or sacred dimensions of existence, leading to a sense of meaninglessness and nihilism."
    },
    {
      "name": "Secular humanism",
      "description": "Critical assessment of secular humanism for its perceived failure to provide a robust foundation for values like human dignity, equality, and the sanctity of life without a spiritual underpinning."
    },
    {
      "name": "Militant atheism",
      "description": "Disapproval of militant atheism for its outright rejection of religious and spiritual insights, missing the inner, mystical core of religious traditions."
    },
    {
      "name": "Outer mysteries of religion",
      "description": "Critique of the superficial aspects of religious practices and institutions, distinguishing them from the esoteric, mystical teachings that offer deeper wisdom."
    },
    {
      "name": "Nihilism",
      "description": "Opposition to nihilistic philosophies that deny inherent meaning or purpose in the universe, viewing them as detrimental to individual and collective well-being."
    }
  ]
}

Recursive data analysis

Given this method yields thousands of data points, further analysis of this would be extremely useful to gain deeper insights. Thankfully we can just feed the aggregated output back in as input and ask more questions. The experiment ran here was to (1) generate a small number of higher level categories for the values that were expressed and (2) to assign each value one or more of these categories. The following image shows the result of this process, with the name of each value coloured by the assigned category.

There are some questionable category assignments here, but on the whole it performed pretty well. This could likely be improved with further prompt tuning, and manual intervention is always an option here where the dataset is small enough for it to be manageable.

Potential issues

Errors

These models aren't perfect, and much like humans they are prone to making mistakes (such as so-called hallucinations). Two points on this:

It seems clear that these models can generate extremely useful and high quality datasets even without being error-proof
Errors can potentially be worked out or aggregated out in further analysis
Errors can always be manually rectified by hand after generation.

Pricing

This experiment was conducted using GPT4's API. The cost of a single prompt with a 1-2 hour long video transcript is probably in the region of $0.10-0.20 currently. If we analyzed 1000 hours of content, that puts the cost at roughly $100 per attribute. Some ways this could be brought down:

Extracting multiple attributes per prompt (unsure whether this would lead to lower quality data collection)
Other paid models. Unfortunately GPT3.5 doesn't have a context window large enough to do a video transcript in a single pass.
Open source models. Smaug-72B is a recently released model that has a capability on par with, or even exceeding, GPT3.5, and with a context window potentially large enough for a video transcript.

Further ideas

Podcasts

The prototype above works well for YouTube videos. Podcasts would require some extra steps:

Harvest podcast metadata (probably from RSS feeds)
Download audio files
Transcribe audio (there are existing OSS models that can do this very well)

Projects

It's possible we could apply this same method to generate metadata about projects by using text extracted through crawling project websites as prompt input. This would just require a dataset of projects that included their website URL.

Can I try this out?

If you have access to GPT Plus you can experiment with this manually. Here is a zip file with processed video transcripts pulled from thestoa. Here are the prompts I used in the experiment. Set the model to GPT4 and paste the prompt, along with the video transcript, into a new/empty chat window:

Values

In the following video transcript, what are some the values the guest expresses support for, and what are some of the values the guest expresses criticism of? Respond with JSON only, with these values separated into two arrays under the top level keys "support" and "critical". Each value should have the key "name" and "description".

TRANSCRIPT FOLLOWS

Actions

For the following video transcript, create an exhaustive list of individual or collective actions that the guest supports or advocates. Format your response as JSON, with the actions separated into two arrays under the top level keys "individual" and "collective". Each action should have the key "name" and "description". You can include a third top-level key "notes" for any additional notes you want to make about your response.

TRANSCRIPT FOLLOWS

catherinet1 · 2024-02-29T09:46:58Z

catherinet1
Feb 29, 2024
Maintainer

Thanks for this @mattosborn 🙏 I think this is super interesting, particularly the names and descriptions of values and anti-values generated from transcripts.

Interesting also to see the AI experiment in categorising the values generated. That kind of sensemaking and clustering I would be interested to work on more, and interested to see how different thinkers or organisations relate to those clusters of values/anti-values.

I also wonder whether it would be interesting to look at where different language is used to express essentially the same value, or even where the same language is used to describe different values.

Can you tell from the AI output to what extent it is quoting directly from the data input vs paraphrasing?

1 reply

mattosborn Mar 5, 2024
Collaborator Author

I also wonder whether it would be interesting to look at where different language is used to express essentially the same value, or even where the same language is used to describe different values.

There are certainly cases of both of these. And also examples where one value encapsulates another (because it's a higher level value)

Can you tell from the AI output to what extent it is quoting directly from the data input vs paraphrasing?

The names of the values themselves don't always seem to be mentioned explicitly and are sometimes deduced rather than just extracted. The descriptions are not lifted from the dialogue but are written by the AI.

asimong · 2024-02-29T13:09:52Z

asimong
Feb 29, 2024
Collaborator

This is a question, not an answer! I've noticed that automatic transcripts are often full of inaccuracies affecting the meaning or getting key terms wrong. For the McGilchrist/Schmachtenberger/Vervaeke conversation they did a pretty good edited transcript, but I still found plenty of mistakes and inaccuracies. My current version with corrections to date is here: https://www.simongrant.org/philosophy/2023_metacrisis-conversation.html (which you're welcome to use of course). Given the common awareness of the inaccuracy of automatic transcript tech, would you perhaps include some comparison between results obtained based on a typical raw transcripts, and results from carefully cleaned up ones?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Life Itself

AI assisted data collection for ecosystem mapping and sensemaking #1044

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Life Itself

AI assisted data collection for ecosystem mapping and sensemaking #1044

mattosborn Feb 16, 2024 Collaborator

Overview

Process in more detail

Channel level

Per video

What kind of data are we talking about here?

Expressed by the speaker

Conversation metadata

Can I see some data?

Recursive data analysis

Potential issues

Errors

Pricing

Further ideas

Podcasts

Projects

Can I try this out?

Values

Actions

Replies: 2 comments · 1 reply

catherinet1 Feb 29, 2024 Maintainer

mattosborn Mar 5, 2024 Collaborator Author

asimong Feb 29, 2024 Collaborator

mattosborn
Feb 16, 2024
Collaborator

Replies: 2 comments 1 reply

catherinet1
Feb 29, 2024
Maintainer

mattosborn Mar 5, 2024
Collaborator Author

asimong
Feb 29, 2024
Collaborator