# Florida free-speech project

The goal of the project is to get transcripts from Town Hall meetings in Florida cities and towns for research purposes. 

In practice, from a given list of towns in Florida, I used YouTube API to first search for official town/city channel. The I asked ChatGPT to evaluate if the channel seems official based on a channel title and description. Lastly, I called YouTube API to get all videos from the channel and get transcripts for each video.

## Setup

The repo is called [transcripts](https://github.com/nesaboz/transcripts):

In [None]:
try:
    from google.colab import drive
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False


if IS_COLAB: 
    response = input("Do you want to setup everything? ([yes]/no): ").lower().strip()
    if response != "no":
        # delete sample_data folder for beauty
        !rm -rf /content/* /content/.[!.]* /content/..?*

        !git clone https://github.com/nesaboz/transcripts.git /content

        # mount google drive
        drive.mount('/content/drive')
        
        # install_packages
        !pip install -r requirements.txt

        !cp "drive/MyDrive/.env" .

        DATA_FOLDER = "drive/MyDrive/PN"
else:
    !pip install -r requirements.txt
    
    DATA_FOLDER = "data"

##  Imports

Set up data folder:

In [None]:
from utils import ChannelCrawler, ChannelAnalyzer, aggregate_analysis_files, Channel, VideoInfo

%load_ext autoreload
%autoreload 2

# Search for YT channels

So we go over the list of all cities in Florida and search YouTube for "city of XYZ Florida" and "town of XYZ Florida". This is what `Crawler` class does. See docstring in `Crawler` for details.

In [None]:
crawler = ChannelCrawler(search_query_fns=[lambda x: f"town of {x}, Florida", lambda x: f"city of {x}, Florida"], data_folder=DATA_FOLDER)

Now start crawling, limit is infinite by default though you will of course hit into YouTube API quota limit:

In [None]:
crawler.start(limit=3)

If all goes well one should have folder called `responses` in the data folder.

## Analysis

For each json response in `responses` we will now ask ChatGPT to determine whether the channel is official or not. This will a new folder `analysis` with csv files having yes/no answers, and updates in `status.csv`. We first create analyzer and then run it:

In [None]:
analyzer = ChannelAnalyzer(
    model_name="gpt-4",
    prompt_fn= lambda x: f"Your job will be to analyze a short text, \
comprised of a title and a description of a YouTube channel, to asses whether this \
text corresponds to an official YouTube channel of a city {x}, in Florida. Your answer should be 'Yes' or 'No' only",
    data_folder=DATA_FOLDER
)

In [None]:
analyzer.start()

## Aggregation

We now aggregate the results in an excel file, very similar to the `assets/cities_to_collect.xlsx`, storing only positive results:

In [None]:
aggregate_analysis_files(crawler, analyzer, DATA_FOLDER)

## Extract info from one video

Now we shift focus on video metadata and transcripts, each video has ID, we can simply exatract all info knowing that ID:

In [None]:
video = VideoInfo("thGB9IILDOw", DATA_FOLDER)

In [None]:
video.get_all_video_info()

In [None]:
video.get_only_transcript()

## Extract info from all videos of a channel

We now know how to extract one transcript, we just need to get a list of all videos of a channel (with some id) and repeat the extraction:

In [None]:
channel = Channel('UCm9YZSpPqHckVrtDdrL3isw', DATA_FOLDER)

Get all videos and create a file `data/channels/<channel_id>/videos.json`:

In [None]:
channel.get_videos()

Extract all transcripts from a all channel videos:

In [None]:
channel.extract_all()