In [None]:
import logging 
from finder.main import Finder, DATADIR
from typing import List, Tuple, Union

There are two main ways to initialize the `Finder` object, depending on whether you already have your query audio downloaded or not. 
1. If you do, set `query` to be the path to your audio file. 
2. If not, you'll want to write something like this:

```python
myfinder = Finder(
    source=r"https://youtu.be/BmT8nFctYko",             
    query=r"https://youtu.be/jJd98IWDn54",
    start="5:32", 
    stop="5:50", 
    fmt=139, 
    loc=DATADIR,
    logname="jJd98IWDn54"
)
```

In [None]:
# set up the finder object 
myfinder = Finder(
    source=r"https://youtu.be/BmT8nFctYko",
    query="./data/8KWnymfSczU_2.m4a",
    logname="8KWnymfSczU_2c"
)

# create the bins 
myfinder.get_bins(
    max_binwidth=150,
    clip_edges=220,
    binorder="linear"
)

# save bins to log file 
logging.info(myfinder._bins_str)


The cell above will download the query if it's not already available, and then sets up the clips of the source video which it will download. These clips are called 'bins' because they discretize the source video into equal-length bins with several adjustable parameters. 

In this case, I've set 
- `max_binwidth=150` : the maximum clip duration is 150s (2:30s) 
- `clip_edges=220` : ignore the first and last 220s of the source video
- `binorder='linear'` : once the bins are setup, the clips will be downloaded in chronological order. 

For now, `binorder` doesn't quite make a difference, but I originally planned to stop the program when some threshold number of candidate timepoints had been identified. For example, other options are `random` and `mirrored`, which randomly shuffle the bins and 'mirror' the bins around the midpoint of the source video, respectively.

In [None]:
candidates: List[Tuple[int, int], float] = []

max_dl = 20
start_bin = 39
max_bin = 58
max_wait_time = 180

run_dict = dict(
    max_dl=max_dl,
    fmt=139,
    loc=DATADIR,
    keepfiles=True,
    max_wait_time=max_wait_time
)

The cell above sets up the main program, which iterates through each bin and compares the query to the bins' audio. If the query is similar enough, the bin is considered a *candidate*, and will be added, along with its similarity score, to the list `candidates`. 

The other options are:
- `max_dl` : the maximum number of source clips to download
- `start_bin` : the index of the starting bin. This is the main reason why we saved the list of bins, which could number in the hundreds, to our log file! 
  - If we are unsuccessful with one part of the video, we may be interested in scanning another section.
  - Or, we may have some prior knowledge about where we should begin our search. 
  - Whatever the case, if `start_bin = i`, we will be starting our search from the $(i-1)$-th bin.
  - Note: if you are resuming/restarting a search, make sure to use the same `binorder` (and *not* `binorder=random`)! 
- `max_bin` : like `start_bin`, this is the index of the last bin that will be downloaded. Depending on `max_dl`, the duration of `bins`, the number of bins, and, of course, whether the search terminates early, we may never reach `max_bin`. However, this can be set for extra safety. 
- `max_wait_time` : this is the maximum amount of time to wait for the first clip to finish downloading. 
  - This only applies to the first clip, however, as clips are downloaded in batches of five. 
  - If your connection is slow, you may want to increase this. 
  - If your bins are very small/large, you will want to adjust this accordingly. 

In [None]:
while len(candidates) < 1:
    candidates = myfinder.run(
        start_bin=start_bin,
        **run_dict
    )

    start_bin += max_dl
    if start_bin > max_bin:
        break

The cell above runs the main program, which stops when at least one candidate has been identified. This does *not* mean that the clips will not be downloaded. In fact, the batch of clips corresponding to the candidate bin will all be downloaded, as downloading prcedes processing. 

If you would like to instead do an exhaustive search, simply remove the reassignment of `candidates`. For example:

```python
while len(candidates) < 1:
    candidates = myfinder.run(
        start_bin=start_bin,
        **run_dict
    )

    start_bin += max_dl
    if start_bin > max_bin:
        break
```

Of course, you may also want to adjust the threshold from 1 to some other number/condition. 

In [None]:
from finder.postplot import ReadLog
from pathlib import Path 

p = r"./8KWnymfSczU_2c.log"

ReadLog(p).read(
    save_csv=True, 
    save_fig=True,
    outpath=Path.cwd() / 'output'
)

While the main script includes visualization, I've included a post-processing script that visualizes and saves candidates that were identified. 

The graph below shows the output of the program as written above. It has the source video's timestamps on the x-axis and the cross-correlation (similarity) between the query and each of the downloaded bins on the y-axis. Other features:
- The threshold for candidacy is a cross-correlation score of 0.5, indicated by the dashed line. 
- The bin with the highest similarity to the query is enclosed by a purple circle. 

<img src="./output/8KWnymfSczU_2c_log.png">

Note, however, that *the most similar point is **not** always correct.* In fact, in this case, the correct bin was at (2:21:52, 1.48), whereas the most-similar bin was apparently at (1:41:58, 1.81). 

This discrepancy happens for a number of reasons, e.g. if the query video does not perfectly match the source. This is common if the query is from a clip, as clippers often remove silence and add various sound effects or background music. 

The cross-correlation method is quite sensitive, making even short audio clips sufficient. Thus, success is more likely when the query is as close to the original as possible. 

The code above also creates a `.csv` file that shows the results in a much more reader-friendly format than the original log file:

In [3]:
import pandas as pd 

pd.read_csv(
    './output/8KWnymfSczU_2c_parsed.csv', 
    header=0, 
    index_col=0
)

Unnamed: 0,Time,Correlation
0,01:41:58,1.812128
14,02:21:52,1.482511
12,02:16:43,1.341701
6,01:57:09,1.077711
3,01:50:35,1.023812
1,01:44:21,0.981022
2,01:46:40,0.947678
9,02:08:10,0.905017
7,01:58:50,0.715186
8,02:02:41,0.714531


Note how the true bin placed second-highest in terms of correlation with the source. This is why it is often important to 
- look for multiple candidates, 
- have a relatively low threshold for detection (here, 0.5), and 
- consider large sections of the source video.