Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I just can't make it SYNC any ideas would be helpful. hours on hours days after days and nothing. #670

Open
search620 opened this issue Jan 19, 2024 · 0 comments

Comments

@search620
Copy link

I am using this audio file for example:
https://www.thepodcastexchange.ca/s/McDonalds_LNG_061019.wav

There is no way to make the end of it sync with the audio.

Original Transcription:
Start: 00:00:00,009, End: 00:00:19,940, Text: Hey, Canadians, this is for you guys and for McDonald's. And so I'm going to read it in my most Canadian voice possible. It's impossible to be bummed out
when you think about a McDonald's Happy Meal, eh? I think everyone has a nostalgic attachment to them, including me and Christian. I can't tell you how many of the little toys we collected. Right, Christian?
Start: 00:00:20,606, End: 00:00:48,659, Text: All the time. And I love your accent so far. Thanks. And well, Canadians are very sarcastic, so I appreciate that. And we fought over the toys even with our fists were flying. You broke my nose on one occasion. I did. Oh man, that was actually really bad. Really dark. McDonald's cares about families, especially in Canada. So they care about the ingredients in their happy meals. The hamburgers are made with 100% beef from where else? Canada. Canada.
Start: 00:00:49,683, End: 00:01:05,793, Text: from the plains of Canada with no artificial preservatives, additives, or fillers, and Happy Meal chicken McNuggets. I wonder if the chickens are Canadian. And grilled chicken snack wraps. That's hard to say. And made with 100%... Yes, they are. Canadian-raised seasoned chicken.
Start: 00:01:05,793, End: 00:01:24,939, Text: No artificial flavors, colors, or preservatives. Enjoy quality ingredients while spending quality time with your family, whether they're Canadian or friends visiting from out of the country. Happy Meals start at just $3.99. $3.99. That's not that many loonies or toonies. Only at McDonald's.

Aligned Transcription:
Start: 00:00:00,029, End: 00:00:03,611, Text: Hey, Canadians, this is for you guys and for McDonald's.
Start: 00:00:03,770, End: 00:00:06,932, Text: And so I'm going to read it in my most Canadian voice possible.
Start: 00:00:07,493, End: 00:00:11,074, Text: It's impossible to be bummed out when you think about a McDonald's Happy Meal, eh?
Start: 00:00:11,654, End: 00:00:15,678, Text: I think everyone has a nostalgic attachment to them, including me and Christian.
Start: 00:00:16,138, End: 00:00:18,638, Text: I can't tell you how many of the little toys we collected.
Start: 00:00:19,039, End: 00:00:19,559, Text: Right, Christian?
Start: 00:00:20,725, End: 00:00:21,486, Text: All the time.
Start: 00:00:22,027, End: 00:00:23,507, Text: And I love your accent so far.
Start: 00:00:23,928, End: 00:00:24,288, Text: Thanks.
Start: 00:00:24,908, End: 00:00:28,070, Text: And well, Canadians are very sarcastic, so I appreciate that.
Start: 00:00:28,769, End: 00:00:32,351, Text: And we fought over the toys even with our fists were flying.
Start: 00:00:32,491, End: 00:00:33,953, Text: You broke my nose on one occasion.
Start: 00:00:34,353, End: 00:00:34,612, Text: I did.
Start: 00:00:34,713, End: 00:00:36,334, Text: Oh man, that was actually really bad.
Start: 00:00:36,374, End: 00:00:36,933, Text: Really dark.
Start: 00:00:38,634, End: 00:00:41,095, Text: McDonald's cares about families, especially in Canada.
Start: 00:00:41,496, End: 00:00:43,798, Text: So they care about the ingredients in their happy meals.
Start: 00:00:44,357, End: 00:00:45,859, Text: The hamburgers are made with 100% beef from where else?
Start: 00:00:45,899, End: 00:00:46,139, Text: Canada.
Start: 00:00:46,219, End: 00:00:46,439, Text: Canada.
Start: 00:00:49,704, End: 00:00:57,469, Text: from the plains of Canada with no artificial preservatives, additives, or fillers, and Happy Meal chicken McNuggets.
Start: 00:00:57,488, End: 00:00:58,670, Text: I wonder if the chickens are Canadian.
Start: 00:00:59,090, End: 00:01:00,511, Text: And grilled chicken snack wraps.
Start: 00:01:00,670, End: 00:01:01,350, Text: That's hard to say.
Start: 00:01:01,750, End: 00:01:03,332, Text: And made with 100%... Yes, they are.
Start: 00:01:03,371, End: 00:01:05,073, Text: Canadian-raised seasoned chicken.
Start: 00:01:05,933, End: 00:01:08,194, Text: No artificial flavors, colors, or preservatives.
Start: 00:01:08,594, End: 00:01:16,016, Text: Enjoy quality ingredients while spending quality time with your family, whether they're Canadian or friends visiting from out of the country.
Start: 00:01:16,637, End: 00:01:17,617, Text: Happy Meals start at just $3.99.
Start: 00:01:17,658, End: 00:01:17,617, Text: $3.99.
Start: 00:01:17,677, End: 00:01:19,778, Text: That's not that many loonies or toonies.
Start: 00:01:19,798, End: 00:01:20,259, Text: Only at McDonald's.

As you can see something here is not working and I don't even know how the "alignment" work and from where it getting the timestamp if the original transcribe don't give this information.

this is the script that I am trying to build for srt file:

import whisperx
import gc
import os

# Setup
device = "cuda"
audio_file = r"C:\Users\HOME\Desktop\Testing Whisper\7.wav"
batch_size = 12
compute_type = "float16"
language_code = "en"
model = whisperx.load_model("large-v3", device, compute_type=compute_type)
print_progress = True

# Load and transcribe audio
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size, print_progress=print_progress, language=language_code)

# Function to format time in SRT format
def format_time(seconds):
    millisec = int((seconds % 1) * 1000)
    minutes, seconds = divmod(int(seconds), 60)
    hours, minutes = divmod(minutes, 60)
    return f"{hours:02}:{minutes:02}:{seconds:02},{millisec:03}"

# Print Original Transcription
print("Original Transcription:")
for segment in result['segments']:
    print(f"Start: {format_time(segment['start'])}, End: {format_time(segment['end'])}, Text: {segment['text']}")

# Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=language_code, device=device)
aligned_result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

# Print Aligned Transcription
print("\nAligned Transcription:")
for segment in aligned_result['segments']:
    print(f"Start: {format_time(segment['start'])}, End: {format_time(segment['end'])}, Text: {segment['text']}")

# Function to convert transcription result to SRT format
def convert_to_srt(transcription_segments, srt_file_path):
    with open(srt_file_path, 'w', encoding='utf-8') as file:
        for i, segment in enumerate(transcription_segments, start=1):
            start_time = format_time(segment['start'])
            end_time = format_time(segment['end'])
            text = segment['text']
            file.write(f"{i}\n{start_time} --> {end_time}\n{text}\n\n")

# Main script
srt_file_name = os.path.splitext(audio_file)[0] + '.srt'
convert_to_srt(aligned_result['segments'], srt_file_name)

# Clean up if necessary
gc.collect()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant