Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradio. Built-in example - Cat Video. FileNotFoundError: [Errno 2] No such file or directory #61

Closed
ekiwi111 opened this issue Apr 6, 2023 · 7 comments

Comments

@ekiwi111
Copy link

ekiwi111 commented Apr 6, 2023

Server is up an running.
Commit SHA - bc66e5a
Using gradio python run_gradio_demo.py --config config.gradio.yaml:

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

Running built-in example based on the /examples/a.jpg, please generate a video and audio . Gradio terminal output:

2023-04-06 16:33:43,346 - awesome_chat - INFO - ********************************************************************************
2023-04-06 16:33:43,352 - awesome_chat - INFO - input: based on the /examples/a.jpg, please generate a video and audio
2023-04-06 16:33:47,719 - awesome_chat - INFO - [{"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "/examples/a.jpg" }}, {"task": "text-to-video", "id": 1, "dep": [0], "args": {"text": "<GENERATED>-0" }}, {"task": "text-to-speech", "id": 2, "dep": [0], "args": {"text": "<GENERATED>-0" }}]
2023-04-06 16:34:36,874 - awesome_chat - INFO - response: I have carefully considered your request and I have generated a video and audio based on the image you provided. The workflow I used is as follows: First, I used the image-to-text model 'nlpconnect/vit-gpt2-image-captioning' to generate the text 'a cat sitting on a window sill looking out'. Then, I used the text-to-video model 'damo-vilab/text-to-video-ms-1.7b' to generate the video '/videos/293f.mp4'. Finally, I used the text-to-speech model 'facebook/fastspeech2-en-ljspeech' to generate the audio '/audios/19fc.flac'. The complete path of the video and audio are '/videos/293f.mp4' and '/audios/19fc.flac' respectively. I hope this answer your request. Is there anything else I can help you with?
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/jarvis/lib/python3.8/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/anaconda3/envs/jarvis/lib/python3.8/site-packages/gradio/blocks.py", line 1111, in process_api
    data = self.postprocess_data(fn_index, result["prediction"], state)
  File "/home/user/anaconda3/envs/jarvis/lib/python3.8/site-packages/gradio/blocks.py", line 1045, in postprocess_data
    prediction_value = block.postprocess(prediction_value)
  File "/home/user/anaconda3/envs/jarvis/lib/python3.8/site-packages/gradio/components.py", line 4333, in postprocess
    self._postprocess_chat_messages(message_pair[1]),
  File "/home/user/anaconda3/envs/jarvis/lib/python3.8/site-packages/gradio/components.py", line 4296, in _postprocess_chat_messages
    filepath = self.make_temp_copy_if_needed(filepath)
  File "/home/user/anaconda3/envs/jarvis/lib/python3.8/site-packages/gradio/components.py", line 245, in make_temp_copy_if_needed
    temp_dir = self.hash_file(file_path)
  File "/home/user/anaconda3/envs/jarvis/lib/python3.8/site-packages/gradio/components.py", line 217, in hash_file
    with open(file_path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'public//videos/293f.mp4'
@tricktreat
Copy link
Contributor

tricktreat commented Apr 6, 2023

Hi, maybe ffmpeg is not installed. In the README, Note that in order to display the video properly in HTML, you need to compile ffmpeg manually with H.264. If ffmpeg is not properly installed, video files in H.263 format will not be generated.

@ekiwi111
Copy link
Author

ekiwi111 commented Apr 6, 2023

I don't believe that's the issue. When the command LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/ffmpeg -i input.mp4 -vcodec libx264 output.mp4 is executed (with the appropriate input mp4 file), it runs correctly.

Here's the content of debug.log:

2023-04-06 17:18:19,832 - awesome_chat - INFO - ********************************************************************************
2023-04-06 17:18:19,833 - awesome_chat - INFO - input: based on the /examples/a.jpg, please generate a video and audio
2023-04-06 17:18:19,834 - awesome_chat - DEBUG - [{'role': 'system', 'content': '#1 Task Planning Stage: The AI assistant can parse user input to several tasks: [{"task": task, "id": task_id, "dep": dependency_task_id, "args": {"text": text or <GENERATED>-dep_id, "image": image_url or <GENERATED>-dep_id, "audio": audio_url or <GENERATED>-dep_id}}]. The special tag "<GENERATED>-dep_id" refer to the one genereted text/image/audio in the dependency task (Please consider whether the dependency task generates resources of this type.) and "dep_id" must be in "dep" list. The "dep" field denotes the ids of the previous prerequisite tasks which generate a new resource that the current task relies on. The "args" field must in ["text", "image", "audio"], nothing else. The task MUST be selected from the following options: "token-classification", "text2text-generation", "summarization", "translation", "question-answering", "conversational", "text-generation", "sentence-similarity", "tabular-classification", "object-detection", "image-classification", "image-to-image", "image-to-text", "text-to-image", "text-to-video", "visual-question-answering", "document-question-answering", "image-segmentation", "depth-estimation", "text-to-speech", "automatic-speech-recognition", "audio-to-audio", "audio-classification", "canny-control", "hed-control", "mlsd-control", "normal-control", "openpose-control", "canny-text-to-image", "depth-text-to-image", "hed-text-to-image", "mlsd-text-to-image", "normal-text-to-image", "openpose-text-to-image", "seg-text-to-image". There may be multiple tasks of the same type. Think step by step about all the tasks needed to resolve the user\'s request. Parse out as few tasks as possible while ensuring that the user request can be resolved. Pay attention to the dependencies and order among tasks. If the user input can\'t be parsed, you need to reply empty JSON [].'}, {'role': 'user', 'content': 'Give you some pictures e1.jpg, e2.png, e3.jpg, help me count the number of sheep?'}, {'role': 'assistant', 'content': '[{"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "e1.jpg" }}, {"task": "object-detection", "id": 1, "dep": [-1], "args": {"image": "e1.jpg" }}, {"task": "visual-question-answering", "id": 2, "dep": [1], "args": {"image": "<GENERATED>-1", "text": "How many sheep in the picture"}} }}, {"task": "image-to-text", "id": 3, "dep": [-1], "args": {"image": "e2.png" }}, {"task": "object-detection", "id": 4, "dep": [-1], "args": {"image": "e2.png" }}, {"task": "visual-question-answering", "id": 5, "dep": [4], "args": {"image": "<GENERATED>-4", "text": "How many sheep in the picture"}} }}, {"task": "image-to-text", "id": 6, "dep": [-1], "args": {"image": "e3.jpg" }},  {"task": "object-detection", "id": 7, "dep": [-1], "args": {"image": "e3.jpg" }}, {"task": "visual-question-answering", "id": 8, "dep": [7], "args": {"image": "<GENERATED>-7", "text": "How many sheep in the picture"}}]'}, {'role': 'user', 'content': 'Look at /e.jpg, can you tell me how many objects in the picture? Give me a picture and video similar to this one.'}, {'role': 'assistant', 'content': '[{"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "/e.jpg" }}, {"task": "object-detection", "id": 1, "dep": [-1], "args": {"image": "/e.jpg" }}, {"task": "visual-question-answering", "id": 2, "dep": [1], "args": {"image": "<GENERATED>-1", "text": "how many objects in the picture?" }}, {"task": "text-to-image", "id": 3, "dep": [0], "args": {"text": "<GENERATED-0>" }}, {"task": "image-to-image", "id": 4, "dep": [-1], "args": {"image": "/e.jpg" }}, {"task": "text-to-video", "id": 5, "dep": [0], "args": {"text": "<GENERATED-0>" }}]'}, {'role': 'user', 'content': 'given a document /images/e.jpeg, answer me what is the student amount? And describe the image with your voice'}, {'role': 'assistant', 'content': '{"task": "document-question-answering", "id": 0, "dep": [-1], "args": {"image": "/images/e.jpeg", "text": "what is the student amount?" }}, {"task": "visual-question-answering", "id": 1, "dep": [-1], "args": {"image": "/images/e.jpeg", "text": "what is the student amount?" }}, {"task": "image-to-text", "id": 2, "dep": [-1], "args": {"image": "/images/e.jpg" }}, {"task": "text-to-speech", "id": 3, "dep": [2], "args": {"text": "<GENERATED>-2" }}]'}, {'role': 'user', 'content': 'Given an image /example.jpg, first generate a hed image, then based on the hed image generate a new image where a girl is reading a book'}, {'role': 'assistant', 'content': '[{"task": "openpose-control", "id": 0, "dep": [-1], "args": {"image": "/example.jpg" }},  {"task": "openpose-text-to-image", "id": 1, "dep": [0], "args": {"text": "a girl is reading a book", "image": "<GENERATED>-0" }}]'}, {'role': 'user', 'content': "please show me a video and an image of (based on the text) 'a boy is running' and dub it"}, {'role': 'assistant', 'content': '[{"task": "text-to-video", "id": 0, "dep": [-1], "args": {"text": "a boy is running" }}, {"task": "text-to-speech", "id": 1, "dep": [-1], "args": {"text": "a boy is running" }}, {"task": "text-to-image", "id": 2, "dep": [-1], "args": {"text": "a boy is running" }}]'}, {'role': 'user', 'content': 'please show me a joke and an image of cat'}, {'role': 'assistant', 'content': '[{"task": "conversational", "id": 0, "dep": [-1], "args": {"text": "please show me a joke of cat" }}, {"task": "text-to-image", "id": 1, "dep": [-1], "args": {"text": "a photo of cat" }}]'}, {'role': 'user', 'content': 'The chat log [ [] ] may contain the resources I mentioned. Now I input { based on the /examples/a.jpg, please generate a video and audio }. Pay attention to the input and output types of tasks and the dependencies between tasks.'}]
2023-04-06 17:18:23,834 - awesome_chat - DEBUG - {"id":"cmpl-72CImEORT89oJrJWiLw2acE0X57SX","object":"text_completion","created":1680758300,"model":"text-davinci-003","choices":[{"text":"\n[{\"task\": \"image-to-text\", \"id\": 0, \"dep\": [-1], \"args\": {\"image\": \"/examples/a.jpg\" }}, {\"task\": \"text-to-video\", \"id\": 1, \"dep\": [0], \"args\": {\"text\": \"<GENERATED>-0\" }}, {\"task\": \"text-to-speech\", \"id\": 2, \"dep\": [0], \"args\": {\"text\": \"<GENERATED>-0\" }}]","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":1919,"completion_tokens":113,"total_tokens":2032}}
2023-04-06 17:18:23,834 - awesome_chat - INFO - [{"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "/examples/a.jpg" }}, {"task": "text-to-video", "id": 1, "dep": [0], "args": {"text": "<GENERATED>-0" }}, {"task": "text-to-speech", "id": 2, "dep": [0], "args": {"text": "<GENERATED>-0" }}]
2023-04-06 17:18:23,834 - awesome_chat - DEBUG - [{'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': '/examples/a.jpg'}}, {'task': 'text-to-video', 'id': 1, 'dep': [0], 'args': {'text': '<GENERATED>-0'}}, {'task': 'text-to-speech', 'id': 2, 'dep': [0], 'args': {'text': '<GENERATED>-0'}}]
2023-04-06 17:18:23,853 - awesome_chat - DEBUG - Run task: 0 - image-to-text
2023-04-06 17:18:23,853 - awesome_chat - DEBUG - Deps: []
2023-04-06 17:18:23,853 - awesome_chat - DEBUG - parsed task: {'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': 'public//examples/a.jpg'}}
2023-04-06 17:18:25,089 - awesome_chat - DEBUG - avaliable models on image-to-text: {'local': ['nlpconnect/vit-gpt2-image-captioning'], 'huggingface': ['microsoft/trocr-base-printed', 'kha-white/manga-ocr-base', 'nlpconnect/vit-gpt2-image-captioning', 'Salesforce/blip-image-captioning-base']}
2023-04-06 17:18:25,089 - awesome_chat - DEBUG - [{'role': 'system', 'content': '#2 Model Selection Stage: Given the user request and the parsed tasks, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The assistant should focus more on the description of the model and find the model that has the most potential to solve requests and tasks. Also, prefer models with local inference endpoints for speed and stability.'}, {'role': 'user', 'content': 'based on the /examples/a.jpg, please generate a video and audio'}, {'role': 'assistant', 'content': "{'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': 'public//examples/a.jpg'}}"}, {'role': 'user', 'content': 'Please choose the most suitable model from [{\'id\': \'nlpconnect/vit-gpt2-image-captioning\', \'inference endpoint\': [\'nlpconnect/vit-gpt2-image-captioning\'], \'likes\': 219, \'description\': \'\\n\\n# nlpconnect/vit-gpt2-image-captioning\\n\\nThis is an image captioning model trained by @ydshieh in [\', \'language\': None, \'tags\': None}, {\'id\': \'microsoft/trocr-base-printed\', \'inference endpoint\': [\'microsoft/trocr-base-printed\', \'kha-white/manga-ocr-base\', \'nlpconnect/vit-gpt2-image-captioning\', \'Salesforce/blip-image-captioning-base\'], \'likes\': 56, \'description\': \'\\n\\n# TrOCR (base-sized model, fine-tuned on SROIE) \\n\\nTrOCR model fine-tuned on the [SROIE dataset](ht\', \'language\': None, \'tags\': None}, {\'id\': \'Salesforce/blip-image-captioning-base\', \'inference endpoint\': [\'microsoft/trocr-base-printed\', \'kha-white/manga-ocr-base\', \'nlpconnect/vit-gpt2-image-captioning\', \'Salesforce/blip-image-captioning-base\'], \'likes\': 44, \'description\': \'\\n\\n# BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Ge\', \'language\': None, \'tags\': None}, {\'id\': \'kha-white/manga-ocr-base\', \'inference endpoint\': [\'microsoft/trocr-base-printed\', \'kha-white/manga-ocr-base\', \'nlpconnect/vit-gpt2-image-captioning\', \'Salesforce/blip-image-captioning-base\'], \'likes\': 24, \'description\': \'\\n\\n# Manga OCR\\n\\nOptical character recognition for Japanese text, with the main focus being Japanese m\', \'language\': None, \'tags\': None}] for the task {\'task\': \'image-to-text\', \'id\': 0, \'dep\': [-1], \'args\': {\'image\': \'public//examples/a.jpg\'}}. The output must be in a strict JSON format: {"id": "id", "reason": "your detail reasons for the choice"}.'}]
2023-04-06 17:18:27,771 - awesome_chat - DEBUG - {"id":"cmpl-72CIr730hJuLfN3gOFo3iKItF29Dp","object":"text_completion","created":1680758305,"model":"text-davinci-003","choices":[{"text":"\n{\"id\": \"nlpconnect/vit-gpt2-image-captioning\", \"reason\": \"This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes\"}","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":763,"completion_tokens":59,"total_tokens":822}}
2023-04-06 17:18:27,772 - awesome_chat - DEBUG - chosen model: {"id": "nlpconnect/vit-gpt2-image-captioning", "reason": "This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes"}
2023-04-06 17:18:28,151 - awesome_chat - DEBUG - inference result: {'generated text': 'a cat sitting on a window sill looking out '}
2023-04-06 17:18:28,370 - awesome_chat - DEBUG - Run task: 1 - text-to-video
2023-04-06 17:18:28,370 - awesome_chat - DEBUG - Deps: [{"task": {"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "public//examples/a.jpg"}}, "inference result": {"generated text": "a cat sitting on a window sill looking out "}, "choose model result": {"id": "nlpconnect/vit-gpt2-image-captioning", "reason": "This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes"}}]
2023-04-06 17:18:28,370 - awesome_chat - DEBUG - Detect the generated text of dependency task (from results):a cat sitting on a window sill looking out 
2023-04-06 17:18:28,370 - awesome_chat - DEBUG - Detect the image of dependency task (from args): public//examples/a.jpg
2023-04-06 17:18:28,370 - awesome_chat - DEBUG - parsed task: {'task': 'text-to-video', 'id': 1, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}
2023-04-06 17:18:28,371 - awesome_chat - DEBUG - Run task: 2 - text-to-speech
2023-04-06 17:18:28,371 - awesome_chat - DEBUG - Deps: [{"task": {"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "public//examples/a.jpg"}}, "inference result": {"generated text": "a cat sitting on a window sill looking out "}, "choose model result": {"id": "nlpconnect/vit-gpt2-image-captioning", "reason": "This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes"}}]
2023-04-06 17:18:28,372 - awesome_chat - DEBUG - Detect the generated text of dependency task (from results):a cat sitting on a window sill looking out 
2023-04-06 17:18:28,372 - awesome_chat - DEBUG - Detect the image of dependency task (from args): public//examples/a.jpg
2023-04-06 17:18:28,372 - awesome_chat - DEBUG - parsed task: {'task': 'text-to-speech', 'id': 2, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}
2023-04-06 17:18:29,585 - awesome_chat - DEBUG - avaliable models on text-to-video: {'local': ['damo-vilab/text-to-video-ms-1.7b'], 'huggingface': []}
2023-04-06 17:18:29,586 - awesome_chat - DEBUG - chosen model: {'id': 'damo-vilab/text-to-video-ms-1.7b', 'reason': 'Only one model available.'}
2023-04-06 17:18:29,635 - awesome_chat - DEBUG - avaliable models on text-to-speech: {'local': ['espnet/kan-bayashi_ljspeech_vits'], 'huggingface': ['facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur']}
2023-04-06 17:18:29,635 - awesome_chat - DEBUG - [{'role': 'system', 'content': '#2 Model Selection Stage: Given the user request and the parsed tasks, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The assistant should focus more on the description of the model and find the model that has the most potential to solve requests and tasks. Also, prefer models with local inference endpoints for speed and stability.'}, {'role': 'user', 'content': 'based on the /examples/a.jpg, please generate a video and audio'}, {'role': 'assistant', 'content': "{'task': 'text-to-speech', 'id': 2, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}"}, {'role': 'user', 'content': 'Please choose the most suitable model from [{\'id\': \'espnet/kan-bayashi_ljspeech_vits\', \'inference endpoint\': [\'espnet/kan-bayashi_ljspeech_vits\'], \'likes\': 70, \'description\': \'\\n## ESPnet2 TTS pretrained model \\n### `kan-bayashi/ljspeech_vits`\\n♻️ Imported from https://zenodo.or\', \'language\': None, \'tags\': None}, {\'id\': \'facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur\', \'inference endpoint\': [\'facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur\'], \'likes\': 14, \'description\': \'\\n## unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur\\n\\nSpeech-to-speech translation mo\', \'language\': None, \'tags\': None}] for the task {\'task\': \'text-to-speech\', \'id\': 2, \'dep\': [0], \'args\': {\'text\': \'a cat sitting on a window sill looking out \'}}. The output must be in a strict JSON format: {"id": "id", "reason": "your detail reasons for the choice"}.'}]
2023-04-06 17:18:32,377 - awesome_chat - DEBUG - {"id":"cmpl-72CIveauNwMazGxHs0hEKlH44qlE4","object":"text_completion","created":1680758309,"model":"text-davinci-003","choices":[{"text":"\n{\"id\": \"espnet/kan-bayashi_ljspeech_vits\", \"reason\": \"This model is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability\"}","index":0,"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":525,"completion_tokens":48,"total_tokens":573}}
2023-04-06 17:18:32,377 - awesome_chat - DEBUG - chosen model: {"id": "espnet/kan-bayashi_ljspeech_vits", "reason": "This model is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability"}
2023-04-06 17:18:53,582 - awesome_chat - DEBUG - inference result: {'generated audio': '/audios/7540.wav'}
2023-04-06 17:19:06,242 - awesome_chat - DEBUG - inference result: {'generated video': '/videos/da6b.mp4'}
2023-04-06 17:19:06,258 - awesome_chat - DEBUG - {0: {'task': {'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': 'public//examples/a.jpg'}}, 'inference result': {'generated text': 'a cat sitting on a window sill looking out '}, 'choose model result': {'id': 'nlpconnect/vit-gpt2-image-captioning', 'reason': 'This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes'}}, 2: {'task': {'task': 'text-to-speech', 'id': 2, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}, 'inference result': {'generated audio': '/audios/7540.wav'}, 'choose model result': {'id': 'espnet/kan-bayashi_ljspeech_vits', 'reason': 'This model is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability'}}, 1: {'task': {'task': 'text-to-video', 'id': 1, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}, 'inference result': {'generated video': '/videos/da6b.mp4'}, 'choose model result': {'id': 'damo-vilab/text-to-video-ms-1.7b', 'reason': 'Only one model available.'}}}
2023-04-06 17:19:06,259 - awesome_chat - DEBUG - [{'role': 'system', 'content': '#4 Response Generation Stage: With the task execution logs, the AI assistant needs to describe the process and inference results.'}, {'role': 'user', 'content': 'based on the /examples/a.jpg, please generate a video and audio'}, {'role': 'assistant', 'content': "Before give you a response, I want to introduce my workflow for your request, which is shown in the following JSON data: [{'task': {'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': 'public//examples/a.jpg'}}, 'inference result': {'generated text': 'a cat sitting on a window sill looking out '}, 'choose model result': {'id': 'nlpconnect/vit-gpt2-image-captioning', 'reason': 'This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes'}}, {'task': {'task': 'text-to-video', 'id': 1, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}, 'inference result': {'generated video': '/videos/da6b.mp4'}, 'choose model result': {'id': 'damo-vilab/text-to-video-ms-1.7b', 'reason': 'Only one model available.'}}, {'task': {'task': 'text-to-speech', 'id': 2, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}, 'inference result': {'generated audio': '/audios/7540.wav'}, 'choose model result': {'id': 'espnet/kan-bayashi_ljspeech_vits', 'reason': 'This model is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability'}}]. Do you have any demands regarding my response?"}, {'role': 'user', 'content': "Yes. Please first think carefully and directly answer my request based on the inference results. Then please detail your workflow step by step including the used models and inference results for my request in your friendly tone. Please filter out information that is not relevant to my request. If any generated files of images, audios or videos in the inference results, must tell me the complete path. If there is nothing in the results, please tell me you can't make it. }"}]
2023-04-06 17:19:12,879 - awesome_chat - DEBUG - {"id":"cmpl-72CJWrjDQwz22fi9HNmLVbojR8X9j","object":"text_completion","created":1680758346,"model":"text-davinci-003","choices":[{"text":"\nI have carefully considered your request and I can generate a video and audio based on the image you provided. I have used three models in my workflow, which are image-to-text, text-to-video and text-to-speech. \n\nFor the image-to-text task, I used the model nlpconnect/vit-gpt2-image-captioning, which is trained by @ydshieh and has the highest number of likes. The inference result is a text description of the image: \"a cat sitting on a window sill looking out\". \n\nFor the text-to-video task, I used the model damo-vilab/text-to-video-ms-1.7b, which is the only model available. The inference result is a video file located at /videos/da6b.mp4. \n\nFor the text-to-speech task, I used the model espnet/kan-bayashi_ljspeech_vits, which is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability. The inference result is an audio file located at /audios/7540.wav.\n\nI hope this information is helpful. Please let me know if you have any other questions. ","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":579,"completion_tokens":277,"total_tokens":856}}
2023-04-06 17:19:12,879 - awesome_chat - INFO - response: I have carefully considered your request and I can generate a video and audio based on the image you provided. I have used three models in my workflow, which are image-to-text, text-to-video and text-to-speech. 

For the image-to-text task, I used the model nlpconnect/vit-gpt2-image-captioning, which is trained by @ydshieh and has the highest number of likes. The inference result is a text description of the image: "a cat sitting on a window sill looking out". 

For the text-to-video task, I used the model damo-vilab/text-to-video-ms-1.7b, which is the only model available. The inference result is a video file located at /videos/da6b.mp4. 

For the text-to-speech task, I used the model espnet/kan-bayashi_ljspeech_vits, which is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability. The inference result is an audio file located at /audios/7540.wav.

I hope this information is helpful. Please let me know if you have any other questions.

@tricktreat
Copy link
Contributor

All right. I'll be back later to address that after a short meeting.

@tricktreat
Copy link
Contributor

Hi, was the problem solved? I'm having a hard time reproducing the problem in my environment.

@ekiwi111
Copy link
Author

ekiwi111 commented Apr 6, 2023

No it’s still there. Clean install. Ubuntu 22.04. Anything I can do to narrow down the scope of the bug?

@tricktreat
Copy link
Contributor

tricktreat commented Apr 6, 2023

FileNotFoundError: [Errno 2] No such file or directory: 'public//videos/293f.mp4'

Is this file actually generated and can it be found in public/videos?

@tricktreat
Copy link
Contributor

tricktreat commented Apr 6, 2023

You can add these lines at the beginning of run_gradio_demo.py

import os
os.makedirs("public/images", exist_ok=True)
os.makedirs("public/audios", exist_ok=True)
os.makedirs("public/videos", exist_ok=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants