In [1]:
from utils.download_pdfs_from_url import download_papers
from utils import icml_parser 
from pathlib import Path
import pickle

## Project Parameters

In [2]:
PROJECT_DIR = './examples/icml_2024'

paper_pdf_dir = Path(PROJECT_DIR, 'paper_pdfs')
paper_parsed_dir = Path(PROJECT_DIR, 'paper_parsed.pkl')

## Download and Parse all ICML papers

In [3]:
_ = download_papers('https://proceedings.mlr.press/v235/', paper_pdf_dir)

all_papers = icml_parser.parse_folder(paper_pdf_dir)
with open(paper_parsed_dir, 'wb') as f:
    pickle.dump(all_papers, f)

## Load ICML papers fomr pkl

In [3]:
with open(paper_parsed_dir, 'rb') as f:
    all_papers = pickle.load(f)

In [4]:
len(all_papers)

2610

## Test single paper summary

In [9]:
# Google AI Studio parameters 
from utils.paper_ontology import *
import os
from IPython.display import Markdown

genai.configure(api_key='<>')
flash = genai.GenerativeModel('gemini-1.5-flash')

  '''


In [21]:
# Test single paper summary
po = PaperOntology(all_papers[81], flash)
ps = po.create_summary()
Markdown(ps)

**Overview of VideoPoet: A Large Language Model for Zero-Shot Video Generation**

This paper introduces VideoPoet, a groundbreaking model for generating high-quality videos from various inputs like text, images, and audio.  For business stakeholders, this is significant because it opens doors to numerous applications requiring automated video creation, improving efficiency and potentially reducing costs.

**Problem Statement:**

Current video generation models primarily rely on diffusion-based methods. These methods typically start with a pre-trained image model and then fine-tune it for video, often resulting in a complex and less flexible system.  They also often struggle with generating high-fidelity motion and require significant computational resources. VideoPoet addresses these limitations by leveraging the power and efficiency of Large Language Models (LLMs).

**Use Cases Impacted:**

VideoPoet's impact spans several business sectors:

* **Marketing and Advertising:** Generate marketing videos quickly and easily from text descriptions or existing images, reducing production time and costs.
* **Entertainment:** Create short video clips, animations, and even longer videos for social media, games, and other entertainment applications.
* **Education and Training:**  Develop engaging educational videos from text scripts or images, simplifying the creation of instructional material.
* **E-commerce:** Generate product demonstration videos directly from product descriptions and images, streamlining the creation of online product showcases.
* **Film and Animation:** While still in early stages,  the technology shows potential to assist in creating initial drafts or concept videos for larger productions.


**Proposed Approach:**

VideoPoet utilizes a decoder-only transformer architecture, similar to successful LLMs used for text and code generation.  It's trained in two stages:

1. **Pretraining:** VideoPoet learns to generate videos from a diverse set of multimodal inputs and tasks (e.g., text-to-video, image-to-video, video inpainting). This is unlike diffusion models which usually focus on a single task.
2. **Task-Specific Adaptation:** The pre-trained model can then be fine-tuned for specific applications to enhance quality or add new capabilities.

The key difference from existing diffusion models is the use of a unified LLM architecture. This allows for easier integration of various input modalities and tasks, unlike diffusion models which often require architectural modifications for each new task.


**Fundamental Techniques:**

VideoPoet leverages several fundamental techniques:

* **Large Language Model (LLM) Architecture:**  The core of the model is a transformer-based LLM, known for its ability to handle sequential data and learn complex relationships between different modalities.
* **Autoregressive Generation:** The model generates videos sequentially, predicting the next frame based on the previous ones and any conditioning signals (text, images, etc.).
* **Multimodal Input:**  The model processes images, videos, text, and audio as discrete tokens, allowing for flexible input combinations.


**Existing Methods Used:**

VideoPoet employs existing methods and frameworks:

* **MAGVIT-v2 Tokenizer:**  For efficient encoding of images and videos into discrete tokens.
* **SoundStream Tokenizer:** For encoding audio into discrete tokens.
* **T5 XL Encoder:**  For processing text input.
* **Windowed Local Attention:**  To manage the computational complexity of processing high-resolution videos.
* **Alternating Gradient Descent (AGD):**  For efficient multi-task training.


**Benchmarks and Metrics:**

VideoPoet was evaluated on standard video generation benchmarks (MSR-VTT, UCF-101, Kinetics-600, Something-Something V2) using metrics like Fréchet Video Distance (FVD), CLIP similarity score, and Inception Score (IS).  These metrics assess the realism, consistency, and fidelity of the generated videos.

**Outperformance:**

VideoPoet demonstrates competitive performance and outperforms other methods in certain aspects, particularly in generating videos with complex and realistic motion, as demonstrated in both automated metrics and human evaluations.  It shows comparable or superior performance to state-of-the-art diffusion models on certain metrics.  Human evaluations confirm its strength in generating more interesting and realistic motion compared to other models.


**Main Conclusion and Future Impact:**

VideoPoet successfully demonstrates the feasibility and advantages of using LLMs for high-quality video generation.  This approach offers improved flexibility, efficiency, and the potential for zero-shot video generation and diverse editing capabilities.  The research opens new avenues for future research in video generation, and for businesses, it promises more efficient and cost-effective methods for video content creation across various industries.  However, ethical considerations regarding misuse (deepfakes etc.) need to be addressed.


## Test Single Ontology Creation

In [None]:
#neo4j database installation see https://neo4j.com/docs/python-manual/current/install/
#And comments in paper_ontology.py

URI = "neo4j://localhost:7687"
AUTH = ("neo4j", "secretgraph")
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()
    print("Connection established.")

In [18]:
#Test single ontology creation
pj = po.create_ontology_json()

In [11]:
kg = OntologyKG(URI, AUTH[0], AUTH[1])

kg.clean()
kg.insert(pj)

# Go to http://localhost:7474 and click * to check the graph

In [28]:
po.create_ontology_str()
print(po.ontology_str)

::Model::VideoPoet
---->HAS_ARCHITECTURE::Architecture::Decoder-only transformer
---->PROCESSES::Modality::Image
---->PROCESSES::Modality::Video
---->PROCESSES::Modality::Text
---->PROCESSES::Modality::Audio
---->IS_A::Model::Large Language Model (LLM)
---->UNDERGOES::Training Stage::Pretraining
---->---->USES::Objective::Multimodal generative objectives
---->UNDERGOES::Training Stage::Task-specific adaptation
---->PERFORMS::Task::Zero-shot video generation
---->PERFORMS::Task::Text-to-video
---->PERFORMS::Task::Image-to-video
---->PERFORMS::Task::Video editing
---->PERFORMS::Task::Video-to-video stylization
---->USES::Tokenizer::MAGVIT-v2
---->USES::Tokenizer::SoundStream
---->USES::Embedding::T5 XL
---->HAS_MODULE::Module::Super-resolution module
---->EVALUATED_ON::Dataset::MSR-VTT
---->EVALUATED_ON::Dataset::UCF-101
---->EVALUATED_ON::Dataset::Kinetics 600 (K600)
---->EVALUATED_ON::Dataset::Something-Something V2 (SSv2)



## Test QA Engine using neo4j

In [6]:
#RAG Demo https://streamlit-aicamp-1069753422075.us-central1.run.app
from utils.paper_QA import *

#kg.clean()

docs = paper2doc([all_papers[0]])
len(docs)

10

In [7]:
db = EmbeddingDB("neo4j://localhost:7687", "neo4j", "secretgraph", docs)

In [8]:
query = all_papers[0]['sections'][0]['section_content']
_ = db.query(query)

--------------------------------------------------------------------------------
Score:  0.9593460559844971
Creation of nanomaterials with specific morphol-
ogy remains a complex experimental process,
even though there is a growing demand for these
materials in various industry sectors. This study
explores the potential of AI to predict the morphol-
ogy of nanoparticles within the data availability
constraints. For that, we first generated a new
multi-modal dataset that is double the size of anal-
ogous studies. Then, we systematically evaluated
performance of classical machine learning and
large language models in prediction of nanoma-
terial shapes and sizes. Finally, we prototyped
a text-to-image system, discussed the obtained
empirical results, as well as the limitations and
promises of existing approaches.

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.9230