In [1]:
from utils.download_pdfs_from_url import download_papers
from utils import icml_parser 
from pathlib import Path
import pickle

## Project Parameters

In [2]:
PROJECT_DIR = './examples/icml_2024'

paper_pdf_dir = Path(PROJECT_DIR, 'paper_pdfs')
paper_parsed_dir = Path(PROJECT_DIR, 'paper_parsed.pkl')

## Download and Parse all ICML papers

In [3]:
_ = download_papers('https://proceedings.mlr.press/v235/', paper_pdf_dir)

all_papers = icml_parser.parse_folder(paper_pdf_dir)
with open(paper_parsed_dir, 'wb') as f:
    pickle.dump(all_papers, f)

## Load ICML papers fomr pkl

In [3]:
with open(paper_parsed_dir, 'rb') as f:
    all_papers = pickle.load(f)

In [4]:
len(all_papers)

2610

## Test single paper summary

In [6]:
# Google AI Studio parameters 
from utils.paper_ontology import *
import os
from IPython.display import Markdown

genai.configure(api_key='<>')
flash = genai.GenerativeModel('gemini-1.5-flash')

In [7]:
# Test single paper summary
po = PaperOntology(all_papers[81], flash)
ps = po.create_summary()
Markdown(ps)

Overview of VideoPoet: A Large Language Model for Zero-Shot Video Generation

**For Business Stakeholders:**

This research paper introduces VideoPoet, a groundbreaking AI model capable of generating high-quality videos from various inputs like text, images, and audio.  This has significant implications for several businesses.

**1. Problem Statement:**

Current video generation methods primarily rely on diffusion models. These models often require extensive fine-tuning for each specific task (e.g., text-to-video, image-to-video) and struggle with generating videos containing complex, high-fidelity motion.  They also lack the flexibility to handle multiple tasks within a single model. VideoPoet addresses these limitations by leveraging the power of Large Language Models (LLMs).

**2. Use Cases Impacted:**

VideoPoet's capabilities impact a broad range of businesses, including:

* **Marketing and Advertising:** Generate engaging video ads quickly and efficiently from text descriptions or existing images.  Imagine creating personalized video ads tailored to specific customer segments without extensive video production costs.
* **Film and Entertainment:**  Accelerate the creation of movie trailers, animated sequences, or even entire short films with text-based prompts, reducing production time and costs dramatically.  
* **Education and Training:** Create dynamic and engaging educational videos easily.  Imagine generating illustrative videos for complex concepts based on simple text descriptions.
* **Gaming:**  Generate in-game cinematics and procedural content, improving the efficiency of game development.
* **Social Media:** Create short, high-quality videos for various social media platforms from simple text descriptions.
* **Virtual and Augmented Reality:** Generate realistic videos for use in VR/AR applications.


**3. Proposed Approach:**

VideoPoet uses a decoder-only transformer architecture, similar to successful LLMs used for text and code generation.  It's trained in two stages:

* **Pretraining:**  The model learns from a massive dataset of images, videos, and audio, using a mix of multimodal generative tasks. This includes generating videos from text, images, or audio; predicting future video frames; and inpainting/outpainting missing parts of videos. This allows the model to learn a rich understanding of various video elements and their relationships.
* **Task-specific adaptation:** After pretraining, the model can be further fine-tuned for specific tasks to improve its performance and address specialized requirements.


**How it's different:** Unlike existing diffusion models, VideoPoet uses a unified architecture for various video generation tasks, eliminating the need for extensive retraining for each new application. This significantly reduces development time and cost.


**4. Fundamental Techniques:**

The approach is based on the fundamental concepts of:

* **Large Language Models (LLMs):**  Utilizing the power of LLMs to handle diverse modalities and tasks within a single architecture.
* **Transformer Networks:**  Leveraging the efficiency and effectiveness of transformer architectures for processing sequential data like videos.
* **Tokenization:** Representing various input modalities (text, images, audio) as discrete tokens for unified processing.
* **Autoregressive Generation:**  Generating video frames sequentially, conditioned on the previously generated frames and other input signals.

**5. Existing Methods, Algorithms, Frameworks:**

VideoPoet utilizes:

* **MAGVIT-v2 tokenizer:** For efficient image and video encoding.
* **SoundStream tokenizer:** For audio encoding.
* **T5 XL encoder:** For text embeddings.
* **Custom super-resolution transformer:** To efficiently upscale the generated videos to higher resolutions.
* **Alternating Gradient Descent (AGD):**  An optimization technique for efficient multi-task training.


**6. Benchmarks and Metrics:**

The model was evaluated on several standard video generation benchmarks like MSR-VTT and UCF-101 (text-to-video), Kinetics-600 (frame prediction), and Something-Something V2 (inpainting/outpainting).  Metrics reported include:

* **Fréchet Video Distance (FVD):** Measures the perceptual similarity between generated and real videos.
* **CLIP Similarity Score:**  Evaluates the semantic similarity between generated video and its corresponding text description.
* **Inception Score (IS):**  A measure of the quality and diversity of the generated videos.


**7. Outperformance:**

VideoPoet outperforms several state-of-the-art methods, particularly in generating videos with high-fidelity motions and in zero-shot settings (meaning it can handle tasks not explicitly seen during training).  Human evaluations also show its competitive performance against leading diffusion-based models.  In particular, it excelled in generating videos with interesting and realistic motion.

**8. Main Conclusion and Impact:**

VideoPoet demonstrates the significant potential of LLMs in high-quality video generation. Its ability to handle multiple tasks in a unified framework, combined with its strong zero-shot performance, has substantial implications for research and business.  Future research can focus on improving visual fidelity, handling finer details, and further exploring its editing capabilities. For businesses, it promises to significantly streamline video creation workflows, reduce production costs, and enable new creative applications. The ease of generating videos from text descriptions alone opens up numerous opportunities for various industries to adopt this technology for faster, more cost-effective video content creation.


## Test Single Ontology Creation

In [8]:
#neo4j database installation see https://neo4j.com/docs/python-manual/current/install/
#And comments in paper_ontology.py
from neo4j import GraphDatabase

URI = "neo4j://localhost:7687"
AUTH = ("neo4j", "secretgraph")
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()
    print("Connection established.")

Connection established.


In [9]:
#Test single ontology creation
pj = po.create_ontology_json()

In [11]:
kg = OntologyKG(URI, AUTH[0], AUTH[1])

kg.clean()
kg.insert(pj)

# Go to http://localhost:7474 and click * to check the graph

In [10]:
po.create_ontology_str()
print(po.ontology_str)

::Model::VideoPoet
---->USES::Architecture::Decoder-only transformer architecture
---->---->PROCESSES::Modality::Image
---->---->PROCESSES::Modality::Video
---->---->PROCESSES::Modality::Text
---->---->PROCESSES::Modality::Audio
---->IS_A::Model::Large Language Model (LLM)
---->PERFORMS::Task::Video Generation
---->---->INCLUDES::Task::Text-to-video
---->---->INCLUDES::Task::Image-to-video
---->---->INCLUDES::Task::Video editing
---->---->INCLUDES::Task::Video-to-video stylization
---->---->USES::Method::Diffusion-based methods
---->---->---->USES::Model::Stable Diffusion
---->USES::Tokenizer::MAGVIT-v2
---->USES::Tokenizer::SoundStream
---->USES::Embedding::T5 XL embeddings
---->USES::Module::Super-resolution module
---->EVALUATED_ON::Dataset::MSR-VTT
---->EVALUATED_ON::Dataset::UCF-101
---->EVALUATED_ON::Dataset::Kinetics 600 (K600)
---->EVALUATED_ON::Dataset::Something-Something V2 (SSv2)
---->EVALUATED_WITH::Metric::Fr´echet Video Distance (FVD)
---->EVALUATED_WITH::Metric::CLIP si

## Test QA Engine using neo4j

In [12]:
#RAG Demo https://streamlit-aicamp-1069753422075.us-central1.run.app
from utils.paper_QA import *

#kg.clean()

docs = paper2doc([all_papers[0]]) # Please update to test indexing all papers
len(docs)

10

In [13]:
import os

#if "GOOGLE_API_KEY" not in os.environ:
os.environ["GOOGLE_API_KEY"] = "<>"

db = EmbeddingDB("neo4j://localhost:7687", "neo4j", "secretgraph", docs)

In [14]:
query = "What is the main idea?"
r_docs = db.query(query, n=5)

--------------------------------------------------------------------------------
Score:  0.6931540966033936
Statistical tests proved certain associations between the pa-
rameters of syntheses and the morphologies of the resulting
nanomaterials. Therefore, we attempted to exploit them in
predicting shapes and sizes of nanomaterials using classical
machine learning algorithms.
In some cases, several nanoparticles of different shapes and
sizes were present on the same image, so initial 215 syn-
theses produced 314 training examples of nanoparticles of
different types. Following the logic of the statistical eval-
uation, we formulated a set of binary classification tasks,
one for each type of shape or a combination of shape and
size. In this formulation, we first trained a separate model
to distinguish nanoparticles of each particular shape. Then,
we ran multiple predictions for each sample during the infer-
ence to establish what shapes of nanoparticles were present
on the corresponding i

In [15]:
context = ''
for doc, score in r_docs:
    context += doc.page_content + '\n'

prompt = f'Answer question based on given context. Question: {query} \n Context: {context}'
answer = flash.generate_content(prompt)

Markdown(answer.text)

This study explores the use of AI, specifically machine learning and large language models (LLMs), to predict the morphology (shape and size) of nanoparticles.  The researchers created a multimodal dataset of nanoparticle syntheses and corresponding SEM images,  performed statistical analysis to identify relevant synthesis parameters, and then trained and evaluated classical machine learning models and LLMs on this dataset.  Finally, they prototyped a text-to-image system to generate nanoparticle images from synthesis descriptions.  The results demonstrate the potential of AI, particularly LLMs, to accurately predict nanoparticle morphology, offering a faster and less expensive alternative to traditional experimental methods.
