In [1]:
from utils.download_pdfs_from_url import download_papers
from utils import icml_parser 
from pathlib import Path
import pickle

## Project Parameters

In [2]:
PROJECT_DIR = './examples/icml_2024'

paper_pdf_dir = Path(PROJECT_DIR, 'paper_pdfs')
paper_parsed_dir = Path(PROJECT_DIR, 'paper_parsed.pkl')

## Download and Parse all ICML papers

In [3]:
#_ = download_papers('https://proceedings.mlr.press/v235/', paper_pdf_dir)

#all_papers = icml_parser.parse_folder(paper_pdf_dir)
#with open(paper_parsed_dir, 'wb') as f:
#    pickle.dump(all_papers, f)

## Load ICML papers fomr pkl

In [3]:
with open(paper_parsed_dir, 'rb') as f:
    all_papers = pickle.load(f)

In [4]:
len(all_papers)

2610

## Test single paper summary

In [9]:
# Google AI Studio parameters 
from utils.paper_ontology import *
import os
from IPython.display import Markdown

genai.configure(api_key='<>')
flash = genai.GenerativeModel('gemini-1.5-flash')

  '''


In [10]:
# Test single paper summary
po = PaperOntology(all_papers[81], flash)
ps = po.create_summary()
Markdown(ps)

## VideoPoet: A Large Language Model for Zero-Shot Video Generation - An Overview for Business Stakeholders

This paper introduces VideoPoet, a groundbreaking AI model capable of generating high-quality videos from various inputs like text, images, and audio.  This has significant implications for several businesses.

**1. The Problem:**

Current methods for generating videos rely heavily on diffusion models, which are complex, computationally expensive, and often require significant fine-tuning for each specific task (e.g., turning text into video, modifying an existing video).  These models struggle with generating videos featuring high-fidelity, complex motion.  The process is also inflexible, requiring separate models for different video manipulation tasks.


**2. Use Cases Impacted:**

VideoPoet's capabilities impact a wide array of businesses:

* **Marketing and Advertising:** Generate engaging video ads tailored to specific demographics or product features directly from text descriptions.
* **Film and Entertainment:** Create realistic and stylized videos, potentially streamlining animation and visual effects processes.
* **Education and Training:** Produce instructional videos quickly and easily from text scripts or existing images.
* **Gaming:**  Generate dynamic in-game content or cut-scenes more efficiently.
* **Virtual and Augmented Reality:** Create immersive and interactive experiences by producing high-quality video content on the fly.


**3. Proposed Approach:**

Unlike existing diffusion-based models, VideoPoet uses a Large Language Model (LLM) architecture.  Think of it like a highly advanced text prediction model, but instead of predicting words, it predicts video frames and audio. It works by converting all inputs (text, images, audio) into a universal code (tokens) that the LLM can understand and process.  This allows for a unified approach to various video generation tasks, eliminating the need for separate models.

**4. Fundamental Techniques:**

VideoPoet is based on:

* **Large Language Models (LLMs):** The core of the model, leveraging the power of LLMs to handle multiple modalities (text, image, audio) in a unified way.
* **Transformer Architecture:** The underlying neural network architecture known for its ability to process sequential data like text and video effectively.
* **Autoregressive Generation:** The model predicts future frames based on preceding frames, similar to how we predict the next word in a sentence.


**5. Existing Methods Used:**

The model utilizes:

* **MAGVIT-v2 tokenizer:**  A highly efficient method for converting images and videos into the numerical codes (tokens) that the LLM understands.
* **SoundStream tokenizer:**  Converts audio into tokens.
* **T5 XL encoder:**  A pre-trained model for converting text into embeddings suitable for the LLM.
* **Bidirectional Transformer with Windowed Attention:** A technique to efficiently upscale low-resolution videos to high resolution.


**6. Benchmarks and Metrics:**

The researchers tested VideoPoet on several established video generation benchmarks (MSR-VTT, UCF-101, Kinetics-600, Something-Something V2), comparing it to state-of-the-art methods.  They used metrics like Fréchet Video Distance (FVD – measures the similarity between generated and real videos), CLIP similarity (measures how well the generated video matches a text description), and Inception Score (measures the diversity and quality of generated videos).


**7. Outperformance:**

VideoPoet showed competitive or superior performance compared to other leading video generation models, especially in generating videos with complex and realistic motions.  Human evaluations confirmed that VideoPoet often surpasses existing models in terms of motion realism and "interestingness."


**8. Main Conclusion and Future Impact:**

The paper demonstrates that LLMs can be a highly effective approach to video generation, outperforming traditional diffusion models in several key areas.  This opens up new avenues of research and business opportunities:

* **Increased efficiency:**  Generating videos becomes faster and cheaper.
* **Enhanced creativity:** Easier and more intuitive video creation tools for businesses.
* **New possibilities:** Enables previously impossible video applications.

However, ethical considerations surrounding deepfakes and misinformation need to be addressed through measures like digital watermarking.  The research strongly suggests that LLM-based video generation is a promising path forward for many industries.


## Test Single Ontology Creation

In [None]:
#neo4j database installation see https://neo4j.com/docs/python-manual/current/install/
#And comments in paper_ontology.py

URI = "neo4j://localhost:7687"
AUTH = ("neo4j", "secretgraph")
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()
    print("Connection established.")

In [12]:
#Test single ontology creation
pj = po.create_ontology_json()

In [11]:
kg = OntologyKG(URI, AUTH[0], AUTH[1])

kg.clean()
kg.insert(pj)

# Go to http://localhost:7474 and click * to check the graph

In [13]:
po.create_ontology_str()
print(po.ontology_str)

::Model::VideoPoet
---->HAS_ARCHITECTURE::Architecture::Decoder-only transformer
---->PROCESSES::Modality::Images
---->PROCESSES::Modality::Videos
---->PROCESSES::Modality::Text
---->PROCESSES::Modality::Audio
---->IS_A::Model::Large Language Model (LLM)
---->UNDERGOES::Training Stage::Pretraining
---->---->USES::Objective::Multimodal generative objectives
---->UNDERGOES::Training Stage::Task-specific adaptation
---->PERFORMS::Task::Zero-shot video generation
---->PERFORMS::Task::Text-to-video
---->PERFORMS::Task::Image-to-video
---->PERFORMS::Task::Video editing
---->PERFORMS::Task::Video-to-video stylization
---->USES::Tokenizer::MAGVIT-v2
---->USES::Tokenizer::SoundStream
---->HAS_MODULE::Module::Super-resolution module
---->EVALUATED_ON::Dataset::MSR-VTT
---->EVALUATED_ON::Dataset::UCF-101
---->EVALUATED_ON::Dataset::Kinetics 600 (K600)
---->EVALUATED_ON::Dataset::Something-Something V2 (SSv2)
---->EVALUATED_WITH::Metric::Fr´echet Video Distance (FVD)
---->EVALUATED_WITH::Metric::C

## Test QA Engine using neo4j

In [6]:
#RAG Demo https://streamlit-aicamp-1069753422075.us-central1.run.app
from utils.paper_QA import *

#kg.clean()

docs = paper2doc([all_papers[0]])
len(docs)

10

In [7]:
db = EmbeddingDB("neo4j://localhost:7687", "neo4j", "secretgraph", docs)

In [8]:
query = all_papers[0]['sections'][0]['section_content']
_ = db.query(query)

--------------------------------------------------------------------------------
Score:  0.9593460559844971
Creation of nanomaterials with specific morphol-
ogy remains a complex experimental process,
even though there is a growing demand for these
materials in various industry sectors. This study
explores the potential of AI to predict the morphol-
ogy of nanoparticles within the data availability
constraints. For that, we first generated a new
multi-modal dataset that is double the size of anal-
ogous studies. Then, we systematically evaluated
performance of classical machine learning and
large language models in prediction of nanoma-
terial shapes and sizes. Finally, we prototyped
a text-to-image system, discussed the obtained
empirical results, as well as the limitations and
promises of existing approaches.

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.9230