## Questions for Ahmad:

- What is the best way to store and process large amounts of text data?
    - how to amalgamate the different stored files
    - what should my approach be (work on work corpus first?) because different data have different royalty rates
    
- Goal is to study royaltyrate (a continuous variable) from text data
- Validate current methods (NER & Text summarisation) to extract relavent features and then predict royalty rate
    - NER on BERT : **location (LOC)**, **organizations (ORG)**, person (PER) and Miscellaneous (MISC) (https://huggingface.co/dslim/bert-base-NER)
    - Legal BERT: (https://huggingface.co/nlpaueb/legal-bert-base-uncased)
    - Text Summarisation: 16k token count limit, seems plenty (https://huggingface.co/nsi319/legal-led-base-16384)
    - topic segmentation/clustering
- Otherwise, what steps would you suggest? 
    - traditional models?
    - specific forms of data cleaning (lack of legal domain)


- **Summarisation:**
    - document is organised in chunks, which can already provide us with a good structure/flow
    - 

- **NER:**
    - extract informations like 
        - 1) lisensor 2) lisensee 3) date of trasaction

- **Features for Predicting Royalty Rates**:
    - Sales Periods: Differentiate rates based on the stage of the product (Beta Testing, Initial Launch, Secondary Launch, Extension Periods).
    - Net Sales: Net Sales can directly influence the royalty rate based on brackets (1st 1 million, 2nd 1 million, above 2 million).
    - Minimum Amounts: These are the royalty floors for various periods.
    - Patent Status: Depending on the jurisdiction and IP considerations, the royalty rate can change (e.g., 50% reduction).
    - Sub-licensing: If sublicensing occurs, royalty needs to be considered.
    - Competing or Next-Generation Products: Could lead to renegotiation or changes in royalty rates. 

**Summarised Royalty rate segment:**
The paragraph details the royalty agreement between a Licensor and Licensee for Covered Products. The Licensee must pay royalties to the Licensor starting from the Covered Product Launch Date, with rates and terms described in specific sections. There are various phases such as Beta Testing, Initial Launch, Secondary Launch, and Extension Periods, each with its own Minimum Amount of royalty. Net Sales Brackets are defined with rates of 20%, 15%, and 10%. Sales in jurisdictions without patent recognition are subject to a 50% reduced royalty rate. Royalties are to be paid quarterly in US dollars. The Licensee must maintain records for verification and has the right to sublicense. If the Licensor develops a competing product, the Licensee has a right of first negotiation. If the Licensee fails to maintain the Minimum Amounts, they may rectify this in the last report of the applicable period to maintain their license rights.

**Extracted Features of Interest:**
- Sales Periods:

        Beta Testing
        Initial Launch
        Secondary Launch
        Extension Periods
        
- **Net Sales:**

        Defined with rates based on Net Sales Brackets
        20% for the 1st 1 million
        15% for the 2nd 1 million
        10% for above 2 million
        
- **Minimum Amounts:**

        Different for each phase (though exact amounts are not provided in the summary)
- **Patent Status:**

        Sales in jurisdictions without patent recognition are subject to a 50% reduced royalty rate.
- **Sub-licensing:**

        The Licensee has the right to sublicense.
- **Competing or Next-Generation Products:**

        If the Licensor develops a competing product, the Licensee has a right of first negotiation  .

### Concerns

- each agreement is structured differently
- large amounts of data (which can be good but not sure if my processing can handle them)
- many different segments, lack knowledge on which features to keep or pay attention to

### Notes:

- Bin the royaltyrate

- EDA of corpus (tokens)
- ReGeX
- function to read in data (tokens)
- vector databases

- LLM (langchain)
- Claude 2 (feature engineering) 

ahmad.ammari@unimelb.edu.au

In [4]:
import pandas as pd
from docx import Document

In [5]:
def read_txt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

file_path = '/home/kprasath/DS/NLPInternship/txt_data/ref_1_txt/8683.txt'
text = read_txt(file_path)

print(text)

Top of Form







                                                                    EXHIBIT 10.2

                                LICENSE AGREEMENT
                                -----------------

         This LICENSE AGREEMENT ("Agreement") is made this 30th day of March,
2006 by and between AzurTec, Inc. ("AzurTec" or the "Licensor"), a Pennsylvania
corporation, and PhotoMedex, Inc. ("PHMD" or the "Licensee"), a Delaware
corporation.

                                   WITNESSETH:
                                   ----------

         WHEREAS, PHMD and AzurTec are parties to that certain Investment
Agreement (the "Investment Agreement"), that certain Security Agreement (the
"Security Agreement") and that certain Development Agreement (the "Development
Agreement"), as amended, all of even date herewith;

         WHEREAS, pursuant to the Investment Agreement, PHMD and AzurTec have
agreed to enter into this License Agreement.

         NOW, THEREFORE, in consideration of the pre

In [6]:
file_path = '/home/kprasath/DS/NLPInternship/txt_data/ref_1_txt/8866.txt' 
text = read_txt(file_path)
print(text)

Top of Form







                             ANDA OWNERSHIP TRANSFER
                          AND PRODUCT LICENSE AGREEMENT

         This license agreement (the "Agreement")  dated as of May 17, 2006 (the
"Effective Date") is entered into by and between Nostrum  Pharmaceuticals,  Inc.
("Nostrum"), a Delaware corporation, and Synovics Laboratories,  Inc. ("Synovics
Labs"), a Nevada corporation.

WHEREAS,

         A. Nostrum has  developed,  and is in the process of  developing,  time
release and other technology for pharmaceutical products.

         B. Synovics Labs is researching, developing,  manufacturing,  marketing
and selling pharmaceutical products.

         C. Synovics Labs desires to license from Nostrum the exclusive right to
develop,  manufacture,  market and sell the Product (as hereinafter  defined) in
the United  States of  America,  and  Nostrum  desires to grant such  license to
Synovics Labs, all on the terms and conditions hereinafter set forth.

         NOW, 

In [8]:
file_path = '/home/kprasath/DS/NLPInternship/txt_data/ref_1_txt/9740.txt' # Replace with the actual path to your TXT file
text = read_txt(file_path)
print(text)

Top of Form


  Type: EX-10.2 Description: No Description  





  
    
      Exhibit 10.4 Joint HIV Barrel Product Marketing Agreement






  
    

      

        

          


        

        

          


        

        
 

        

          
FINAL

          
 

          
Joint
            HIV Barrel Product Commercialization Agreement

          
 

          
PREAMBLE

          
 

          
This
            Joint HIV Barrel Product Commercialization Agreement (the “Agreement”)
            is
            made as of September 29, 2006 (“Effective
            Date”),
            by
            and between Chembio
            Diagnostic Systems, Inc.,
            a
            Delaware corporation having its principal place of business at 3661 Horseblock
            Road, Medford, NY 11763 (“Chembio”),
            and
StatSure
            Diagnostic Systems, Inc.,
            (f/k/a
            Saliva Diagnostic Systems) a Delaware corporation having its principal
  

In [9]:
file_path = '/home/kprasath/DS/NLPInternship/txt_data/ref_1_txt/10899.txt' # Replace with the actual path to your TXT file
text = read_txt(file_path)
print(text)

Top of Form







                            ASSET PURCHASE AGREEMENT

         This Asset Purchase Agreement  ("Agreement"),  is entered into as of 19
June  2006  (the  "Effective   Date")  by  and  between  the  following  parties
(collectively, the "Parties"):

     Implantable  Vision,  Inc. a Utah  Corporation  with a principal  office at
25730 Lorain Rd., North Olmsted, OH 44070 ("IMPLANTABLE VISION"), and,

     CIBA  Vision AG, a Swiss  corporation  with  offices at  Hardhoftrasse  15,
CH-8424, Embrach, Switzerland ("CIBA")

         Whereas,  CIBA desires to convey to IMPLANTABLE  VISION and IMPLANTABLE
VISION desires to acquire from CIBA certain intangible assets relating to CIBAs
ophthalmic surgical products business;

         Now, therefore,  in consideration of the obligations undertaken by each
party and other good and  valuable  consideration,  and  intending to be legally
bound, the Parties hereby agree as follows:

1.       Definitions

         As used in this Agre

In [10]:
file_path = '/home/kprasath/DS/NLPInternship/txt_data/ref_1_txt/11206.txt' # Replace with the actual path to your TXT file
text = read_txt(file_path)
print(text)

Top of Form


  Type: EX-10.15 Description: SETTLEMENT, LICENSE AND DEVELOPMENT AGREEMENT  








 

 
[ * ] = CERTAIN CONFIDENTIAL INFORMATION CONTAINED
IN THIS DOCUMENT, MARKED BY BRACKETS, HAS BEEN OMITTED AND FILED
SEPARATELY WITH THE SECURITIES AND EXCHANGE COMMISSION PURSUANT TO RULE 24B-2
OF THE SECURITIES EXCHANGE ACT OF 1934, AS AMENDED. 
 
Exhibit 10.15 
 
SETTLEMENT, LICENSE AND DEVELOPMENT AGREEMENT 
 
THIS SETTLEMENT, LICENSE AND DEVELOPMENT AGREEMENT (the
“Agreement”), is entered into as of March 5, 2007 (the “Execution Date”) by and between Tercica, Inc., a company incorporated under the laws of Delaware with offices at 2000 Sierra Point Parkway, Suite 400,
Brisbane, CA 94005, United States of America (“Tercica”), Insmed Incorporated, a company incorporated under the laws of Virginia with offices at 8720 Stony Point Parkway, Suite 200, Richmond, VA 23235, Insmed Therapeutic Proteins,
Inc., a company incorporated under the laws of Colorado with offices at 2590 Central A