This sheets answers the question of: out of the best models, how similar are the top X% of the results.  The best models are mpnet_base_v2, roberta, and scispacy, since all three of these models have the greatest z-scores from the noise.  See Noise_to_Related_Claims_Histogram_DrugLabelsandPatent.ipynb.

In [85]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
zip_path='/content/drive/MyDrive/Colab Notebooks/zip/db2file.zip'
!cp "{zip_path}" .
!cp "/content/drive/MyDrive/Colab Notebooks/requirements.txt" .
!unzip -q db2file.zip
!rm db2file.zip
!rm -r en_core_sci_lg-0.4.0.zip
!cp "/content/drive/MyDrive/Colab Notebooks/zip/en_core_sci_lg-0.4.0.zip" .
!unzip -q en_core_sci_lg-0.4.0.zip
!rm en_core_sci_lg-0.4.0.zip

Mounted at /content/drive
replace db2file/10402/87E2DA8D-432C-4ED5-67A1-DC26294B2295/additions_with_context/2007-08-27? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
rm: cannot remove 'en_core_sci_lg-0.4.0.zip': No such file or directory
replace en_core_sci_lg-0.4.0/MANIFEST.in? [y]es, [n]o, [A]ll, [N]one, [r]ename: A


In [86]:
!pip install -r '/content/requirements.txt'



In [87]:
import random
import os
random.seed(30)
db2files = "/content/db2file/"
NDA_list=[f for f in os.listdir(db2files)]

In [88]:
def get_lines_in_file(file_name):
  if os.path.exists(file_name):
    f = open(file_name, "rb")
    return_list = [str(line.decode('unicode_escape')) for line in f if line.decode('unicode_escape').strip()]
    f.close()
    return return_list
  else:
    return []

In [89]:
def flat_list(lst):
  return [item for sublist in lst for item in sublist]

def get_additions(NDA, additions_folder_name):
  """ 
  Return all additions as a list for the set-id with most additions for a NDA 
  excluding the first addition.
  Parameters:
      NDA (string): NDA dir
      additions_folder_name (string): either 'just_additions' or 'additions_with_context'
  """
  if additions_folder_name not in ['just_additions', 'additions_with_context']:
    print(f"Parameter {additions_folder_name} not in ['just_additions', 'additions_with_context']")
    return []
  NDA_dir=db2files+str(NDA)+'/'
  set_id_dirs=[f for f in os.listdir(NDA_dir)]
  try:
    set_id_dirs.remove('patents')
  except ValueError:
    pass
  additions_list=[]
  for set_id_dir in set_id_dirs:
    additions_dir=NDA_dir+set_id_dir+'/'+additions_folder_name+'/'
    if os.path.exists(additions_dir):
      additions_files=sorted([additions_dir+f for f in os.listdir(additions_dir)])[1:]
      additions_list_tmp=flat_list([get_lines_in_file(file) for file in additions_files])
      if len(additions_list_tmp)> len(additions_list):
        additions_list=additions_list_tmp
  return additions_list

def get_patent_claims(NDA, patents_folder_name):
  """Return a list of patents claims for a NDA
  Parameters:
      NDA (string): NDA dir
      patents_folder_name (string): either 'patents' or 'patents_longhand'
  """
  patent_dir=db2files+str(NDA)+'/'+patents_folder_name+'/'
  if os.path.exists(patent_dir):
    patent_files=[patent_dir+f for f in os.listdir(patent_dir)]
    return flat_list([get_lines_in_file(file) for file in patent_files])
  return []
    

In [90]:
random_NDA_list=random.sample(NDA_list, int(len(NDA_list)*.33))
print(len(NDA_list), len(random_NDA_list), random_NDA_list[1])

1606 529 202895-21976


In [91]:
# narrow 1/3 of random data to NDA with patents and NDAs with additions.  If either is missing, we cannot check quality of additions to related patents.
random_NDA_list=[x for x in random_NDA_list if get_patent_claims(x, 'patents') and get_additions(x, 'additions_with_context')]

In [92]:
print(len(random_NDA_list))

292


In [100]:
def return_match(NDA, additions_folder_name, patent_folder_name, scoring_method_list, cutoff_percentage=.1):
  """
  This method returns {("method_name", NDA, len(additions), len(claims)):[[claim_num, claim_num],],}
  Parameters:
    random_NDA_list (list): list of NDA numbers
    additions_folder_name (string): either 'patents' or 'patents_longhand'
    patent_folder_name (string): either 'just_additions' or 'additions_with_context'
    scoring_method_list (list): list of [[function that is use to score similarity, optional_scoring_method_field],..]
  """

  # scoring_method_result_dict={ "scoring_method.__name__": {"additions_len": X, "claims_len": X, "matches":[(addition, claim), }}
  scoring_method_result_dict={}

  for i in range(len(scoring_method_list)):
    scoring_method=scoring_method_list[i][0]
    optional_scoring_method_field=scoring_method_list[i][1]
    optional_scoring_method_field_name=scoring_method_list[i][2]

    claims = get_patent_claims(NDA, patent_folder_name)
    additions = get_additions(NDA, additions_folder_name)

    matrix=scoring_method(additions, claims, optional_scoring_method_field)
    # get top 10 percent of each row/addition of the matrix.
    top_10 = top_10_percent_matrix(matrix, cutoff_percentage)
    method_name = optional_scoring_method_field_name if optional_scoring_method_field else scoring_method.__name__
    key = (method_name, NDA, len(additions), len(claims))
    scoring_method_result_dict[key]=top_10
  return scoring_method_result_dict



In [101]:
import math
def top_10_percent_matrix(matrix, cutoff_percentage):
  """ Return a matrix: ie. [[claim_num_index, claim_num_index],] for top 10% of each match.
        The rows represents additions.

  Parameters:
    matrix (list of list): a matrix where the rows represents additions, and the columns, claims for all additions to all claims
  """
  return_matrix_of_index=[]
  for row in matrix:
    length_10percent=math.ceil(len(row)*cutoff_percentage)
    # get indices of top 10 in max_scores_list
    indexes=sorted(range(len(row)), key=lambda i: row[i], reverse=True)[:length_10percent]
    return_matrix_of_index.append(indexes)
  return return_matrix_of_index


In [102]:
from sentence_transformers import SentenceTransformer, util

def scoring_method_bert(additions, claims, model):
  """ Returns a list of [[cosine_score,cosine_score,],], where rows indicate addition, columns are claims.
  """
  # Compute embedding for both lists
  additions_embeddings = model.encode(
      additions,
      convert_to_tensor=True,
  )
  claims_embeddings = model.encode(
      claims,
      convert_to_tensor=True,
  )
  # Compute cosine-similarity for every additions to every claim
  cosine_scores = util.pytorch_cos_sim(
      additions_embeddings, claims_embeddings
  ).tolist()
  return cosine_scores


In [96]:
import spacy
gpu = spacy.prefer_gpu()
print('GPU:', gpu)
!pip install -U spacy[cuda101]
_N_PROCESS = 1
_en_core_sci_lg_nlp = spacy.load("/content/en_core_sci_lg-0.4.0/en_core_sci_lg/en_core_sci_lg-0.4.0")
!python -m spacy download en_core_web_trf
nlp_en = spacy.load('en_core_web_trf')

GPU: True
Requirement already up-to-date: spacy[cuda101] in /usr/local/lib/python3.7/dist-packages (3.0.6)
2021-05-06 02:04:12.501568: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


In [103]:

def preprocess_with_spacy_nlp(text_list, steps, nlp=nlp_en):
    """
    This method can remove punctuation,
    Parameters:
        text_list (list): list of strings
        steps (list): one of ["punct", "lemma", "stopwords"]
    """
    # make a copy of text_lis
    return_list = text_list
    if any(item in ["punct", "lemma", "stopwords"] for item in steps):
        # 'lemmatizer' required 'tagger' and 'attribute_ruler'
        nlp_list = list(
            nlp.pipe(
                return_list,
                disable=["tok2vec", "ner"],
                n_process=_N_PROCESS,
            )
        )
        return_list = [
            " ".join(
                [
                    token.lemma_ if "lemma" in steps else token.text
                    for token in doc
                    if (
                        (
                            ("punct" in steps and not token.is_punct)
                            or "punct" not in steps
                        )
                        and (
                            ("stopwords" in steps and not token.is_stop)
                            or "stopwords" not in steps
                        )
                    )
                ]
            )
            for doc in nlp_list
        ]
    return return_list


def similarity_matrix(embed_A_list, embed_B_list):
    """
    This method returns a matrix such as:
        [[X, X, X],
        [X, X, X]]
    wherein each row represents the similarity measurement between an embedding
    from embed_A_list to each of the embeddings in embed_B_list.

    Parameters:
        embed_A_list (list): list of NLP object generated by spaCy
        embed_B_list (list): list of NLP object generated by spaCy to be
                             compared to embed_A
    """
    matrix = [[0] * len(embed_B_list) for y in range(len(embed_A_list))]
    for i in range(len(embed_A_list)):
        for j in range(len(embed_B_list)):
            matrix[i][j] = embed_A_list[i].similarity(embed_B_list[j])
    return matrix


def scoring_method_spacy(additions, claims, nlp=nlp_en):
  """ Scores with spaCy; return a matrix of similar scores
  """
  additions=preprocess_with_spacy_nlp(additions, ["punct", "lemma", "stopwords"], nlp)
  claims=preprocess_with_spacy_nlp(claims, ["punct", "lemma", "stopwords"], nlp)
  # Compute embedding for both lists
  # tokenization only requires tok2vec
  disabled_list = ["tagger", "attribute_ruler", "lemmatizer", "parser", "ner"]
  additions_embeddings = list(
      nlp.pipe(
          additions, disable=disabled_list, n_process=_N_PROCESS
      )
  )
  claims_embeddings = list(
      nlp.pipe(
          claims, disable=disabled_list, n_process=_N_PROCESS
      )
  )
  # Compute cosine-similarity for every additions to every claim
  return similarity_matrix(additions_embeddings, claims_embeddings)


In [104]:
_device = None

model_mpnet_base_v2 = SentenceTransformer("stsb-mpnet-base-v2", device=_device)
model_mpnet_base_v2.zero_grad()
# Will limit size since CUDA runs out of memory
model_mpnet_base_v2.max_seq_length=512

model_roberta_base_v2 = SentenceTransformer("stsb-roberta-base-v2", device=_device)
model_roberta_base_v2.zero_grad()
# Will limit size since CUDA runs out of memory
model_roberta_base_v2.max_seq_length=512



In [105]:

scoring_method_list=[
                     [scoring_method_bert, model_mpnet_base_v2, "model_mpnet_base_v2"],
                     [scoring_method_bert,model_roberta_base_v2, "model_roberta_base_v2"],
                     [scoring_method_spacy,nlp_en, "en_core_web_trf"],
                     [scoring_method_spacy,_en_core_sci_lg_nlp, "_en_core_sci_lg_nlp"],]

NDA=random_NDA_list[3]

# return_match_result={("method_name", NDA, len(additions), len(claims)):[[claim_num, claim_num],],}, 
# where the value are (row) additions and (col) index/specific line in the patent file when all files are joined together
return_match_result=return_match(NDA, "additions_with_context", "patents", scoring_method_list, cutoff_percentage=.1)
print(return_match_result)




{('model_mpnet_base_v2', '208051', 8, 176): [[101, 112, 109, 164, 22, 142, 39, 51, 0, 114, 118, 110, 103, 175, 113, 115, 111, 159], [92, 78, 68, 65, 154, 79, 97, 64, 34, 91, 60, 70, 49, 81, 61, 87, 82, 50], [101, 112, 109, 142, 22, 164, 39, 51, 0, 114, 118, 115, 110, 175, 111, 103, 113, 159], [131, 91, 61, 65, 96, 86, 76, 31, 81, 23, 62, 119, 78, 35, 27, 82, 60, 85], [128, 129, 170, 130, 42, 138, 168, 167, 86, 87, 139, 26, 165, 140, 23, 35, 27, 55], [101, 109, 112, 22, 142, 39, 0, 164, 51, 110, 111, 114, 115, 175, 118, 75, 159, 73], [142, 0, 22, 109, 101, 39, 112, 164, 51, 175, 111, 114, 2, 144, 159, 75, 162, 143], [0, 142, 101, 112, 39, 109, 51, 22, 164, 175, 111, 114, 143, 75, 110, 35, 96, 159]], ('model_roberta_base_v2', '208051', 8, 176): [[101, 109, 112, 142, 118, 115, 0, 103, 22, 39, 51, 164, 75, 175, 117, 143, 59, 1], [87, 97, 92, 54, 82, 62, 84, 94, 52, 51, 99, 30, 55, 38, 59, 40, 53, 89], [101, 109, 142, 112, 118, 115, 0, 22, 103, 39, 164, 117, 51, 75, 143, 175, 116, 1], [119,

In [106]:
# Remove en_core_web_trf, since results indicate that there are no vectors
scoring_method_list=[
                     [scoring_method_bert, model_mpnet_base_v2, "model_mpnet_base_v2"],
                     [scoring_method_bert, model_roberta_base_v2, "model_roberta_base_v2"],
                     [scoring_method_spacy, _en_core_sci_lg_nlp, "_en_core_sci_lg_nlp"],]

multi_return_match_result=[]
for i in range(len(random_NDA_list)):
  print(f"processing {str(i)} of {str(len(random_NDA_list))}")
  NDA=random_NDA_list[i]
  multi_return_match_result.append(return_match(NDA, "additions_with_context", "patents", scoring_method_list, cutoff_percentage=.1))

processing 0 of 292
processing 1 of 292
processing 2 of 292
processing 3 of 292
processing 4 of 292
processing 5 of 292
processing 6 of 292
processing 7 of 292
processing 8 of 292
processing 9 of 292
processing 10 of 292
processing 11 of 292
processing 12 of 292
processing 13 of 292
processing 14 of 292
processing 15 of 292
processing 16 of 292
processing 17 of 292
processing 18 of 292
processing 19 of 292
processing 20 of 292
processing 21 of 292
processing 22 of 292
processing 23 of 292
processing 24 of 292
processing 25 of 292
processing 26 of 292
processing 27 of 292
processing 28 of 292
processing 29 of 292
processing 30 of 292
processing 31 of 292
processing 32 of 292
processing 33 of 292
processing 34 of 292
processing 35 of 292
processing 36 of 292
processing 37 of 292
processing 38 of 292
processing 39 of 292
processing 40 of 292
processing 41 of 292
processing 42 of 292
processing 43 of 292
processing 44 of 292
processing 45 of 292
processing 46 of 292
processing 47 of 292
pr

In [107]:
# print(multi_return_match_result[0])
# import json
# json=json.dumps(multi_return_match_result)
# f=open("multi_match_result.json", "w")
# f.write(json)
# f.close()

{('model_mpnet_base_v2', '202895-21976', 52, 121): [[58, 31, 71, 63, 59, 56, 32, 50, 52, 54, 55, 15, 67], [58, 31, 63, 53, 56, 71, 55, 32, 51, 59, 50, 54, 60], [63, 58, 53, 51, 0, 31, 56, 54, 15, 33, 60, 71, 52], [58, 53, 51, 31, 56, 54, 71, 52, 59, 55, 32, 63, 50], [84, 2, 119, 75, 105, 114, 104, 117, 116, 3, 115, 15, 46], [84, 2, 114, 104, 111, 119, 102, 116, 107, 110, 117, 81, 115], [58, 31, 71, 63, 53, 32, 67, 15, 59, 51, 60, 0, 56], [30, 28, 25, 95, 58, 31, 27, 0, 63, 53, 26, 91, 108], [74, 72, 63, 20, 104, 102, 33, 15, 92, 60, 58, 51, 57], [49, 104, 5, 102, 111, 92, 74, 84, 116, 43, 72, 80, 34], [117, 104, 116, 49, 119, 111, 74, 102, 39, 92, 15, 114, 33], [104, 102, 15, 33, 119, 4, 74, 114, 31, 60, 63, 0, 58], [104, 102, 107, 58, 114, 15, 74, 110, 33, 31, 119, 113, 120], [20, 57, 60, 19, 92, 91, 39, 75, 105, 58, 79, 109, 50], [74, 72, 104, 102, 15, 33, 0, 107, 120, 90, 63, 103, 20], [104, 102, 58, 15, 110, 112, 74, 72, 33, 31, 107, 63, 87], [58, 31, 91, 92, 63, 51, 60, 90, 87, 53

TypeError: ignored

In [108]:
def mean(lst):
  return sum(lst)/len(lst)

def return_mean_percent_same_for_each_nda(multi_return_match_result):
  # calculate percentage that match
  # percent_same_for_each_nda is a list of percentages of claims that are the same for each addtion for each NDA
  percent_same_for_each_nda=[]
  for NDA in multi_return_match_result:
    # matches = [[[claim_index1,claim_index2,],],] wherein the outermost list represents list for each model, the second outermost list represented the additions
    matches=[]
    model_names=[]
    # breakup dict into lists; the format of the dict is {(method_name, NDA, len(additions), len(claims)):matches, ...}
    for key, value in NDA.items():
      model_names.append(key[0])
      matches.append(value)
    if not matches or not matches[0] or not matches[0][0]:
      continue

    models_len=len(matches)
    additions_len=len(matches[0])
    claims_len=len(matches[0][0])

    # additions_percent_same is a list of the percentage of additions that have same claim_indexes across multiple models for each addition in an NDA
    additions_percent_same=[]

    for a in range(additions_len):
      # addition_matches=[[claim_index1,claim_index2,],] wherein each row represents a different model
      addition_matches=[]
      for m in range(models_len):
        addition_matches.append(matches[m][a])
      # count number of claim_index that is the same across multiple models for same addition
      same_claims=addition_matches[0]
      same_claims_orig_len=len(same_claims)
      assert same_claims_orig_len==claims_len
      for i in range(1,len(addition_matches)):
        same_claims=list(set(same_claims).intersection(addition_matches[i]))
      percent_same=len(same_claims)/same_claims_orig_len
      additions_percent_same.append(percent_same)
    
    percent_same_for_each_nda.append(mean(additions_percent_same))
  return mean(percent_same_for_each_nda)

print("Mean percent_same_for_each_nda: ", return_mean_percent_same_for_each_nda(multi_return_match_result))
  

Mean percent_same_for_each_nda:  0.3022827958802241


The above indicates that if three models are used, the top 10% of the results are shared by 30% of the models.  The following indicates if 2 models are using the results are shared by 55% of the two models.

In [109]:
scoring_method_list=[
                     [scoring_method_bert, model_mpnet_base_v2, "model_mpnet_base_v2"],
                     [scoring_method_spacy,_en_core_sci_lg_nlp, "_en_core_sci_lg_nlp"],
                     ]

multi_return_match_result=[]
for i in range(len(random_NDA_list)):
  # print(f"processing {str(i)} of {str(len(random_NDA_list))}")
  NDA=random_NDA_list[i]
  multi_return_match_result.append(return_match(NDA, "additions_with_context", "patents", scoring_method_list, cutoff_percentage=.1))

print("Mean percent_same_for_each_nda: ", return_mean_percent_same_for_each_nda(multi_return_match_result))

Mean percent_same_for_each_nda:  0.4189151577329994


In [110]:
scoring_method_list=[
                     [scoring_method_bert, model_mpnet_base_v2, "model_mpnet_base_v2"],
                     [scoring_method_bert,model_roberta_base_v2, "model_roberta_base_v2"],
                     ]

multi_return_match_result=[]
for i in range(len(random_NDA_list)):
  # print(f"processing {str(i)} of {str(len(random_NDA_list))}")
  NDA=random_NDA_list[i]
  multi_return_match_result.append(return_match(NDA, "additions_with_context", "patents", scoring_method_list, cutoff_percentage=.1))

print("Mean percent_same_for_each_nda: ", return_mean_percent_same_for_each_nda(multi_return_match_result))

Mean percent_same_for_each_nda:  0.5502248546898962


If a single model is use, the following should report 1, since all matches are the same to itself.

In [111]:
scoring_method_list=[
                     [scoring_method_bert, model_mpnet_base_v2, "model_mpnet_base_v2"],
                     ]

multi_return_match_result=[]
for i in range(len(random_NDA_list)):
  # print(f"processing {str(i)} of {str(len(random_NDA_list))}")
  NDA=random_NDA_list[i]
  multi_return_match_result.append(return_match(NDA, "additions_with_context", "patents", scoring_method_list, cutoff_percentage=.1))

print("Mean percent_same_for_each_nda: ", return_mean_percent_same_for_each_nda(multi_return_match_result))

Mean percent_same_for_each_nda:  1.0


If the cutoff percentage increases, then the number of matching claims should increase.

In [112]:
scoring_method_list=[
                      [scoring_method_bert, model_mpnet_base_v2, "model_mpnet_base_v2"],
                     [scoring_method_bert,model_roberta_base_v2, "model_roberta_base_v2"],
                     [scoring_method_spacy,_en_core_sci_lg_nlp, "_en_core_sci_lg_nlp"]
                     ]

multi_return_match_result=[]
for i in range(len(random_NDA_list)):
  # print(f"processing {str(i)} of {str(len(random_NDA_list))}")
  NDA=random_NDA_list[i]
  multi_return_match_result.append(return_match(NDA, "additions_with_context", "patents", scoring_method_list, cutoff_percentage=.2))

print("Mean percent_same_for_each_nda: ", return_mean_percent_same_for_each_nda(multi_return_match_result))

Mean percent_same_for_each_nda:  0.38887973947123505


In [113]:
scoring_method_list=[
                      [scoring_method_bert, model_mpnet_base_v2, "model_mpnet_base_v2"],
                     [scoring_method_bert,model_roberta_base_v2, "model_roberta_base_v2"],
                     [scoring_method_spacy,_en_core_sci_lg_nlp, "_en_core_sci_lg_nlp"]
                     ]

multi_return_match_result=[]
for i in range(len(random_NDA_list)):
  # print(f"processing {str(i)} of {str(len(random_NDA_list))}")
  NDA=random_NDA_list[i]
  multi_return_match_result.append(return_match(NDA, "additions_with_context", "patents", scoring_method_list, cutoff_percentage=.3))

print("Mean percent_same_for_each_nda: ", return_mean_percent_same_for_each_nda(multi_return_match_result))

Mean percent_same_for_each_nda:  0.45128598488193755


In [152]:
scoring_method_list=[
                      [scoring_method_bert, model_mpnet_base_v2, "model_mpnet_base_v2"],
                     [scoring_method_bert,model_roberta_base_v2, "model_roberta_base_v2"],
                     [scoring_method_spacy,_en_core_sci_lg_nlp, "_en_core_sci_lg_nlp"]
                     ]

multi_return_match_result=[]
for i in range(len(random_NDA_list)):
  # print(f"processing {str(i)} of {str(len(random_NDA_list))}")
  NDA=random_NDA_list[i]
  multi_return_match_result.append(return_match(NDA, "additions_with_context", "patents", scoring_method_list, cutoff_percentage=.1))

In [167]:
def print_top_percentage_multi(multi_return_match_result, NDA_, additions_folder_name, patent_folder_name):
  """
  Prints the same matches from an NDA
  Parameters:
    multi_return_match_result=[{ ("method_name", NDA, len(additions), len(claims)): [[additions_num, claim_num],], },] for a list of NDA
  """
  # calculate percentage that match
  # percent_same_for_each_nda is a list of percentages of claims that are the same for each addtion for each NDA
  for NDA_set in multi_return_match_result:
    # print(str(NDA)[:50])
    key_=list(NDA_set.keys())[0]
    value_ = NDA_set.values()
    if key_[1]!=NDA_:
      continue
    claims = get_patent_claims(NDA, patent_folder_name)
    additions = get_additions(NDA, additions_folder_name)

    # matches = [[[claim_index1,claim_index2,],],] wherein the outermost list represents list for each model, the second outermost list represented the additions
    matches=[]

    # breakup dict into lists; the format of the dict is {(method_name, NDA, len(additions), len(claims)):matches, ...}
    for key, value in NDA_set.items():
      matches.append(value)
    if not matches or not matches[0] or not matches[0][0]:
      continue

    models_len=len(matches)
    additions_len=len(matches[0])
    claims_len=len(matches[0][0])

    print(f"==NDA: {NDA_}")
    patent_dir=db2files+str(NDA)+'/'+patent_folder_name+'/'
    if os.path.exists(patent_dir):
      patent_files=[patent_dir+f for f in os.listdir(patent_dir)]
      patent_files=[f.split('/')[-1] for f in patent_files]
    print(f"==patents: {patent_files}")

    for a in range(additions_len):

      # addition_matches=[[claim_index1,claim_index2,],] wherein each row represents a different model
      addition_matches=[]
      for m in range(models_len):
        addition_matches.append(matches[m][a])
      # count number of claim_index that is the same across multiple models for same addition
      same_claims=addition_matches[0]
      same_claims_orig_len=len(same_claims)
      assert same_claims_orig_len==claims_len
      for i in range(1,len(addition_matches)):
        same_claims=list(set(same_claims).intersection(addition_matches[i]))

      if same_claims:
        print(f"==addition: {additions[a]}")
        print(f'==same_claims_list: {same_claims}')
        
        for c in same_claims:
          print(f"********claim: \n{claims[c]}")
  
    
    break
    
  

In [168]:
# Change i to see all addition to claim matching 
i=1
NDA=random_NDA_list[0]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 202895-21976
==patents: ['8597876', '9889115', '7470506', '7700645', '6987102', '8518987']
==addition: PREZISTA®, co-administered with ritonavir (PREZISTA/ritonavir), and with other antiretroviral agents, is indicated for the treatment of human immunodeficiency virus (HIV-1) infection.

==same_claims_list: [32, 71, 50, 55, 58, 31]
********claim: 
33. The method of claim 32, wherein the at least one other antiviral agent is ritonavir.

********claim: 
9. The method of claim 1, wherein the at least one antiviral agent is ritonavir.

********claim: 
51. The method of claim 50, wherein the at least one other antiviral agent is ritonavir.

********claim: 
56. The method of claim 15, which comprises further administration of at least one other antiviral agent selected from the group consisting of ritonavir, indinavir, amprenavir, and saquinavir.

********claim: 
2. The method of claim 1, which comprises further administration of at least one other antiviral agent selected from the gro

In [171]:
i=1
NDA=random_NDA_list[i]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 22425
==patents: ['8318800', '5223510', '9107900', '8410167', '7323493', '8602215']
==addition: MULTAQ® is indicated to reduce the risk of hospitalization for atrial fibrillation in patients in sinus rhythm with a history of paroxysmal or persistent atrial fibrillation (AF) [see Clinical Studies (14)].

==same_claims_list: [64, 65, 96, 69, 44, 15, 49, 52, 58, 59]
********claim: 
7. A method of treating a patient with a recent history of or current atrial fibrillation or flutter, said method comprising administrating to said patient a therapeutically effective amount of dronedarone, or a pharmaceutically acceptable salt thereof, twice a day with a morning and an evening meal, wherein said patient does not have severe heart failure, (i) wherein severe heart failure is indicated by: a) NYHA Class IV heart failure or b) hospitalization for heart failure within the last month, and (ii) wherein said atrial fibrillation or flutter is non-permanent and is paroxysmal or persistent; and (

In [172]:
i=2
NDA=random_NDA_list[i]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 18602
==patents: ['3562257']
==addition: Also contains: colloidal silicon dioxide, D&C Yellow #10 Aluminum Lake, FD&C Blue #1 Aluminum Lake (30 mg and 90 mg), FD&C Yellow #6 Aluminum Lake (60 mg and 120 mg), hydroxypropyl cellulose, hypromellose, lactose, magnesium stearate, methylparaben, microcrystalline cellulose, and polyethylene glycol.

==same_claims_list: [1]
********claim: 
2. A process for obtaining an optically active salt of a cis-3-hydroxy-1,5-benzothiazepine compound of formula (I) from a racemic salt of the 1,5-benzothiazepine compound of formula (I): ##STR9## wherein each of Ring A and Ring B is a substituted or unsubstituted benzene ring, and R.sup.1 and R.sup.2 are the same or different and each is a lower alkyl group, the process comprising: dissolving a 1,5-benzothiazepine compound and 1-naphthalenesulfonic acid in an appropriate solvent under heating; cooling the solution; subjecting the resulting solution to preferential crystallization by obtaining a supers

In [173]:
i=3
NDA=random_NDA_list[i]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 208051
==patents: ['8669273', '9630946', '10035788', '8518446', '9139558', '9211291', '8790708', '9265784', '7399865', '7982043']
==addition: NERLYNX is indicated for the extended adjuvant treatment of adult patients with early stage HER2-overexpressed/amplified breast cancer, to follow adjuvant trastuzumab based therapy [
  see Clinical Studies (
   14)
  ].

==same_claims_list: [0, 101, 103, 109, 142, 112, 115, 118, 22]
********claim: 
1. A method for treating an ErbB-2 positive metastatic breast cancer in a subject; comprising administering to the subject neratinib and capecitabine; wherein neratinib and capecitabine act synergistically.

********claim: 
1. A regimen for treatment of early stage HER-2 positive breast cancer comprising delivering neratinib therapy to early stage HER-2 positive breast cancer patients at the completion of at least twelve cycles of trastuzumab adjuvant therapy; wherein the neratinib therapy is started about 2 weeks to about one year from the comp

The following is the result of a bad data grab.  All claims are group together as one unit in database.  Data grab is part of unrelated module.

In [174]:
i=4
NDA=random_NDA_list[i]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 21822-21882
==patents: ['6596750']
==addition:   In these patients, Exjade has been shown to reduce liver iron concentration and serum ferritin levels.  Clinical trials to demonstrate increased survival or to confirm clinical benefit have not been completed [see Clinical Studies (14)]. 
The safety and efficacy of Exjade when administered with other iron chelation therapy have not been established.

==same_claims_list: [0]
********claim: 
1. A method of treating diseases which cause an excess of metal in a human or animal body or are caused by an excess of metal in a human or animal body comprising administering to a subject in need of such treatment a therapeutically effective amount of a compound of formula I 








in which
R1 and R5 simultaneously or independently of one another are hydrogen, halogen, hydroxyl, lower alkyl, halo-lower alkyl, lower alkoxy, halo-lower alkoxy, carboxyl, carbamoyl, N-lower alkylcarbamoyl, N,N-di-lower alkylcarbamoyl or nitrile; 
R2 and R4 simul

In [175]:
i=5
NDA=random_NDA_list[i]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 22517
==patents: ['7560429', '7947654', '9504647', '9974826', '10137167', '9919025', '10307459', '8802624', '9220747']
==addition: It is chemically defined as follows:
Molecular weight of 1183.34 with the following empirical formula:
   C
  46H
  64N
  14O
  12S
  2∙C
  2H
  4O
  2∙3H
  2O
1-(3-mercaptopropionic acid)-8-D-arginine vasopressin monoacetate (salt) trihydrate.

==same_claims_list: [108, 76, 14, 53, 62]
********claim: 
1. An orodispersible pharmaceutical dosage form of desmopressin comprising:
desmopressin in a form selected from one or more of the free base of desmopressin and a pharmaceutically acceptable salt thereof, in an amount measured as the free base, selected from 25 μg and 50 μg, and
one or more carriers, wherein at least one carrier is gelatin in an open matrix network structure,
wherein the dosage form exhibits a mean elimination half-life of desmopressin after administration of about 2.8 to 3 hours after the maximum plasma concentration is reached.

***

In [176]:
i=6
NDA=random_NDA_list[i]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 20785
==patents: ['8315886', '7874984', '8626531', '7959566', '6561977', '7230012', '7435745', '8143283', '6869399', '7141018', '6561976', '8589188', '7723361', '6908432', '8204763']
==addition: Chemical Structure of thalidomide
Thalidomide is an off-white to white, odorless, crystalline powder that is soluble at 25°C in dimethyl sulfoxide and sparingly soluble in water and ethanol.

==same_claims_list: [116, 111]
********claim: 
6. The method of claim 1, wherein the epoxide hydrolase inhibitor is administered to said human together with thalidomide.

********claim: 
1. A method for inhibiting undesired angiogenesis in a human having a blood born tumor comprising orally administering to said human a capsule comprising an angiogenesis-inhibiting amount of thalidomide and administering an amount of a compound that is an epoxide hydrolase inhibitor.

==addition: Multiple Myeloma
THALOMID® (thalidomide) in combination with dexamethasone is indicated for the treatment of patients wit

In [177]:
i=7
NDA=random_NDA_list[i]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 205831
==patents: ['10039719', '9801823', '7438930', '9066869', '7247318', '8580310', '7083808']
==addition: APTENSIO XR is indicated for the treatment of Attention Deficit Hyperactivity Disorder (ADHD) in patients 6 years and older [
see Clinical Studies (14)].

==same_claims_list: [74]
********claim: 
20. A method of treating Attention Deficit Hyperactivity Disorder in a child comprising: administering to the child once-a-day in the morning a formulation according to claim 1.

==addition: APTENSIO XR contains methylphenidate hydrochloride, a central nervous system (CNS) stimulant.

==same_claims_list: [1]
********claim: 
2. The method of claim 1, wherein said formulation comprises methylphenidate hydrochloride.

==addition:  
Inactive Ingredients: ammonio methacrylate copolymer, type B; colloidal silicon dioxide (added if necessary); gelatin; hypromelloses; methacrylic acid copolymer, type C; polyethylene glycol; sugar spheres; talc; titanium oxide; and triethyl citrate.

==sa

In [178]:
i=8
NDA=random_NDA_list[i]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 209241
==patents: ['10857148', '8039627', '10906903', '10952997', '10919892', '10912771', '10851103', '8357697', '10940141', '10851104', '10874648', '10065952', '10906902', '10844058', '10857137']
==addition: INGREZZA capsules are available in the following strengths:
40 mg capsules with a white opaque body and purple cap, printed with ‘VBZ’ and ‘40’ in black ink.

==same_claims_list: [163, 8, 176, 17, 124, 125]
********claim: 
2. The method of claim 1, wherein the therapeutically effective amount is an amount equivalent to about 40 mg of (S)-2-amino-3-methyl-butyric acid (2R,3R,11bR)-3-isobutyl-9,10-dimethoxy-1,3,4,6,7,11b-hexahydro-2H-pyrido[2,1-a]isoquinolin-2-yl ester free base once daily.

********claim: 
9. The method of claim 1, wherein the VMAT2 inhibitor is administered in the form of a tablet or capsule.

********claim: 
15. The method of claim 14, wherein the VMAT2 inhibitor is administered in the form of a tablet or capsule.

********claim: 
18. The method of claim 1

In [179]:
i=9
NDA=random_NDA_list[i]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 21964-208271
==patents: ['10376505', '9492445', '10376584', '9180125', '8822490', '6559158', '9314461', '8524276', '10307417', '8956651', '8420663', '8552025', '9669096', '9724343', '8247425']
==addition: 
Limitation of use: Use of RELISTOR beyond four months has not been studied in the advanced illness population.

==same_claims_list: [145, 355, 47]
********claim: 
1. A method of preventing or treating an opioid-induced side effect in a chronic opioid patient, the method comprising administering a quaternary derivative of noroxymorphone in an amount sufficient to prevent or treat the side effect in the patient, wherein said amount is insufficient to treat the side effect in a patient to whom opioids have not been chronically administered and wherein said amount is such that peak plasma concentrations do not exceed 100 ng/ml.


2. The method of claim 1 wherein the quaternary derivative is methylnaltrexone.


3. The method of claim 2 wherein the side effect to be prevented or tre

In [180]:
i=10
NDA=random_NDA_list[i]

print_top_percentage_multi(multi_return_match_result, NDA, "additions_with_context", "patents")

==NDA: 22406
==patents: ['9415053', '7592339', '7157456', '9539218', '7585860', '10828310']

==same_claims_list: [16, 40, 75]
********claim: 
17. A method for the prophylaxis and/or treatment of thromboembolic diseases comprising administering an effective amount of the pharmaceutical composition of claim 6 to a patient in need thereof.

********claim: 
17. A method for treating a disorder selected from the group consisting of myocardial infarct, unstable angina, stroke, transitory ischaemic attacks, peripheral arterial occlusive disorders, pulmonary embolisms and deep venous thromboses comprising administering an effective amount of a compound of formula:




or a hydrate thereof, to a patient in need of said method.

********claim: 
1. A method of treating a thromboembolic disorder comprising
administering a direct factor Xa inhibitor that is 5-Chloro-N-({(5S)-2-oxo-3-[4-(3-oxo-4-morpholinyl)phenyl]-1,3-oxazolidin-5-yl}methyl)-2-thiophenecarboxamide no more than once daily for at lea