# Patent Number and Publication Number Mining

---

This experiment compares different text-mining approaches and libraries and their accuracy in extracting patent numbers and patent publication numbers from PDF documents.

**Text-Mining approaches:**
1. Parse the text from the pdf file using a pdf parsing library
2. Convert the first page of the pdf to an image and run it through an OCR engine to extract the text

*In both approaches, the extracted text is searched using a simple regex expression to match the number.*

**Libraries used in this experiment:**
1. PyPDF2  "pdf processing and text parsing"
2. textract (pdftotext, pdfminer, tesseract) "text parsing and OCR"
3. fitz "pdf to image conversion"

The final schema of the document is higlighted below. We are only concerned with `docId` in this experiment


```json
{
    "id": ""                                                   // db doc id
    "jobId": ""                                                // the upload jobId
    "docId": "US008080243B2",                                  // 10
    "docType": "patent|patent_app",                            // 12 
    "appNums": ["12/474,176"],                                 // 21
    "appDates": [""],                                          // 22
    "preGrantDate": "",                                        // 44 
    "claimsDate": "",                                          // 46
    "grantDate": "",                                           // 45
    "title": "",                                               // 54
    "relatedDocs": [{"number":"", "date":""}]                  // 60-68
    "applicants": [""],                                        // 71
    "inventors": [""],                                         // 72
    "assignees": [""],                                         // 73
    "patentAttorney": "",                                      // 74
    "sequences": [{"seqNoId": "", "seqRef": ""}],              // mined
    "proteinId": "PCSK9",                                      // added by user
    "epitopes": [{"seqNoId": 53, "numbers": [1,3,5,7]}],       // mined
    "legalOpinion":[{"txt": "", "date":"", "userId":""}],      // added by lawyers
    "legalStatus": [{"status":"", "date":""}]                  //[allowed,reissued,invalidated,challenged]
    "created": ""                                              // db created time
}

```

### Import the Dependencies 

In [1]:
import textract, PyPDF2, fitz, re, os

### Define the main processing function

In [2]:
def process_file(file_name, file_dir, pwd):
    file_path = os.path.join(file_dir, file_name)
    tmp_file = os.path.join(pwd,'tmp', file_name)
    tmp_img = os.path.join(pwd, 'tmp' ,file_name.replace('.pdf', '.png'))
    with open(file_path, mode='rb') as f:
        reader = PyPDF2.PdfFileReader(f)
        page = reader.getPage(0)
        
        # Write the first page to the tmp dir (to save time)
        page_writer = PyPDF2.PdfFileWriter()
        page_writer.addPage(page)
        writer_stream = open(tmp_file, "wb")
        page_writer.write(writer_stream)
        writer_stream.close()
        
        text = page.extractText()

        if text != "" and not text.startswith('\n\n\n'): 
            """
                if no text, process using ocr only
            """
            mined_info = mine_doc_text(text)
            print('PyPdF2:        ', mined_info)
            doc = textract.process(tmp_file, method='pdfminer')
            text = doc.decode('utf8')
            mined_info = mine_doc_text(text)
            print('pdfminer:      ', mined_info)
            doc = textract.process(tmp_file)
            text = doc.decode('utf8')
            mined_info = mine_doc_text(text)
            print('pdftotext:     ', mined_info)
            
        else:
            print('....OCR PROCESSING ONLY....')
 
        # take a screenshot of the first page and OCR it
        doc = fitz.open(file_path)
        page = doc.loadPage(0)  # number of page
        pix = page.getPixmap(matrix=fitz.Matrix(5, 5))
        pix.writePNG(tmp_img)
        doc = textract.process(tmp_img, method='tesseract')
        text = doc.decode('utf8')
        img_mined_info = mine_doc_text(text)
        print('tesseract img: ', img_mined_info) 
        # let's also test ocering the first pdf page in the doc
        doc = textract.process(tmp_file, method='tesseract')
        text = doc.decode('utf8')
        mined_info = mine_doc_text(text)
        print('tesseract pdf: ', mined_info)
        print('doc_id: ', process_doc_id(img_mined_info))

### Define a helper function to extract the patent # or pub # from the extracted text

In [3]:
debug = False
DOC_NUMBER_REGEX = '((US|us)\s?([,|\/|\s|\d|&])+\s?([a-zA-Z]\d))'
def mine_doc_text(txt):
    if debug == True:
        print('DEBUG: mined text: ', txt[:300])
    res = re.search(DOC_NUMBER_REGEX, txt)
    if res != None:
        return res.group()
    return 'NO MATCH FOUND. Dumping the first 200 chars... \n' + txt[:200].replace('\n','')

### Define a helper function to clean the extracted patent numbers

In [4]:
def process_doc_id(txt):
    txt = re.sub('[us|US|,|&|\s|/]', '',txt).strip('0')
    txt = re.sub('\w\d$', '', txt)
    return txt

### Define a helper function to remane the patent files removing spaces and special chars

In [5]:
def clean_file_name(file):
    """
    utility function to rename the patent files locally
    """
    clean_name = file.replace(' ', '_').replace('[','').replace(']','').lower()
    if clean_name != file:
        os.rename(os.path.join(root, file), os.path.join(root, clean_name))    
        print('renamed ', file)
    return clean_name

### Run the processing function for all the patents in the patents/* directory

In [6]:
for root, dirs, files in os.walk('../../patents'):
     for file in files:
            if file.startswith('.ds'): # ignore .ds_store files
                continue
            print('************************************************')
            file_name = clean_file_name(file)
            print('root_dir: ', root)
            print('file_name: ', file_name)
            process_file(file_name, root, os.getcwd())

************************************************
root_dir:  ../../patents/Genentech
file_name:  us2012195910_claim_45.pdf
PyPdF2:         US2012/0195910A1




pdfminer:       US 2012/0195910 A1
pdftotext:      US 2012/0195910 A1
tesseract img:  US 20120195910A1
tesseract pdf:  US 20

 

120195910A1
doc_id:  20120195910
************************************************
root_dir:  ../../patents/TFPI antibodies (tissue factor pathway inhibitor)/Baxter
file_name:  us9046536_claim_8.pdf
PyPdF2:         US009046536B2
pdfminer:       US009046536B2
pdftotext:      US009046536B2
tesseract img:  US009046536B2
tesseract pdf:  US009046536B2
doc_id:  9046536
************************************************
root_dir:  ../../patents/TFPI antibodies (tissue factor pathway inhibitor)/Novo Nordisk
file_name:  us2011268745_claim_25.pdf
PyPdF2:         US2011/0268745A1
pdfminer:       US 2011/0268745 A1
pdftotext:      US 2011/0268745 A1
tesseract img:  US 20110268745A1
tesseract pdf:  NO MATCH FOUND. Dumping the first 200 chars... 
 US 5. 2011026874541as) United Statesa2) Patent Application Publication co) Pub. No.: US 2011/0268745 AlHilden et al. (43) Pub. Dat



pdfminer:       US 20100166768A1
pdftotext:      US 20100166768A1
tesseract img:  US 20100166768A1
tesseract pdf:  US 20100166768A1
doc_id:  20100166768
************************************************
root_dir:  ../../patents/Regeneron
file_name:  us9724411.pdf
PyPdF2:         US009724411B2
pdfminer:       US009724411B2
pdftotext:      US009724411B2
tesseract img:  US009724411B2
tesseract pdf:  US009724411B2
doc_id:  9724411
************************************************
root_dir:  ../../patents/Regeneron
file_name:  us8062640.pdf




PyPdF2:         US008062640B2
pdfminer:       US008062640B2
pdftotext:      US008062640B2
tesseract img:  US008062640B2
tesseract pdf:  US 8,062,640 B2
doc_id:  8062640
************************************************
root_dir:  ../../patents/Regeneron
file_name:  us10023654_epitope.pdf




PyPdF2:         US010023654B2
pdfminer:       US010023654B2
pdftotext:      US010023654B2
tesseract img:  US010023654B2
tesseract pdf:  US 10,023,654 B2
doc_id:  10023654
************************************************
root_dir:  ../../patents/Regeneron
file_name:  us9550837.pdf
PyPdF2:         US009550837B2
pdfminer:       US009550837B2
pdftotext:      US009550837B2
tesseract img:  US009550837B2
tesseract pdf:  US009550837B2
doc_id:  9550837
************************************************
root_dir:  ../../patents/Amgen
file_name:  us8563698_aa123-132.pdf
PyPdF2:         US008563698B2
pdfminer:       US008563698B2
pdftotext:      US008563698B2
tesseract img:  US008563698B2
tesseract pdf:  US008563698B2
doc_id:  8563698
************************************************
root_dir:  ../../patents/Amgen
file_name:  us2009326202_issued_as_698_patent.pdf
PyPdF2:         NO MATCH FOUND. Dumping the first 200 chars... 
llIIlIlllIlIlIllIlIllIllIllIIlllIIllIlIlllIIlIIllllIllIlIIlllIllIllIIllIllI



pdfminer:       NO MATCH FOUND. Dumping the first 200 chars... 
llIIlIlllIlIlIllIlIllIllIllIIlllIIllIlIlllIIlIIllllIllIlIIlllIllIllIIllIllIIIIIIIIIIIIIIII US 20090326202AI (19) United  States (12) Patent  Application  Publication Jackson  et al. (10) Pub.
pdftotext:      NO MATCH FOUND. Dumping the first 200 chars... 
llIIlIlllIlIlIllIlIllIllIllIIlllIIllIlIlllIIlIIllllIllIlIIlllIllIllIIllIllIIIIIIIIIIIIIIIIUS 20090326202AIUnited StatesPublication(12) Patent Application(19)(10)Jackson et al.(54)(43)AN
tesseract img:  US 20090326202A1
tesseract pdf:  US 2009/0326202 A1
doc_id:  20090326202
************************************************
root_dir:  ../../patents/Amgen
file_name:  us8829165.pdf
PyPdF2:         US008829165B2
pdfminer:       US008829165B2
pdftotext:      US008829165B2
tesseract img:  US008829165B2
tesseract pdf:  US008829165B2
doc_id:  8829165
************************************************
root_dir:  ../../patents/Amgen
file_name:  us8859741.pdf
PyPdF2:         US00885974



pdfminer:       NO MATCH FOUND. Dumping the first 200 chars... 
llIIlIlllIlIlIllIlIllIllIllIIlllIIllIlIlllIIlIIllllIllIlIIlllIllIllIIllIllIIIIIIIIIIIIIIII US 20090326202AI (19) United  States (12) Patent  Application  Publication Jackson  et al. (10) Pub.
pdftotext:      NO MATCH FOUND. Dumping the first 200 chars... 
llIIlIlllIlIlIllIlIllIllIllIIlllIIllIlIlllIIlIIllllIllIlIIlllIllIllIIllIllIIIIIIIIIIIIIIIIUS 20090326202AIUnited StatesPublication(12) Patent Application(19)(10)Jackson et al.(54)(43)AN
tesseract img:  US 20090326202A1
tesseract pdf:  US 2009/0326202 A1
doc_id:  20090326202
************************************************
root_dir:  ../../patents/Amgen_Marked_AR
file_name:  us8829165.pdf
....OCR PROCESSING ONLY....
tesseract img:  US008829165B2
tesseract pdf:  US008829165B2
doc_id:  8829165
************************************************
root_dir:  ../../patents/Amgen_Marked_AR
file_name:  us8859741.pdf
PyPdF2:         US008859741B2
pdfminer:       US008859741B2
pdftotex



pdfminer:       US008080243B2
pdftotext:      US008080243B2
tesseract img:  US008080243B2
tesseract pdf:  US008080243B2
doc_id:  8080243
************************************************
root_dir:  ../../patents/Pfizer
file_name:  us9175093.pdf
PyPdF2:         US009175093B2
pdfminer:       US009175093B2
pdftotext:      US009175093B2
tesseract img:  US009175093B2
tesseract pdf:  US009

175093B2
doc_id:  9175093
************************************************
root_dir:  ../../patents/Pfizer
file_name:  us8426363.pdf
....OCR PROCESSING ONLY....
tesseract img:  US008426363B2
tesseract pdf:  US008426363B2
doc_id:  8426363
************************************************
root_dir:  ../../patents/Pfizer
file_name:  us8399646.pdf
PyPdF2:         US008399646B2
pdfminer:       US008399646B2
pdftotext:      US008399646B2
tesseract img:  US008399646B2
tesseract pdf:  US008399646B2
doc_id:  8399646
************************************************
root_dir:  ../../patents/Pfizer
file_name:  us201006



pdfminer:       US 20100068199A1
pdftotext:      US 20100068199A1
tesseract img:  US 20100068199A1
tesseract pdf:  US 20100068199A1
doc_id:  20100068199


# Observation
*Extracting the Document Numbers from a pdf screencapture is more accurate and less prone to encoding issues than parsing the pdf files directly*

### Toogle any warnings in the last command output
The snippet below is used to toggle the warnings emitted from the PYPDF2 library. This is done to have a cleaner output in jupyter notebook. The script is a modified version of the code published here https://stackoverflow.com/a/59863323/4106075

In [7]:
%%javascript
(function(on){
    const e=$( "<button style='font-size:18px; color:white;'>Setup failed</button>" );
    const ns="js_jupyter_suppress_warnings";
    var cssrules=$("#"+ns);
    if(!cssrules.length) 
        cssrules = $("<style id='"+ns+"' type='text/css'>div.output_stderr { } </style>").appendTo("head");
    
    e.click(function() {
        var s='Hide';  
        cssrules.empty()
        if(on) {
            s='Show hiding';
            cssrules.append("div.output_stderr, div[data-mime-type*='.stderr'] { display:none; }");
            console.log(e)
            e.css('background-color', 'red');
        }else{
            e.css('background-color', 'blue');
        }
        e.text(s+' warnings');
        on=!on;
    }).click();
    $(element).append(e);
})(true);

<IPython.core.display.Javascript object>

### Patent Doc Refrences

- how to read a patent: https://guides.library.queensu.ca/c.php?g=501420&p=3436527
- WIPO st-9 doc: https://www.wipo.int/export/sites/www/standards/en/pdf/03-09-01.pdf