# i/o.vate

## Chapter 1 - Environment; Sourcing and Scraping our Data.

The project idea, in line with a tight deadline, requires to restrict the heaps of data at some point. A proof of concept that is able to produce valid outcomes is the target for a showcase, and should spark the motivation to continue producing findings. We have found that a solid body of investigation should be provided by WIPO as the governing body (the World Intellectual Property Organization). Hong Kong being part of the Paris convention is included in the territories governed by WIPO. Which allows us to define and easily locate patents valid in Hong Kong by their WIPO numbers.

As such, they certainly require some sort of categorization method, and this in English language. As we have seen, the Hong Kong Department of intellectual propery provides a weekly journal with updates on patent filings in the Special Administrative region in PDF format.

Now we are facing an initial problem:

### What environment to use?

Since this is a duo's effort and we wouldn't want to rely on methods such as Dropbox for file exchanges, there had to be a solution that fulfills several requirements:

- decentralized
- remotely accessible
- ideally bringing enough computational power to the table

This way we made sure we'd operate on the same environment.

Eventually a Linux environment on cloud servers was chosen, and a machine learning suite deployed to this. Specifications:

- Ubuntu 18.04
- 4GB RAM, 80GB disk size, 2 virtual CPUs has proven to be a good compromise
- Headless operation kept it lean and quick
- SSH connection / FTP for file exchange and running scripts

Software:

- Python 3
- pdf parsers
- Selenium, headless Firefox
- Pandas, Seaborn
- NLTK
- Keras / Word2Vec
- Gensim / Doc2Vec


### Getting our Hong Kong Data

***We reference hereby to the file scrape_convert.py.
This notebook serves as an explanation of our method, step by step, while the script file is execution-ready***.

The Hong Kong Intellectual Property Department releases a weekly journal with patent information. As these are all PDF files, we can handle them easily by downloading and extracting text on our machine. The following elements were necessary from a concept point of view:

- downloader for a massive amount of PDFs. Challenge: there is not a clear pattern on the publication dates. 
- file handling: save a large quantity of files distinctively for further processing
- data extraction. Challenge: separate and structure information, find extraction patterns
- parsing: create concise dataframes

As a side note, we have noticed that while the titles of these patents have been published here, we are missing the abstracts (that will make up our actual corpus).

Let's import some libraries. Since we are using Python 3.x, note that pdfminer can be installed via pip under "pdfminer.six".

In [12]:
import urllib.request
import io
import datetime
import os
import pdfminer
import pickle
import re
import numpy as np
import pandas as pd

The peculiar problem is that not all documents have been published on a Friday. So we want a hassle-free method of trying different dates. Let's create a list of the past 10,000 days.

In [11]:
today = datetime.date.today()
date_list = [today - datetime.timedelta(days=x) for x in range(0, 10000)]

We now have to convert the datetimes in a way that they fit our links. It has been a pleasure to obtain the PDFs, since the URLs to the issues were bearing the publication date:

http://ipsearch.ipd.gov.hk/hkipjournal/29112013/Patent_29112013.pdf

The method of storing these files have been changed over the years, as well as their encodings. For this documentation we are only following one path, since the others only required adjustments in the method.

### Removing Chinese characters

Since WIPO and Basic Law in Hong Kong require the publication in English language, we recognize Chinese in this exercise as a supplement. It is, all things considered, also easier to process given the proven approaches to NLP on English corpora. Here's our function that handles the separation of Chinese from English:

In [34]:
def getnonChinese(context):
    filtrate = re.compile(u'[\u4e00-\u9fff]') # Chinese unicode range
    context = filtrate.sub(r'', context) # remove all Chinese characters
    return context

### Chapter-based Division of text

Furthermore, we see that the text is divided in three parts: requests, granted standard patents and short term (for HK first applications). The formatting here is different depending on those, so we are extracting these parts separately:

In [35]:
def part_divider():

    text = text[text.find('Requests to Record Published (Arranged by International\nPatent Classification)'):]

    # Part 1 : Requests to Record Designated Patent Applications published under section 20 of the Patents Ordinance
    part1 = text[:text.find("Requests to Record Published (Arranged by Publication No.)")]

    # Part 2 : Granted Standard Patents published under section 27 of the Patents Ordinance
    part2 = text[text.find("Standard Patents Granted (Arranged by International Patent\nClassification)"):text.find("Standard Patents Granted (Arranged by Publication No.)")]

    # Part 3 : Granted Short-term Patents published under section 118 of the Patents Ordinance
    part3 = text[text.find("Short-term Patents Granted (Arranged by International Patent\nClassification)"):text.find("Short-term Patents Granted (Arranged by Publication No.)")]

### Structuring INID codes

If we look at how every single patent is annotated, we see that for each item there are some feature representations flagged via INID codes ("Internationally agreed Numbers for the Identification of (bibliographic) Data"). Since we want to grab these information for each patent, it is a good idea to parse them into a dictionary:

In [36]:
# Create empty dictionnary with keys representing the INID Codes
def generate_allindex():
    d = {}
    d['51'] = ''
    d['11'] = ''
    d['11A'] = ''
    d['13'] = ''
    d['25'] = ''
    d['21'] = ''
    d['22'] = ''
    d['86'] = ''
    d['86A'] = ''
    d['87'] = ''
    d['30'] = ''
    d['62'] = ''
    d['54'] = ''
    d['73N'] = ''
    d['73C'] = ''
    d['71N'] = ''
    d['71C'] = ''
    d['72'] = ''
    d['74N'] = ''
    d['74C'] = ''
    return d

### Parsing INID features

The tricky part follows, as each field's values are represented in a different way. One comfortable observation: our patents are clearly separated by a line of underscores. This will help us later.

In [37]:
def parser51(entry):
    if not entry:
        return np.nan
    else:
        fiftyone = "".join(entry.split('\n'))
        fiftyone = "".join(re.findall(r'\b[A-Z0-9]{4}\b|$', entry))
        return fiftyone

def parser11(entry):
    if not entry:
        return np.nan
    return re.findall(r'[\d]{7}[*]?|$', entry)[0]

def parser11A(entry):
    if not entry:
        return np.nan
    return re.findall("[A-Z]{2}[\d]+.+[\d]*[A-Z]|$", entry)[0]

def parser13(entry):
    if not entry:
        return np.nan
    return re.findall(r'\b[A-Z]{1}\b|$' ,entry)[0]

def parser25(entry):
    if not entry:
        return np.nan
    return "".join(entry.split('\n')[0].strip())

def parser21(entry):
    if not entry:
        return np.nan
    return "".join(entry.strip())

def parser22(entry):
    if not entry:
        return np.nan
    return "".join(entry.split("\n")[0]).strip()

def parser86(entry):
    if not entry:
        return np.nan
    return "".join(entry.split('\n')[0].strip())

def parser86A(entry):
    if not entry:
        return np.nan
    return "".join(re.findall(r'[A-Z]+.[A-Z]{2}\d+.\d+', entry))

def parser87(entry):
    if not entry:
        return np.nan
    return "".join(entry.split('\n')[0].strip())

def parser30(entry):
    if not entry:
        return np.nan
    return ",".join(re.findall(r'\d{2}.\d{2}.\d{4}\s+\w{2}.+\d+.[A-Z0-9,]+', entry))

def parser62(entry):
    if not entry:
        return np.nan
    return ",".join(re.findall(r'\d{2}.\d{2}.\d{4}\s+\w{2}.\d+.[A-Z0-9]+',entry))

def parser54(entry):
    if not entry:
        return np.nan
    return ''.join(re.findall(r'[A-Z]+[^A-Za-z]', entry)).strip()

def parser73N(entry):
    if not entry:
        return np.nan
    else:
        CompanyName = entry.split('\n')[0]
        return CompanyName

def parser73C(entry):
    if not entry:
        return np.nan
    else:
        Country = re.sub(r'[^a-zA-Z\s]','', entry)
        Country = "".join(Country.strip().split('\n')[-2:])
        return Country

def parser71N(entry):
    if not entry:
        return np.nan
    else:
        CompanyName = entry.split('\n')[0]
        return CompanyName

def parser71C(entry):
    if not entry:
        return np.nan
    else:
        Country = re.sub(r'[^a-zA-Z\s]','', entry)
        Country = "".join(Country.strip().split("\n")[-2:])
        return Country

def parser72(entry):
    if not entry:
        return np.nan
    else:
        Name = re.sub(r'[^a-zA-Z\s,]','', entry)
        Name = "".join(Name.split('\n')).strip()
        return Name

def parser74N(entry):
    if not entry:
        return np.nan
    else:
        CompanyName = entry.split('\n')[0].strip()
        return CompanyName

def parser74C(entry):
    if not entry:
        return np.nan
    else:
        Country = re.sub(r'[^a-zA-Z\s]','', entry)
        Country = Country.strip().split('\n')[-1].strip()
        return Country

### Creating our row data

The following function creates a list of patents and their INID/Information pairs:

In [38]:
def get_list_patent(part):
    dlist =[]
    for patent in part:
        d = generate_allindex()
        for entry in patent:
            if entry.startswith('[51]'):
                d['51'] = parser51(entry)
            elif entry.startswith('[11]'): 
                d['51'] += parser51(entry)
                d['11'] = parser11(entry)
                d['11A'] = parser11A(entry)
            elif entry.startswith('[13]'):
                d['13'] = parser13(entry[5:])
            elif entry.startswith('[25]'):
                d['25'] = parser25(entry[5:])
            elif entry.startswith('[21]'):
                d['21'] = parser21(entry[5:])
            elif entry.startswith('[22]'):
                d['22'] = parser22(entry[5:])
            elif entry.startswith('[86]'):
                d['86'] = parser86(entry[5:])
                d['86A'] = parser86A(entry)
            elif entry.startswith('[87]'):
                d['87'] = parser86(entry[5:])
            elif entry.startswith('[30]'):
                d['30'] = parser30(entry[5:])
            elif entry.startswith('[62]'):
                d['62'] = parser62(entry)
            elif entry.startswith('[54]'):
                d['54'] = parser54(entry)
            elif entry.startswith('[73]'):
                d['73N'] = parser73N(entry[5:])
                d['73C'] = parser73C(entry)
            elif entry.startswith('[71]'):
                d['71N'] = parser71N(entry[5:])
                d['71C'] = parser71C(entry)
            elif entry.startswith('[72]'):
                d['72'] = parser72(entry[5:])
            elif entry.startswith('[74]'):
                d['74N'] = parser74N(entry[5:])
                d['74C'] = parser74C(entry)
        dlist.append(d)
    return dlist

### Push it into a data frame

Lastly, we want to populate a dataframe with this information:

In [30]:
def dataframes(part):
    df = pd.DataFrame(get_list_patent(part))
    df = df[['51','11','11A','13','25','21','22','86','86A','87','30','62','54','73N','73C','71N','71C','72','74N','74C']]
    df = df[:-1]
    return df

Remember it looked different, depending on the patent type (request, grant, short-term grant?) We need to treat them differently before we call this function, so here goes:

In [32]:
# PART 1:  "Requests to Record Designated Patent Applications published under section 20 of the Patents Ordinance"
def part_1_creator(part1):
    # get rid of the Header on each page
    part1 = part1.replace("Journal No.:",'')
    part1 = part1.replace('\x0c','')
    part1 = part1.replace("Section Name:","")
    part1 = part1.replace('Publication Date:','')
    part1 = part1.replace('Requests to Record Published (Arranged by International\nPatent Classification)','')
    part1 = part1.replace('Arranged by International Patent Classification', '')
    line = "_______________________________________________________________________"
    part1 = part1.split(line)
    patent_part1 = []
    for patent in part1:
        patent_part1.append(re.findall(r'\[\d+\][^\[]*', patent, re.S))
    df1 = dataframes(patent_part1)

In [40]:
# PART 2:  "Granted Standard Patents published under section 27 of the Patents Ordinance"
def part_2_creator(part2):
    # get rid of the Header on each page
    part2 = part2.replace("Journal No.:",'')
    part2 = part2.replace('\x0c','')
    part2 = part2.replace("Section Name: ()","")
    part2 = part2.replace('Publication Date:','')
    part2 = part2.replace('Standard Patents Granted (Arranged by International Patent\nClassification)','')
    part2 = part2.replace('Arranged by International Patent Classification', '')
    part2 = part2.replace('Section Name: ()  Standard Patents Granted ()', '')

    part2 = part2.split(line)
    patent_part2 = []
    for patent in part2:
        patent_part2.append(re.findall(r'\[\d+\][^\[]*', patent, re.S))

    df2 = dataframes(patent_part2)

In [43]:
# PART 3:  "Granted Short-term Patents published under section 118 of the Patents Ordinance"
def part_3_creator(part3):
    # get rid of the Header on each page
    part3 = part3.replace("Journal No.: 803",'')
    part3 = part3.replace('\x0c','')
    part3 = part3.replace("Section Name: ()","")
    part3 = part3.replace('Publication Date: 17-08-2018','')
    part3 = part3.replace('Short-term Patents Granted (Arranged by International Patent\nClassification)','')
    part3 = part3.replace('Arranged by International Patent Classification', '')
    part3 = part3.replace('Section Name: ()  Short-term Patents Granted ()', '')

    part3 = part3.split(line)
    patent_part3 = []
    for patent in part3:
        patent_part3.append(re.findall(r'\[\d+\][^\[]*', patent, re.S))

    df3 = dataframes(patent_part3)

Now that we have the three parts, we'd have them glued together, and save our dataframe for further processing:

In [None]:
# CREATE FINAL DATAFRAME :
def final_Dataframe():
    final1 = pd.concat([df1,df2,df3],keys=['Type1','Type2','Type3'])
    final = final1.reset_index()
    final.drop(['level_1'], inplace = True, axis=1)
    final = final.replace(r'^\s*$', np.nan, regex=True)
    final.rename(columns={'level_0': 'Patent Type'}, inplace=True)
    return final
df_final = final_Dataframe()
print(df_final)
save_frame = open('%s.pickle' % keydate,'wb')
pickle.dump(df_final,save_frame)
save_frame.close()
print("file saved")

We are now ready to pack all these elements into a for loop that fetches the PDFs for each issue and performs these operatations. We end up with a big dataframe comprising all patents in a structured way. We had a script running on our server to handle this task that consisted of all the above elements. This took about one night's time.