# Parsing Text Files

## 1. Import Libraries
In this task, I will use only 2 library, Regex and Pandas

In [1]:
import re
import pandas as pd

## 2. Read File
First thing first, we need to read the file. Function `file_reader()` below will take the txt file and re-format the txt file into a list of XML text file.

In [2]:
# read the text file
# return a list of XML text
def file_reader(file_name):
    # open text file
    with open(file_name) as f:
        xml_tag_pattern = r'<\/us-patent-grant>'
        xml_lst = []
        xml_txt = ""
        # loop for every line
        for line in f:
            # add text
            xml_txt += line
            if re.match(xml_tag_pattern, line):
                xml_lst.append(xml_txt)
                xml_txt = ""
        # end for loop line
    # end with open
    
    print("Total XML detected: {}".format(len(xml_lst)))
    
    return xml_lst

In [None]:
xml_lst = file_reader("xml_text.txt")

Total XML detected: 150


## 3. Extract XML Information
### 3.1 Information Extraction Function
To extract the information for each data, here we have several function.
* `get_title()` function will generate the title of the XML file
* `type_a()` function is a function to get the data that has "type A", here type A is a type of data capture where the data has a opening tag and has an inline tag to capture the information. The information that consider as this type are grant_id and patent_kind
* `type_b()` function is a function to get the data that has "type B", here type B is a type of data capture where the data consist of several line and has multiple html tags like \<p\> tag or others. Here I will remove the html tag and combine every new into a single line. The information data that consider as this type are abstract.
* `type_c()` function is a function to get the data that has "type C", here type C is a type of data capture where we need to count the number of data. The information data that consider as this type are the count for citation and examiner count.
* `get_inventors()` function will generate the inventors first name and last name.
* `get_claim()` function will capture each claim that the XML patent has.
* `get_pub_stat()` function will return True if the patent is being published, and return False otherwise. This function will only run if the patent kind are either "utility patent" or "plant patent".

In [4]:
def get_title(xml_txt, pat):
    # check for pattern
    found_pat = re.search(pat, xml_txt)
    if found_pat:
        # remove the HTML entities
        return re.sub(r'&#[\w\d]+;',"",found_pat[0])
    # end if
    return "NA"

def type_a(xml_txt, tag, inline):
    # check for the tag
    found_tag = re.search(tag, xml_txt)
    if found_tag:
        # search for text pattern
        found_pat = re.search(tag, found_tag[0])
        if found_pat:
            # search for inline pat
            found_inline = re.search(inline, found_pat[0])
            if found_inline:
                # remove the HTML entities
                return re.sub(r'&#[\w\d]+;',"",found_inline[0])
            # endif 
        # endif
    # endif
    return "NA"

def type_b(xml_txt, pat):
    # check for pattern
    found_pat = re.search(pat, xml_txt)
    if found_pat:
        html_tag = r'<.*?>' # html tag pattern
        # split by newline, remove the first list, join using space, remove html tag, remove the HTML entities
        text = re.sub(r'&#[\w\d]+;',"",re.sub(html_tag, "", " ".join(found_pat[0].split("\n")[1:])))
        
        return text
    # end if
    return "NA"

def type_c(xml_txt, pat, ct_p):
    # check for pattern
    found_pat = re.search(pat, xml_txt)
    if found_pat:
        # search for all matches
        found_text = re.findall(ct_p, found_pat[0])
        if found_text:
            return len(found_text)
        # end if found text
    # enf if found pat
    return 0

def get_inventors(xml_txt, pat, fname_p, lname_p):
    # check for pattern
    found_pat = re.search(pat, xml_txt)
    if found_pat:
        # find first and lastname, zip, make into list, loop for each pair, combine with space
        inv_name = "[" + ", ".join([" ".join(name) for name in list(zip(re.findall(fname_p, found_pat[0]), re.findall(lname_p, found_pat[0])))]) + "]"
        
        return inv_name
    # end if
    return "NA"

def get_claim(xml_txt, pat, clm_p):
    # check for pattern
    found_pat = re.search(pat, xml_txt)
    if found_pat:
        html_tag = r'<.*?>' # html tag pattern
        # search for claim text
        found_claim = re.findall(clm_p, found_pat[0])
        # for every claim text, remove html tag, replace new line with space, after that remove the HTML entities
        claim_lst = re.sub(r'&#[\w\d]+;',"","[" + ", ".join([re.sub(r'\n', " ", re.sub(html_tag, "",claim)) for claim in found_claim]) + "]")
        
        return claim_lst, len(found_claim)
    # end if
    return "NA", 0

def get_pub_stat(xml_txt, pat):
    found_pat = re.search(pat, xml_txt)
    if found_pat:
        return True
    else:
        return False

### 3.2 Information Fetching
`fetch_data()` function will takes a patent XML text as the input and use 3.1 functions to catch all of the information needed, and this function will return a dictionary full of the information from the patent XML

In [5]:
def fetch_data(xml_txt):
    # grant id tag, grant id inline value
    gid_tagp = r'<us-patent-grant.+'
    gid_inlp = r'(?<=file=\")[A-Z0-9]+'
    # patent kind tag, patent kind inline value
    pk_tagp = r'<application-reference.+'
    pk_inlp = r'(?<=appl-type=\")[A-Za-z0-9 ]+'
    # publication status
    pub_tagp = r'(?<=<related-publication>)[\s\S]'
    # patent title pattern
    pt_tagp = r'(?<=<invention-title id=\"[\d\w]{5}\">)[\w\W]+(?=</invention-title>)'
    # abstract pattern
    abs_tagp = r'(?<=<abstract)[\w\W]+(?=</abstract>)'
    # claim text tag, claim text inline tag
    ct_tagp = r'(?<=<claims id=\"claims\">\s)[\w\W]+?(?=<\/claims>)'
    ct_inlp = r'(?<=<claim id=\"[A-Z\-0-9]{9}\" num=\"[\d]{5}\">\s)[\w\W]+?(?=<\/claim>)'
    # citation
    citation_tag = r'(?<=us-references-cited>\s)[\w\W]+?(?=<\/us-references-cited>)'
    cec_tagp = r'cited by examiner' # citation examiner count
    cac_tagp = r'cited by applicant' # citation application count
    # inventors, first name, last name
    inv_tagp = r'(?<=<inventors>\s)[\w\W]+?(?=<\/inventors>)'
    fname_tagp = r'(?<=<first-name>)[^<>]+(?=</first-name>)'
    lname_tagp = r'(?<=<last-name>)[^<>]+(?=</last-name>)'
    
    grant_id = type_a(xml_txt, gid_tagp, gid_inlp) # grant id
    patent_kind = type_a(xml_txt, pk_tagp, pk_inlp) # patent kind
    pub_stat = get_pub_stat(xml_txt, pub_tagp) # publication status
    patent_title = get_title(xml_txt, pt_tagp) # patent title
    abstract = type_b(xml_txt, abs_tagp) # abstract
    inventors = get_inventors(xml_txt, inv_tagp, fname_tagp, lname_tagp) # inventors
    claim_text, number_of_claims = get_claim(xml_txt, ct_tagp, ct_inlp) # claim text
    citations_examiner_count = type_c(xml_txt, citation_tag, cec_tagp) # num of examiner citation
    citations_applicant_count = type_c(xml_txt, citation_tag, cac_tagp) # num of applicant citation
    
    # patent kind
    if patent_kind in ["utility", "plant"]:
        if pub_stat:
            patent_kind = patent_kind.capitalize() + " Patent Grant (with a published application) issued on or after January 2, 2001."
        else:
            patent_kind = patent_kind.capitalize() + " Patent Grant (no published application) issued on or after January 2, 2001."
        # end if
    else:
        patent_kind = patent_kind.capitalize() + " Patent"
    # end if
    
    # return a dictionary of fetched data
    return {'grant_id':grant_id, 'patent_kind':patent_kind, 'patent_title':patent_title, 'abstract':abstract,
            'inventors':inventors, 'claim_text':claim_text, 'number_of_claims':number_of_claims,
            'citations_examiner_count':citations_examiner_count,'citations_applicant_count':citations_applicant_count}

Now with all of the functions from 3.1 and 3.2, I can extract all of the information. First, I create an empty dictionary, I loop for every list inside the `xml_lst` list and call `fetch_data` function and we can save the first dictionary data that we found, and move on to the next loop.

After the loop is finished, here I use Pandas library to create a DataFrame from the dictionary that we got

In [None]:
dict_lst = {}
i = 0
for xml in xml_lst:
    dict_lst[i] = fetch_data(xml)
    i += 1
# end for
df = pd.DataFrame.from_dict(dict_lst, orient="index")

below is the preview of the first few dataset that I capture

In [7]:
print(f"Shape: {df.shape}")
print(f"Unique Grant ID: {len(df.grant_id.unique())}")
df.head()

Shape: (150, 9)
Unique Grant ID: 150


Unnamed: 0,grant_id,patent_kind,patent_title,abstract,inventors,claim_text,number_of_claims,citations_examiner_count,citations_applicant_count
0,US10361428,Utility Patent Grant (with a published applica...,"Anode active material, method of preparing the...",An anode active material including a porous si...,"[Young-Ugk Kim, Seung-Uk Kwon, Jae-Hyuk Kim, C...",[1. An anode active material comprising a sili...,15,1,7
1,USD0854461,Design Patent,Hand rotator for a swivel lift,,[Michael A. Haessly],[The ornamental design for a hand rotator for ...,1,9,0
2,US10360817,Utility Patent Grant (with a published applica...,Wearable partial task surgical simulator,A wearable device for simulating wounds and in...,[Stuart Charles Segall],[1. A prosthetic internal organ module compris...,7,25,12
3,US10357652,Utility Patent Grant (with a published applica...,System and method for using impedance to deter...,A method for implanting a neurostimulation lea...,[Kerry Bradley],[1. A method performed using a neurostimulatio...,20,4,30
4,US10360279,Utility Patent Grant (with a published applica...,Computer networking system and method with pre...,"An apparatus, method, and non-transitory compu...",[Douglas M. Dillon],[1. An apparatus-implemented method comprising...,25,22,99


## 4. Convert to Document Format
### 4.1 CSV
To convert the dataset into CSV file, I use a pandas library `pandas.DataFrame.to_csv()`. This function will create a CSV file based on our dataset.

In [None]:
# to CSV
df.to_csv("result.csv")

### 4.2 JSON
To conver the dataset into JSON file, here I use the manual loop, first we itterate the rows in each dataset, then print each data as a long string with JSON file format. After that I can just write the JSON file using `with open() as file` code.

In [None]:
# to JSON
json_txt = "{"
for idx, row in df.iterrows():
    json_txt += '"'+row["grant_id"]+'"' + ':{' + '"patent_title": '+'"'+ row["patent_title"] +'",' + '"patent_kind": '+'"'+ row["patent_kind"] +'",' + '"number_of_claims": '+ str(row["number_of_claims"]) +',' + '"inventors": '+'"'+ row["inventors"] +'",' + '"citations_applicant_count": ' + str(row["citations_applicant_count"]) +',' + '"citations_examiner_count": '+ str(row["citations_examiner_count"]) +',' + '"claim_text": '+'"'+ row["claim_text"] +'",' + '"abstract": '+'"'+ row["abstract"] +'"' + '},'
    # end txt
# end for loop
# remove comma at the end and add closing bracket
json_txt = json_txt[:-1] + "}"
# create file
with open('result.json', 'w') as f:
    f.write(json_txt)