# Web scraper to collect abstracts from Google Patents

Google patents provides a platform to search published patents. However the platform has excluded abstracts from the search downloads. It does allow downloads of the web address for each document.

The abstract contains critical text that our model will need to predict if a document is relevant. A third party API exists for Google Patents at https://serpapi.com, but this does not scrape the full abstract.

The following code utilizes a downloaded CSV of search results from Google Patents to scrape the full abstract and create a new CSV with title and abstract that can be used to create a labeled training set for NLP models downstream

In [2]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import time

## Testing a single URL to extract an abstract

In [5]:
base_url = 'https://patents.google.com/patent/US20230125819A1/en'

In [6]:
response = requests.get(base_url)
response.status_code, response

200

In [8]:
html = response.content
html[:100]

b'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <title>US20230125819A1 - Curable film-forming composit'

In [9]:
soup = BeautifulSoup(html, 'html.parser')

In [8]:
with open('DTM_response.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

In [10]:
abstract = soup.find('abstract')
abstract

<abstract lang="EN" load-source="patent-office" mxw-id="PA588727294">
<div class="abstract" id="p-0001" num="0000">Methods of coating a substrate are disclosed. The methods comprise applying shear force to a coating composition either before or during application of the coating composition to the substrate. The coating composition comprises a water-borne or solvent-borne film-forming resin and a catalyst associated with a carrier, wherein at least some of the catalyst can be released from the carrier upon application of the shear force. Also provided are coated articles prepared by the methods.</div>
</abstract>

In [11]:
abstract.text

'\nMethods of coating a substrate are disclosed. The methods comprise applying shear force to a coating composition either before or during application of the coating composition to the substrate. The coating composition comprises a water-borne or solvent-borne film-forming resin and a catalyst associated with a carrier, wherein at least some of the catalyst can be released from the carrier upon application of the shear force. Also provided are coated articles prepared by the methods.\n'

## Importing and cleaning the full search results

The full search results of 3935 documents were downloaded from Google Patents and saved as TAC_2.xlsx

In [13]:
data = pd.read_excel('TAC_2.xlsx')
data.head()

  for idx, row in parser.parse():


Unnamed: 0,Master patent number,title,Abstract,assignee,inventor/author,priority date,filing/creation date,publication date,grant date,result link,representative figure link,Column1
0,WO-2023125317-A1,Chromium-free anticorrosive coating compositio...,The present application is directed to chromiu...,"Guangdong Huarun Paints Co., Ltd.","Kai He, Rong Xiong, Xi Zhao, Yu Zhang, Wenbin ...",2021-12-28,2022-12-23,2023-07-06,,https://patents.google.com/patent/WO2023125317...,https://patentimages.storage.googleapis.com/be...,A two-part epoxy adhesive comprises a Part A a...
1,WO-2023125320-A1,Chromium-free anticorrosive coating compositio...,The present application is directed to chromiu...,"Guangdong Huarun Paints Co., Ltd.","Kai He, Rong Xiong, Xi Zhao, Yu Zhang, Wenbin ...",2021-12-28,2022-12-23,2023-07-06,,https://patents.google.com/patent/WO2023125320...,https://patentimages.storage.googleapis.com/63...,
2,WO-2022256945-A1,Coatings for marine vessels that reduce cavita...,Disclosed are compositions for coating substra...,Graphite Innovation And Technologies Inc,"Marciel GAIER, Ilia RODIONOV, Mohammed ALGERMOZI",2021-06-10,2022-06-10,2022-12-15,,https://patents.google.com/patent/WO2022256945...,,
3,WO-2022149157-A1,Multifunctional polymer hybrid for direct to m...,A uniquely designed multifunctional polymer hy...,Asian Paints Ltd.,"Vrijeshkumar SINGH, Rajeev Kumar Jain, Devchan...",2021-01-11,2021-12-15,2022-07-14,,https://patents.google.com/patent/WO2022149157...,,
4,US-2023279238-A1,Chromium-free anticorrosive coating compositio...,The present application is directed to chromiu...,"Guangdong Huarun Paints Co., Ltd.",Tingyu HU,2020-08-03,2021-08-03,2023-09-07,,https://patents.google.com/patent/US2023027923...,https://patentimages.storage.googleapis.com/e3...,


In [15]:
data.describe()

Unnamed: 0,Master patent number,title,Abstract,assignee,inventor/author,priority date,filing/creation date,publication date,grant date,result link,representative figure link,Column1
count,3936,3936,71,3928,3910,3922,3932,3935,1127,3935,372,1
unique,3933,3367,68,1558,3602,2839,2978,2795,988,3932,371,1
top,WO-2004033565-A1,Coating composition,The purpose of the present invention is to pro...,"Kansai Paint Co Ltd, 関西ペイント株式会社","Shigeru Nakamura, 茂 中村, Yasushi Nakao, 泰志 中尾",2013-03-15,2019-11-01,2004-04-22,2006-12-27,https://patents.google.com/patent/US2018037125...,https://patentimages.storage.googleapis.com/e3...,A two-part epoxy adhesive comprises a Part A a...
freq,2,60,2,250,9,8,6,7,4,2,2,1


## Cleaning the data and testing the extraction of a URL

In [14]:
data_A_removed = data.drop(['Abstract', 'Column1'], axis = 1)
data_A_na_removed= data_A_removed.dropna(subset='result link')
data_A_na_removed.describe()
data_A_na_removed.shape[0]

3935

In [15]:
link = data_A_removed['result link'][0]
link

'https://patents.google.com/patent/WO2023125317A1/en'

## Creating a for loop to scrape the first 100 abstracts

In [16]:
abstract_list = []

In [None]:
i = 0
for link in data_A_na_removed['result link'][:100]:
    url = link
    response = requests.get(url)
    if response.status_code == 200:
        print(i, 'Status:', response.status_code, link)
        soup = BeautifulSoup(response.content, 'html.parser')
        abstract = soup.find('abstract')
        
        if abstract == None:
            print(i, 'Status:', response.status_code, link, 'abstract not found')
            abstract_list.append('None')
            i = i+1
            continue
        else:
            abstract_list.append(abstract.text)
            i = i+1
    else:
        print(i,'Status:', response.status_code, link, 'not found')
        abstract_list.append('None')
        i= i+1
        continue
    
    time.sleep(1) #sleep timer to prevent overloading the website

abstract_df = pd.DataFrame(abstract_list)
abstract_df.to_csv('ScrappedAbstracts.csv') #saves abstracts

In [None]:
for link in data_A_na_removed['result link'][100:1000]:
    url = link
    response = requests.get(url)
    if response.status_code == 200:
        print(i, 'Status:', response.status_code, link)
        soup = BeautifulSoup(response.content, 'html.parser')
        abstract = soup.find('abstract')
        
        if abstract == None:
            print(i, 'Status:', response.status_code, link, 'abstract not found')
            abstract_list.append('None')
            i = i+1
            continue
        else:
            abstract_list.append(abstract.text)
            i = i+1
    else:
        print(i,'Status:', response.status_code, link, 'not found')
        abstract_list.append('None')
        i= i+1
        continue
    
    time.sleep(1)
#save the results
abstract_df = pd.DataFrame(abstract_list)
abstract_df.to_csv('ScrappedAbstracts.csv')

In [None]:
for link in data_A_na_removed['result link'][1000:2000]:
    url = link
    response = requests.get(url)
    if response.status_code == 200:
        print(i, 'Status:', response.status_code, link)
        soup = BeautifulSoup(response.content, 'html.parser')
        abstract = soup.find('abstract')
        
        if abstract == None:
            print(i, 'Status:', response.status_code, link, 'abstract not found')
            abstract_list.append('None')
            i = i+1
            continue
        else:
            abstract_list.append(abstract.text)
            i = i+1
    else:
        print(i,'Status:', response.status_code, link, 'not found')
        abstract_list.append('None')
        i= i+1
        continue
    
    time.sleep(1)
#save results
abstract_df = pd.DataFrame(abstract_list)
abstract_df.to_csv('ScrappedAbstracts.csv')

In [None]:
for link in data_A_na_removed['result link'][2000:]:
    url = link
    response = requests.get(url)
    if response.status_code == 200:
        print(i, 'Status:', response.status_code, link)
        soup = BeautifulSoup(response.content, 'html.parser')
        abstract = soup.find('abstract')
        
        if abstract == None:
            print(i, 'Status:', response.status_code, link, 'abstract not found')
            abstract_list.append('None')
            i = i+1
            continue
        else:
            abstract_list.append(abstract.text)
            i = i+1
    else:
        print(i,'Status:', response.status_code, link, 'not found')
        abstract_list.append('None')
        i= i+1
        continue
    
    time.sleep(1)
#save results and add full abstract list to the original dataframe
abstract_df = pd.DataFrame(abstract_list)
abstract_df.to_csv('ScrappedAbstracts.csv', index = False)
data_A_na_removed['Abstracts'] = abstract_list
data_A_na_removed.to_csv('TAC_scraped.csv', index = False)

## Additional option to save

In [21]:
abstract_df.to_csv('ScrappedAbstracts.csv')

In [None]:
data_A_na_removed['Abstracts'] = abstract_list
data_A_na_removed.to_csv('TAC_scraped.csv', index = False)