# **Patent Data Scraping Notebook**


## **Introduction**
This Jupyter notebook contains Python code for scraping patent data from multiple sources, including FPO (Foreign Patent Office), WIPO (World Intellectual Property Organization), and Google Patents. The objective of this notebook is to demonstrate how to programmatically retrieve patent information from various sources using web scraping techniques.

## **Overview**
- **FPO (Foreign Patent Office):** This section of the notebook focuses on scraping patent data from foreign patent offices, such as the European Patent Office (EPO), Japan Patent Office (JPO), and Korean Intellectual Property Office (KIPO). We utilize web scraping libraries like BeautifulSoup and requests to extract patent data from the official websites of these offices.

- **WIPO (World Intellectual Property Organization):** In this section, we demonstrate how to scrape patent data from the World Intellectual Property Organization (WIPO) using their API. WIPO provides a RESTful API for accessing patent information, and we utilize Python requests library to make API calls and retrieve data in JSON format.

- **Google Patents:** The final section of the notebook focuses on scraping patent data from Google Patents. We use Selenium WebDriver to automate the process of searching for patents on the Google Patents website and extracting relevant information from the search results pages.

## **Prerequisites**
Before running the code in this notebook, ensure that you have the following dependencies installed:
- Python 3.x
- Jupyter Notebook
- BeautifulSoup (for FPO scraping)
- Requests (for FPO and WIPO scraping)
- Selenium WebDriver (for Google Patents scraping)
## **Usage**
**FPO Scraping:**

Execute the code in the FPO scraping section to scrape patent data from foreign patent offices.
Customize the code as needed to scrape data from specific patent offices or jurisdictions.

**WIPO Scraping:**

Run the code in the WIPO scraping section to retrieve patent data from WIPO using their API.
Ensure that you have valid authentication credentials for accessing the WIPO API.

**Google Patents Scraping:**

Execute the code in the Google Patents scraping section to automate the process of scraping patent data from Google Patents.
Customize the code to search for patents based on specific criteria or keywords.

# Google Patents

#### 1. Using SerpAPI load patents metadata

In [2]:
import requests
import pandas as pd
import json

In [6]:
import requests

# Set your SerpApi API key
API_KEY = ""

# Define the search query
search_query = "(Virus engineering)"

# Construct the API request URL
url = f"https://serpapi.com/search.json?engine=google_patents&q={search_query}&api_key={API_KEY}"

try:
    # Send GET request
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for any HTTP error status codes

    # Parse response JSON
    data = response.json()

    # Extract total number of results
    total_results = data.get('search_information', {}).get('total_results', 0)
    print("Total Results:", total_results)

    # Extract patent data
    patents = data.get('organic_results', [])

    # Print titles of patents
    for patent in patents:
        print("Title:", patent.get('title', 'N/A'))
        print("Inventor:", patent.get('inventor', 'N/A'))
        print("Assignee:", patent.get('assignee', 'N/A'))
        print("Publication Date:", patent.get('publication_date', 'N/A'))
        print("Patent ID:", patent.get('patent_id', 'N/A'))
        print()

except requests.exceptions.HTTPError as err:
    print(f"HTTP error occurred: {err}")

except Exception as e:
    print(f"An error occurred: {e}")


Total Results: 183736
Title: Glycosylation engineering of antibodies for improving antibody-dependent …
Inventor: Pablo Umaña
Assignee: Roche Glycart Ag
Publication Date: 2017-08-01
Patent ID: patent/US9718885B2/en

Title: Compositions and methods for in vitro viral genome engineering
Inventor: K·C·凯蒂
Assignee: C3J治疗公司
Publication Date: 2021-05-28
Patent ID: patent/CN107278227B/en

Title: Attenuated viruses useful for vaccines
Inventor: Eckard Wimmer
Assignee: The Research Foundation for The State of University New york
Publication Date: 2022-09-22
Patent ID: patent/US20220298492A1/en

Title: Inducible adeno -associated virus vector mediated transgene ablation system
Inventor: James M. Wilson
Assignee: The Trustees of The University of Pennsylvania
Publication Date: 2015-09-09
Patent ID: patent/EP2761009B1/en

Title: Engineering and optimization of improved systems, methods and enzyme …
Inventor: フェン・ジャン
Assignee: ザ・ブロード・インスティテュート・インコーポレイテッド
Publication Date: 2022-08-24
Patent ID: pate

In [7]:
import csv
import requests

# Set your SerpApi API key
API_KEY = ""

# Define the search query
search_query = "(Virus engineering)"

# Construct the API request URL
url = f"https://serpapi.com/search.json?engine=google_patents&q={search_query}&api_key={API_KEY}"

try:
    # Send GET request
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for any HTTP error status codes

    # Parse response JSON
    data = response.json()

    # Extract patent data
    patents = data.get('organic_results', [])

    # Specify the CSV file name
    csv_filename = "patents.csv"

    # Write patent data to CSV file
    with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        # Write header row
        writer.writerow(["Title", "Inventor", "Assignee", "Publication Date", "Patent ID"])
        # Write data rows
        for patent in patents:
            title = patent.get('title', 'N/A')
            inventor = patent.get('inventor', 'N/A')
            assignee = patent.get('assignee', 'N/A')
            publication_date = patent.get('publication_date', 'N/A')
            patent_id = patent.get('patent_id', 'N/A')
            writer.writerow([title, inventor, assignee, publication_date, patent_id])

    print(f"Total {len(patents)} patents scraped and saved to {csv_filename}.")

except requests.exceptions.HTTPError as err:
    print(f"HTTP error occurred: {err}")

except Exception as e:
    print(f"An error occurred: {e}")


Total 10 patents scraped and saved to patents.csv.


In [1]:
import csv
import requests
import time

# Set your SerpApi API key
API_KEY = ""

# Define the search query
search_query = "(Virus engineering)"

# Initialize variables
patents_data = []

# Define function to scrape patents data
def scrape_patents(url):
    try:
        # Send GET request
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for any HTTP error status codes

        # Parse response JSON
        data = response.json()

        # Extract patent data
        patents = data.get('organic_results', [])
        patents_data.extend(patents)

        # Check for next page
        pagination = data.get('pagination', {})
        next_page = pagination.get('next')
        if next_page:
            time.sleep(1)  # Adding a delay to avoid rate limiting
            scrape_patents(next_page)

    except requests.exceptions.HTTPError as err:
        print(f"HTTP error occurred: {err}")

    except Exception as e:
        print(f"An error occurred: {e}")

# Construct the initial API request URL
initial_url = f"https://serpapi.com/search.json?engine=google_patents&q={search_query}&api_key={API_KEY}"

# Start scraping
print("Starting scraping...")
scrape_patents(initial_url)

# Specify the CSV file name
csv_filename = "patents.csv"

# Write patent data to CSV file
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    # Write header row
    writer.writerow(["Title", "Inventor", "Assignee", "Publication Date", "Patent ID", "Patent Link", "PDF Link"])
    # Write data rows
    for patent in patents_data:
        title = patent.get('title', 'N/A')
        inventor = patent.get('inventor', 'N/A')
        assignee = patent.get('assignee', 'N/A')
        publication_date = patent.get('publication_date', 'N/A')
        patent_id = patent.get('patent_id', 'N/A')
        patent_link = patent.get('patent_link', 'N/A')
        pdf_link = patent.get('pdf', 'N/A')
        writer.writerow([title, inventor, assignee, publication_date, patent_id, patent_link, pdf_link])

print(f"Total {len(patents_data)} patents scraped and saved to {csv_filename}.")


Starting scraping...
Total 10 patents scraped and saved to patents.csv.


### Import data as Json file:

In [3]:
# Set up your API key and construct the request URL
API_KEY = '190703dca805b2973f234d6726898c99d39a9a0c48150a3efda26beb433665db'
search_query = 'virus engeening'
url = f'https://serpapi.com/search?engine=google_patents&q={search_query}&api_key={API_KEY}'

# Make the GET request
response = requests.get(url)

# Parse the JSON response
data = response.json()

# Specify the filename to save the JSON data
filename = 'patents_in_virus_engineering24.json'

# Write the JSON data to the file
with open(filename, 'w') as file:
    json.dump(data, file, indent=4)

# Extract and process the information you need from the response
while 'pagination' in data and 'next' in data['pagination']:
    next_page_url = data['pagination']['next']
    next_response = requests.get(next_page_url)
    next_data = next_response.json()
    
    if 'patents' in next_data:
        data['patents'].extend(next_data['patents'])
    elif 'organic_results' in next_data:
        # Assuming patents are within 'organic_results'
        patents = [result for result in next_data['organic_results'] if result.get('patent')]
        data['patents'].extend(patents)

    if 'pagination' in next_data and 'next' in next_data['pagination']:
        data['pagination'] = next_data['pagination']
    else:
        break

# Write the updated JSON data to the file
with open(filename, 'w') as file:
    json.dump(data, file, indent=4)

To load the total metadata from Google patents

In [10]:
# This code can be also run from terminal and this the best way to see the process.

In [None]:
"""
You should do this steps in order ro run this code:
    * Use Search_Url_Finder.py to Download CSV file which contain url of each patent
    * Copy it (CSV file) to path where this code exist
    * Rename it to gp-search.csv
    
This code extract this information from patents page from Google Patents and store them into datafram:
    - ID
    - Title
    - Abstract
    - Description
    - Claims
    - Inventors
    - Patent Office
    - Publication Date
    - URL
    
The code have capability to resume from last run. So don't worry if something unwanted happend (i.e  Power outage!)

This code create two files in the code directory :
    patents_data.csv --> Contain all information scraped from patents pages
    not_scrap_pickle --> Contain all pantents from gp-search.csv which weren't scrapped 
    
@author: zil.ink/anvaari
"""

# Import required packages
import pandas as pd
import requests
import progressbar
import time
import os
from os.path import join
from bs4 import BeautifulSoup
import pickle

script_path=os.path.dirname(os.path.abspath(__file__))

# Make sure gp-search.csv exist  
while not os.path.isfile(join(script_path,'gp-search.csv')):
    print('\nYou should do this steps in order ro run this code:\n\t* Use Search_Url_Finder.py to Download CSV file which contain url of each patent\n\t* Copy it (CSV file) to path where this code exist\n\t* Rename it to gp-search.csv\n')
    print("\ngp-search.csv doesn't find. It should exist where this code exist\n")
    temp_=input('\nPlease copy the file and  press Enter\n')
# Import search-gp.csv as dataframe
search_df=pd.read_csv(join(script_path,'gp-search.csv'),skiprows=[0])

# This piece add resume capability to code
# Load result (if exist) from code path and slice search-gp.csv from where last index of result to the end
if os.path.isfile(join(script_path,'patents_data.csv')):
    result=pd.read_csv(join(script_path,'patents_data.csv'),index_col=0)
    search_df = search_df.loc[int(result.index[-1]) + 1:, :]
else:
    result=pd.DataFrame(columns=['ID','Title','Abstract','Description','Claims','Inventors','Current Assignee','Patent Office','Publication Date','URL'])
# Load list of not scraped links if exist
if os.path.isfile(join(script_path,'not_scrap_pickle')):
    with open(join(script_path,'not_scrap_pickle'),'rb') as fp:
        not_scraped=pickle.load(fp)
else:
    not_scraped=[]

# Set user agent for every request send to google    
h={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'}

# Iteate over search-gp.csv and send request to server
for (index,row),i in zip(search_df.iterrows(),progressbar.progressbar(range(len(search_df)))):
    link=row['result link']
    # Send request to Google Patents and scrap source of patent page
    # try except use in order handle connection errors
    try:
        r=requests.get(link,headers=h)
    except requests.exceptions.ConnectionError as e:
        not_scraped.append(link)
        print(e,'\n\n')
        # This piece closes the program if rate of errors go higher than 20% 
        if len(not_scraped)/int(index) >=0.2:
            print('\nAbove half of request result in erroe please read the output to investigate why this happend\n')
            break
        continue
    # Use Beautidulsoup to extract information from html
    bs=BeautifulSoup(r.content,'html.parser')
    # Find claims section
    claims=bs.find('section',{'itemprop':'claims'})
    # Handle situation where claims not exist
    if not claims is None:
        # Handle situation where claims have non-english paragraphs
        if claims.find('span',class_='notranslate') is None:
            claims=claims.text.strip()    
        else:
            notranslate=[tag.find(class_='google-src-text') for tag in  claims.find_all('span',class_='notranslate')]
            for tag in notranslate:
                tag.extract()
            claims=claims.text.strip()
            
    else: 
        claims='Not Found'
        
    desc=bs.find('section',{'itemprop':'description'})
    # Handle situation where description not exist
    if not desc is None:
        # Handle situation where description have non-english paragraphs
        if desc.find('span',class_='notranslate') is None:
            desc=desc.text.strip()
        else:
            notranslate=[tag.find(class_='google-src-text') for tag in  desc.find_all('span',class_='notranslate')]
            for tag in notranslate:
                tag.extract()
            desc=desc.text.strip()
    else:
        desc='Not Found'
        
    abst=bs.find('section',{'itemprop':'abstract'})
    # Handle situation where abstract not exist
    if not abst is None:
        # Handle situation where abstract have non-english paragraphs
        if abst.find('span',class_='notranslate') is None:
            abst=abst.text.strip()
        else:
            notranslate=[tag.find(class_='google-src-text') for tag in  abst.find_all('span',class_='notranslate')]
            for tag in notranslate:
                tag.extract()
            abst=abst.text.strip()
    else:
        abst='Not Found'
      
    
    patent_office=bs.find('dd',{'itemprop':'countryName'})
    # Handle situation where patent office name not exist
    if patent_office is None:
        patent_office='Not Found'
    else:
        patent_office=patent_office.text
    # Add information to result dataframe
    result.at[index,'ID']=search_df.at[index,'id']
    result.at[index,'Title']=search_df.at[index,'title']
    result.at[index,'Abstract']=abst
    result.at[index,'Description']=desc
    result.at[index,'Claims']=claims
    result.at[index,'Inventors']=search_df.at[index,'inventor/author']
    result.at[index,'Current Assignee']=search_df.at[index,'assignee']
    result.at[index,'Publication Date']=search_df.at[index,'publication date']
    result.at[index,'Patent Office']=patent_office
    result.at[index,'URL']=search_df.at[index,'result link']
    
    # Save result dataframe and not scraped list every 5 iteration
    if i%5==0:
        result.to_csv(join(script_path,'patents_data.csv'))
        with open(join(script_path,'not_scrap_pickle'),'wb') as fp:
            pickle.dump(not_scraped, fp)
    # Wain 70 seconds every 10 iteration in order to avoid blocking from google
    if i%10==0 and i!=0:
        time.sleep(70)
    
result.to_csv(join(script_path,'patents_data.csv'))
with open(join(script_path,'not_scrap_pickle'),'wb') as fp:
            pickle.dump(not_scraped, fp)
  



#### 2. Load Patents data format pdf

In [13]:
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: wget
  Building wheel for wget (setup.py): started
  Building wheel for wget (setup.py): finished with status 'done'
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9680 sha256=394f014a5dd8892f9eaa99429488260166365336e23b8e78185cc36b925f6ae3
  Stored in directory: c:\users\leila\appdata\local\pip\cache\wheels\40\b3\0f\a40dbd1c6861731779f62cc4babcb234387e11d697df70ee97
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [1]:
# Script run from terminal file: script.py.txt

In [27]:
from requests_html import HTMLSession
import wget
from time import sleep


s = HTMLSession()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'
}

with open('Google_patents_links.txt', 'r') as file:
    try:
        for url in file:
            url = url.strip()  # Remove leading/trailing whitespaces
            print(url)
            r = s.get(url, headers=headers)
            # render the JavaScript
            r.html.render(sleep=3, timeout=50)
            pdf_url = r.html.find('a.style-scope.patent-result', first=True).attrs['href']
            wget.download(pdf_url)
            sleep(1)
    except Exception as e:
        print("An error occurred:", e)


https://patents.google.com/patent/US20220298492A1/en
An error occurred: 'coroutine' object is not callable


  r.html.arender()(sleep=3, timeout=50)


15 pdfs was loaded into drive.

In [13]:
# Patents metadata

In [1]:
import pandas as pd
df = pd.read_excel('gp-search-20240303-102418.xlsx')

In [2]:
df

Unnamed: 0,id,title,assignee,inventor/author,priority date,filing/creation date,publication date,grant date,result link,representative figure link
0,US-2022298492-A1,Attenuated viruses useful for vaccines,The Research Foundation for The State of Unive...,"Eckard Wimmer, Steve Skiena, Steffen Mueller, ...",2007-03-30,2021-10-28,2022-09-22,,https://patents.google.com/patent/US2022029849...,https://patentimages.storage.googleapis.com/d6...
1,CN-107278227-B,Compositions and methods for in vitro viral ge...,C3J治疗公司,"K·C·凯蒂, E·M·巴尔布, C·G·迪皮特里洛",2014-12-16,2015-12-15,2021-05-28,2021-05-28,https://patents.google.com/patent/CN107278227B/en,https://patentimages.storage.googleapis.com/a5...
2,US-6971019-B1,Histogram-based virus detection,Symantec Corporation,Carey S. Nachenberg,2000-03-14,2000-03-14,2005-11-29,2005-11-29,https://patents.google.com/patent/US6971019B1/en,https://patentimages.storage.googleapis.com/c2...
3,ES-2558138-T3,Newcastle disease recombinant virus RNA expres...,The Mount Sinai School Of Medicine Of New York...,"Adolfo Garcia-Sastre, Peter Palese",1998-09-14,1999-09-14,2016-02-02,2016-02-02,https://patents.google.com/patent/ES2558138T3/en,
4,US-8679785-B2,Knobs and holes heteromeric polypeptides,"Genentech, Inc.","Paul J. Carter, Leonard G. Presta, John B. Rid...",1995-03-01,2012-06-12,2014-03-25,2014-03-25,https://patents.google.com/patent/US8679785B2/en,https://patentimages.storage.googleapis.com/63...
...,...,...,...,...,...,...,...,...,...,...
20125,CN-111655044-B,灭菌方法,新南创新私人有限公司,"R·M·帕什利, A·G·桑切斯, 巴里·尼哈姆",2017-11-28,2018-11-28,2024-02-23,2024-02-23,https://patents.google.com/patent/CN111655044B/zh,
20126,DE-69938059-T2,"Chlamydia antigene, entsprechende dna-fragment...","Sanofi Pasteur Ltd., Toronto","Andrew D. Richmond Hill Murdin, Raymond P. Sch...",1998-12-28,1999-12-22,2009-01-08,2009-01-08,https://patents.google.com/patent/DE69938059T2/de,
20127,MX-PA00006184-A,Rhabdovirus recombinante que contiene una prot...,Univ Tennessee Res Corp,Michael A Whitt,1997-12-22,1998-12-22,2003-02-11,,https://patents.google.com/patent/MXPA00006184...,
20128,SK-108193-A3,Patent SK108193A3,Sandoz Ltd,"Jorn D Mikkelsen, Kirsten Bojsen, Klaus K Niel...",1991-04-08,1992-04-07,1994-04-06,,https://patents.google.com/patent/SK108193A3/sk,


**Exctract the links to use them to download each patent pdf.**

In [11]:
df['result link']

0        https://patents.google.com/patent/US2022029849...
1        https://patents.google.com/patent/CN107278227B/en
2         https://patents.google.com/patent/US6971019B1/en
3         https://patents.google.com/patent/ES2558138T3/en
4         https://patents.google.com/patent/US8679785B2/en
                               ...                        
20125    https://patents.google.com/patent/CN111655044B/zh
20126    https://patents.google.com/patent/DE69938059T2/de
20127    https://patents.google.com/patent/MXPA00006184...
20128      https://patents.google.com/patent/SK108193A3/sk
20129     https://patents.google.com/patent/ES2405848T3/es
Name: result link, Length: 20130, dtype: object

In [12]:
df.isna().sum()

id                               0
title                            0
assignee                        17
inventor/author                 46
priority date                   44
filing/creation date             4
publication date                 0
grant date                    7526
result link                      0
representative figure link    8546
dtype: int64

filter the data based on the data and taking just the period between 2020 and 2024

In [4]:
import pandas as pd

# Assuming your data is stored in a DataFrame called 'df'
df['filing/creation date'] = pd.to_datetime(df['filing/creation date'])
filtered_data = df[(df['filing/creation date'] >= '2020-01-01') & (df['filing/creation date'] <= '2024-12-31')]


In [5]:
filtered_data

Unnamed: 0,id,title,assignee,inventor/author,priority date,filing/creation date,publication date,grant date,result link,representative figure link
0,US-2022298492-A1,Attenuated viruses useful for vaccines,The Research Foundation for The State of Unive...,"Eckard Wimmer, Steve Skiena, Steffen Mueller, ...",2007-03-30,2021-10-28,2022-09-22,,https://patents.google.com/patent/US2022029849...,https://patentimages.storage.googleapis.com/d6...
7,JP-2023073245-A,"Delivery, engineering and optimization of syst...","ザ・ブロード・インスティテュート・インコーポレイテッド, Broad Institute I...","フェン・ジャン, Feng Zhang, マティアス・ハイデンライク, HEIDENREIC...",2012-12-12,2023-02-10,2023-05-25,,https://patents.google.com/patent/JP2023073245...,https://patentimages.storage.googleapis.com/06...
8,JP-2023072038-A,Human monoclonal antibodies to programmed deat...,"イー・アール・スクイブ・アンド・サンズ・リミテッド・ライアビリティ・カンパニー, E R S...","アラン ジェイ． コーマン, Alan J Korman, マーク ジェイ． セルビー, M...",2005-07-01,2023-03-14,2023-05-23,,https://patents.google.com/patent/JP2023072038...,https://patentimages.storage.googleapis.com/12...
9,JP-7269990-B2,"CRISPR-Cas Component Systems, Methods and Comp...","ザ・ブロード・インスティテュート・インコーポレイテッド, マサチューセッツ・インスティトュー...","ジャン フェン, デイヴィッド・オリヴァー・バイカード, レ・コン, デイヴィッド・ベンジャ...",2012-12-12,2021-06-09,2023-05-09,2023-05-09,https://patents.google.com/patent/JP7269990B2/en,https://patentimages.storage.googleapis.com/5e...
11,JP-7125440-B2,Engineering and optimization of improved syste...,"ザ・ブロード・インスティテュート・インコーポレイテッド, マサチューセッツ・インスティトュー...","フェン・ジャン, フェイ・ラン, オフィール・シャレム",2012-12-12,2020-02-18,2022-08-24,2022-08-24,https://patents.google.com/patent/JP7125440B2/en,https://patentimages.storage.googleapis.com/e5...
...,...,...,...,...,...,...,...,...,...,...
20098,KR-20220136057-A,바이러스 표면 엔지니어링 기반의 면역 증강된 바이러스 백신,충남대학교산학협력단,"신현진, 유지훈, 박정은",2021-03-31,2021-10-26,2022-10-07,,https://patents.google.com/patent/KR2022013605...,https://patentimages.storage.googleapis.com/27...
20111,CN-114672589-A,一种用于检测对虾hinv病毒的靶序列、引物及其应用,"中山大学, 南方海洋科学与工程广东省实验室(珠海)","何建国, 邓恒为, 翁少萍, 曹昶政, 何心怡, 周丹丹",2021-10-25,2021-10-25,2022-06-28,,https://patents.google.com/patent/CN114672589A/zh,https://patentimages.storage.googleapis.com/64...
20115,KR-20230072150-A,강력한 적응성 면역반응 유도 및 모체이행항체의 간섭을 극복하는 재조합 구제역 a형 ...,대한민국(농림축산식품부 농림축산검역본부장),"이민자, 김현미, 신세희, 김수미, 박종현",2021-11-17,2021-11-17,2023-05-24,,https://patents.google.com/patent/KR2023007215...,https://patentimages.storage.googleapis.com/d7...
20116,KR-20230072149-A,강력한 적응성 면역반응 유도 및 모체이행항체의 간섭을 극복하는 재조합 구제역 o형 ...,대한민국(농림축산식품부 농림축산검역본부장),"이민자, 김현미, 신세희, 김수미, 박종현",2021-11-17,2021-11-17,2023-05-24,,https://patents.google.com/patent/KR2023007214...,https://patentimages.storage.googleapis.com/4e...


In [6]:
# Save the data on a xlsx file 

In [7]:
filtered_data.to_excel('filtered_data.xlsx', index=False)


#### This code should be run from terminal.

In [None]:
import boto3
import os
import wget
from requests_html import HTMLSession
from time import sleep
import pandas as pd

# Initialize HTMLSession
s = HTMLSession()

# Define headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'
}

# Read Excel file
df = pd.read_excel('filtered_data.xlsx')

# AWS S3 credentials
aws_access_key_id = ''
aws_secret_access_key = ''
aws_bucket_name = 'pdfpatents'

# Initialize S3 client
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)

# Maximum total size allowed in bytes (4 GB)
max_total_size_bytes = 4 * 1024 * 1024 * 1024
total_uploaded_size_bytes = 0

# List to keep track of uploaded filenames
uploaded_filenames = []

# Loop through the existing files in the S3 bucket and add their names to the list
response = s3.list_objects_v2(Bucket=aws_bucket_name)
if 'Contents' in response:
    for obj in response['Contents']:
        uploaded_filenames.append(obj['Key'])

try:
    for url in df['result link']:  # Assuming 'result link' is the column containing the URLs
        url = url.strip()  # Remove leading/trailing whitespaces
        print(url)
        r = s.get(url, headers=headers)
        # render the JavaScript
        r.html.render(sleep=3, timeout=100)  # Increase timeout value to 100 seconds
        pdf_url = r.html.find('a.style-scope.patent-result', first=True).attrs['href']
        # Extract the filename from the URL
        filename = pdf_url.split('/')[-1]
        # Check if the file has already been uploaded
        if filename in uploaded_filenames:
            print(f"{filename} has already been uploaded. Skipping.")
            continue
        # Define the directory to save the PDFs
        save_directory = r'C:\Users\leila\Desktop\ID2\S4\Big Data Avancee\PatentsData_pdf'
        # Download the PDF
        pdf_path = os.path.join(save_directory, filename)
        os.makedirs(save_directory, exist_ok=True)
        wget.download(pdf_url, out=pdf_path)
        # Get the size of the downloaded PDF
        pdf_size_bytes = os.path.getsize(pdf_path)
        # Check if adding the size of this PDF exceeds the limit
        if total_uploaded_size_bytes + pdf_size_bytes > max_total_size_bytes:
            print("Total size limit reached. Stopping further uploads.")
            break
        # Upload the PDF to S3
        with open(pdf_path, 'rb') as f:
            s3.upload_fileobj(f, aws_bucket_name, filename)
        total_uploaded_size_bytes += pdf_size_bytes
        # Add the filename to the list of uploaded filenames
        uploaded_filenames.append(filename)
        print("Uploaded:", filename)
        sleep(1)
        
except Exception as e:
    print("An error occurred:", e)


### Data Lake Setup:

We have used AWS cloud storage to store the Pdf files and Snowflake linked with AZURE to store the metadata.

# FPO Patents

To scrape patents from the Free Patents Online website using Playwright.

In [2]:
!pip install playwright

Collecting playwright
  Downloading playwright-1.42.0-py3-none-win_amd64.whl.metadata (3.5 kB)
Collecting greenlet==3.0.3 (from playwright)
  Downloading greenlet-3.0.3-cp311-cp311-win_amd64.whl.metadata (3.9 kB)
Collecting pyee==11.0.1 (from playwright)
  Downloading pyee-11.0.1-py3-none-any.whl.metadata (2.7 kB)
Downloading playwright-1.42.0-py3-none-win_amd64.whl (29.4 MB)
   ---------------------------------------- 0.0/29.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/29.4 MB 653.6 kB/s eta 0:00:45
   ---------------------------------------- 0.1/29.4 MB 459.5 kB/s eta 0:01:04
   ---------------------------------------- 0.2/29.4 MB 893.0 kB/s eta 0:00:33
   ---------------------------------------- 0.3/29.4 MB 883.3 kB/s eta 0:00:33
   ---------------------------------------- 0.3/29.4 MB 759.5 kB/s eta 0:00:39
    --------------------------------------- 0.6/29.4 MB 1.2 MB/s eta 0:00:25
    --------------------------------------- 0.6/29.4 MB 1.1 MB/s eta 0:00:26
  

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyppeteer 1.0.2 requires pyee<9.0.0,>=8.1.0, but you have pyee 11.0.1 which is incompatible.


In [7]:
!pip install pyppeteer

^C


#### **This code also should be run from terminal**

Here you can also define the number of  patents that you want to scrape.

In [None]:
import asyncio
from openpyxl import Workbook
from playwright.async_api import async_playwright

async def scrape_patents():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()

        # Initialize Excel workbook and worksheet
        wb = Workbook()
        ws = wb.active
        ws.append(["Patent Title", "Patent Number", "Abstract"])

        # Start scraping from the first page
        page_number = 1
        while True:
            # Visit the URL
            url = f"https://www.freepatentsonline.com/result.html?p={page_number}&sort=relevance&srch=top&query_txt=virus+engineering&patents_us=on"
            print(f"Scraping page {page_number}: {url}")
            await page.goto(url)

            # Extract patent data
            patents = await page.query_selector_all('#results > div.legacy-container > div > div > table tr')

            # Check if there are no patents on the page
            if not patents:
                print("No patents found on this page. Exiting...")
                break

            for patent in patents:
                # Extract patent title, number, and abstract
                title = await patent.query_selector('td:nth-child(3) a')
                title_text = await title.text_content() if title else ""
                number = await patent.query_selector('td:nth-child(2)')
                number_text = await number.text_content() if number else ""
                abstract = await patent.query_selector('td:nth-child(3)')
                abstract_text = await abstract.text_content() if abstract else ""

                # Append data to the worksheet
                ws.append([title_text.strip(), number_text.strip(), abstract_text.strip()])
                print(f"Row appended: {title_text.strip()} - {number_text.strip()} - {abstract_text.strip()}")

            # Check if there is a next button and click it if it exists
            next_button = await page.query_selector(f'.paginate_spacing a[href="result.html?p={page_number+1}&sort=relevance&srch=top&query_txt=virus+engineering&patents_us=on"]')
            if not next_button:
                print("No more pages to scrape. Exiting...")
                break  # Exit the loop if there's no next button
            else:
                await next_button.click()
                page_number += 1

            # Optional: Add a condition to interrupt the loop after scraping a certain number of pages
            if page_number == 10:
                print("Scraping interrupted after 10 pages.")
                break

        # Save the Excel file
        wb.save("patents_data.xlsx")
        print("Excel file saved successfully.")

        # Close the browser
        await browser.close()

asyncio.run(scrape_patents())



## WIPO Patents

### For the totral pages

In [None]:
from selenium import webdriver
import time
from urllib.parse import quote
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'https://patentscope.wipo.int/search/en/result.jsf?_vid=P21-LTT2MI-53213'

mots_cles = 'FP:(virus engineering)'
mots_cles_encodes = quote(mots_cles)

try:
    driver = webdriver.Chrome()

    driver.get(f'{url}&query={mots_cles_encodes}')

    time.sleep(15)

    page_content = driver.page_source
    soup = BeautifulSoup(page_content, 'html.parser')
    total_pages_info = soup.find('span', {'class': 'ps-paginator--page--value'}).text.strip()
    total_pages = int(total_pages_info.split('/')[-1])

    all_links = []

    for _ in range(total_pages):
        page_content = driver.page_source

        soup = BeautifulSoup(page_content, 'html.parser')

        tbody = soup.find('tbody', {'id': 'resultListForm:resultTable_data'})

        liens_target_self = tbody.find_all('a', {'target': '_self'}, href=True)

        urls_completes = [urljoin(url, lien['href']) for lien in liens_target_self]

        all_links.extend(urls_completes)

        print(f"Liens pour la page {_ + 1}:")
        for url_complete in urls_completes:
            print(url_complete)

        next_page_button = driver.find_element('css selector', '.js-paginator-next')


        if next_page_button.is_enabled():
            next_page_button.click()
            time.sleep(5)  

finally:
    driver.quit()

print("Tous les liens récupérés:")
for link in all_links:
    print(link)

### Excracting data from two pages for testing:

In [1]:
from selenium import webdriver
import time
from urllib.parse import quote
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'https://patentscope.wipo.int/search/en/result.jsf?_vid=P21-LTT2MI-53213'

mots_cles = 'FP:(virus engineering)'
mots_cles_encodes = quote(mots_cles)

try:
    driver = webdriver.Chrome()

    driver.get(f'{url}&query={mots_cles_encodes}')

    time.sleep(15)

    page_content = driver.page_source
    soup = BeautifulSoup(page_content, 'html.parser')
    total_pages_info = soup.find('span', {'class': 'ps-paginator--page--value'}).text.strip()
    total_pages = min(int(total_pages_info.split('/')[-1]), 2)  # Extract data from at most 2 pages

    all_links = []

    for page_num in range(total_pages):  # Run the loop only twice
        page_content = driver.page_source

        soup = BeautifulSoup(page_content, 'html.parser')

        tbody = soup.find('tbody', {'id': 'resultListForm:resultTable_data'})

        liens_target_self = tbody.find_all('a', {'target': '_self'}, href=True)

        urls_completes = [urljoin(url, lien['href']) for lien in liens_target_self]

        all_links.extend(urls_completes)

        print(f"Links for page {page_num + 1}:")
        for url_complete in urls_completes:
            print(url_complete)

        if page_num < total_pages - 1:  # Click next page only if there's more than one page left
            next_page_button = driver.find_element('css selector', '.js-paginator-next')

            if next_page_button.is_enabled():
                next_page_button.click()
                time.sleep(5)

finally:
    driver.quit()

print("All links retrieved:")
for link in all_links:
    print(link)


Links for page 1:
https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2022211482&_cid=P20-LTVSP4-52847-1
https://patentscope.wipo.int/search/en/detail.jsf?docId=CN133674310&_cid=P20-LTVSP4-52847-1
https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2012122649&_cid=P20-LTVSP4-52847-1
https://patentscope.wipo.int/search/en/detail.jsf?docId=CN132798540&_cid=P20-LTVSP4-52847-1
https://patentscope.wipo.int/search/en/detail.jsf?docId=CN133413931&_cid=P20-LTVSP4-52847-1
https://patentscope.wipo.int/search/en/detail.jsf?docId=CN308335805&_cid=P20-LTVSP4-52847-1
https://patentscope.wipo.int/search/en/detail.jsf?docId=CN137631414&_cid=P20-LTVSP4-52847-1
https://patentscope.wipo.int/search/en/detail.jsf?docId=CN82798930&_cid=P20-LTVSP4-52847-1
https://patentscope.wipo.int/search/en/detail.jsf?docId=JP273943820&_cid=P20-LTVSP4-52847-1
https://patentscope.wipo.int/search/en/detail.jsf?docId=CN194517224&_cid=P20-LTVSP4-52847-1
Links for page 2:
https://patentscope.wipo.int/search/en/deta

In [6]:
liens_finaux=all_links

In [7]:
from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
import pandas as pd

def extraire_informations(lien_final, liste_dictionnaires=None):
    if liste_dictionnaires is None:

        liste_dictionnaires = []

    driver = webdriver.Chrome()

    try:
        driver.get(lien_final)
        time.sleep(10)
        page_content = driver.page_source
        soup = BeautifulSoup(page_content, 'html.parser')

        div_ps_panel = soup.find('div', {'class': 'ps-panel--content font-size--small'})

        if div_ps_panel:
            
                data = {}  
                balises_ps_field = div_ps_panel.find_all('div', {'class': 'ps-field ps-biblio-field'})

                for balise in balises_ps_field:
                    label = balise.find('span', {'class': 'ps-field--label ps-biblio-field--label'})
                    valeur = balise.find('span', {'class': 'ps-field--value ps-biblio-field--value'})

                    if label and valeur:
                        label_text = label.text.strip()
                        valeur_text = valeur.text.strip()

                        data[label_text] = valeur_text

                        if label_text == 'Related patent documents':
                            liens_documents = valeur.find_all('a')
                            related_documents = [urljoin(lien_final, lien_document['href']) for lien_document in liens_documents]
                            data[label_text] = related_documents

                liste_dictionnaires.append(data)

    finally:
        driver.quit()

    return liste_dictionnaires


liste_resultats = []

for lien_final in liens_finaux:

    liste_resultats = extraire_informations(lien_final, liste_dictionnaires=liste_resultats)
    
for result in liste_resultats:
    
    print(result)

{'Publication Number': 'WO/2022/211482', 'Publication Date': '06.10.2022', 'International Application No.': 'PCT/KR2022/004491', 'International Filing Date': '30.03.2022', 'Applicants': '충남대학교 산학협력단 THE INDUSTRY & ACADEMIC COOPERATION IN CHUNGNAM NATIONAL UNIVERSITY (IAC)\n\t\t\t\t\t\t\t[KR]/[KR]', 'Inventors': '신현진 SHIN, Hyun-Jin\n\t\t\t\t\t\t\t\n\n\n유지훈 RYU, Jihoon\n\t\t\t\t\t\t\t\n\n\n박정은 PARK, Jungeun', 'Agents': '리앤목특허법인 Y.P.LEE, MOCK & PARTNERS', 'Priority Data': '10-2021-004202731.03.2021KR10-2021-014399926.10.2021KR', 'Publication Language': 'Korean (ko)', 'Filing Language': 'Korean (ko)', 'Designated States': 'View all\n\t\t\t\t\t\t\t\n\n\n\nAE, AG, AL, AM, AO, AT, AU, AZ, BA, BB, BG, BH, BN, BR, BW, BY, BZ, CA, CH, CL, CN, CO, CR, CU, CZ, DE, DJ, DK, DM, DO, DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, HN, HR, HU, ID, IL, IN, IR, IS, IT, JM, JO, JP, KE, KG, KH, KN, KP, KW, KZ, LA, LC, LK, LR, LS, LU, LY, MA, MD, ME, MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI, NO, NZ, OM, PA, P

In [9]:
cles_distinctes = list(set(key for dictionnaire in liste_resultats for key in dictionnaire.keys()))

print("Clés distinctes extraites : ", cles_distinctes)

Clés distinctes extraites :  ['', 'Application Date', 'F-term', 'Agents', 'Grant Number', 'Publication Language', 'Filing Language', 'Abstract', 'FI', 'Application Number', 'Publication Kind', 'Grant Date', 'Inventors', 'Title', 'CPC', 'Publication Number', 'Applicants', 'International Filing Date', 'Designated States', 'Publication Date', 'Related patent documents', 'Office', 'Priority Data', 'International Application No.']


In [12]:
import pandas as pd

# Assuming cles_distinctes is defined somewhere in your code
Wipodata = pd.DataFrame(columns=cles_distinctes)

for dictionnaire in liste_resultats:
    ligne = {cle: None for cle in cles_distinctes}
    ligne.update(dictionnaire)
    Wipodata = pd.concat([Wipodata, pd.DataFrame([ligne])], ignore_index=True)


In [13]:
Wipodata

Unnamed: 0,Unnamed: 1,Application Date,F-term,Agents,Grant Number,Publication Language,Filing Language,Abstract,FI,Application Number,...,CPC,Publication Number,Applicants,International Filing Date,Designated States,Publication Date,Related patent documents,Office,Priority Data,International Application No.
0,,,,"리앤목특허법인 Y.P.LEE, MOCK & PARTNERS",,Korean (ko),Korean (ko),(EN) The present invention relates to a virus ...,,,...,,WO/2022/211482,충남대학교 산학협력단 THE INDUSTRY & ACADEMIC COOPERATIO...,30.03.2022,"View all\n\t\t\t\t\t\t\t\n\n\n\nAE, AG, AL, AM...",06.10.2022,[https://patentscope.wipo.int/search/en/detail...,,10-2021-004202731.03.2021KR10-2021-014399926.1...,PCT/KR2022/004491
1,,05.12.2014,,,104610455.0,,,(EN) The objective of the invention is to prov...,,201410729749.X,...,,104610455,QINGDAO AGRICULTURAL UNIVERSITY青岛农业大学,,,13.05.2015,,China,2014105560974 20.10.2014 CN,
2,,,,"SILVER, Gail C.",,English (en),English (en),(EN) The present invention relates to recombin...,,,...,,WO/2012/122649,OTTAWA HOSPITAL RESEARCH INSTITUTE\n\t\t\t\t\t...,14.03.2012,"View all\n\t\t\t\t\t\t\t\n\n\n\nAE, AG, AL, AM...",20.09.2012,,,"61/452,85315.03.2011US",PCT/CA2012/050153
3,,24.11.2014,,jiu limeng,,,,(EN) The invention provides a set of specific ...,,201410683195.4,...,C12Q 1/6834\n\t\t\t\t\t\n\n\n\n\n\n\nC12Q 1/70...,104498622,HUBEI XINZONGKE VIRUS DISEASE ENGINEERING TECH...,,,08.04.2015,,China,,
4,,20.11.2014,,,104372013.0,,,(EN) The invention aims to provide a duck hepa...,,201410668062.X,...,,104372013,青岛宏昊生物科技有限公司,,,25.02.2015,,China,,
5,,24.07.2020,,南京利丰知识产权代理事务所(特殊普通合伙) 32256,111729078.0,,,(EN) The invention discloses a chicken infecti...,,202010720531.3,...,A61K 39/12\n\t\t\t\t\t\n\n\n\n\n\n\nA61K 2039/...,111729078,"SUZHOU SHINUO BIOTECHNOLOGY CO., LTD.苏州世诺生物技术有限公司",,,02.10.2020,,China,,
6,,06.01.2015,,,104628865.0,,,(EN) The invention relates to preparation and ...,,201510009086.9,...,,104628865,青岛明勤生物科技有限公司,,,20.05.2015,,China,,
7,,23.07.2004,,,,,,(EN) \nHuman anti-hepatitis virus gene enginee...,,200410070620.9,...,,1605628,National Institute for Viral Disease Control a...,,,13.04.2005,,China,,
8,,19.12.2014,4B024AA01\n \n\n\n4B024BA32\n ...,村山　靖彦志賀　正武渡邊　隆実広　信哉,,,,(JA) 【課題】復帰の可能性が実質的になく、したがって、速く、効率的で、かつ安全なワクチン...,A61K 39/13\n \n\n\nA61K 39/155...,2014257141,...,A61K 2039/5254\n\t\t\t\t\t\n\n\n\n\n\n\nC12N 2...,2015091247,ザ・リサーチ・ファウンデーション・フォー・ザ・ステート・ユニヴァーシティー・オブ・ニュー・ヨーク,,,14.05.2015,[https://patentscope.wipo.int/search/en/detail...,Japan,"60/909,389 30.03.2007 US61/068,666 07.03.2008 US",
9,,28.10.2016,,北京科亿知识产权代理事务所(普通合伙) 11350,106526108.0,,,(EN) The invention aims to provide a test meth...,,201610962733.2,...,G01N 33/15\n\t\t\t\t\t\n\n\n\n\n\n\n\n\nG01N 3...,106526108,"YEBIO BIOENGINEERING CO., LTD. OF QINGDAO青岛易邦生...",,,22.03.2017,,China,,


In [14]:
Wipodata.isna().sum()

                                  0
Application Date                  2
F-term                           19
Agents                            4
Grant Number                     10
Publication Language             18
Filing Language                  18
Abstract                          0
FI                               19
Application Number                2
Publication Kind                  2
Grant Date                       10
Inventors                         0
Title                             0
CPC                               8
Publication Number                0
Applicants                        0
International Filing Date        18
Designated States                16
Publication Date                  0
Related patent documents         15
Office                            2
Priority Data                    13
International Application No.    18
dtype: int64

In [16]:
Wipodata.to_excel('WIPO_PatentsData.xlsx', index=False)
print("Les données ont été enregistrées avec succès dans le fichier Excel 'WIPO_PatentsData.xlsx'.")

Les données ont été enregistrées avec succès dans le fichier Excel 'resultats.xlsx'.


# **Conclusion**
This notebook provides a practical demonstration of how to scrape patent data from various sources using Python. By leveraging web scraping techniques and APIs, researchers, analysts, and intellectual property professionals can gather valuable insights from patent databases for analysis and decision-making purposes.