<a href="https://colab.research.google.com/github/paulowiz/uff_engenharia_de_dados_com_python/blob/master/uff_engenharia_de_dados_com_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial pratico ETL (Extraction,Transformation,Loading)

## PT - Extração,Transformação e Carregamento de dados 


### Problema a ser resolvido:

O escritório de inovação de uma empresa precisa de um engenheiro de dados para fazer uma nova "Pipeline de Dados" para que eles possam ter acesso aos novos arquivos sobre patentes do setor publico de patentes dos Estados Unidos da America(USPTO).Com esses dados eles poderão fazer o download com o link facilidade e verificar quantos arquivos são processados diariamente. 

# Extração de dados 

É o processo que analisa o provedor de dados, os tipos de dados que serão processados e faz a conexão com esse provedor através de REST API, Web Scrapping, SOAP etc. 

Em nosso caso, como nosso provider é o USPTO, não temos uma API deles para consumir esses dados.... então vamos utilizar a técnica de raspagem da web(Web Scrapping) que vamos pegar os dados direto do website. 

In [11]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from sqlalchemy import create_engine
import pandas as pd


class Uspto:
    """Classe que extrai os dados sobre arquivos de patentes do USPTO"""

    def __init__(self, year):
        """
        Construtor que pega o ano como referência para iniciar a extração.
        """
        try:
            self.year = int(year)
        except:
            print('Year must be a number')
            return False
        self.link = 'https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/' + str(year) + '/'

    def get_uspto_files_information(self):
        headers = {
            "User-Agent":
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
        }

        page = requests.get(self.link, headers=headers)

        soup = BeautifulSoup(page.text, 'html.parser')

        soup = soup.find_all('table')
        soup = soup[-1]
        arr_files =[]
        trs = soup.find_all('tr')
        for tr in trs:
            tds = tr.find_all('td')
            dict_temp = {}
            try:
                dict_temp = {'filename': tds[0].text, 'size': tds[1].text, 'publish_on': tds[2].text, 'url': self.link+tds[0].find('a')['href']}
                arr_files.append(dict_temp)
            except:
                continue

        return arr_files

    def transform_and_load(self,files):
            df = pd.DataFrame(files)
            df.head(10)
            engine = create_engine('sqlite:///patent.db', echo=False)
            df.to_sql('uspto_files', con=engine, if_exists='append')

pass

### Chamando a classe de extração que criamos acima 

In [12]:
for year in range(2010,2022):
    uspto = Uspto(year)
    files = uspto.get_uspto_files_information()
    print(files)
    uspto.transform_and_load(files)
    

[{'filename': 'ipa100107.zip', 'size': '89258845', 'publish_on': '2010-01-07 02:00', 'url': 'https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2010/ipa100107.zip'}, {'filename': 'ipa100114.zip', 'size': '94854797', 'publish_on': '2010-01-14 02:00', 'url': 'https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2010/ipa100114.zip'}, {'filename': 'ipa100121.zip', 'size': '103280203', 'publish_on': '2010-01-21 02:00', 'url': 'https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2010/ipa100121.zip'}, {'filename': 'ipa100128.zip', 'size': '99827487', 'publish_on': '2010-01-28 02:00', 'url': 'https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2010/ipa100128.zip'}, {'filename': 'ipa100204.zip', 'size': '116724577', 'publish_on': '2010-02-04 02:00', 'url': 'https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2010/ipa100204.zip'}, {'filename': 'ipa100211.zip', 'size': '93250275', 'publish_on': '2010-02-11 02:00', 'ur

In [13]:
import pandas as pd

# Importando o pandas e inserindo os dados coletados no dataframe

In [14]:
df = pd.DataFrame(files)
df.head(10)

Unnamed: 0,filename,size,publish_on,url
0,ipa210107.zip,147841751,2021-01-07 00:01,https://bulkdata.uspto.gov/data/patent/applica...
1,ipa210114.zip,156042926,2021-01-14 00:01,https://bulkdata.uspto.gov/data/patent/applica...
2,ipa210121.zip,146902825,2021-01-21 00:01,https://bulkdata.uspto.gov/data/patent/applica...
3,ipa210128.zip,154837377,2021-01-28 00:01,https://bulkdata.uspto.gov/data/patent/applica...
4,ipa210204.zip,154202636,2021-02-04 00:01,https://bulkdata.uspto.gov/data/patent/applica...
5,ipa210211.zip,155928214,2021-02-11 00:01,https://bulkdata.uspto.gov/data/patent/applica...
6,ipa210218.zip,136650642,2021-02-18 00:01,https://bulkdata.uspto.gov/data/patent/applica...
7,ipa210225.zip,148148145,2021-02-25 00:01,https://bulkdata.uspto.gov/data/patent/applica...
8,ipa210304.zip,176522001,2021-03-04 00:01,https://bulkdata.uspto.gov/data/patent/applica...
9,ipa210311.zip,171245107,2021-03-11 00:02,https://bulkdata.uspto.gov/data/patent/applica...


### Fazendo o Load dos ddados no Sqlite

In [15]:
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///patent.db', echo=False)
df.to_sql('uspto_files', con=engine, if_exists='append')

In [16]:
df['url'][0]

'https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2021/ipa210107.zip'

In [17]:
!wget https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2022/ipa220106.zip

--2022-10-17 09:01:35--  https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2022/ipa220106.zip
Resolving bulkdata.uspto.gov (bulkdata.uspto.gov)... 151.207.240.28, 2610:20:5004:1604::28
Connecting to bulkdata.uspto.gov (bulkdata.uspto.gov)|151.207.240.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 163074754 (156M) [application/zip]
Saving to: ‘ipa220106.zip’



In [18]:
!unzip ipa220106.zip -a

Archive:  ipa220106.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of ipa220106.zip or
        ipa220106.zip.zip, and cannot find ipa220106.zip.ZIP, period.


In [19]:
!mkdir xml-patents
!cd xml-patents && mkdir ipa220106

In [20]:
!csplit -s -f 'xml-patents/ipa220106/ipa220106-' -b '%02d.xml' ipa220106.xml '/^<?xml /' '{*}'

csplit: cannot open 'ipa220106.xml' for reading: No such file or directory


In [21]:
import pandas as pd 
import xml.etree.ElementTree as et 

xtree = et.parse("/content/xml-patents/ipa220106/ipa220106-05.xml")
xroot = xtree.getroot() 

df_cols = ["doc-number", "invention_title", "abstract", "published_date"]
rows = []

for node in xroot: 
    s_doc_number = node.attrib.get("doc-number")
    s_title = node.find("invention-title").text if node is not None else None
    s_number = node.find("abstract").text if node is not None else None
    s_publish_date = node.find("date").text if node is not None else None
    
    rows.append({"name": s_name, "email": s_mail, 
                 "grade": s_grade, "age": s_age})

out_df = pd.DataFrame(rows, columns = df_cols)

FileNotFoundError: ignored

In [None]:
df.head()