<a href="https://colab.research.google.com/github/patrickerson/corpus/blob/main/corpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Patrickerson dos Santos Veiga

Classe para modelagem dos dados que serão usados para realizar o web scrap. Logo abaixo, as instâncias com o nome do site, url, tag e  classe do conteúdo, respectivamente.

In [4]:
class WebscrappingModal:

    def __init__(self, name, url, content_tag, content_class):
        self.name=name
        self.url=url
        self.content_tag=content_tag
        self.content_class=content_class

    def append_df(self, df):
        self.df = df

    
    def set_sents_array(self, sents_array):
        self.sents_array = sents_array

In [5]:
Abs = WebscrappingModal(
    name="abs",
    url="https://www.abs.gov.au/websitedbs/D3310114.nsf/home/Basic+Survey+Design+-+Data+Processing",
    content_tag='div',
    content_class="content"
)

In [6]:
Helpscout = WebscrappingModal(
    name="helpscout",
    url="https://www.helpscout.com/company/legal/dpa/",
    content_tag='div',
    content_class="Contentstyles__ContentDIV-sc-7tdxle-0 dzbACc"
)

In [7]:
Integrate = WebscrappingModal(
    name="integrate",
    url="https://www.integrate.io/blog/the-5-types-of-data-processing/",
    content_tag='article',
    content_class="container-fluid integrateio-blog-post-content"
)

In [8]:
Peda = WebscrappingModal(
    name="peda",
    url="https://peda.net/kenya/ass/subjects2/computer-studies/form-3/data-processing",
    content_tag='article',
    content_class="textmodule document uuid-199b3e82-3256-11e7-bd46-d102fbf45fbc enclose"
)

In [9]:
Simplilearn = WebscrappingModal(
    name="simplilearn",
    url="https://www.simplilearn.com/what-is-data-processing-article",
    content_tag='article',
    content_class="desig_author empty-text"
)

Classe para junção das models com a controller (Scrapper), contentando os conteúdos que serão analisados posteriormente

In [10]:
class Middleware:
    
    contents = [
        Peda,
        Simplilearn,
        Integrate,
        Helpscout,
        Abs
    ]

In [299]:
class Content:
  parser="html.parser"
  def __init__(self, content):
     """
        Carrega o conteúdo dado uma WebscrappingModal

        Parameters
        ----------
        content : WebscrappingModal
          Conteúdo a ser carregado

        
      """
     self.content = content

  def get_text(self):
        """
        Obtem um texto do conteúdo de uma url e realiza web scrap


        Nesta classe, utilizando um método get em uma URL previamente definida
        no content. É definido o encode default desta classe.

        Para o webscrap, procura-se pela tag e classe definida respectivamente
        no content

        Quaisquer que sejam o conteúdo com a tag script, é extraido do conteúdo,
        dessa forma, scripts não são lidos pela classe.


        Returns
        --------
        String
          Texto extraido
        """
        
        try:
          with open(self.content.name + ".html","r", encoding='utf-8') as file:
            html_text = file.read()
        except FileNotFoundError:
          self.save_html(self.content)
          with open(self.content.name + ".html","r", encoding='utf-8') as file:
            html_text = file.read()
        soup = BeautifulSoup(html_text, self.parser)
        for s in soup.select('script'):
          s.extract()
        if self.content.content_class=="":
            find = soup.find(self.content.content_tag)
            return find.text.replace("\n\n", "\n").replace("\xa0", " ")
            
        else:
            find = soup.find(self.content.content_tag, class_=self.content.content_class)
            return find.text.replace("\n\n", "\n").replace("\xa0", " ")
      
  def save_html(self):
      """
      Salva os textos requeridos através de um request em arquivos 
      .html para facilitar o desenvolvimento e evitar bloqueios de firewall

      """
      html_text = get(self.content.url)
      html_text.encoding=self.encode
      with open(self.content.name + ".html", "w",encoding='utf-8') as file:
        file.write(html_text.text)

In [300]:
from bs4 import BeautifulSoup
from requests import get
import spacy
import pandas as pd
from spacy.language import Language
class Scrapper:
    nlp=spacy.load("en_core_web_sm")
    contents = Middleware.contents
    encode = 'utf-8'

    def __init__(self):
      self.nlp.add_pipe("set_start_setence", before="parser")


    def load_contents(self):
        """
        Chama as models do contents para carregar o conteúdo
        """
        for content in self.contents:
            
            self.load_content(content)

    @Language.component("set_start_setence")
    def set_start_setence(doc):
      """
      Adiciona novos inicios de sentenças
      """
      pouncts = ["!", "?", ",", ";", ".","\n"]
      for token in doc[:-1]:
          if token.text in pouncts:
              doc[token.i+1].is_sent_start = True
      return doc

    def load_content(self, content):
        """
        Carrega o conteúdo dado uma WebscrappingModal

        Parameters
        ----------
        content : WebscrappingModal
          Conteúdo a ser carregado

        
        """
        c = Content(content)
        text = c.get_text()
        
        doc = self.nlp(text.strip())
        sents_array = [sent.text for sent in doc.sents]
        void_arg = lambda arg: arg != "" and arg != " "
        content.set_sents_array(list(filter(void_arg, sents_array)))


    def view_sents_array(self):
      """
      Função para exibição dos arrays de sentença de cada conteúdo
      """
      for i in self.contents:
         print(i.sents_array)
          

In [301]:
scrapper = Scrapper()
scrapper.set_point()

In [302]:
scrapper.load_contents()

In [303]:
scrapper.view_sents_array()

['DATA PROCESSING CYCLE\n', 'Introduction\n Data procesing refers to the transformating raw data into meaningful output.', '\n', 'Data can be done manually using a pen and paper,', 'mechanically using simple devices eg typewritter or electronically using modern dat processing toolseg computers \n', 'Data collection involves getting the data/facts needed for processing from the point of its origin to the computer\n', 'Data Input- the collected data is converted into machine-readable form by an input device,', 'and send into the machine.', '\n', 'Processing is the transformation of the input data to a more meaningful form (information) in the CPU\n', 'Output is the production of the required information,', 'which may be input in future.', '\n', 'The difference between data collection and data capture.', '\n', 'Data capture is the process of obtaining data in a computer-sensible form for at the point of origin (the source document itself is prepared in a machine-sensible form for input)\n