# Projeto - Extração de Dados I

## Sistema de Monitoramento de Avanços no Campo da Genômica

### Contexto:

O grupo trabalha no time de engenharia de dados na HealthGen, uma empresa especializada em genômica e pesquisa de medicina personalizada. A genômica é o estudo do conjunto completo de genes de um organismo, desempenha um papel fundamental na medicina personalizada e na pesquisa biomédica. Permite a análise do DNA para identificar variantes genéticas e mutações associadas a doenças e facilita a personalização de tratamentos com base nas características genéticas individuais dos pacientes.

A empresa precisa se manter atualizada sobre os avanços mais recentes na genômica, identificar oportunidades para pesquisa e desenvolvimento de tratamentos personalizados e acompanhar as tendências em genômica que podem influenciar estratégias de pesquisa e desenvolvimento. Pensando nisso, o time de dados apresentou uma proposta de desenvolvimento de um sistema que coleta, analisa e apresenta as últimas notícias relacionadas à genômica e à medicina personalizada, e também estuda o avanço do campo nos últimos anos.

O time de engenharia de dados tem como objetivo desenvolver e garantir um pipeline de dados confiável e estável. As principais atividades são:

1. **Consumo de dados com a News API**:
    - Implementar um mecanismo para consumir dados de notícias de fontes confiáveis e especializadas em genômica e medicina personalizada, a partir da News API:
      [https://newsapi.org/](https://newsapi.org/)

2. **Definir Critérios de Relevância**:
    - Desenvolver critérios precisos de relevância para filtrar as notícias. Por exemplo, o time pode se concentrar em notícias que mencionem avanços em sequenciamento de DNA, terapias genéticas personalizadas ou descobertas relacionadas a doenças genéticas específicas.

3. **Cargas em Batches**:
    - Armazenar as notícias relevantes em um formato estruturado e facilmente acessível para consultas e análises posteriores. Essa carga deve acontecer 1 vez por hora. Se as notícias extraídas já tiverem sido armazenadas na carga anterior, o processo deve ignorar e não armazenar as notícias novamente, os dados carregados não podem ficar duplicados.

4. **Dados transformados para consulta do público final**:
    - A partir dos dados carregados, aplicar as seguintes transformações e armazenar o resultado final para a consulta do público final:
        - Quantidade de notícias por ano, mês e dia de publicação;
        - Quantidade de notícias por fonte e autor;
        - Quantidade de aparições de 3 palavras-chave por ano, mês e dia de publicação (as 3 palavras-chave serão as mesmas usadas para fazer os filtros de relevância do item 2 (2. Definir Critérios de Relevância)).
    - Atualizar os dados transformados 1 vez por dia.


# Opcoes Extras
 
 - criar uma api para solicitar os dados
 - API para "triggar" um processo (processo da criação das tabelas do item 4)
 - criar um webhook (duas partes: criar api, criar o gerador de eventos)




# Grupo 4

- Jose Marchezi
- Rafael Leite
- Mayra Alves
- Renato Freitas


In [0]:
import requests
from pyspark.sql.functions import *

def get_newsbase_url(base_url, key,query_params):
  # url = f'{base_url}/everything?q={topic}&from={trange[0]}&to={trange[1]}&sortBy=popularity&apiKey={key}'

  response = requests.get(base_url, params=query_params)
  response_json = response.json()

  return response_json



In [0]:
time = ['2024-03-23', '2024-03-26']

base_url = "https://newsapi.org/v2/everything"
key = "ebe15579813549c58b1b1baafb1a4d68"
domains='nature.com,biomedcentral.com'

query_params = {
      "q": "genomics",
      "from": trange[0],
      "to": trange[1],
      "sortBy": "popularity",
      "apiKey": key,
      "searchIn": "title"
    }


get_newsbase_url(base_url, key, query_params)

Out[25]: {'status': 'ok',
 'totalResults': 10,
 'articles': [{'source': {'id': None, 'name': 'ETF Daily News'},
   'author': 'MarketBeat News',
   'title': '10x Genomics, Inc. (NASDAQ:TXG) Shares Purchased by Citigroup Inc.',
   'description': 'Citigroup Inc. grew its position in shares of 10x Genomics, Inc. (NASDAQ:TXG – Free Report) by 94.4% during the 3rd quarter, according to the company in its most recent filing with the SEC. The firm owned 10,951 shares of the company’s stock after purchasing …',
   'url': 'https://www.etfdailynews.com/2024/03/24/10x-genomics-inc-nasdaqtxg-shares-purchased-by-citigroup-inc/',
   'urlToImage': 'https://www.americanbankingnews.com/wp-content/timthumb/timthumb.php?src=https://www.marketbeat.com/logos/10x-genomics-inc-logo-1200x675.png?v=20240206085559&w=240&h=240&zc=2',
   'publishedAt': '2024-03-24T09:18:42Z',
   'content': 'Citigroup Inc. grew its position in shares of 10x Genomics, Inc. (NASDAQ:TXG – Free Report) by 94.4% during the 3rd quarter, 

In [0]:
def parse_data(response_json):

    articles = response_json['articles']
    structured = []
    for item in articles:
        dicts = {'publisher': item['source']['name'],
                'author': item['author'],
                'title': item['title'],
                'description': item['description'],
                'url': item['url'],
                'publication_date': item['publishedAt'], 
                'content': item['content']}
        structured.append(dicts)
    return structured

In [0]:
structured = parse_data(response_json)

In [0]:
df = spark.createDataFrame(structured)

In [0]:
df.display()

author,content,description,publication_date,publisher,title,url
,"Introduction Diabetes, obesity, heart disease, cancer, and liver disease have all been linked in various ways to acetate availability, metabolism, and signaling. Acetate supplementation induces phys… [+163834 chars]","Acetate, the shortest chain fatty acid, has been implicated in providing health benefits whether it is derived from the diet or is generated from microbial f...",2024-03-26T11:46:02Z,Frontiersin.org,"Acetate Revisited: A Key Biomolecule at the Nexus of Metabolism, Epigenetics",https://www.frontiersin.org/journals/physiology/articles/10.3389/fphys.2020.580171/full
,"Humans pass on more viruses to domestic and wild animals than we catch from them, according to a major new analysis of viral genomes by UCL researchers.For the new paper published in Nature Ecology &… [+4202 chars]","Humans pass on more viruses to domestic and wild animals than we catch from them, according to a major new analysis of viral genomes.",2024-03-25T15:41:38Z,Science Daily,Humans pass more viruses to other animals than we catch from them,https://www.sciencedaily.com/releases/2024/03/240325114138.htm
UC Irvine,Researchers have discovered the key role that the APOBEC3A and APOBEC3B enzymes play in driving cancer mutations by modifying the DNA in tumor genomes. The work offers potential new targets for inte… [+2562 chars],"Two enzymes that play a role in cancer mutation offer potential new targets for intervention strategies, researchers report.",2024-03-25T14:17:49Z,Futurity: Research News,Team cracks 2 enzymes’ role in cancer mutation,https://www.futurity.org/enzymes-cancer-mutation-3197292-2/
Alli Chase,"Making life-changing, innovative advances is intrinsic to the UKs  life sciences sector, which unceasingly thrives on industry  collaboration, academic excellence, and government support. Uncoveri… [+8763 chars]",,2024-03-25T11:18:35Z,MIT Technology Review,"Backed by heritage, ready for the future",https://www.technologyreview.com/2024/03/25/1090079/backed-by-heritage-ready-for-the-future/
Jacinta Bowler,"In short: Researchers have used genetic evidence to pinpoint where ancient humans went after leaving Africa. They suggest the 'Persian Plateau' an area including modern day Iran. Our ancestors left Africa some 70,000 years ago, but it took them thousands of years to make it to Europe or Asia. A DNA study may shed light on where they ended up during this 'long pause'.2024-03-25T18:30:00ZABC News (AU)'A very big question in human evolution': Where did our ancestors go after leaving Africa?https://www.abc.net.au/news/science/2024-03-26/out-of-africa-human-migration-persian-plateau/103614458",,,,,
T0@st,"A manufacturing plant near Hsinchu, Taiwan's Silicon Valley, is among facilities worldwide boosting energy efficiency with AI-enabled digital twins. A virtual model can help streamline operations, ma… [+4275 chars]","A manufacturing plant near Hsinchu, Taiwan's Silicon Valley, is among facilities worldwide boosting energy efficiency with AI-enabled digital twins. A virtual model can help streamline operations, maximizing throughput for its physical counterpart, say engine…",2024-03-26T16:43:06Z,Techpowerup.com,(PR) NVIDIA Modulus & Omniverse Drive Physics-informed Models and Simulations,https://www.techpowerup.com/320853/nvidia-modulus-omniverse-drive-physics-informed-models-and-simulations
Investor's Business Daily,"Dow Jones futures rose slightly after hours, along with S&P 500 futures and Nasdaq futures. The stock market rally had a relatively quiet Tuesday, with the Nasdaq leading a fade into the close fo… [+7544 chars]",Trump Media broke out in its post-SPAC debut. Tesla jumped to key resistance.,2024-03-26T21:11:04Z,Investor's Business Daily,"Dow Jones Futures: Donald Trump Stock Great Again? Tesla Hits Resistance, 5 AI Stocks Near Buy Points",https://www.investors.com/market-trend/stock-market-today/dow-jones-futures-donald-trump-stock-great-tesla-stock-ai-stocks-buy-points/
"Cecilia Wieder, Juliette Cooke, Clement Frainay, Nathalie Poupin, Russell Bowler, Fabien Jourdan, Katerina J. Kechris, Rachel PJ Lai, Timothy Ebbels","Abstract As terabytes of multi-omics data are being generated, there is an ever-increasing need for methods facilitating the integration and interpretation of such data. Current multi-omics integrat… [+102316 chars]","Author summary Omics data, which provides a readout of the levels of molecules such as genes, proteins, and metabolites in a sample, is frequently generated to study biological processes and perturbations within an organism. Combining multiple omics data type…",2024-03-25T14:00:00Z,Plos.org,PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration,https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011814
Rick Merritt,"A manufacturing plant near Hsinchu, Taiwans Silicon Valley, is among facilities worldwide boosting energy efficiency with AI-enabled digital twins. A virtual model can help streamline operations, ma… [+4281 chars]","A manufacturing plant near Hsinchu, Taiwan’s Silicon Valley, is among facilities worldwide boosting energy efficiency with AI-enabled digital twins. A virtual model can help streamline operations, maximizing throughput for its physical counterpart, say engine…",2024-03-26T15:00:51Z,Nvidia.com,Model Innovators: How Digital Twins Are Making Industries More Efficient,https://blogs.nvidia.com/blog/digital-twins-modulus-wistron/
,"Began global shipments for Visium HD, a new product that enables whole transcriptome spatial discovery at single cellscale resolution PLEASANTON, Calif., March 26, 2024 /PRNewswire/ -- 10x Genomics… [+6237 chars]","(marketscreener.com) Began global shipments for Visium HD, a new product that enables whole transcriptome spatial discovery at single cell–scale resolution PLEASANTON, Calif., March 26, 2024 /PRNewswire/ -- 10x Genomics, Inc. , a leader in single cell and …",2024-03-26T13:01:19Z,Marketscreener.com,10x Genomics Commercially Launches Visium HD Spatial Gene Expression Assay,https://www.marketscreener.com/quote/stock/10X-GENOMICS-INC-65338646/news/10x-Genomics-Commercially-Launches-Visium-HD-Spatial-Gene-Expression-Assay-46287534/
