# Projeto - Extração de Dados I

## Sistema de Monitoramento de Avanços no Campo da Genômica

### Contexto:

O grupo trabalha no time de engenharia de dados na HealthGen, uma empresa especializada em genômica e pesquisa de medicina personalizada. A genômica é o estudo do conjunto completo de genes de um organismo, desempenha um papel fundamental na medicina personalizada e na pesquisa biomédica. Permite a análise do DNA para identificar variantes genéticas e mutações associadas a doenças e facilita a personalização de tratamentos com base nas características genéticas individuais dos pacientes.

A empresa precisa se manter atualizada sobre os avanços mais recentes na genômica, identificar oportunidades para pesquisa e desenvolvimento de tratamentos personalizados e acompanhar as tendências em genômica que podem influenciar estratégias de pesquisa e desenvolvimento. Pensando nisso, o time de dados apresentou uma proposta de desenvolvimento de um sistema que coleta, analisa e apresenta as últimas notícias relacionadas à genômica e à medicina personalizada, e também estuda o avanço do campo nos últimos anos.

O time de engenharia de dados tem como objetivo desenvolver e garantir um pipeline de dados confiável e estável. As principais atividades são:

1. **Consumo de dados com a News API**:
    - Implementar um mecanismo para consumir dados de notícias de fontes confiáveis e especializadas em genômica e medicina personalizada, a partir da News API:
      [https://newsapi.org/](https://newsapi.org/)

2. **Definir Critérios de Relevância**:
    - Desenvolver critérios precisos de relevância para filtrar as notícias. Por exemplo, o time pode se concentrar em notícias que mencionem avanços em sequenciamento de DNA, terapias genéticas personalizadas ou descobertas relacionadas a doenças genéticas específicas.

3. **Cargas em Batches**:
    - Armazenar as notícias relevantes em um formato estruturado e facilmente acessível para consultas e análises posteriores. Essa carga deve acontecer 1 vez por hora. Se as notícias extraídas já tiverem sido armazenadas na carga anterior, o processo deve ignorar e não armazenar as notícias novamente, os dados carregados não podem ficar duplicados.

4. **Dados transformados para consulta do público final**:
    - A partir dos dados carregados, aplicar as seguintes transformações e armazenar o resultado final para a consulta do público final:
        - Quantidade de notícias por ano, mês e dia de publicação;
        - Quantidade de notícias por fonte e autor;
        - Quantidade de aparições de 3 palavras-chave por ano, mês e dia de publicação (as 3 palavras-chave serão as mesmas usadas para fazer os filtros de relevância do item 2 (2. Definir Critérios de Relevância)).
    - Atualizar os dados transformados 1 vez por dia.


# Opcoes Extras
 
 - criar uma api para solicitar os dados
 - API para "triggar" um processo (processo da criação das tabelas do item 4)
 - criar um webhook (duas partes: criar api, criar o gerador de eventos)




# Grupo 4

- Jose Marchezi
- Rafael Leite
- Mayra Alves
- Renato Freitas


In [0]:
import requests
from pyspark.sql.functions import *

def get_newsbase_url(base_url, key,query_params):
  # url = f'{base_url}/everything?q={topic}&from={trange[0]}&to={trange[1]}&sortBy=popularity&apiKey={key}'

  response = requests.get(base_url, params=query_params)
  response_json = response.json()

  return response_json



def parse_data(response_json):

    articles = response_json['articles']
    structured = []
    for item in articles:
        dicts = {'publisher': item['source']['name'],
                'author': item['author'],
                'title': item['title'],
                'description': item['description'],
                'url': item['url'],
                'publication_date': item['publishedAt'], 
                'content': item['content']}
        structured.append(dicts)
    return structured

In [0]:
time = ['2024-02-01', '2024-04-01']

base_url = "https://newsapi.org/v2/everything"
key = "ebe15579813549c58b1b1baafb1a4d68"
domains='nature.com,biomedcentral.com'

query_params = {
      "q": "genomic OR genetic OR DNA",
      "from": trange[0],
      "to": trange[1],
      "sortBy": "popularity",
      "apiKey": key
    }


response_json = get_newsbase_url(base_url, key, query_params)

In [0]:
structured = parse_data(response_json)

In [0]:
structured

Out[85]: [{'publisher': 'The Verge',
  'author': 'Charles Pulliam-Moore',
  'title': 'The creators of 3 Body Problem want to have ‘a back and forth’ with the book',
  'description': 'The executive producers of Netflix’s 3 Body Problem discuss the challenges of adapting the book for screen and their hopes for more seasons.',
  'url': 'https://www.theverge.com/24111515/3-body-problem-interview-david-benioff-alexander-woo-db-weiss-season-2',
  'publication_date': '2024-03-25T21:00:00Z',
  'content': 'The creators of 3 Body Problem want to have a back and forth with the book\r\nThe creators of 3 Body Problem want to have a back and forth with the book\r\n / Though Netflixs 3 Body Problem is very diffe… [+5114 chars]'},
 {'publisher': 'BBC News',
  'author': 'https://www.facebook.com/bbcnews',
  'title': 'Scientists help save UK pint from climate change',
  'description': 'Researchers are identifying genes in the hop plant to produce varieties that will be more resilient to climate change.'

In [0]:
df = spark.createDataFrame(structured)

In [0]:
df.display()

author,content,description,publication_date,publisher,title,url
Charles Pulliam-Moore,The creators of 3 Body Problem want to have a back and forth with the book The creators of 3 Body Problem want to have a back and forth with the book  / Though Netflixs 3 Body Problem is very diffe… [+5114 chars],The executive producers of Netflix’s 3 Body Problem discuss the challenges of adapting the book for screen and their hopes for more seasons.,2024-03-25T21:00:00Z,The Verge,The creators of 3 Body Problem want to have ‘a back and forth’ with the book,https://www.theverge.com/24111515/3-body-problem-interview-david-benioff-alexander-woo-db-weiss-season-2
https://www.facebook.com/bbcnews,"Scientists fear climate change could ""call time"" on the great British pint and are helping the brewing industry to save it. Hops give bitter its taste but the plant doesn't like the hotter, drier co… [+3775 chars]",Researchers are identifying genes in the hop plant to produce varieties that will be more resilient to climate change.,2024-03-26T06:30:01Z,BBC News,Scientists help save UK pint from climate change,https://www.bbc.co.uk/news/science-environment-68636451
https://www.facebook.com/bbcnews,"Ministers are facing calls to prevent leaseholders being threatened with losing their home over unpaid charges. Under current laws, branded ""draconian"" by campaigners, a property can be repossessed… [+5591 chars]",Forfeiture means leaseholders can be threatened with losing their home if they do not pay service charges.,2024-03-26T02:26:36Z,BBC News,Moves to make it harder to repossess leasehold homes,https://www.bbc.co.uk/news/uk-politics-68622126
Jonathan M. Gitlin,Enlarge/ This concept points the way to a future Genesis flagship SUV. 12 Genesis provided train tickets from Washington to New York and accommodation so Ars could attend its event. Ars does not ac… [+3136 chars],Korea's luxury automaker brings five surprises to the New York International Auto Show.,2024-03-26T16:07:53Z,Ars Technica,Genesis unveils its take on the big luxury EV—the Neolun Concept,https://arstechnica.com/cars/2024/03/sporty-orange-and-big-luxury-genesis-brings-its-ev-game-to-ny/
Rachel Treisman,"Servers take off for the ""Course des Cafes"" in front of City Hall in central Paris on Sunday. Dimitar Dilkoff/AFP via Getty Images Foreign stereotypes of French restaurants tend to paint the servic… [+6508 chars]","Some 200 servers speed-walked through Paris balancing trays of beverages and croissants on Sunday. Paris hasn't held a waiters race since 2011, but brought it back ahead of the Olympics.",2024-03-25T16:13:33Z,NPR,"Hurry up and wait: Servers speed-walk through Paris, reviving a century-old race",https://www.npr.org/2024/03/25/1240667647/paris-waiters-race-tradition-cafe-olympics
,"A woman places flowers in memory of the victims of the attack in Moscow, in the center of Simferopol, in Russian-held Crimea, Sunday, March 24, 2024. AP MOSCOW Family and friends of those still mis… [+6976 chars]","Russia paused for a day of mourning Sunday for the more than 130 people killed at a Moscow concert. The attack, claimed by an affiliate of the Islamic State, is the deadliest on Russian soil in years.",2024-03-24T14:57:30Z,NPR,Russia marks a national day of mourning for victims of the concert hall attack,https://www.npr.org/2024/03/24/1240553741/russia-concert-hall-attack-day-of-mourning
Cameron Manley,"Russian Rosguardia national guard servicemen secure an area near the Crocus City Hall on the western edge of Moscow on March 22, 2024.AP Photo/Vitaly Smolnikov 115 people are reported to hav… [+5707 chars]","Gunmen who attacked Moscow's Crocus City Hall killed 115 dead, as Russian Federal Security Bureau confirms arrests.",2024-03-23T10:45:26Z,Business Insider,Death toll from Moscow concert hall attack rises to 115. The FSB confirmed 11 suspects have been arrested.,https://www.businessinsider.com/93-dead-11-arrested-terrorist-attack-moscow-russia-2024-3
Maya Posch,"Finding extraterrestrial life in any form would be truly one of the largest discoveries in humankind’s history, yet after decades of scouring the surface of Mars and investigating other bodies like a… [+1788 chars]","Finding extraterrestrial life in any form would be truly one of the largest discoveries in humankind’s history, yet after decades of scouring the surface of Mars and investigating other bodies …read more",2024-03-26T02:00:12Z,Hackaday,Complex Organic Chemistry In Sulfuric Acid and Life on Venus,https://hackaday.com/2024/03/25/complex-organic-chemistry-in-sulfuric-acid-and-life-on-venus/
Jamie Ducharme,"Just this month, two young, high-profile public figures announced that they have cancer. First, Olivia Munn, 43, disclosed that she was treated for breast cancer after catching it early. Days later, … [+5380 chars]",Kate Middleton and Olivia Munn's cancer diagnoses spotlight a troubling trend: cancer is getting more common among younger adults.,2024-03-26T15:57:11Z,Time,Why Are So Many Young People Getting Cancer? It’s Complicated,https://time.com/6960506/cancer-rates-young-people/
Associated Press,"MOSCOW Suspects in the Russia concert hall attack, which left more than 130 dead, arrived at a Moscow district court on Sunday night. There was a heavy police presence around the Basmanny District C… [+6004 chars]","Suspects in the Russia concert hall attack, which left more than 130 dead, arrived at a Moscow district court on Sunday night.",2024-03-24T20:42:08Z,Time,Russia Concert Hall Attack Suspects Appear in a Moscow Courtroom,https://time.com/6960153/russia-concert-hall-attack-suspects-moscow-courtroom/
