# PubMed Crawling

https://pubmed.ncbi.nlm.nih.gov/

## 준비

크롤링 순서 리마인딩   

1. 라이브러리 로드
2. URL 설정
3. HTTP 요청
4. response 데이터 해석(페이지 형태 파악)
6. 원하는 내용 읽어오기
7. 수집함수 만들기
8. 저장

## 코드 작성

### 라이브러리 불러오기

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

### 수집 URL 설정

In [15]:
keyword = str(input('검색어 입력 : '))
keyword = keyword.replace(' ', '+')


base_url = 'https://pubmed.ncbi.nlm.nih.gov'
params = f'/?term={keyword}'

url = base_url + params
url

'https://pubmed.ncbi.nlm.nih.gov/?term=helth+care'

### HTTP 요청

In [3]:
response = requests.get(url)
# response.status_code
# 200 확인

html = bs(response.text)
# 봇 방지 안 걸림

### 페이지 형태 파악

In [4]:
# 제목 위치
a = html.select('div.search-results-chunks > div > article > div.docsum-wrap > div.docsum-content > a')[0].text.strip()

# 초록 수집을 위한 세부 주소
page = html.select('div.search-results-chunks > div > article > div.docsum-wrap > div.docsum-content > a')[0]['href']


# 세부페이지
durl = f'https://pubmed.ncbi.nlm.nih.gov{page}'
dresponse = requests.get(durl)
dhtml = bs(dresponse.text)

# 저널명
b = dhtml.select('#full-view-journal-trigger')[0]['title']
# 저널 호
c = dhtml.select('#full-view-heading > div.article-citation > div > span.cit')[0].text


# 저자
d = dhtml.select('div.inline-authors > div > div > span > a')[0].text.strip()

# 초록
e = dhtml.select('#eng-abstract')[0].text.replace("\n", '').replace("  ", '').strip()

print(a, b, c, d, e, sep='\n')

Application of internal fixation of steel-wire limited loop in early Achilles tendon rupture.
Asian Pacific journal of tropical medicine
2013 Nov;6(11):902-7.
Zhe Chen
Objective:To explore the clinical effect and safety of internal fixation of steel-wire limited loop in early Achilles tendon rupture.Methods:Seventy-six patients respectively with early transected and avulsed types of Achilles tendon rupture were selected and treated with internal fixation of steel-wire limited loop. The patients began to take exercise for their lower limbs through continous passive motion as early as possible after surgical repair, and the loops were removed after 3-5 months. Six months later, the condition of complications including Achilles tendon re-rupture, wound fistula, wound infection and skin necrosis, cutaneous sensation in sural nerve dominance region, time back to preinjury work or learning as well as time to physical activities were observed. One year later, the therapeutic effect was evalua

In [5]:
# 제목 위치
a = html.select('div.search-results-chunks > div > article > div.docsum-wrap > div.docsum-content > a')[1].text.strip()

# 초록 수집을 위한 세부 주소
page = html.select('div.search-results-chunks > div > article > div.docsum-wrap > div.docsum-content > a')[1]['href']


# 세부페이지
durl = f'https://pubmed.ncbi.nlm.nih.gov{page}'
dresponse = requests.get(durl)
dhtml = bs(dresponse.text)

# 저널명
b = dhtml.select('#full-view-journal-trigger')[0]['title']
# 저널 호
c = dhtml.select('#full-view-heading > div.article-citation > div > span.cit')[0].text


# 저자
d = dhtml.select('div.inline-authors > div > div > span > a')[0].text.strip()

# 초록
e = dhtml.select('#eng-abstract')[0].text.replace("\n", '').replace("  ", '').strip()

print(a, b, c, d, e, sep='\n')

From Bioinspired Glue to Medicine: Polydopamine as a Biomedical Material.
Materials (Basel, Switzerland)
2020 Apr 7;13(7):1730.
Daniel Hauser
Biological structures have emerged through millennia of evolution, and nature has fine-tuned the material properties in order to optimise the structure-function relationship. Following this paradigm, polydopamine (PDA), which was found to be crucial for the adhesion of mussels to wet surfaces, was hence initially introduced as a coating substance to increase the chemical reactivity and surface adhesion properties. Structurally, polydopamine is very similar to melanin, which is a pigment of human skin responsible for the protection of underlying skin layers by efficiently absorbing light with potentially harmful wavelengths. Recent findings have shown the subsequent release of the energy (in the form of heat) upon light excitation, presenting it as an ideal candidate for photothermal applications. Thus, polydopamine can both be used to (i) coat na

### 원하는 내용 읽어오기 - 한 페이지

In [4]:
# 검색어 입력
keyword = str(input('검색어 입력 : '))

# 입력된 검색어를 쿼리스트링에 맞는 형태로 변환
edit_keyword = keyword.replace(' ', '+')

# 기본 주소에 키워드 전달
base_url = 'https://pubmed.ncbi.nlm.nih.gov'
params = f'/?term={edit_keyword}'

url = base_url + params

response = requests.get(url)
html = bs(response.text)

title_list = []
jname_list = []
jvol_list = []
author_list = []
Abstract_list = []

# 검색결과 - 한페이지
result = html.select('div.search-results-chunks > div > article > div.docsum-wrap > div.docsum-content > a')

for i in range(len(result)):
    # 제목
    title = result[i].text.strip()
    title_list.append(title)
    
    # 초록 수집을 위한 세부 주소
    page = result[i]['href']

    # 세부페이지 불러오기
    durl = f'https://pubmed.ncbi.nlm.nih.gov{page}'
    dresponse = requests.get(durl)
    dhtml = bs(dresponse.text)

    # 세부내용 수집
    # 저널명
    journal_name = dhtml.select('#full-view-journal-trigger')[0]['title']
    jname_list.append(journal_name)
    # 저널 호
    journal_vol = dhtml.select('#full-view-heading > div.article-citation > div > span.cit')[0].text
    jvol_list.append(journal_vol)
    # 1 저자
    authors = dhtml.select('div.inline-authors > div > div > span > a')[0].text.strip()
    author_list.append(authors)
    # 초록
    Abstract = dhtml.select('#eng-abstract')[0].text.replace("\n", '').replace("  ", '').strip()
    Abstract_list.append(Abstract)

paper_list = pd.DataFrame({
    'title': title_list,
    'journal name': jname_list,
    'journal vol.': jvol_list,
    'author': author_list,
    'Abstract': Abstract_list
})

print(keyword)
display(paper_list)

sport injury prevention


Unnamed: 0,title,journal name,journal vol.,author,Abstract
0,Current trends in sport injury prevention.,Best practice & research. Clinical rheumatology,2019 Feb;33(1):3-15.,Carolyn A Emery,Participation in sport and recreation has impo...
1,The challenge of the sporting shoulder: From i...,Annals of physical and rehabilitation medicine,2021 Jul;64(4):101384.,Ann M Cools,Shoulder injuries and sports-related shoulder ...
2,Understanding Injury and Injury Prevention in ...,Journal of sport rehabilitation,2021 May 25;30(7):1053-1059.,Shana E Harrington,"Context:Training loads, injury, and injury pre..."
3,Hamstring Injury Prevention Practices in Elite...,"Sports medicine (Auckland, N.Z.)",2018 Mar;48(3):513-524.,Anthony J Shield,Hamstring strain injuries are endemic in runni...
4,2022 Bern Consensus Statement on Shoulder Inju...,The Journal of orthopaedic and sports physical...,2022 Jan;52(1):11-28.,Ariane Schwank,There is an absence of high-quality evidence t...
5,Sport-specific biomechanical responses to an A...,Journal of sports sciences,2018 Nov;36(21):2492-2501.,Jeffrey B Taylor,Anterior cruciate ligament (ACL) injury preven...
6,Psychosocial Factors and Sport Injuries: Meta-...,"Sports medicine (Auckland, N.Z.)",2017 Feb;47(2):353-365.,Andreas Ivarsson,Background:Several studies have suggested that...
7,Injury prevention in sport: not yet part of th...,Injury prevention : journal of the Internation...,2002 Dec;8 Suppl 4(Suppl 4):IV22-5.,D J Chalmers,"Background:There is a saying in sport that ""in..."
8,Physical exercises for preventing injuries amo...,Journal of sport and health science,2022 Jan;11(1):115-122.,Jorge Pérez-Gómez,Background:Football is the most practised spor...
9,Planning injury prevention training for youth ...,Injury prevention : journal of the Internation...,2020 Apr;26(2):164-169.,Eva Ageberg,Background:Youth handball players are vulnerab...


In [7]:
#  초록 전체 보기
# pd.set_option('display.max_colwidth',1000)
# paper_list['Abstract'][0]

'Participation in sport and recreation has important positive implications for public health across the lifespan; however, the burden of sport-related musculoskeletal injury is significant, with the greatest risk being in youth and young adults. Moving upstream to primary prevention of injury is a public health priority that will have significant implications for reducing the long-term consequences of musculoskeletal injury including early post-traumatic osteoarthritis. The primary targets for the prevention of musculoskeletal injury in sport include neuromuscular training (NMT), rule modification, and equipment recommendations. Currently, there is significant high-quality evidence to support the widespread use of NMT warm up programs in team and youth sport, with an expected significant impact of reducing the risk of musculoskeletal injury by over 35%. Policy disallowing body checking in youth ice hockey has led to a >50% reduction in injuries, and rules limiting contact practice in y

'Participation in sport and recreation has important positive implications for public health across the lifespan; however, the burden of sport-related musculoskeletal injury is significant, with the greatest risk being in youth and young adults. Moving upstream to primary prevention of injury is a public health priority that will have significant implications for reducing the long-term consequences of musculoskeletal injury including early post-traumatic osteoarthritis. The primary targets for the prevention of musculoskeletal injury in sport include neuromuscular training (NMT), rule modification, and equipment recommendations. Currently, there is significant high-quality evidence to support the widespread use of NMT warm up programs in team and youth sport, with an expected significant impact of reducing the risk of musculoskeletal injury by over 35%. Policy disallowing body checking in youth ice hockey has led to a >50% reduction in injuries, and rules limiting contact practice in youth American football has significant potential for injury prevention. There is evidence to support the use of bracing and taping in elite sport to reduce the risk of recurrent ankle sprain injury but not for use to prevent the primary injury, and wrist guards are protective of sprain injuries in snowboarding. Future research examining the maintenance of NMT programs across real-world sport and school settings, optimization of adherence, additional benefit of workload modification, and evaluation of rule changes in other sports is needed.'

### 코드 효율성 개선 시도  
- 저자도 1저자만이 아니라 전부 가져올 수 있게 됐다!

In [2]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd


def get_paper_details(page):
    """Extracts paper details from the detailed page."""
    durl = f'https://pubmed.ncbi.nlm.nih.gov{page}'
    with requests.Session() as session:
        response = session.get(durl)
    dhtml = bs(response.text)
    try:
        journal_name = dhtml.select('#full-view-journal-trigger')[0]['title']
    except IndexError:
        journal_name = ''
    try:
        journal_vol = dhtml.select('#full-view-heading > div.article-citation > div > span.cit')[0].text.strip()
    except IndexError:
        journal_vol = ''
    try:
        authors = [a.text.strip() for a in dhtml.find_all('a', {'class': 'full-name'})]
    except IndexError:
        authors = []
    try:
        abstract = dhtml.select('#eng-abstract')[0].text.replace("\n", '').replace("  ", '').strip()
    except IndexError:
        abstract = ''
    return journal_name, journal_vol, authors, abstract


keyword = input('Enter a search term: ').replace(' ', '+')
base_url = 'https://pubmed.ncbi.nlm.nih.gov'
params = f'/?term={keyword}&size=10'
url = base_url + params
with requests.Session() as session:
    response = session.get(url)
html = bs(response.text)

paper_list = []
result = html.select('div.search-results-chunks > div > article > div.docsum-wrap > div.docsum-content > a')
for r in result:
    title = r.text.strip()
    page = r['href']
    journal_name, journal_vol, authors, abstract = get_paper_details(page)
    paper_list.append({
        'title': title,
        'journal name': journal_name,
        'journal vol.': journal_vol,
        'authors': ', '.join(authors),
        'abstract': abstract
    })

df = pd.DataFrame(paper_list)
print(keyword.replace('+', ' '))
display(df)


sport injury prevention


Unnamed: 0,title,journal name,journal vol.,authors,abstract
0,Current trends in sport injury prevention.,Best practice & research. Clinical rheumatology,2019 Feb;33(1):3-15.,"Carolyn A Emery, Kati Pasanen, Carolyn A Emery...",Participation in sport and recreation has impo...
1,The challenge of the sporting shoulder: From i...,Annals of physical and rehabilitation medicine,2021 Jul;64(4):101384.,"Ann M Cools, Annelies G Maenhout, Fran Vanders...",Shoulder injuries and sports-related shoulder ...
2,Understanding Injury and Injury Prevention in ...,Journal of sport rehabilitation,2021 May 25;30(7):1053-1059.,"Shana E Harrington, Sean McQueeney, Marcus Fea...","Context:Training loads, injury, and injury pre..."
3,Hamstring Injury Prevention Practices in Elite...,"Sports medicine (Auckland, N.Z.)",2018 Mar;48(3):513-524.,"Anthony J Shield, Matthew N Bourne, Anthony J ...",Hamstring strain injuries are endemic in runni...
4,2022 Bern Consensus Statement on Shoulder Inju...,The Journal of orthopaedic and sports physical...,2022 Jan;52(1):11-28.,"Ariane Schwank, Paul Blazey, Martin Asker, Mer...",There is an absence of high-quality evidence t...
5,Sport-specific biomechanical responses to an A...,Journal of sports sciences,2018 Nov;36(21):2492-2501.,"Jeffrey B Taylor, Kevin R Ford, Randy J Schmit...",Anterior cruciate ligament (ACL) injury preven...
6,Psychosocial Factors and Sport Injuries: Meta-...,"Sports medicine (Auckland, N.Z.)",2017 Feb;47(2):353-365.,"Andreas Ivarsson, Urban Johnson, Mark B Anders...",Background:Several studies have suggested that...
7,Injury prevention in sport: not yet part of th...,Injury prevention : journal of the Internation...,2002 Dec;8 Suppl 4(Suppl 4):IV22-5.,"D J Chalmers, D J Chalmers","Background:There is a saying in sport that ""in..."
8,Physical exercises for preventing injuries amo...,Journal of sport and health science,2022 Jan;11(1):115-122.,"Jorge Pérez-Gómez, José Carmelo Adsuar, Pedro ...",Background:Football is the most practised spor...
9,Planning injury prevention training for youth ...,Injury prevention : journal of the Internation...,2020 Apr;26(2):164-169.,"Eva Ageberg, Sofia Bunke, Per Nilsen, Alex Don...",Background:Youth handball players are vulnerab...


In [24]:
# 저장
df.to_csv('pubmed_scraping.csv')

### 메모
##### 웹쿼리변수 정리  
- term = 검색어  
- page = 검색시 페이지 수  
- size = 페이지당 표시 아이템 수  

##### abstract : #eng-abstract 와 #abstract 의 차이
> eng-abstract:  
>> Healthy aging is the ability to maintain independence, purpose, vitality, and quality of life into old age despite unexpected medical conditions, accidents, and unhelpful social determinants of health. Exercise, or physical activity, is an important component of healthy aging, preventing or mitigating falls, pain, sarcopenia, osteoporosis, and cognitive impairment. A well-balanced exercise program includes daily aerobic, strength, balance, and flexibility components. Most older adults do not meet the currently recommended minutes of regular physical activity weekly. Counseling by health care providers may help older adults improve exercise habits, but it is also important to take advantage of community-based exercise opportunities.

- abstract 내용만 깔끔하게 가져옴  

> abstract:
>> AbstractHealthy aging is the ability to maintain independence, purpose, vitality, and quality of life into old age despite unexpected medical conditions, accidents, and unhelpful social determinants of health. Exercise, or physical activity, is an important component of healthy aging, preventing or mitigating falls, pain, sarcopenia, osteoporosis, and cognitive impairment. A well-balanced exercise program includes daily aerobic, strength, balance, and flexibility components. Most older adults do not meet the currently recommended minutes of regular physical activity weekly. Counseling by health care providers may help older adults improve exercise habits, but it is also important to take advantage of community-based exercise opportunities.Keywords:Exercise; Healthy aging; Older adults; Physical activity.

- 맨 앞 Abstract 문자가 붙어 나오고(슬라이싱으로 처리하면 될 듯)
- 맨 뒤 논문 키워드도 붙어 나옴(키워드가 필요하면 이쪽을 처리하면 될 듯)

##### 잘 이해 안되고 있는 부분  
- requests.Session() : requests 라이브러리에서 Session 개체를 생성하여 요청 간에 특정 매개변수를 유지할 수 있도록 하는 것
- with : 꼬박꼬박 왜 써주는거지? 다쓰고나면 닫아주기 위해서라는 건 알겠는데... 다쓴거 어떻게 판단혀,,,? 진짜 맨처음 파이썬 기초때 텍스트 데이터 열고 닫았는데...
- join : 자주 나오는데 자세히 이해해보려 한적이 없다


In [23]:
# 키워드 (음청 기네......)
dhtml.select('#abstract')[0].text.split('Keywords')[1].replace('\n','').replace(':','').strip().split(';')

['adolescent',
 ' implementation / translation',
 ' sports / leisure facility',
 ' therapy',
 ' training.']

In [28]:
# chatGPT로 수정
import re

text = dhtml.select('#abstract')[0].text

# 정규표현식 사용
re.search(r'Keywords:\n\s+(.*)\n', text).group(1).strip().split('; ')


['adolescent',
 'implementation / translation',
 'sports / leisure facility',
 'therapy',
 'training.']