# 需求：給一些文章，能不能夠擷取文本資料中重要的句子？

思路：
 - 從文本資料中，針對文章進行評分，找出高訊息量的文章。
 - 根據文章內容的句子進行評分，找出高訊息量的句子。

In [1]:
import data

In [2]:
tabulation = data.tabulation(path='resource/csv/pubmed-sample.csv')
tabulation.read()
tabulation.table[['abstract', 'keyword']].head(10)

Unnamed: 0,abstract,keyword
0,Preventing death from malignant melanoma is th...,melanoma
1,Introduction: Type 1 diabetes (T1DM) patients ...,chest
2,Aim: To determine the age-specific prevalence ...,hypertension
3,Objectives: Investigation whether in depth cha...,covid-19
4,Case report: A 6-year-old neutered male Britis...,chest
5,Approximately 100 million people suffer from f...,chest
6,A comprehensive repertoire of human microRNAs ...,melanoma
7,Background: SARS-CoV-2-infected subjects have ...,covid-19
8,Perioperative derangements of fluid and electr...,diabetes
9,"A female patient, aged 44, with diabetes insip...",diabetes


In [3]:
vocabulary = data.vocabulary(content=tabulation.table['abstract'], title=tabulation.table['title'])
vocabulary.build()



In [4]:
vocabulary.transform(mode='tf-idf')
print("term(row), document(column) matrix")
vocabulary.matrix

use [self.matrix] to call the weight matrix
term(row), document(column) matrix


array([[0.1 , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       [0.07, 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       [0.08, 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       ...,
       [0.  , 0.  , 0.  , ..., 0.  , 0.07, 0.  ],
       [0.  , 0.  , 0.  , ..., 0.  , 0.07, 0.  ],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.35]])

In [5]:
vocabulary.matrix.shape

(3582, 200)

根據 TF-IDF 矩陣，計算每篇文章的訊息量，擷取出訊息量最高的 5 篇文章。

In [6]:
k = 5
importance = data.importance(vocabulary=vocabulary)
document = importance.document(top=k)
document

Unnamed: 0,title,score
0,Lung incarceration after anterior mediastinal ...,3.99
1,Functional valvular incompetence in decompensa...,3.94
2,Hydranencephaly complicated by central diabete...,3.91
3,Impact of Confounding Thoracic Tubes and Pleur...,3.81
4,Body composition measurement by air displaceme...,3.8


---

針對第 k 篇文章，擷取出分數最高的 s 個句子。

In [8]:
k = 1
s = 2
title = document['title'][k]
content = tabulation.table.loc[tabulation.table['title']==title]['abstract'].item()
sentence, _ = importance.sentence(title=title, content=content, top=s)
data.mark(title, content, sentence)

[44mFunctional valvular incompetence in decompensated heart failure: noninvasive monitoring and response to medical management[0m
[33mObjective: We hypothesized that functional mitral and tricuspid valvular incompetence (MR and TR, respectively) are reversible causes of reduced cardiac output in decompensated heart failure (DF) that accompanies systolic dysfunction in ischemic or nonischemic cardiomyopathy.[0m
--------------------------------------------------


---

挑選第二篇來分析，擷取出分數最高的 s 個句子。

In [6]:
k = 1
title = document['title'][k]
content = tabulation.table.loc[tabulation.table['title']==title]['abstract'].item()
sentence, _ = importance.sentence(title=title, content=content, top=3)
data.mark(title, content, sentence)

[44mNeuropeptidomic analysis of the brain and thoracic ganglion from the Jonah crab, Cancer borealis[0m
Mass spectrometric methods were applied to determine the peptidome of the brain and thoracic ganglion of the Jonah crab (Cancer borealis). Fractions obtained by high performance liquid chromatography were characterized using MALDI-TOF MS and ESI-Q-TOF MS/MS. [33mIn total, 28 peptides were identified within the molecular mass range 750-3000Da.[0m [33mComparison of the molecular masses obtained with MALDI-TOF MS with the calculated molecular masses of known crustacean peptides revealed the presence of at least nine allatostatins, three orcokinin precursor derived peptides, namely FDAFTTGFGHS, [Ala(13)]-orcokinin, and [Val(13)]-orcokinin, and two kinins, a tachykinin-related peptide and four FMRFamide-related peptides.[0m [33mEight other peptides were de novo sequenced by collision induced dissociation on the Q-TOF system and yielded AYNRSFLRFamide, PELDHVFLRFamide or EPLDHVFLRFa

挑選第三篇來分析，擷取出分數最高的 3 個句子。

In [7]:
k = 2
title = document['title'][k]
content = tabulation.table.loc[tabulation.table['title']==title]['abstract'].item()
sentence, _ = importance.sentence(title=title, content=content, top=3)
data.mark(title, content, sentence)

[44m[Resection and Reconstruction of the Diaphragm and the Pericardium in Extrapleural Pneumonectomy][0m
The diaphragm dissection should be started from anterior, because the portion is just under the thoracotomy incision. The diaphragmatic muscle was cut by an electric knife along the line of 1 to 2 cm from the chest wall from anterior and lateral to posterior. The diaphragm including the tendon center is dissected from the peritoneum. The peritoneum should be preserved. If the peritoneum is opened, it should be repaired by sutures. The pericardium is opened at the apex. The pericardium incision is extended from the apex to cranial side. And then, it is cut from the apex to posterior with the diaphragm. And next, the incision of the cranial side edge is extended to posterior. The lower pulmonary vein, upper pulmonary vein, and pulmonary artery are exposed. They are encircled and divided in the pericardium by autosutures. A Goretex sheet with 1 mm thickness is used to reconstruct the

挑選第四篇來分析，擷取出分數最高的 3 個句子。

In [8]:
k = 3
title = document['title'][k]
content = tabulation.table.loc[tabulation.table['title']==title]['abstract'].item()
sentence, _ = importance.sentence(title=title, content=content, top=3)
data.mark(title, content, sentence)

[44mMagnocellular hypothalamic projections to the lower brain stem and spinal cord of the rat. Immunocytochemical evidence for predominance of the oxytocin-neurophysin system compared to the vasopressin-neurophysin system[0m
The paraventricular nucleus of the rat hypothalamus has been shown to project to the medulla and spinal cord. The proportion of oxytocin-neurophysin (OTNP) axons to vasopressin-neurophysin (VPNP) axons in these structures is unknown. A major difficulty in resolving this problem in previous immunocytochemical studies was the lack of a specific antiserum to each rat neurophysin. [33mIn this study two approaches have been used: (1) comparison of immunostaining for neurophysin in normal versus homozygous Brattleboro rats with diabetes insipidus (HODI) which lack VPNP, and (2) application of an antiserum to both rat neurophysins absorbed with HODI rat hypothalamic-pituitary extracts which contain only OTNP.[0m The latter would result in an antiserum specific for VPN

挑選第五篇來分析，擷取出分數最高的 3 個句子。

In [9]:
k = 4
title = document['title'][k]
content = tabulation.table.loc[tabulation.table['title']==title]['abstract'].item()
sentence, _ = importance.sentence(title=title, content=content, top=3)
data.mark(title, content, sentence)

[44mIntra-thoracic fibrous tissue induction by polylactic acid and epsilon-caprolactone copolymer cubes, with or without slow release of basic fibroblast growth factor[0m
[33mObjective: We investigated whether implantation of polylactic acid and epsilon-caprolactone copolymer (PLAC) cubes with or without basic fibroblast growth factor (b-FGF) released slowly from gelatin microspheres was able to induce fibrous tissue in the dead space remaining after pneumonectomy in the thoracic cavity.[0m
--------------------------------------------------
