基本xml資訊擷取

.xml檔案是以tag型式定義的文字檔案
可以直接以文字檔方式讀取，也可以使用如lxml(python內建工具)等解析工具讀取處理。
一般來說直接讀取文字處理會比較快(檔案不大時感覺不出來)，不過彈性也較小，每個案子要更改及開發的內容比較多。
這裡介紹如何用lxml套件來取得想要的資訊。

In [None]:
範例檔案: 
    內容複製範例https://www.blog.pythonlibrary.org/2010/11/20/python-parsing-xml-with-lxml/
    並且加入幾個新的tag以增加使用靈活度。
    
    <>包起來的部分稱做tag，一定有頭尾比如<price> $5,000 </price>。資訊以tag做好管理，所以我們根據檔案的邏輯去拿取需要的資訊。
目標:
    拿取如下格式的資訊
    [
        {
            Query_id: Query_1,
            書:{
                [{book_title: book1, book_catalog: XXX}, {book_title: book2, book_catalog: XXX}, {}, ... ]
            }
        }
        {
            Query_id: Query_2,
            書:{
                
            }
        }
        .
        .
        .      
    ]
    一個query裡面(查詢)，有好幾個Hit(書的分類)；每個Hit裡面又有多個Hsp(書)
    
# <?xml version="1.0"?>
# <PH>
# <Iteration><Iteration_query-ID>Query_1</Iteration_query-ID><Hit><Hit_def>7 dna_sm:chromosome chromosome:GRCh37:7:1:159138663:1 REF</Hit_def>
#    <Hsp id="bk101">
#       <author>Gambardella, Matthew</author>
#       <title>XML Developer's Guide</title>
#       <genre>Computer</genre>
#       <price>44.95</price>
#       <publish_date>2000-10-01</publish_date>
#       <description>An in-depth look at creating applications 
#       with XML.</description>
#    </Hsp>
#    </Hit>
#     .
#     .
#     .

In [1]:
from lxml import etree

In [None]:
基本步驟: 先瞄準需要的tag，例子中我想從Iteration開始，不需要從PH這個tag開始。
    首先使用etree.parse 或是 etree.iterparse

In [33]:
for event, each_Iteration in etree.iterparse('lxml_practice_example.xml', tag = 'Iteration'):
    print(event, each_Iteration)

end <Element Iteration at 0x6034208>
end <Element Iteration at 0x603d2c8>
end <Element Iteration at 0x603d188>


In [None]:
event在文件中有定義，each_Iteraction是parser解析出來的物件。
https://lxml.de/parsing.html

In [34]:
for event, each_Iteration in etree.iterparse('lxml_practice_example.xml', tag = 'Iteration'):
    print(each_Iteration.tag, each_Iteration.text)

Iteration None
Iteration None
Iteration None


In [None]:
兩個關鍵的 attributes 分別是 .tag和.text
一個是顯示 tag 本身，另一個是顯示tag中包起來的文字。

In [36]:
for event, each_Iteration in etree.iterparse('lxml_practice_example.xml', tag = 'Iteration'):
    q_id = each_Iteration.find('Iteration_query-ID').text
    print('Query_id:', q_id)

Query_id: Query_1
Query_id: Query_1
Query_id: Query_1


In [None]:
我們需要的第一個東西是Query_id，使用find()函數，找到對應的tag -> Iteration_query-ID
這裡每個Iteration指包含一個Query_id，所以我們用find()
find()只會找到第一個tag
如果有很多個Query_id在同個Iteration裡面，可能要使用findall('ttt')，會列出此Iteration中所有'ttt' tag的物件。

In [43]:
for event, each_Iteration in etree.iterparse('lxml_practice_example.xml', tag = 'Iteration'):

#     print(event)
    q_id = each_Iteration.find('Iteration_query-ID').text
#     print('Query_id:', q_id)
    
    for each_Hsp in each_Iteration.find('Hit/Hsp'):
        print(each_Hsp.text)

Gambardella, Matthew
XML Developer's Guide
Computer
44.95
2000-10-01
An in-depth look at creating applications 
      with XML.
Gambardella, Matthew
XML Developer's Guide
Computer
44.95
2000-10-01
An in-depth look at creating applications 
      with XML.
Gambardella, Matthew
XML Developer's Guide
Computer
44.95
2000-10-01
An in-depth look at creating applications 
      with XML.


In [None]:
另一個技巧是find()裡面可以tag接下一個tag一直接下去，用'/'格開。
each_Iteration.find('Hit/Hsp') 就可以找出 Iteration 中第一個 Hit 中第一個 Hsp 物件。

In [38]:
for event, each_Iteration in etree.iterparse('lxml_practice_example.xml', tag = 'Iteration'):

#     print(event)
    q_id = each_Iteration.find('Iteration_query-ID').text
    print('Query_id:', q_id)
    
    for each_Hit in each_Iteration.findall('Hit'):
        cata_str = each_Hit.find('Hit_def').text
        cata_str = cata_str.split(':')[3]
        print('chr:', cata_str)
        for each_Hsp in each_Hit.findall('Hsp'):
            print(each_Hsp.find('title').text)

Query_id: Query_1
chr: 7
XML Developer's Guide
chr: X
Midnight Rain
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost
Microsoft .NET: The Programming Bible
MSXML3: A Comprehensive Guide
Visual Studio 7: A Comprehensive Guide
Query_id: Query_1
chr: 7
XML Developer's Guide
chr: X
Midnight Rain
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost
Microsoft .NET: The Programming Bible
MSXML3: A Comprehensive Guide
Visual Studio 7: A Comprehensive Guide
Query_id: Query_1
chr: 7
XML Developer's Guide
chr: X
Midnight Rain
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost
Microsoft .NET: The Programming Bible
MSXML3: A Comprehensive Guide
Visual Studio 7: A Comprehensive Guide


In [None]:
著個方式可以得到我們所有想要的資訊，之後整理成python dictionary的物件就在這裡跳過囉。