# Session 4


## Example of writing plain text file: Diary logging


In [None]:
import datetime

content = input("What do you want to say to Mr. Diary? ")
if len(content) > 0:
    with open('diary.txt', "a") as file_obj:
        today = datetime.date.today().isoformat()
        file_obj.write(today + ": " + content + "\n")

with open('diary.txt', "r") as file_obj:
    lines = file_obj.readlines()
    for line in lines[-3:]:
        print(line.rstrip())


What do you want to say to Mr. Diary? Hello python.
2020-06-11: Hello
2020-06-11: Hello
2020-08-10: Hello python.


## Writing DOCX

We can use `python-docx` module to write content to DOCX.

First, we need to install the module by calling `pip install python-docx` once in terminal or in Jupyter.

In [None]:
pip install python-docx

Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
   ---------------------------------------- 0.0/244.3 kB ? eta -:--:--
   - -------------------------------------- 10.2/244.3 kB ? eta -:--:--
   ------------------------------ --------- 184.3/244.3 kB 2.8 MB/s eta 0:00:01
   ---------------------------------------- 244.3/244.3 kB 3.0 MB/s eta 0:00:00
Installing collected packages: python-docx
Successfully installed python-docx-1.1.2
Note: you may need to restart the kernel to use updated packages.


👆🏻🤔 If you’re wondering how the above line works. It is a command executed in command line prompt. But Jupyter is smart enough to parse the `pip install` command and execute it right inside the notebook.

In [None]:
import os
os.mkdir('Testing') #Create folder
os.path.isfile('Testing') #check file exists or not

In [None]:
import datetime
import docx
import os

content = input("What do you want to say to Mr. Diary? ")
if len(content) > 0:
    with open('diary.txt', "a") as file_obj:
        today = str(datetime.date.today())
        file_obj.write(today + ": " + content + "\n")

if os.path.isfile("diary.docx"):
    doc = docx.Document("diary.docx")
    print('Exisited')
else:
    doc = docx.Document()
    print('New')

input('Pause')

doc.add_paragraph(content)
doc.save("diary.docx")

print(f"{content} is written to diary.docx")

What do you want to say to Mr. Diary?  Testing


New


Pause 


Testing is written to diary.docx


## Reading DOCX file

Given that we have a DOCX file named `Sample Document.docx`. We can read all the paragrahs in the DOCX file.

In [None]:
var1 = 3
print(var1)

3


In [None]:
import docx

doc = docx.Document("Sample Document.docx")
print(doc.paragraphs)

[<docx.text.paragraph.Paragraph object at 0x0000025C942A7FD0>, <docx.text.paragraph.Paragraph object at 0x0000025C93E69B10>, <docx.text.paragraph.Paragraph object at 0x0000025C93F03310>, <docx.text.paragraph.Paragraph object at 0x0000025C942C0590>, <docx.text.paragraph.Paragraph object at 0x0000025C942C0450>, <docx.text.paragraph.Paragraph object at 0x0000025C942C0510>, <docx.text.paragraph.Paragraph object at 0x0000025C942C0950>, <docx.text.paragraph.Paragraph object at 0x0000025C942C0610>, <docx.text.paragraph.Paragraph object at 0x0000025C942C0650>, <docx.text.paragraph.Paragraph object at 0x0000025C942C0990>, <docx.text.paragraph.Paragraph object at 0x0000025C942C0550>]


In [None]:
for p in doc.paragraphs:
    print(p.text)

Sample Document

This is a sample paragraph.

This is the second paragraph.

Here is the result

Summary

This is the summary of the sample report document.


## Reading tables in DOCX file

We can also read the tables and the content.

In [None]:
doc.tables[0].columns[0].cells[1].text

'2020-06-01'

The following code read the data row by row into 3 lists: `dates`, `morning_visitors`, `evening_visitors`.

In [None]:
table = doc.tables[0]

dates = []
morning_visitors = []
evening_visitors = []

for row in table.rows[1:]:
    dates.append(row.cells[0].text)
    morning_visitors.append(int(row.cells[1].text))
    evening_visitors.append(int(row.cells[2].text))

print(dates)
print(morning_visitors)
print(evening_visitors)


['2020-06-01', '2020-06-02', '2020-06-03', '2020-06-04', '2020-06-05', '2020-06-06', '2020-06-07']
[23, 25, 24, 26, 25, 24, 23]
[17, 16, 16, 15, 16, 17, 18]


In [None]:
type(table.columns[0])

docx.table._Column

In [None]:
table = doc.tables[0]

dates = []
morning_visitors = []
evening_visitors = []

for c in table.columns[0].cells[1:]:
    dates.append(c.text)

for c in table.columns[1].cells[1:]:
    morning_visitors.append(c.text)

for c in table.columns[2].cells[1:]:
    evening_visitors.append(c.text)

dates
morning_visitors

['23', '25', '24', '26', '25', '24', '23']

In [None]:
morning_visitors

[23, 25, 24, 26, 25, 24, 23]

In [None]:
sum(morning_visitors)

170

In [None]:
evening_visitors

[17, 16, 16, 15, 16, 17, 18]

In [None]:
sum(morning_visitors) + sum(evening_visitors)

285

### Exercise: Spliting a story  

You will find a story.txt in your folder. It contains 12 chapters, try to split each chapter in to a text file.  

e.g.: `Chapter 1 The Mysterious Key.txt` contains the chapter 1 content.

In [None]:
# Hints:
# 1. You can in to determine if a substring is contained
str1 = 'Chapter 1: Hello World!'
print('Plate' in str1)

# 2. You can use slicing among list and string
print(str1[0:10])

# 3. replace character
str1 = 'Chapter 1: Hello World!'
str1 = str1.replace(':','')
print(str1)

False
Chapter 1:
Chapter 1 Hello World!


In [None]:
with open('story.txt', "r") as file_obj:
    lines = file_obj.readlines()
    count=1
    for x in lines:
        if "Chapter "+str(count) in x:
            if (count!=1):
                file2.close()
            print(x)
            count+=1
            file2 = open(x.replace(':','').replace('\n','')+".txt", "w")
        file2.write(x)
    file2.close()

Chapter 1: The Mysterious Island

Chapter 2: The Call to Adventure

Chapter 3: The Gathering

Chapter 4: First Encounter

Chapter 5: The Lost Civilization

Chapter 6: Whispers in the Dark

Chapter 7: The Guardian's Challenge

Chapter 8: The Heart of the Island

Chapter 9: The Test of Knowledge

Chapter 10: The Tempest

Chapter 11: The Revelation

Chapter 12: The Departure



In [None]:
with open('story.txt', encoding="utf-8") as f:
    data = f.read().splitlines()

for d in data:
    if d[:7] == 'Chapter':
        chapter = d.replace(':','')
        with open(f"{chapter}.txt", mode='w', encoding="utf-8") as f:
            f.write(d + '\n')
    else:
        with open(f"{chapter}.txt", mode='a', encoding="utf-8") as f:
            f.write(d + '\n')

# Exercise:  

https://www.dsat.gov.mo/dsat/news.aspx  

請從交通事務局的新聞網站取得最新新聞列表。並按要求儲存成 Word DOCX 檔案。  
包括第一版的所有新聞，當中日期從 DD-MM-YYYY 改為 YYYY-MM-DD  
標題中的空格及斜號(與倘有的特殊符號，請使用底線替代)  

輸出的 docx 中只考慮是否內容齊備：標題及內容。不考慮顏色樣式等。

In [None]:
import requests
import docx
from bs4 import BeautifulSoup

res = requests.get("https://www.dsat.gov.mo/dsat/news.aspx")
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)
newslist = soup.find('div', {'id': 'news_list'}).find_all('div', {'class': 'my_news_list'})
for news in newslist:
    text = news.find('div', {'class': 'news_content'}).text.strip()
    date = news.find('div', {'class': 'news_date'}).text.strip().split('-')
    link = 'https://www.dsat.gov.mo/dsat/'+news.find('a')['href']
    date.reverse()
    date="-".join(date)
    print(f"{date} {text}.docx")
    doc = docx.Document()
    res2 = requests.get(link)
    soup2 = BeautifulSoup(res2.text)
    newsdetail = soup2.find('div', {'class': 'my_content'}).find_all('p')
    for i in newsdetail:
        doc.add_paragraph(i.text.strip())
    doc.save(f"{date} {text}.docx")

2024-05-10 蓮花海濱大馬路5月13日晚間實施臨時交通安排.docx
2024-05-10 青草街、涼水街及沙梨頭街鋪設電纜之臨時交通安排(轉載電力公司新聞稿).docx
2024-05-10 聖母像巡遊活動 多處道路周一晚間短暫實施有限度通車.docx
2024-05-10 光輝路環四月八 路環多處道路周二三實施臨時交管.docx
2024-05-09 交通局已完成10個8年期的士准照審標工作.docx
2024-05-08 關閘地下巴士站天花受損 未影響巴士服務運作.docx
2024-05-08 局方未發現學時不足參與駕駛考試情況.docx
2024-05-08 九澳高頂馬路部分路段5月11日起臨時封閉.docx
2024-05-08 第二回合長跑賽 路氹多處道路周六日臨時交通管制.docx
2024-05-06 下水道整治 黑沙馬路及周邊道路5月8日起臨時交管.docx
2024-05-03 重鋪路面 路氹連貫公路5月6日起臨時交通管制.docx
2024-05-03 何賢紳士大馬路行車天橋及周邊道路下周一起重鋪 晚間施工日間通車.docx
2024-05-03 卑第巷風順堂街及周邊道路5月4日起臨時交通管制.docx
2024-05-03 柯維納馬路戶外公共停車場14日起調整泊車位分配.docx
2024-05-03 安卓版“澳車北上”手機應用程式5月5日起率先試行界面優化.docx


In [None]:
import pandas as pd
import requests
import docx
from bs4 import BeautifulSoup
res = requests.get("https://www.dsat.gov.mo/dsat/news.aspx")
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)
newslist = soup.find('div', {'id': 'news_list'}).find_all('div', {'class': 'my_news_list'})
for news in newslist:
    text = news.find('div', {'class': 'news_content'}).text.strip()
    date = pd.to_datetime(news.find('div', {'class': 'news_date'}).text.strip(), format="%d-%m-%Y").strftime("%Y-%m-%d")
    link = 'https://www.dsat.gov.mo/dsat/'+news.find('a')['href']
    print(f"{date} {text}.docx")
    doc = docx.Document()
    res2 = requests.get(link)
    soup2 = BeautifulSoup(res2.text)
    newsdetail = soup2.find('div', {'class': 'my_content'}).find_all('p')
    for i in newsdetail:
        if i.attrs.get('class') and i.attrs.get('class')[0]=='content_title':
            doc.add_heading(i.text.strip())
        else:
            doc.add_paragraph(i.text.strip())
    doc.save(f"{date} {text}.docx")

2024-05-10 蓮花海濱大馬路5月13日晚間實施臨時交通安排.docx
2024-05-10 青草街、涼水街及沙梨頭街鋪設電纜之臨時交通安排(轉載電力公司新聞稿).docx
2024-05-10 聖母像巡遊活動 多處道路周一晚間短暫實施有限度通車.docx
2024-05-10 光輝路環四月八 路環多處道路周二三實施臨時交管.docx
2024-05-09 交通局已完成10個8年期的士准照審標工作.docx
2024-05-08 關閘地下巴士站天花受損 未影響巴士服務運作.docx
2024-05-08 局方未發現學時不足參與駕駛考試情況.docx
2024-05-08 九澳高頂馬路部分路段5月11日起臨時封閉.docx
2024-05-08 第二回合長跑賽 路氹多處道路周六日臨時交通管制.docx
2024-05-06 下水道整治 黑沙馬路及周邊道路5月8日起臨時交管.docx
2024-05-03 重鋪路面 路氹連貫公路5月6日起臨時交通管制.docx
2024-05-03 何賢紳士大馬路行車天橋及周邊道路下周一起重鋪 晚間施工日間通車.docx
2024-05-03 卑第巷風順堂街及周邊道路5月4日起臨時交通管制.docx
2024-05-03 柯維納馬路戶外公共停車場14日起調整泊車位分配.docx
2024-05-03 安卓版“澳車北上”手機應用程式5月5日起率先試行界面優化.docx


# Markdown 轉 docx

製作一個文字生成 DOCX 的簡單轉換器。 在這個版本中，我們不希望實現所有 Markdown 功能。 我們只要求能轉換大標題、文字段落、及分頁。   


這是一種輕量化純文字格式，例如當一行起始為 # 時，則表示為標題。當一行是 --- 或 ---- 時，則表示分頁符。其他文字每行則為段落。


而段落有個比較特別的規則，就是一個跳行不當為段落，而且一個跳行不起任何作用，即在輸出的 Word 中不會跳行。而兩個跳行的（即有空行的）才計算為段落。


In [None]:
import docx

# Create an instance of a word document
doc = docx.Document()

# Add a Title to the document
doc.add_heading('GeeksForGeeks', 0)

# Adding a paragraph
doc.add_heading('Page 1:', 3)
doc.add_paragraph('GeeksforGeeks is a Computer Science portal for geeks.')

# Adding a page break
doc.add_page_break()

# Adding a paragraph
doc.add_heading('Page 2:', 3)
doc.add_paragraph('GeeksforGeeks is a Computer Science portal for geeks.')

# Now save the document to a location
doc.save('gfg.docx')

In [None]:
import docx
with open('sample.md', "r", encoding="utf-8") as f:
    data = f.readlines()
doc = docx.Document()
text=''
for x in data:
    if x[0]=='#':
        doc.add_heading(x.replace('#','').replace('\n','').strip())
    elif x[:3]=='---':
        if text!='':
            print(text)
            doc.add_paragraph(text)
            text=''
        doc.add_page_break()
    elif x!='\n':
        text+=x.replace('\n','').strip()

    else:
        if text!='':
            print(text)
            doc.add_paragraph(text)
            text=''

doc.save('testing.docx')

段落一的文字，段落一的文字，仍然是段落一的文字，且在同一行。
段落二開始了。是段落二的文字。仍然是段落二的文字。
段落三了。
第二頁第一段文字。
第二頁第二段文字。
第三頁第一段文字。是段落一的文字。
第三頁第二段文字。


In [None]:
import docx
with open('sample.md', "r", encoding="utf-8") as f:
    data = f.read().splitlines()
doc = docx.Document()
for d in data:
    if d.startswith('#'):
        doc.add_heading(d.replace('#','').strip())
    elif d== '---':
        doc.add_page_break()
    else:
        if d == '':
            c=doc.add_paragraph(d)
        else:
            c.text+=d
doc.save('testing2.docx')