# Delo z Word datotekami

## Word dokumenti

python-docx is a Python library for creating and updating Microsoft Word (.docx) files.

https://python-docx.readthedocs.io/en/latest/

Several libraries exist that can be used to read and write MS Word files in Python. However, we will be using the python-docx module owing to its ease-of-use. Execute the following pip command in your terminal to download the python-docx module as shown below:

In [2]:
#!pip install python-docx

### Reading MS Word Files

In this section, you will see how to read text from MS Word files via the python-docx module.

Create a new MS Word file and rename it as "Uvod_v_Python.docx". 

To read the above file, first import the docx module and then create an object of the Document class from the docx module. Pass the path of the my_word_file.docx to the constructor of the Document class, as shown in the following script:

In [29]:
import docx

doc = docx.Document('data/Uvod_v_Python.docx')

The Document class object doc can now be used to read the content of the my_word_file.docx.

#### Reading Paragraphs

Once you create an object of the Document class using the file path, you can access all the paragraphs in the document via the paragraphs attribute. An empty line is also read as a paragraph by the Document. Let's fetch all the paragraphs from the my_word_file.docx and then display the total number of paragraphs in the document:

In [30]:
all_paras = doc.paragraphs
len(all_paras)

12

Now we'll iteratively print all the paragraphs in the my_word_file.docx file:

In [31]:
for para in all_paras:
    print(para.text)
    print("-------")

Uvod v Python
-------

-------
Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.
-------

-------
Using the Python Interpreter
-------
When commands are read from a tty, the interpreter is said to be in interactive mode. In this mode it prompts for the next command with the primary prompt, usually three greater-than signs (>>>); for continuation lines it prompts with the secondary prompt, by default three dots (...). The interpreter prints a welcome message stating its version number and a copyright notice before printing the first prompt:
-------

-------
Tabela
-------

-------

-------

-------

-------


The output shows all of the paragraphs in the Word file.

We can even access a specific paragraph by indexing the paragraphs property like an array. Let's print the 5th paragraph in the file:

In [32]:
single_para = doc.paragraphs[2]
print(single_para.text)

Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.


In [33]:
print(single_para.text.lower().replace(',', '').replace('.','').split())

['python', 'is', 'an', 'easy', 'to', 'learn', 'powerful', 'programming', 'language', 'it', 'has', 'efficient', 'high-level', 'data', 'structures', 'and', 'a', 'simple', 'but', 'effective', 'approach', 'to', 'object-oriented', 'programming', 'python’s', 'elegant', 'syntax', 'and', 'dynamic', 'typing', 'together', 'with', 'its', 'interpreted', 'nature', 'make', 'it', 'an', 'ideal', 'language', 'for', 'scripting', 'and', 'rapid', 'application', 'development', 'in', 'many', 'areas', 'on', 'most', 'platforms']


Vaja: funkcija, ki prebere celotni text.



In [36]:
def read_all_text(document_path, sep='\n'):
    doc = docx.Document(document_path)
    final_text = []
    for line in doc.paragraphs:
        final_text.append(line.text)
    return sep.join(final_text)

In [38]:
read_all_text('data/Uvod_v_Python.docx', '|')

'Uvod v Python||Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.||Using the Python Interpreter|When commands are read from a tty, the interpreter is said to be in interactive mode. In this mode it prompts for the next command with the primary prompt, usually three greater-than signs (>>>); for continuation lines it prompts with the secondary prompt, by default three dots (...). The interpreter prints a welcome message stating its version number and a copyright notice before printing the first prompt:||Tabela||||'

#### Reading Runs

A **run in a word document** is **a continuous sequence of words having similar properties**, such as similar font sizes, font shapes, and font styles. 

To get all the runs in a paragraph, you can use the run property of the paragraph attribute of the doc object.

In [34]:
single_para = doc.paragraphs[2]

for run in single_para.runs:
    print(run.text)
    print('----')

Python is an easy to learn
----
, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, 
----
make it an ideal language
----
 for scripting and rapid application development in many areas on most platforms.
----


#### Reading Tabels

In [39]:
# Load the first table from your document. In your example file,
# there is only one table, so I just grab the first one.
doc = docx.Document('data/Uvod_v_Python.docx')
table = doc.tables[0]

In [40]:
# Data will be a list of rows represented as dictionaries
# containing each row's data.
data = []

In [48]:
keys = None
for i, row in enumerate(table.rows):
    text = (cell.text for cell in row.cells)

    # Establish the mapping based on the first row
    # headers; these will become the keys of our dictionary
    if i == 0:
        keys = tuple(text)
        continue

    # Construct a dictionary for this row, mapping
    # keys to values for this row
    row_data = dict(zip(keys, text))
    data.append(row_data)

In [49]:
data

[{'Verzija': '1.3', 'Hitrost': '58', 'Cena': '425'},
 {'Verzija': '2.2', 'Hitrost': '78', 'Cena': '526'},
 {'Verzija': '2.8', 'Hitrost': '79', 'Cena': '636'}]

In [50]:
import pandas as pd

In [51]:
df = pd.DataFrame.from_dict(data)

In [54]:
df

Unnamed: 0,Verzija,Hitrost,Cena
0,1.3,58,425
1,2.2,78,526
2,2.8,79,636


Vaja: funkcija, ki pretvori tabelo iz dokumenta v df.

In [100]:
import pandas as pd
import docx

def read_docx_table_to_df(document_path: str, table_id: int = 1) -> pd.DataFrame:
    document = docx.Document(document_path)
    
    tables = []
    for table in document.tables:
        df = [['' for i in range(len(table.columns))] for j in range(len(table.rows))]
        for i, row in enumerate(table.rows):
            for j, cell in enumerate(row.cells):
                if cell.text:
                    df[i][j] = cell.text
        
        tables.append(pd.DataFrame(df[1:], columns=df[0]))
        
    if (table_id > len(tables)) or (table_id < 1):
        raise ValueError(f'table_id {table_id} not exists. There are {len(tables)} table in {document_path}')
    
    return tables[table_id - 1]

In [105]:
df_list = read_docx_table_to_df('data/Uvod_v_Python.docx', 1)

In [106]:
df_list

Unnamed: 0,Verzija,Hitrost,Cena
0,1.3,58,425
1,2.2,78,526
2,2.8,79,636


### Reading tables to excel

In [None]:
import os
from typing import List

import docx
import pandas as pd

def read_docx_table_to_list(document_path: str) -> List:
    doc = docx.Document(document_path)
    tables = doc.tables
    final_tables = []
    for table in tables:
        data = []
        keys = None
        for i, row in enumerate(table.rows):
            text = (cell.text for cell in row.cells)
            if i == 0:
                keys = tuple(text)
                continue
            row_data = dict(zip(keys, text))
            data.append(row_data)
        final_tables.append(data)
    return final_tables


def read_docx_table_to_df(document_path: str) -> List[pd.DataFrame]:
    tables = read_docx_table_to_list(document_path)
    tables_df = [pd.DataFrame.from_dict(table) for table in tables]
    return tables_df


def convert_docx_tables_to_xlsx(document_path: str):
    tables = read_docx_table_to_df(document_path)
    for count, table in enumerate(tables):
        converted_file_name = f"{os.path.splitext(document_path)[0]}-table-{count}.xlsx"
        print(f"Converting file {converted_file_name}...")
        table.to_excel(converted_file_name)


if __name__ == "__main__":
    convert_docx_tables_to_xlsx("Del_08_Generiranje_porocil/data/Uvod_v_Python.docx")

### Writing MS Word Files

To write MS Word files, you have to create an object of the Document class with an empty constructor, or without passing a file name.

In [180]:
from docx.shared import Inches

mydoc = docx.Document()
path = "data/my_written_file.docx"

To write paragraphs, you can use the add_paragraph() method of the Document class object. Once you have added a paragraph, you will need to call the save() method on the Document class object. The path of the file to which you want to write your paragraph is passed as a parameter to the save() method. If the file doesn't already exist, a new file will be created, otherwise the paragraph will be appended at the end of the existing MS Word file.

In [181]:
mydoc.add_paragraph("This is first paragraph of a MS Word file.")

<docx.text.paragraph.Paragraph at 0x7fd60e611070>

Once you execute the above script, you should see a new file "my_written_file.docx" in the directory that you specified in the save() method. Inside the file, you should see one paragraph which reads "This is first paragraph of a MS Word file."

In [182]:
from datetime import datetime
time = datetime.now().strftime("%d.%m.%Y %H:%M:%S")

In [183]:
mydoc.add_paragraph(f"Current datetime: {time}.")

<docx.text.paragraph.Paragraph at 0x7fd60d9b6250>

In [184]:
mydoc.add_heading("This is level 1 heading", 0)
mydoc.add_heading("This is level 2 heading", 1)
mydoc.add_heading("This is level 3 heading", 2)

<docx.text.paragraph.Paragraph at 0x7fd60e61b3d0>

In [185]:
mydoc.add_paragraph('Intense quote', style='Intense Quote')

<docx.text.paragraph.Paragraph at 0x7fd5fc11ef10>

In [186]:
mydoc.add_paragraph(
    'first item in unordered list', style='List Bullet'
)

<docx.text.paragraph.Paragraph at 0x7fd60e618df0>

In [187]:
mydoc.add_page_break()

<docx.text.paragraph.Paragraph at 0x7fd60e618bb0>

In [188]:
mydoc.add_picture('data/slika.jpg', width=Inches(1.25))

<docx.shape.InlineShape at 0x7fd5fc118730>

In [189]:
from docx.enum.style import WD_STYLE_TYPE
from docx.shared import Pt

styles = mydoc.styles
style = styles.add_style('tahoma_big', WD_STYLE_TYPE.PARAGRAPH)
style.font.name = 'Tahoma'
style.font.size = Pt(25)

p = mydoc.add_paragraph('danes je lep dan')
p.style = mydoc.styles['tahoma_big']

In [191]:
mydoc.save(path)

## Vaja 1: Branje kompleksnih tabel iz worda

Primer tabele v Word ali PDF obliki, ki bi ju rad pretvoril v EXCELL obliko, kot sem prikazal v priponki.
 
Namreč velikokrat dobimo od strank zahteve v takšni obliki (word, pdf), ki bi jo želeli spraviti v standardno obliko v excellu, ki jo v podjetju uporabljamo. Iz excella potem generiramo kodo v C-ju (v header file), ki je potem osnova za naše nadaljnje delo.
 
V priponki sem prikazal 2 primera tabele, ki bi ju rad pretvoril v excell v željeno obliko. Upam, da je skozi primer v priponki razvidno, kaj je cilj. V primeru nejasnosti, se lahko na kratko tudi slišiva. Lahko potem zadevo predebatiramo v sredo na predavanju.

In [93]:
import re
import docx
from openpyxl import Workbook
from openpyxl.styles import Font, Color, Alignment, Border, Side, colors, PatternFill


class DocxTableReportParser:
    def __init__(self, docx_path):
        self.docx_path = docx_path
        self.workbook = Workbook()
        self.sheet = self.workbook.active
        self.tables = self._read_docx_table_all_elements_to_list(self.docx_path)
        # stili
        self.bold_font = Font(bold=True)
        self.center_aligned_text = Alignment(horizontal="center")
        self.light_gray_bg_color = PatternFill(start_color="D3D3D3", fill_type = "solid")
        self.red_bg_color = PatternFill(start_color="FF0000", fill_type = "solid")
        self.yellow_bg_color = PatternFill(start_color="FFFF00", fill_type = "solid")
        self.cyan_bg_color = PatternFill(start_color="00FFFF", fill_type = "solid")
        self._add_first_line_template()
        self._add_second_line_template()
        
    def _append_rows(self, rows):
        for row in rows:
            self.sheet.append(row)
            
    def _get_class_id_version(self, table_id):
        data = self.tables[table_id][0][1].split(',')
        class_id = int(re.search(r'Class_id\s?=\s?(\d)', data[0], re.IGNORECASE).group(1))
        version = int(re.search(r'version\s?=\s?(\d)', data[1], re.IGNORECASE).group(1))
        return class_id, version
    
    def _get_obis_code_daily(self, table_id):
        result = self.tables[table_id][2]
        result = [el.strip() for el in result]
        obis_code = f"{result[1]}-{result[2]}:{result[3]}.{result[4]}.{result[5]}.{result[6]}"
        return obis_code
    
    def _add_atributes_table(self, table_id):
        attributes = self.tables[table_id][3:]
        attributes_name = [[int(attr[0].split()[0].replace('.','')), attr[0].split()[1]] for attr in attributes]
        access_rights = []
        for ar in attributes:    
            access_rights.append([el.replace('R/-', 'Get_1').replace('R/W', 'Get_1,Set_1').replace('-/- ', '') for el in ar[2:]])
        final_list = []
        for an, ar in zip(attributes_name, access_rights):
            final_list.append([an[0], an[1], '', '', '', ar[0], ar[1], ar[3], ar[2]])
        self._append_rows(final_list)
        return len(attributes_name)
        
        
    def _read_docx_table_all_elements_to_list(self, document_path: str):
        doc = docx.Document(document_path)
        tables_doc = doc.tables
        final_tables = []
        for table in tables_doc:
            data = []
            for i, row in enumerate(table.rows):
                text = list((cell.text for cell in row.cells))
                data.append(text)
            final_tables.append(data)
        return final_tables
        
        
    def _add_first_line_template(self):
        '''Add first line fix template.'''
        rows = [
            ["", "Object/Attribute Name", "IC", "IC", "OBIS Object",
            "Access right", "Access right", "Access right", "Access right"]]
        
        for row in rows:
            self.sheet.append(row)
            
        # združimo celice
        self.sheet.merge_cells('C1:D1')
        self.sheet.merge_cells('F1:I1')
        # uredimo prvo vrstico
        for cell in self.sheet["1:1"]:
            cell.fill = self.light_gray_bg_color
            cell.alignment = self.center_aligned_text
            cell.font = self.bold_font
            
    def _add_second_line_template(self):
        '''Add second line fix template'''
        rows = [["#", "Object/Attribute Name", "Class ID", "Ver.", "OBIS Object Code / Default Value",
        "A.1", "A.2", "A.3", "A.4"]]
        self._append_rows(rows)
        for cell in self.sheet["2:2"]:
            cell.fill = self.light_gray_bg_color
            cell.alignment = self.center_aligned_text
            
    def add_energy_profile_daily_snapshot(self, id_class, id_attr):
        # dodamo rdečo vrstico
        rows = [["", "Energy Profile (Daily snapshot)"]]
        self._append_rows(rows)
        current_row = self.sheet._current_row
        for cell in self.sheet[f"{current_row}:{current_row}"]:
            cell.fill = self.red_bg_color
            cell.font = self.bold_font
            
        # dodamo rumeno vrstico
        calss_id, version = self._get_class_id_version(id_attr)
        obis_code = self._get_obis_code_daily(id_class)
        
        rows = [["Attr", "Energy Profile (Daily snapshot)", calss_id, version, obis_code]]
        self._append_rows(rows)
        current_row = self.sheet._current_row
        for cell in self.sheet[f"{current_row}:{current_row}"]:
            cell.fill = self.yellow_bg_color
            cell.font = self.bold_font
        
        # dodamo atribute
        table_len = self._add_atributes_table(id_attr)
        current_row = self.sheet._current_row
        for a in self.sheet[f"A{current_row-table_len+1}:B{current_row}"]:
            for cell in a:
                cell.fill = self.cyan_bg_color
                
        # dodamo prazno vrstico
        self.sheet._current_row += 1
        
        
    def add_total_energy_registers(self, id_class, id_attr):
        # dodamo rdečo vrstico
        rows = [["", "Total Energy Registers"]]
        self._append_rows(rows)
        current_row = self.sheet._current_row
        for cell in self.sheet[f"{current_row}:{current_row}"]:
            cell.fill = self.red_bg_color
            cell.font = self.bold_font
            
        # dodamo rumeno vrstico
        calss_id, version = self._get_class_id_version(id_attr)
        obis_code = self.tables[0][1][-1].strip()
        
        rows = [["Attr", "Active energy import (+A)", calss_id, version, obis_code]]
        self._append_rows(rows)
        current_row = self.sheet._current_row
        for cell in self.sheet[f"{current_row}:{current_row}"]:
            cell.fill = self.yellow_bg_color
            cell.font = self.bold_font
            
        # dodamo atribute
        table_len = self._add_atributes_table(id_attr)
        current_row = self.sheet._current_row
        for a in self.sheet[f"A{current_row-table_len+1}:B{current_row}"]:
            for cell in a:
                cell.fill = self.cyan_bg_color    
        
        # dodamo prazno vrstico
        self.sheet._current_row += 1
        
        # dodamo rumeno vrstico
        calss_id, version = self._get_class_id_version(id_attr)
        obis_code = self.tables[0][2][-1].strip()
        
        rows = [["Attr", "Active energy import (-A)", calss_id, version, obis_code]]
        self._append_rows(rows)
        current_row = self.sheet._current_row
        for cell in self.sheet[f"{current_row}:{current_row}"]:
            cell.fill = self.yellow_bg_color
            cell.font = self.bold_font
            
        # dodamo atribute
        table_len = self._add_atributes_table(id_attr)
        current_row = self.sheet._current_row
        for a in self.sheet[f"A{current_row-table_len+1}:B{current_row}"]:
            for cell in a:
                cell.fill = self.cyan_bg_color
                
        # dodamo prazno vrstico
        self.sheet._current_row += 1
        
    
    def save_table(self, table_path):
        self.workbook.save(filename=table_path)

In [94]:
my_parser = DocxTableReportParser("data/PRIMER_kompleksna_tabela.docx")

my_parser.add_total_energy_registers(0, 1)
my_parser.add_energy_profile_daily_snapshot(2, 3)
my_parser.save_table("data/complex_report.xlsx")