<hr>

# # OVERVIEW

#### Objective of this page (This is a python notebook.)
 - \# Convert 'tables in the pdf file' into 'excel' file easily.

#### Input & Outcome
 - \# We type in the name of pdf file below.
 - \# You get an ***excel file*** when you run this code. This table in excel file comes from pdf file.

#### Significance
 - \# Cost-cutting >> Saving Time-Resource of employees
   - We can save time up to at least 1/40 of the time originally needed.
   - The bigger the pdf file size, the more we can save time(time-cost).

<hr>

# # Usage Guide
 - We use google drive to perform this notebook.

#### Pre-requisite
 - Make the folder named 'DIGITIZATION' on your 'MyDrive' folder on google drive platform.
 - Locate this notebook file(.ipynb) in 'DIGITIZATION' folder. 
   (google drive >> 'MyDrive' >> 'DIGITIZATION' folder)
 - Locate pdf file to convert in the same location.
      (google drive >> 'MyDrive' >> 'DIGITIZATION' folder)
 - If PDF file has more than 1 table, those tables are saved into different sheet of the excel file.

#### Step.1
 - Edit the name of pdf file in [Part.1]

#### Step.2
 - Run all the cells below.
 - You can run one cell by either clicking '>' button on the left side of the cell or pushing 'shift + enter' together.

#### Step.3
 - Go to 'DIGITIZATION' folder and check the excel file that was generated. You can download it as well.

#### Note!
 - This notebook needs 'google drive authentification process' while you run the cells below. You can go through it by clicking yes, which means 'allowing this notebook to access your google drive folder' temporarily while using this notebook.
 - If you encounter anything wrong and want to do this from the beginning again, you just can refresh(F5) this webpage.
 - (Additional Support & CONTACT) 
   - Donghwan Kim
   - Strategy APAC - Data Science Team
   - kdonghwan@ups.com

<hr>

# # Contents

#### # Part.1. [EDITABLE] File Name Modification
 - ***YOU CAN TYPE(EDIT) THE NAME OF THE PDF FILE THAT YOU WANT TO CONVERT***

#### # Part.2. [RUN ONLY] Code Blocks
 - Prepare to convert ***tables in pdf file*** to ***excel file***.
 - Convert tables in pdf file into excel file.
 - Save excel file on the google drive folder.

<hr>


# # Part.1. [EDITABLE] File Name Modification
 - ***YOU CAN TYPE(EDIT) THE NAME OF THE PDF FILE THAT YOU WANT TO CONVERT***

In [1]:
###############################################################################
##### (the only part you should modify) #######################################
##### Please type in the name of the pdf file between quotation marks. ########
##### This pdf file should be in the same google drive folder.         ########
###############################################################################
PDF_FILE_NAME = 'sample.pdf'
##### (example below) #########################################################
# PDF_FILE_NAME = 'Philippines - Schedule of Tariff Commitments for China.pdf'
###############################################################################
###############################################################################
###############################################################################

# # Part.2. [RUN ONLY] Code Blocks
 - Prepare to convert ***tables in pdf file*** to ***excel file***.
 - Convert tables in pdf file into excel file.
 - Save excel file on the google drive folder.

In [2]:
!pip install -q tabula-py

[K     |████████████████████████████████| 12.0 MB 4.5 MB/s 
[?25h

In [3]:
import tabula, warnings
import pandas as pd
from google.colab import drive
warnings.filterwarnings('ignore')
drive.mount('/content/drive/')

Mounted at /content/drive/


In [4]:
cd ./drive/MyDrive/DIGITIZATION

/content/drive/MyDrive/DIGITIZATION


In [5]:
ls -al

total 853
drwx------ 2 root root   4096 Jun  6 10:38  [0m[01;34mdone[0m/
-rw------- 1 root root  12230 Jun  6 11:06  NOTEBOOK_Converting_PDF_to_EXCEL.ipynb
-rw------- 1 root root 736055 Nov 13  2020 'Philippines - Schedule of Tariff Commitments for China.pdf'
-rw------- 1 root root  59855 Jun  8 08:14 'Philippines - Schedule of Tariff Commitments for China.xlsx'
-rw------- 1 root root  50881 Jun  9 08:04  sample.pdf
-rw------- 1 root root   9370 Jun  9 08:18  tmp.ipynb


In [6]:
tables = tabula.read_pdf(PDF_FILE_NAME, stream=True, pages='all')
print('***************************************************************************')
print('** How many tables in this pdf file? >> Total {} tables.'.format(len(tables)))
print('** How many rows and columns are there in the first table in the first page of pdf file? >> {} Rows & {} Columns'.format(tables[0].shape[0], tables[0].shape[1]))
print('***************************************************************************')

***************************************************************************
** How many tables in this pdf file? >> Total 1 tables.
** How many rows and columns are there in the first table in the first page of pdf file? >> 3 Rows & 5 Columns
***************************************************************************


In [7]:
print('Excel file is being generated... Please wait...')
EXCEL_FILE_NAME = PDF_FILE_NAME.split('.pdf')[0] + '.xlsx'
with pd.ExcelWriter(EXCEL_FILE_NAME) as writer:
    for n in range(len(tables)):
        tables[n].to_excel(writer, sheet_name=str(n), index=False)
print('Excel file is generated. You can check your DIGITIZATION folder after some seconds.')

Excel file is being generated... Please wait...
Excel file is generated. You can check your DIGITIZATION folder after some seconds.


# References
 - https://github.com/chezou/tabula-py/blob/master/examples/tabula_example.ipynb