# Text Pre-Processing

## 1. Import Libraries
Below the common library that will be used in this task

In [1]:
import pdfminer
import re
import pandas as pd
import itertools

Below the library that will be used to download the file from GoogleDrive

In [2]:
# Libraries for Download the files
import io
import os
import google.auth
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from googleapiclient.errors import HttpError

## 2. Extract The Links
To proceed, first we need to extract the information of the given pdf file, using pdfminer we can convert the pdf file to text file.

In [None]:
# source: Tutor class
!pdf2txt.py -o task_2.txt google_drive_links.pdf
print("done")

done


Above code will generate a text file named "task_2.txt"

After that, we can open the new generated text file to get the url for each given research pdf, and convert that to pdf

In [4]:
with open("task_2.txt", 'r') as pdf_txt:
    # pattern for each line
    pat = r'[\w\d\.]+ [\S]+'
    # find all pattern, for each data split by " "
    rows = [row.split(" ") for row in re.findall(pat, pdf_txt.read())]
    
df_t2_url = pd.DataFrame(rows, columns = ['filename', 'url'])

In [5]:
print(df_t2_url.shape)
df_t2_url.head()

(200, 2)


Unnamed: 0,filename,url
0,PP3167.pdf,https://drive.google.com/uc?export=download&id...
1,PP3210.pdf,https://drive.google.com/uc?export=download&id...
2,PP3213.pdf,https://drive.google.com/uc?export=download&id...
3,PP3248.pdf,https://drive.google.com/uc?export=download&id...
4,PP3285.pdf,https://drive.google.com/uc?export=download&id...


## 3. Downloaded Files
To download the file first we need to configure about the API for google drive, which i will not explain here, after you finished setting up the API you can download a json file that contain everything about the necessary key to interact with the API. my json file named "endless-sol-308511-fff70d278fb0.json"

Then we need to iterate for each links that we have from the dataframe, hence, we can download the file by using the url ID

Every files that being downloaded will be safed inside `files/` path

In [None]:
def download_files(df):
    # Folder name where the pdf files will be stored
    folder_n = "files"
    
    # Check for the existant of folder
    if not os.path.exists(folder_n):
        os.makedirs(folder_n)
    
    # Set up the Drive API credentials
    creds, _ = google.auth.load_credentials_from_file("endless-sol-308511-fff70d278fb0.json")

    # Set up the Drive API client
    service = build("drive", "v3", credentials=creds)
    
    # Loop for every rows to get each links
    for idx, row in df.iterrows():
        # Set the ID of the file to download, the sequence after 'id=' in the url
        file_id = re.search(r'(?<=id=)[\S]+', row.url)[0]

        try:
            # Download the file contents
            request = service.files().get_media(fileId=file_id)
            file = io.BytesIO()
            downloader = MediaIoBaseDownload(file, request)
            done = False
            while done is False:
                status, done = downloader.next_chunk()
                print(f"{row.filename} download progress: {int(status.progress() * 100)}.")

            # Save the downloaded file to disk
            with open("files/" + row.filename, 'wb') as f:
                f.write(file.getbuffer())
            print(f"{row.filename} file downloaded successfully.")

        except HttpError as error:
            print(f"An error occurred for {row.filename}: {error}")
            file = None

In [7]:
# download files
download_files(df_t2_url)

PP3167.pdf download progress: 100.
PP3167.pdf file downloaded successfully.
PP3210.pdf download progress: 100.
PP3210.pdf file downloaded successfully.
PP3213.pdf download progress: 100.
PP3213.pdf file downloaded successfully.
PP3248.pdf download progress: 100.
PP3248.pdf file downloaded successfully.
PP3285.pdf download progress: 100.
PP3285.pdf file downloaded successfully.
PP3328.pdf download progress: 100.
PP3328.pdf file downloaded successfully.
PP3377.pdf download progress: 100.
PP3377.pdf file downloaded successfully.
PP3445.pdf download progress: 100.
PP3445.pdf file downloaded successfully.
PP3470.pdf download progress: 100.
PP3470.pdf file downloaded successfully.
PP3476.pdf download progress: 100.
PP3476.pdf file downloaded successfully.
PP3513.pdf download progress: 100.
PP3513.pdf file downloaded successfully.
PP3558.pdf download progress: 100.
PP3558.pdf file downloaded successfully.
PP3579.pdf download progress: 100.
PP3579.pdf file downloaded successfully.
PP3599.pdf d

PP5571.pdf download progress: 100.
PP5571.pdf file downloaded successfully.
PP5609.pdf download progress: 100.
PP5609.pdf file downloaded successfully.
PP5621.pdf download progress: 100.
PP5621.pdf file downloaded successfully.
PP5624.pdf download progress: 100.
PP5624.pdf file downloaded successfully.
PP5631.pdf download progress: 100.
PP5631.pdf file downloaded successfully.
PP5634.pdf download progress: 100.
PP5634.pdf file downloaded successfully.
PP5667.pdf download progress: 100.
PP5667.pdf file downloaded successfully.
PP5680.pdf download progress: 100.
PP5680.pdf file downloaded successfully.
PP5694.pdf download progress: 100.
PP5694.pdf file downloaded successfully.
PP5696.pdf download progress: 100.
PP5696.pdf file downloaded successfully.
PP5708.pdf download progress: 100.
PP5708.pdf file downloaded successfully.
PP5711.pdf download progress: 100.
PP5711.pdf file downloaded successfully.
PP5715.pdf download progress: 100.
PP5715.pdf file downloaded successfully.
PP5725.pdf d

## 4. Read Downloaded PDF files
After finished downloading the file, we can read for every pdf file inside the `files/` and convert it into a text files that will also be saved in the same path

In [8]:
for idx, row in df_t2_url.iterrows():
    # use PDFminer to convert for each pdf file to txt files
    !pdf2txt.py -o files\{row.filename}.txt files\{row.filename}
    print(f"{row.filename} has been converted to text file")

PP3167.pdf has been converted to text file
PP3210.pdf has been converted to text file
PP3213.pdf has been converted to text file
PP3248.pdf has been converted to text file
PP3285.pdf has been converted to text file
PP3328.pdf has been converted to text file
PP3377.pdf has been converted to text file
PP3445.pdf has been converted to text file
PP3470.pdf has been converted to text file
PP3476.pdf has been converted to text file
PP3513.pdf has been converted to text file
PP3558.pdf has been converted to text file
PP3579.pdf has been converted to text file
PP3599.pdf has been converted to text file
PP3617.pdf has been converted to text file
PP3622.pdf has been converted to text file
PP3653.pdf has been converted to text file
PP3661.pdf has been converted to text file
PP3693.pdf has been converted to text file
PP3716.pdf has been converted to text file
PP3723.pdf has been converted to text file
PP3784.pdf has been converted to text file
PP3796.pdf has been converted to text file
PP3826.pdf 

PP7173.pdf has been converted to text file
PP7183.pdf has been converted to text file
PP7184.pdf has been converted to text file
PP7210.pdf has been converted to text file
PP7231.pdf has been converted to text file
PP7236.pdf has been converted to text file
PP7245.pdf has been converted to text file
PP7247.pdf has been converted to text file
PP7265.pdf has been converted to text file


There are several function below with function as follow:
* `get_dir_file()` This function will read every text files inside the `files/` folder, and return the path for each text files
* `finc_pattern()` This function accept text and pattern as the input, and will return the matched pattern using regex, will return None otherwise
* `read_pdf_text()` This function will read the text file of pdf, and by using pattern, we could get file name, title, author, and body. This function will return a dataframe with the value of each information. This function will utulize both function above

In [None]:
def get_dir_file():
    dir_path = "files/"
    res = []
    # Iterate directory
    for path in os.listdir(dir_path):
        # check if current path is a file
        if os.path.isfile(os.path.join(dir_path, path)):
            # check whether it is text file or pdf file
            if path[-1] == "f":
                continue
            else:
                res.append(dir_path+path)
            # endif
        # endif
    # endfor
    return res

def find_pattern(text_f, pat):
    found = re.search(pat, text_f)
    if found:
        return found[0]
    else:
        return None

def read_pdf_text(dir_paths):
    # create an empty dictionary
    text_dict_lst = {}
    i = 0
    
    # loop for each directory path
    for path in dir_paths:
        # open the text file
        with open(path, "r", encoding="utf-8") as file:
            txt = file.read()
            # pattern for each factor
            file_n = path[6:-4]
            # take everything that ends with authored by
            title_pat = r'[\S\s]+(?=\sAuthored by:\s)'
            # take everything in between authored by and abstract
            auth_pat = r'(?<=\sAuthored by:\s)[\S\s]+(?=\sAbstract\s)'
            # take everything between abstract and paper body
            abs_pat = r'(?<=\sAbstract\s)[\S\s]+(?=\s1 Paper Body\s)'
            # take everything between paper body and references
            body_pat = r'(?<=\s1 Paper Body\s)[\S\s]+(?=\s2 References\s)'
            
            # fetch the data
            title = re.sub(r'\s+', ' ', find_pattern(txt, title_pat).strip())
            author = re.sub(r'\n+', ';', find_pattern(txt, auth_pat).strip())
            abstract = re.sub(r'\n+', '\n', find_pattern(txt, abs_pat).strip())
            abstract = re.sub(r'\s+', ' ', re.sub(r'-\n(\w+ *)', r'\1\n', abstract))
            # remove extra spaces, and remove page number
            body = re.sub(r'\n[\d]\n', '', re.sub(r'\n+', '\n', find_pattern(txt, body_pat).strip()))
            body = re.sub(r'\s+', ' ', re.sub(r'-\n(\w+ *)', r'\1\n', body))
        # end open
        # add to dictionary list
        text_dict_lst[i] = {"filename":file_n, "title":title, "author":author, "abstract":abstract, "body":body}
        i += 1
    # end for
        
    print(f"Successfully fetch {i} PDF files")
        
    return pd.DataFrame.from_dict(text_dict_lst, orient="index")

Save the generated dataframe into

In [10]:
df_t2 = read_pdf_text(get_dir_file())
print(df_t2.shape)
df_t2.head()

Successfully fetch 200 PDF files
(200, 5)


Unnamed: 0,filename,title,author,abstract,body
0,PP3167.pdf,Regularized Boost for Semi-Supervised Learning,Ke Chen;Shihai Wang,Semi-supervised inductive learning concerns ho...,Semi-supervised inductive learning concerns th...
1,PP3210.pdf,Conﬁguration Estimates Improve Pedestrian Finding,David A. Forsyth;Duan Tran,Fair discriminative pedestrian ﬁnders are now ...,Very accurate pedestrian detectors are an impo...
2,PP3213.pdf,Unconstrained On-line Handwriting Recognition ...,J?rgen Schmidhuber;Alex Graves;Marcus Liwicki;...,On-line handwriting recognition is unusual amo...,In online handwriting recognition the trajecto...
3,PP3248.pdf,Direct Importance Estimation with Model Select...,Masashi Sugiyama;Motoaki Kawanabe;Hisashi Kash...,When training and test samples follow diﬀerent...,A common assumption in supervised learning is ...
4,PP3285.pdf,Linear programming analysis of loopy belief pr...,Alan S. Willsky;Dmitry Malioutov;Sujay Sanghavi,Loopy belief propagation has been employed in ...,Loopy Belief Propagation (LBP) and its variant...


Here, we just convert the dataframe to CSV file

In [None]:
# convert to csv file
df_t2.to_csv("result.csv", index = False)

Unnamed: 0,top10_items_in_abstract,top10_items_in_title,top10_author
0,data,learning,Bernhard Sch?lkopf
1,learning,models,Le Song
2,model,networks,Lawrence Carin
3,algorithm,neural,Sujay Sanghavi
4,problem,linear,Daniel D. Lee
5,show,model,Pradeep K. Ravikumar
6,models,stochastic,Doina Precup
7,method,optimization,Zoubin Ghahramani
8,methods,recurrent,Alex Graves
9,results,application,David M. Blei
