# Introduction

**This notebook reads a directory of pdf files, extracts the texts using an OCR package (pytesseract), finetunes an OpenAI Curie model, and runs the model on a subdirectory of pdf files. <br><br>The structure of the notebook goes like this:**
1. [Setup](#Setup)
2. [OCR Text Extraction](#OCR-Text-Extraction)
3. [Training Data Preparation](#Training-Data-Preparation)
4. [Fine Tuning OpenAI's Curie Model](#Fine-Tuning-OpenAI's-Curie-Model)
5. [Validation Data Preparation](#Validation-Data-Preparation)
6. [Running Fine Tuned Model](#Running-Fine-Tuned-Model)

# Setup

In [1]:
from pdf2image import convert_from_bytes
import pytesseract
import os
import pandas as pd
from tqdm import tqdm
import re
import openai
import tiktoken
import random
import json
import fitz
import time
import requests
from requests.packages.urllib3.util import ssl_
import warnings
from requests.packages.urllib3.exceptions import InsecureRequestWarning

In [2]:
os.getcwd()

'C:\\Users\\matia\\OneDrive - Universidad del Pacífico\\01-Medidas_emergencia_PE\\01-DATA_PERU\\03-CODE\\Base computadoras'

**Set own root directory.**

In [3]:
root = r'C:\Users\matia\OneDrive - Universidad del Pacífico\01-Medidas_emergencia_PE'
os.chdir(root)

**We also define other important directories.**

In [4]:
data_raw = root + r'\01-DATA_PERU\01-DATA_RAW'
data_pro = root + r'\01-DATA_PERU\02-DATA_PROCESSED'
documentation = root + r'\01-DATA_PERU\04-DATA_DOCUMENTATION'

# PDF Document Scraping

In [5]:
OSCE_computadoras_contratos = pd.read_stata(data_pro + r'\OSCE_computadoras_contratos.dta', convert_dates=True, convert_categoricals=True, index_col=None, convert_missing=False, preserve_dtypes=True, columns=None, order_categoricals=True, chunksize=None, iterator=False, compression='infer', storage_options=None)
OSCE_computadoras_contratos

Unnamed: 0,codigoconvocatoria,n_cod_contrato,urlcontrato,fecha_suscripcion_contrato,year_suscripcion,n_item1,ruc_proveedor1,ruc_destinatario_pago1,n_item2,ruc_proveedor2,ruc_destinatario_pago2,n_item3,ruc_proveedor3,ruc_destinatario_pago3,n_item4,ruc_proveedor4,ruc_destinatario_pago4
0,592583,2019652,http://zonasegura.seace.gob.pe/documentos//srv...,2020-06-29,2020.0,1,20521741555,20521741555,,,,,,,,,
1,501339,2129753,http://zonasegura.seace.gob.pe/documentos//srv...,2019-01-11,2019.0,1,C0003151303,20601090806,,,,,,,,,
2,480093,1181709,http://zonasegura.seace.gob.pe/documentos/mon\...,2018-10-11,2018.0,1,20601717396,20601717396,,,,,,,,,
3,499629,1196905,http://zonasegura.seace.gob.pe/documentos/mon\...,2018-12-13,2018.0,1,20519845017,20519845017,,,,,,,,,
4,486723,1192473,http://zonasegura.seace.gob.pe/documentos/mon\...,2018-11-21,2018.0,1,20369236874,20369236874,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
206,876694,2165662,https://prodapp2.seace.gob.pe/portalseace-uiwd...,2022-12-29,2022.0,1,20606211687,20606211687,,,,,,,,,
207,839173,2141045,https://prodapp2.seace.gob.pe/portalseace-uiwd...,2022-09-07,2022.0,1,20452458536,20452458536,,,,,,,,,
208,800859,2113493,https://prodapp2.seace.gob.pe/portalseace-uiwd...,2022-05-09,2022.0,1,20519984211,20519984211,,,,,,,,,
209,654394,2030537,https://prodapp2.seace.gob.pe/portalseace-uiwd...,2020-12-11,2020.0,1,20518446372,20518446372,,,,,,,,,


In [6]:
save_directory = documentation + r'\Computer_sample\downloaded_pdfs'

# Create the directory if it doesn't exist
if not os.path.exists(save_directory):
    os.makedirs(save_directory)

In [9]:
# Override SSL settings
# ssl_.DEFAULT_CIPHERS += ':HIGH:!DH:!aNULL'

# Disable only DH cipher
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = 'HIGH:!DH:!aNULL'

failed_downloads_df = pd.DataFrame(columns=['n_cod_contrato', 'urlcontrato', 'failed_download'])

# Download and save PDFs
for index, row in tqdm(OSCE_computadoras_contratos.iterrows(), total=OSCE_computadoras_contratos.shape[0]):
    url = row['urlcontrato']
    contract_code = row['n_cod_contrato']
    try:
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=InsecureRequestWarning)
            response = requests.get(url, verify=False)
            if response.status_code == 200:
                # Generate a name for the PDF based on the contract's code
                filename = os.path.join(save_directory, f'pdf_{contract_code}.pdf')
                with open(filename, 'wb') as f:
                    f.write(response.content)
            else:
                failed_downloads_df = pd.concat([failed_downloads_df,pd.DataFrame({'n_cod_contrato': contract_code, 'urlcontrato': url, 'failed_download': 1}, index=[0])], ignore_index=True)
                #failed_downloads_df = failed_downloads_df.append({'Contract Code': contract_code, 'URL': url}, ignore_index=True)
                print(f'Failed to download PDF {contract_code} from {url}. Status code: {response.status_code}')
    except requests.exceptions.SSLError as e:
        print(f"An SSL error occurred: {e} in contract {contract_code} with url: {url}")
        
    except Exception as e:
        print(f"An unexpected error occurred: {e} in contract {contract_code} with url: {url}")

  0%|▍                                                                                 | 1/211 [00:02<07:11,  2.05s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos//srv/nfs4/contratos/0e93aa44-0cd8-4898-a697-0308fa81bad0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31870>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 2019652 with url: http://zonasegura.seace.gob.pe/documentos//srv/nfs4/contratos/0e93aa44-0cd8-4898-a697-0308fa81bad0



  1%|▊                                                                                 | 2/211 [00:04<07:14,  2.08s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos//srv/nfs4/contratos/f486d0d2-a8b3-4725-b950-b67876b6526d (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E32CB0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 2129753 with url: http://zonasegura.seace.gob.pe/documentos//srv/nfs4/contratos/f486d0d2-a8b3-4725-b950-b67876b6526d



  1%|█▏                                                                                | 3/211 [00:06<07:11,  2.08s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C10268%5C355935011102018163157.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E36590>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1181709 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\10268\355935011102018163157.pdf



  2%|█▌                                                                                | 4/211 [00:08<07:21,  2.13s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C10542%5C357118213122018180136.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31C90>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1196905 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\10542\357118213122018180136.pdf



  2%|█▉                                                                                | 5/211 [00:10<07:26,  2.17s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C1060%5C356454027112018175628.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33820>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1192473 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\1060\356454027112018175628.pdf



  3%|██▎                                                                               | 6/211 [00:12<07:18,  2.14s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C1060%5C356455627112018181024.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31240>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1192479 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\1060\356455627112018181024.pdf



  3%|██▋                                                                               | 7/211 [00:14<07:12,  2.12s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C1066%5C354881908082018100349.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30310>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1169512 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\1066\354881908082018100349.pdf



  4%|███                                                                               | 8/211 [00:16<07:13,  2.13s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C118%5C353911123052018183950.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2E350>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1158394 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\118\353911123052018183950.pdf



  4%|███▍                                                                              | 9/211 [00:19<07:19,  2.17s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C1230%5C355143525092018155131.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2FD30>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1177742 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\1230\355143525092018155131.pdf



  5%|███▊                                                                             | 10/211 [00:21<07:13,  2.16s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C1316%5C352838802032018111456.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30730>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1146997 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\1316\352838802032018111456.pdf



  5%|████▏                                                                            | 11/211 [00:23<07:31,  2.26s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C1828%5C355125728082018105221.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E337F0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1172704 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\1828\355125728082018105221.pdf



  6%|████▌                                                                            | 12/211 [00:26<07:26,  2.25s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C1887%5C354450825072018082000.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E319F0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1167465 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\1887\354450825072018082000.pdf



  6%|████▉                                                                            | 13/211 [00:28<07:24,  2.24s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C1908%5C355299021092018125332.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E32920>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1177110 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\1908\355299021092018125332.pdf



  7%|█████▎                                                                           | 14/211 [00:30<07:22,  2.24s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C1946%5C356609011122018091657.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E38E50>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1195718 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\1946\356609011122018091657.pdf



  7%|█████▊                                                                           | 15/211 [00:32<07:20,  2.25s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C200118%5C356525414122018144114.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A64344C0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1197153 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\200118\356525414122018144114.pdf



  8%|██████▏                                                                          | 16/211 [00:35<07:16,  2.24s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C200156%5C357165818122018124227.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31C90>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1198052 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\200156\357165818122018124227.pdf



  8%|██████▌                                                                          | 17/211 [00:37<07:05,  2.19s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C2365%5C354728829102018100459.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33DF0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1185735 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\2365\354728829102018100459.pdf



  9%|██████▉                                                                          | 18/211 [00:39<06:59,  2.17s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C2369%5C357315027122018084931.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33F40>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1200551 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\2369\357315027122018084931.pdf



  9%|███████▎                                                                         | 19/211 [00:41<06:56,  2.17s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C2408%5C356720619122018145835.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E302E0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1198596 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\2408\356720619122018145835.pdf



  9%|███████▋                                                                         | 20/211 [00:43<07:09,  2.25s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C2543%5C355637811102018084049.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2EF80>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1181412 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\2543\355637811102018084049.pdf



 10%|████████                                                                         | 21/211 [00:46<07:11,  2.27s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C276%5C353912613062018125815.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A64340A0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1161380 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\276\353912613062018125815.pdf



 10%|████████▍                                                                        | 22/211 [00:48<07:05,  2.25s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C46%5C357325326122018090212.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2E0B0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1199915 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\46\357325326122018090212.pdf



 11%|████████▊                                                                        | 23/211 [00:50<07:10,  2.29s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C47%5C355239828092018175851.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E310F0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1178740 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\47\355239828092018175851.pdf



 11%|█████████▏                                                                       | 24/211 [00:53<07:15,  2.33s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C544%5C354450311072018145838.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33D60>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1165334 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\544\354450311072018145838.pdf



 12%|█████████▌                                                                       | 25/211 [00:55<07:01,  2.27s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C847%5C355123411102018161257.docx (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31A80>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1181698 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\847\355123411102018161257.docx



 12%|█████████▉                                                                       | 26/211 [00:57<06:52,  2.23s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2018%5C931%5C356058318102018105707.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31C60>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1183420 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2018\931\356058318102018105707.pdf



 13%|██████████▎                                                                      | 27/211 [00:59<06:45,  2.21s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C10249%5C358976012062019161718.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E316C0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1222289 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\10249\358976012062019161718.pdf



 13%|██████████▋                                                                      | 28/211 [01:01<06:36,  2.17s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C10323%5C360725319092019095457.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E350F0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1241653 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\10323\360725319092019095457.pdf



 14%|███████████▏                                                                     | 29/211 [01:03<06:35,  2.18s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C10323%5C361189430102019170851.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31000>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1250227 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\10323\361189430102019170851.pdf



 14%|███████████▌                                                                     | 30/211 [01:06<06:51,  2.27s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C10408%5C361853922112019103509.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33C10>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1254906 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\10408\361853922112019103509.pdf



 15%|███████████▉                                                                     | 31/211 [01:08<06:51,  2.29s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C10427%5C361455405112019110057.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30D90>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1250664 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\10427\361455405112019110057.pdf



 15%|████████████▎                                                                    | 32/211 [01:10<06:49,  2.29s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C10428%5C362000729112019075258.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30700>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1256507 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\10428\362000729112019075258.pdf



 16%|████████████▋                                                                    | 33/211 [01:13<06:38,  2.24s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C10429%5C362593026122019102907.PDF (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2EF80>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1263856 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\10429\362593026122019102907.PDF



 16%|█████████████                                                                    | 34/211 [01:15<06:37,  2.25s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C10451%5C362207718122019152853.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E36BF0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1261731 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\10451\362207718122019152853.pdf



 17%|█████████████▍                                                                   | 35/211 [01:17<06:37,  2.26s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C12632%5C361869102122019173024.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E36290>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1257095 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\12632\361869102122019173024.pdf



 17%|█████████████▊                                                                   | 36/211 [01:19<06:34,  2.25s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1327%5C358703306052019114215.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2DF60>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1216085 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1327\358703306052019114215.pdf



 18%|██████████████▏                                                                  | 37/211 [01:22<06:30,  2.24s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C141%5C361398229102019082353.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30700>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1249691 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\141\361398229102019082353.pdf



 18%|██████████████▌                                                                  | 38/211 [01:24<06:31,  2.26s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1610%5C361723904122019170116.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30790>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1257846 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1610\361723904122019170116.pdf



 18%|██████████████▉                                                                  | 39/211 [01:26<06:28,  2.26s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C162%5C359675118072019143026.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31180>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1228646 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\162\359675118072019143026.pdf



 19%|███████████████▎                                                                 | 40/211 [01:29<06:36,  2.32s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1886%5C360318527082019120929.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33CA0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1236299 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1886\360318527082019120929.pdf



 19%|███████████████▋                                                                 | 41/211 [01:31<06:31,  2.30s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1886%5C360588030092019103212.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A64382B0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1243930 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1886\360588030092019103212.pdf



 20%|████████████████                                                                 | 42/211 [01:33<06:27,  2.29s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1886%5C361128524102019085043.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33460>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1248863 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1886\361128524102019085043.pdf



 20%|████████████████▌                                                                | 43/211 [01:35<06:20,  2.26s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1886%5C361679221112019154331.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E334C0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1254700 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1886\361679221112019154331.pdf



 21%|████████████████▉                                                                | 44/211 [01:38<06:13,  2.23s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1887%5C359368205072019111520.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31990>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1226389 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1887\359368205072019111520.pdf



 21%|█████████████████▎                                                               | 45/211 [01:40<06:14,  2.26s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1891%5C357324209012019193400.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E32500>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1203197 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1891\357324209012019193400.pdf



 22%|█████████████████▋                                                               | 46/211 [01:42<06:09,  2.24s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1906%5C361307611112019162730.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2EE60>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1252169 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1906\361307611112019162730.pdf



 22%|██████████████████                                                               | 47/211 [01:44<06:00,  2.20s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1908%5C359359515072019115039.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E34E50>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1227813 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1908\359359515072019115039.pdf



 23%|██████████████████▍                                                              | 48/211 [01:46<05:53,  2.17s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1908%5C359469724072019170117.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E37940>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1229803 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1908\359469724072019170117.pdf



 23%|██████████████████▊                                                              | 49/211 [01:48<05:55,  2.20s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1952%5C360810105112019113819.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2F0A0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1250698 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1952\360810105112019113819.pdf



 24%|███████████████████▏                                                             | 50/211 [01:51<05:52,  2.19s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1952%5C360810110102019234734.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E304F0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1246327 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1952\360810110102019234734.pdf



 24%|███████████████████▌                                                             | 51/211 [01:53<05:49,  2.19s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1959%5C361412007112019145655.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E318D0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1251490 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1959\361412007112019145655.pdf



 25%|███████████████████▉                                                             | 52/211 [01:55<05:45,  2.17s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1961%5C359605701072019163120.PDF (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33C70>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1225565 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1961\359605701072019163120.PDF



 25%|████████████████████▎                                                            | 53/211 [01:57<05:39,  2.15s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C1962%5C361097005112019152019.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33CA0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1250779 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\1962\361097005112019152019.pdf



 26%|████████████████████▋                                                            | 54/211 [01:59<05:35,  2.13s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200018%5C362380920122019155633.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A64380D0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1262717 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200018\362380920122019155633.pdf



 26%|█████████████████████                                                            | 55/211 [02:01<05:32,  2.13s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200020%5C358540427122019150808.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33460>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1264665 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200020\358540427122019150808.pdf



 27%|█████████████████████▍                                                           | 56/211 [02:03<05:29,  2.13s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C20003%5C359374224062019121850.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31D80>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1224107 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\20003\359374224062019121850.pdf



 27%|█████████████████████▉                                                           | 57/211 [02:06<05:27,  2.13s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200125%5C362207718122019151350.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30FD0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1261696 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200125\362207718122019151350.pdf



 27%|██████████████████████▎                                                          | 58/211 [02:08<05:31,  2.17s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200207%5C360483404092019121736.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33AC0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1237845 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200207\360483404092019121736.pdf



 28%|██████████████████████▋                                                          | 59/211 [02:10<05:26,  2.15s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200207%5C360892230092019141651.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2FBE0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1243961 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200207\360892230092019141651.pdf



 28%|███████████████████████                                                          | 60/211 [02:12<05:32,  2.20s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200244%5C361974012122019174139.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E349A0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1260164 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200244\361974012122019174139.pdf



 29%|███████████████████████▍                                                         | 61/211 [02:14<05:26,  2.18s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200265%5C361937528112019143607.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A64381F0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1256324 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200265\361937528112019143607.pdf



 29%|███████████████████████▊                                                         | 62/211 [02:16<05:21,  2.16s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200265%5C362250616122019170225.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E34C10>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1260905 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200265\362250616122019170225.pdf



 30%|████████████████████████▏                                                        | 63/211 [02:19<05:18,  2.15s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200269%5C359542826062019164108.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2F340>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1224728 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200269\359542826062019164108.pdf



 30%|████████████████████████▌                                                        | 64/211 [02:21<05:15,  2.14s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200269%5C360387526082019111350.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31660>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1235843 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200269\360387526082019111350.pdf



 31%|████████████████████████▉                                                        | 65/211 [02:23<05:12,  2.14s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200324%5C361705116112019210110.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E326E0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1253527 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200324\361705116112019210110.pdf



 31%|█████████████████████████▎                                                       | 66/211 [02:25<05:13,  2.16s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200377%5C361890611122019225822.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30E20>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1259782 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200377\361890611122019225822.pdf



 32%|█████████████████████████▋                                                       | 67/211 [02:27<05:07,  2.14s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200382%5C360465823092019114111.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E314E0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1242306 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200382\360465823092019114111.pdf



 32%|██████████████████████████                                                       | 68/211 [02:29<05:04,  2.13s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200419%5C360013701082019202636.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A64383A0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1230975 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200419\360013701082019202636.pdf



 33%|██████████████████████████▍                                                      | 69/211 [02:31<04:59,  2.11s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200444%5C357422924012019174017.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31BD0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1204475 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200444\357422924012019174017.pdf



 33%|██████████████████████████▊                                                      | 70/211 [02:33<04:54,  2.09s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C200446%5C360272019082019103627.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E32560>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1234184 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\200446\360272019082019103627.pdf



 34%|███████████████████████████▎                                                     | 71/211 [02:35<04:51,  2.09s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C201144%5C357934906032019094624.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30280>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1208812 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\201144\357934906032019094624.pdf



 34%|███████████████████████████▋                                                     | 72/211 [02:38<04:52,  2.11s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C203066%5C361448912112019123730.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2F5E0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1252375 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\203066\361448912112019123730.pdf



 35%|████████████████████████████                                                     | 73/211 [02:40<05:01,  2.18s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C203066%5C361449012112019125807.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E37130>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1252383 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\203066\361449012112019125807.pdf



 35%|████████████████████████████▍                                                    | 74/211 [02:42<04:58,  2.18s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C203426%5C360828827092019183941.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A6438280>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1243769 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\203426\360828827092019183941.pdf



 36%|████████████████████████████▊                                                    | 75/211 [02:44<04:52,  2.15s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C21%5C361513911122019151800.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E342B0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1259623 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\21\361513911122019151800.pdf



 36%|█████████████████████████████▏                                                   | 76/211 [02:46<04:46,  2.12s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C2348%5C359244007062019155235.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2F280>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1221651 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\2348\359244007062019155235.pdf



 36%|█████████████████████████████▌                                                   | 77/211 [02:48<04:42,  2.11s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C2379%5C360625410102019164232.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31900>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1246237 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\2379\360625410102019164232.pdf



 37%|█████████████████████████████▉                                                   | 78/211 [02:50<04:41,  2.11s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C2385%5C359732309082019121238.PDF (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30EB0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1232411 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\2385\359732309082019121238.PDF



 37%|██████████████████████████████▎                                                  | 79/211 [02:53<04:42,  2.14s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C2389%5C360121512112019142301.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E33A60>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1252416 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\2389\360121512112019142301.pdf



 38%|██████████████████████████████▋                                                  | 80/211 [02:55<04:39,  2.14s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C2436%5C359502616072019100228.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E32E90>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1228033 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\2436\359502616072019100228.pdf



 38%|███████████████████████████████                                                  | 81/211 [02:57<04:35,  2.12s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C26%5C361283729102019153756.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E316C0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1249837 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\26\361283729102019153756.pdf



 39%|███████████████████████████████▍                                                 | 82/211 [02:59<04:32,  2.11s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C319%5C362398223122019115006.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E32E60>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1263079 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\319\362398223122019115006.pdf



 39%|███████████████████████████████▊                                                 | 83/211 [03:01<04:40,  2.19s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C35%5C359112907062019092325.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E319C0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1221535 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\35\359112907062019092325.pdf



 40%|████████████████████████████████▏                                                | 84/211 [03:03<04:34,  2.16s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C35%5C359868019072019173239.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31CC0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1228992 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\35\359868019072019173239.pdf



 40%|████████████████████████████████▋                                                | 85/211 [03:06<04:30,  2.15s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C35%5C359868125072019163838.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31870>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1230116 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\35\359868125072019163838.pdf



 41%|█████████████████████████████████                                                | 86/211 [03:08<04:25,  2.12s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2019%5C35%5C360313317092019130846.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2F340>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1241139 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2019\35\360313317092019130846.pdf



 41%|█████████████████████████████████▍                                               | 87/211 [03:10<04:22,  2.12s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C10249%5C363753902062020101236.docx (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E34B50>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1282526 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\10249\363753902062020101236.docx



 42%|█████████████████████████████████▊                                               | 88/211 [03:12<04:19,  2.11s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C1203%5C364937404092020144602.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A64381F0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1298715 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\1203\364937404092020144602.pdf



 42%|██████████████████████████████████▏                                              | 89/211 [03:14<04:16,  2.10s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C1891%5C364023803072020090312.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E2FAF0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1286181 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\1891\364023803072020090312.pdf



 43%|██████████████████████████████████▌                                              | 90/211 [03:16<04:12,  2.09s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C1947%5C364656314082020110041.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E330A0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1291713 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\1947\364656314082020110041.pdf



 43%|██████████████████████████████████▉                                              | 91/211 [03:18<04:09,  2.08s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C1952%5C363692307052020153520.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31480>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1280229 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\1952\363692307052020153520.pdf



 44%|███████████████████████████████████▎                                             | 92/211 [03:20<04:06,  2.08s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C200049%5C364110224062020114411.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E32E90>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1285144 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\200049\364110224062020114411.pdf



 44%|███████████████████████████████████▋                                             | 93/211 [03:22<04:15,  2.16s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C200244%5C364390808092020032130.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E32FE0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1299522 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\200244\364390808092020032130.pdf



 45%|████████████████████████████████████                                             | 94/211 [03:25<04:13,  2.17s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C201385%5C364046016072020190402.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A6438730>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1288093 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\201385\364046016072020190402.pdf



 45%|████████████████████████████████████▍                                            | 95/211 [03:27<04:16,  2.21s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C204590%5C365017028082020084759.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E31E70>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1296259 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\204590\365017028082020084759.pdf



 45%|████████████████████████████████████▊                                            | 96/211 [03:29<04:10,  2.17s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C24%5C362546717012020172439.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E319C0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1268446 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\24\362546717012020172439.pdf



 46%|█████████████████████████████████████▏                                           | 97/211 [03:31<04:06,  2.16s/it]

An unexpected error occurred: HTTPConnectionPool(host='zonasegura.seace.gob.pe', port=80): Max retries exceeded with url: /documentos/mon%5Cdocs%5Ccontratos%5C2020%5C421%5C362935408042020110558.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000246A5E30FD0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) in contract 1278024 with url: http://zonasegura.seace.gob.pe/documentos/mon\docs\contratos\2020\421\362935408042020110558.pdf


100%|████████████████████████████████████████████████████████████████████████████████| 211/211 [07:01<00:00,  2.00s/it]


In [10]:
failed_downloads_df.drop('urlcontrato',
  axis='columns', inplace=True)

# OCR Text Extraction

**OCR extraction with Tesseract in Windows requires to have the program installed from this [link](https://github.com/UB-Mannheim/tesseract/wiki).<br> After installing, we need to specify the location of the exe file as below.**

In [11]:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

**We first define the directory where the pdfs are located.**

In [12]:
pdfs = documentation + r'\Computer_sample\downloaded_pdfs'

**We span through the files of the base pdf directory, and its subdirectories, storing the filepath of each document.**

In [13]:
filenames = []

for base, dirs, files in os.walk(pdfs):
    for filename in files:
        if (filename.lower().endswith('.pdf')):
            filenames.append(os.path.join(base, filename))

random.choices(filenames, k=5)

['C:\\Users\\matia\\OneDrive - Universidad del Pacífico\\01-Medidas_emergencia_PE\\01-DATA_PERU\\04-DATA_DOCUMENTATION\\Computer_sample\\downloaded_pdfs\\pdf_2119932.pdf',
 'C:\\Users\\matia\\OneDrive - Universidad del Pacífico\\01-Medidas_emergencia_PE\\01-DATA_PERU\\04-DATA_DOCUMENTATION\\Computer_sample\\downloaded_pdfs\\pdf_2162463.pdf',
 'C:\\Users\\matia\\OneDrive - Universidad del Pacífico\\01-Medidas_emergencia_PE\\01-DATA_PERU\\04-DATA_DOCUMENTATION\\Computer_sample\\downloaded_pdfs\\pdf_2164720.pdf',
 'C:\\Users\\matia\\OneDrive - Universidad del Pacífico\\01-Medidas_emergencia_PE\\01-DATA_PERU\\04-DATA_DOCUMENTATION\\Computer_sample\\downloaded_pdfs\\pdf_2147231.pdf',
 'C:\\Users\\matia\\OneDrive - Universidad del Pacífico\\01-Medidas_emergencia_PE\\01-DATA_PERU\\04-DATA_DOCUMENTATION\\Computer_sample\\downloaded_pdfs\\pdf_2168525.pdf']

**We open each path and extract the text using the fitz package (a simple pdf reader). <br>Then we identify if the pdf was a scanned document (an image without *"highlight-able"* text) by setting a length threshold. <br>Only if the read pdf's text has a length below the threshold, we proceed by re-extracting the text using the Tesseract OCR package. <br>Each text is stored in a dataframe containing its filename, text, extraction type and file id.**

In [14]:
# Create a dataframe to store the texts of each PDF
pdf_texts_df = pd.DataFrame(columns = ['filename', 'text', 'extraction_type'])
broken_pdfs_df = pd.DataFrame(columns=['n_cod_contrato', 'broken_pdf'])

# Loop through every file in the directory
for filename in tqdm(filenames):
    if filename.lower().endswith('.pdf'):
        try:
        
            # Read the text directly from the PDF file
            reader = fitz.open(filename)
            pdf_text = ''

            for page in reader:
                pdf_text+=page.get_text()+' '
            
            pdf_texts_df.loc[filenames.index(filename), 'extraction_type'] = 'PDF_Reader'
            
            if len(re.sub(r'[^a-zA-Z]', '', pdf_text))<1000:
            
                # Open the PDF file
                with open(filename, 'rb') as file:
                    pdf_bytes = file.read()

                # Convert the PDF to images
                images = convert_from_bytes(pdf_bytes)

                # Use OCR to extract text from each image/page
                pdf_text = ''
                for i, image in enumerate(images):
                    text = pytesseract.image_to_string(image)
                    pdf_text+=text+' '
            
                pdf_texts_df.loc[filenames.index(filename), 'extraction_type'] = 'OCR'

            # Clean the extracted text

            clean_text = re.sub('\$+', ' ', pdf_text)  # Replace multiple \$ with a space
            clean_text = re.sub('\n+', ' ', clean_text)  # Replace multiple newlines with one space
            clean_text = re.sub('\.+', '.', clean_text)  # Replace multiple . with one space
            clean_text = re.sub('\,+', ',', clean_text)  # Replace multiple newlines with one space
            clean_text = clean_text.replace(';', ' ')  # Replace semicolons with spaces
            clean_text = re.sub(' +', ' ', clean_text)  # Replace multiple spaces with one
            clean_text = re.sub(r'[^a-zA-ZÀ-ÿ0-9 \,\.\/\:]', '', clean_text)
        
            # Store the joined text in the dataframe, using the filename (without .pdf) as the key
            pdf_texts_df.loc[filenames.index(filename), 'filename'] = filename[:-4]
            pdf_texts_df.loc[filenames.index(filename), 'text'] = clean_text
            
        except Exception as e:
            error_message = str(e)
            if 'cannot open broken document' in error_message:  # Replace 'FileDataError' with the actual error message you expect
                print(f"A FileDataError occurred: {e} in {filename}")
                broken_pdfs_df = pd.concat([broken_pdfs_df, pd.DataFrame({'n_cod_contrato': [filename.replace(pdfs+'\\pdf_', '').replace('.pdf','')], 'broken_pdf': 1}, index=[0])], ignore_index=True)
                #broken_pdfs_df = pd.concat([broken_pdfs_df, pd.DataFrame({'Contract Code': filename.str.replace(pdfs+'\\pdf_', '', regex=False)}, index=[0])], ignore_index=True)
            else:
                print(f"An unspecified error occurred: {e} in {filename}")

 71%|█████████████████████████████████████████████████████████▌                       | 81/114 [20:26<05:22,  9.78s/it]

An unspecified error occurred: Image size (274270563 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack. in C:\Users\matia\OneDrive - Universidad del Pacífico\01-Medidas_emergencia_PE\01-DATA_PERU\04-DATA_DOCUMENTATION\Computer_sample\downloaded_pdfs\pdf_2153720.pdf


100%|████████████████████████████████████████████████████████████████████████████████| 114/114 [31:46<00:00, 16.72s/it]


In [15]:
pdf_texts_df['n_cod_contrato'] = pdf_texts_df['filename'].str.replace(pdfs+'\\pdf_', '', regex=False)
pdf_texts_df.drop('filename', axis='columns', inplace=True)

In [16]:
computer_analysis_dfs = data_pro + r'\computer_analysis_dfs'

# Create the directory if it doesn't exist
if not os.path.exists(computer_analysis_dfs):
    os.makedirs(computer_analysis_dfs)

In [17]:
display(failed_downloads_df)
display(broken_pdfs_df)
display(pdf_texts_df)

Unnamed: 0,n_cod_contrato,failed_download


Unnamed: 0,n_cod_contrato,broken_pdf


Unnamed: 0,text,extraction_type,n_cod_contrato
0,RUC N 20187651361 JR. MIRAFLORES NRO. SIN LA L...,PDF_Reader,2000640
1,PERU Ministerio Instituto i del Ambiente Geof...,OCR,2003634
2,CONTRATO N 172020MDS ADQUISICION DE MOBILIARIO...,OCR,2010690
3,CONTRATO N 182020MDS ADQUISICION DE MOBILIARIO...,OCR,2010704
4,MUNICIPALIDAD DISTRITAL DESONDORIELO IREGION/P...,OCR,2010706
...,...,...,...
109,Universidad Nacional Hermilio Valdizan de Huan...,OCR,2175999
110,126 ORDEN DE COMPRA GUIA DE ois nes aio Sef...,OCR,2184078
111,Sistema Integrado de Gestién Administrativa Mé...,OCR,2184774
112,MUNICIPALIDAD DISTRITAL DE ANCHONGA as a sae e...,OCR,2186739


In [18]:
pdf_texts_df.to_excel(computer_analysis_dfs + r'\pdf_texts.xlsx', index = False)
failed_downloads_df.to_excel(computer_analysis_dfs + r'\failed_downloads.xlsx', index = False)
broken_pdfs_df.to_excel(computer_analysis_dfs + r'\broken_pdfs.xlsx', index = False)

# Preparing Data for Unit Price Extraction

In [27]:
extraction_df = pd.read_excel(data_pro + r'\computer_analysis_dfs\pdf_texts.xlsx')

In [28]:
extraction_df

Unnamed: 0,text,extraction_type,n_cod_contrato
0,RUC N 20187651361 JR. MIRAFLORES NRO. SIN LA L...,PDF_Reader,2000640.0
1,PERU Ministerio Instituto i del Ambiente Geof...,OCR,2003634.0
2,CONTRATO N 172020MDS ADQUISICION DE MOBILIARIO...,OCR,2010690.0
3,CONTRATO N 182020MDS ADQUISICION DE MOBILIARIO...,OCR,2010704.0
4,MUNICIPALIDAD DISTRITAL DESONDORIELO IREGION/P...,OCR,2010706.0
...,...,...,...
109,Universidad Nacional Hermilio Valdizan de Huan...,OCR,2175999.0
110,126 ORDEN DE COMPRA GUIA DE ois nes aio Sef...,OCR,2184078.0
111,Sistema Integrado de Gestién Administrativa Mé...,OCR,2184774.0
112,MUNICIPALIDAD DISTRITAL DE ANCHONGA as a sae e...,OCR,2186739.0


In [29]:
extraction_df['text'] = extraction_df['text'].astype(str)
extraction_df['text'] = extraction_df['text'].str.lower()
extraction_df['text'] = extraction_df['text'].apply(lambda x: re.sub('\s+', ' ', x).strip())
extraction_df['text'] = extraction_df['text'].apply(lambda x: re.sub(r'(\.|\,)\1+', r'\1', x).strip())
extraction_df.drop('extraction_type', axis='columns', inplace=True)

In [30]:
# Split a text into smaller chunks of size n, preferably ending at the end of a sentence
def create_chunks(text, n, tokenizer):
    tokens = tokenizer.encode(text)
    """Yield successive n-sized chunks from text."""
    i = 0
    while i < len(tokens):
        # Find the nearest end of sentence within a range of 0.9 * n and 1.1 * n tokens
        j = min(i + int(1.1 * n), len(tokens))
        while j > i + int(0.9 * n):
            # Decode the tokens and check for full stop or newline
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # If no end of sentence found, use n tokens as the chunk size
        if j == i + int(0.9 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j

In [31]:
# Initialise tokenizer
tokenizer = tiktoken.encoding_for_model('gpt-3.5-turbo')

prompt=[]

for i in tqdm(range(len(extraction_df['text']))):
    chunks = create_chunks(extraction_df.loc[i,'text'], 1500, tokenizer)
    text_chunks = [tokenizer.decode(chunk) for chunk in chunks]
    if len(text_chunks)>=2:
        prompt.append(' '.join(text_chunks[:1]+text_chunks[-1:]))
    else:
        prompt.append(' '.join(text_chunks))
extraction_df['text'] = prompt

100%|████████████████████████████████████████████████████████████████████████████████| 114/114 [00:01<00:00, 72.59it/s]


In [32]:
extraction_df['text'] = extraction_df['text'] + '---->'

In [33]:
os.environ['OPENAI_API_KEY'] = ""

In [34]:
openai.api_key = os.getenv("OPENAI_API_KEY")

In [35]:
# Iterate through each row in DataFrame
for index, row in tqdm(extraction_df.iterrows(), total=extraction_df.shape[0]):
    user_content = row['text']
    time.sleep(2) 
    
    try:
        # Call OpenAI API
        completion = openai.ChatCompletion.create(
            model="ft:gpt-3.5-turbo-0613:personal::7wfSDkB5",
            request_timeout = 100,
            messages=[
                {"role": "system", "content": "Dado el texto extraido de un contrato, extrae los precios unitarios de los bienes comprados."},
                {"role": "user", "content": user_content}
            ]
        )
    
        # Extract the generated message and store it in DataFrame
        generated_message = completion.choices[0].message['content']
        extraction_df.at[index, 'gpt_unit_prices'] = generated_message

    except openai.error.RateLimitError as e:
        retry_time = e.retry_after if hasattr(e, 'retry_after') else 1
        print(f"Rate limit error. Retrying in {retry_time} seconds...")
        time.sleep(retry_time)
        
    except openai.error.ServiceUnavailableError as e:
        retry_time = e.retry_after if hasattr(e, 'retry_after') else 1
        print(f"Service Unavailable error. Retrying in {retry_time} seconds...")
        time.sleep(retry_time)

    except openai.error.APIError as e:
        retry_time = e.retry_after if hasattr(e, 'retry_after') else 1
        print(f"API error occurred. Retrying in {retry_time} seconds...")
        time.sleep(retry_time)

    except OSError as e:
        retry_time = 1  # Adjust the retry time as needed
        print(f"Connection error occurred: {e}. Retrying in {retry_time} seconds...")      
        time.sleep(retry_time)
        
    except requests.Timeout as e:
        retry_time = 1  # Adjust the retry time as needed
        print(f"Timeout error occurred: {e}. Retrying in {retry_time} seconds...")      
        time.sleep(retry_time)
        
    except Exception as e:
        retry_time = 1
        print(f"An unexpected error occurred: {e}. Retrying in {retry_time} seconds...")
        time.sleep(retry_time)
        
# Print the updated DataFrame to check if it worked
extraction_df

 11%|████████▌                                                                        | 12/114 [00:44<05:34,  3.28s/it]

Service Unavailable error. Retrying in 1 seconds...


 46%|████████████████████████████████████▉                                            | 52/114 [03:10<02:42,  2.63s/it]

Service Unavailable error. Retrying in 1 seconds...


100%|████████████████████████████████████████████████████████████████████████████████| 114/114 [06:52<00:00,  3.62s/it]


Unnamed: 0,text,n_cod_contrato,gpt_unit_prices
0,ruc n 20187651361 jr. miraflores nro. sin la l...,2000640.0,3300.0 \n\n###\n\n
1,peru ministerio instituto i del ambiente geofi...,2003634.0,7681.444 \n\n###\n\n
2,contrato n 172020mds adquisicion de mobiliario...,2010690.0,\n\n###\n\n
3,contrato n 182020mds adquisicion de mobiliario...,2010704.0,\n\n###\n\n
4,municipalidad distrital desondorielo iregion/p...,2010706.0,\n\n###\n\n
...,...,...,...
109,universidad nacional hermilio valdizan de huan...,2175999.0,\n\n###\n\n
110,126 orden de compra guia de ois nes aio seftor...,2184078.0,10535.0; 7499.0; 8789.0 \n\n###\n\n
111,sistema integrado de gestién administrativa mé...,2184774.0,6060.0 \n\n###\n\n
112,municipalidad distrital de anchonga as a sae e...,2186739.0,3000.0; 7450.0; 2000.0; 3900.0; 980.0; 6400.0...


In [36]:
extraction_df.to_excel(computer_analysis_dfs + r'\extraction_df.xlsx', index = False)