# Download Text

Download the text version of the PDF file.

In [30]:
import pandas as pd
import requests as rq
from pathlib import Path

In [31]:
is_devel = True
if is_devel:
    repo_df = pd.read_csv('data/kaust_repo.csv')
else:
    repo_df = pd.read_csv('https://repository.kaust.edu.sa/bitstream/handle/10754/691066/KAUST_Affiliated_Research_Basic_Metadata.csv')

We create a series with the `url` to the extreacted text, and remove the missing values (`NaN`)

In [32]:
text_url_list = repo_df["Link to Extracted Text"].dropna()

In [33]:
for ii in text_url_list:
    print(ii)

https://repository.kaust.edu.sa/bitstream/handle/10754/691095/383319ec-7776-4f6b-b692-fc017931f7c1.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691094/2304.04617.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691093/2304.04565%20%281%29.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691091/2304.04220.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691090/2304.04232.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691088/2304.04315%20%281%29.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691087/2023.04.06.535971v1.full%20%281%29.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691086/2304.03900.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691085/2304.03708.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691083/OptimalEvasivePathPlanning.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691082/MultimodalAsymmetric.pdf.txt
https://repos

We define a simple function to download the files. The function accepts an optional parameter `text_dir` with the directory to where save the files, if not provided, it will save to the directory running the script.

In [34]:
def get_text(url_list, text_dir=""):
    """Download the items from a list of URLs."""
    if text_dir:
        dir = Path(text_dir)    
    for ll in url_list:
        if text_dir:
            file_name = dir.joinpath(ll.split('/')[-1])
        else:
            file_name = ll.split('/')[-1]
        print(f"Downloading file '{file_name}'")
        with open(file_name, 'wb') as ff:
            response = rq.get(ll)
            ff.write(response.content)

For example, we provide a directory called `text` to save the files

In [35]:
get_text(text_url_list, "text")

Downloading file 'text\383319ec-7776-4f6b-b692-fc017931f7c1.pdf.txt'
Downloading file 'text\2304.04617.pdf.txt'
Downloading file 'text\2304.04565%20%281%29.pdf.txt'
Downloading file 'text\2304.04220.pdf.txt'
Downloading file 'text\2304.04232.pdf.txt'
Downloading file 'text\2304.04315%20%281%29.pdf.txt'
Downloading file 'text\2023.04.06.535971v1.full%20%281%29.pdf.txt'
Downloading file 'text\2304.03900.pdf.txt'
Downloading file 'text\2304.03708.pdf.txt'
Downloading file 'text\OptimalEvasivePathPlanning.pdf.txt'
Downloading file 'text\MultimodalAsymmetric.pdf.txt'
Downloading file 'text\3DAutonomous.pdf.txt'
Downloading file 'text\UnsupervisedImageDataset.pdf.txt'
Downloading file 'text\AcentralizedMultistring.pdf.txt'
Downloading file 'text\acsaem.3c00292.pdf.txt'


Instead we can save the URL list to a file and use an external tool like `wget` or `curl` to download the files:

In [38]:
text_url_list.to_csv('files_url.txt', index=False, header=False)

The result will similar to this:

```
(venv) PS C:\Users\garcm0b\Work\RepoDataset> cat .\files_url.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691095/383319ec-7776-4f6b-b692-fc017931f7c1.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691094/2304.04617.pdf.txt
https://repository.kaust.edu.sa/bitstream/handle/10754/691093/2304.04565%20%281%29.pdf.txt
(...)
```