<h2>Building and submitting search queries to AGRIS</h2>
<p>This script is used with the aim to <b>submit a search query</b> to the (<a href = https://agris.fao.org/agris-search/biblio.action?>AGRIS database</a>) and <b>retrieve the list of the URLs</b> (or a subset of the returned URLs) directing to the <b>search results</b>. The <b>result URLs</b> that are obtained are <b>stored in a txt file</b> in order to be used for <b>scraping the AGRIS database</b> for relevant content (i.e., <b>abstracts</b> of publications available from the specific database) to be <b>used for text annotation-related purposes</b>.</p>

<p>The <b>first step</b> in the process of submitting a search query to the AGRIS database and receiving the result URLs is to <b>import the Python libraries and packages</b> that are <b>necessary</b> for the <b>execution of this task</b>.</p>

In [1]:
import requests
from bs4 import BeautifulSoup

<p>The <code>findNumOfTokens</code> function is <b>defined</b> and <b>used</b> with the aim to <b>enable the retrieval of the number of the search results</b> returned from the <b>submission</b> of the <b>query</b> to the <b>AGRIS database</b> (by making use of the <b>search parameters</b> presented and explained below).</p>

In [2]:
def findNumOfTokens(string):
    numOfTokens = len(string.split())
    return numOfTokens

<p>When builing a search query to submit to the AGRIS database, there is a <b>list of search parameters</b> that <b>need to be configured</b>. In other words, these parameters need to be assigned the <b>values</b> that will be used for the <b>execution of the search task</b> and the <b>retrieval of the result URLs</b>. These parameters are the following: 
<ul><li> <b>subject</b> (i.e., the <b>subject of the results</b> to be identified and returned - what the text documents/abstracts to be eventually retrieved need to be about);</li>
    <li> <b>result type</b> (AGRIS allows to execute searches <b>in regard to a list of predefined types</b>; these types are "<code>Publications</code>" and "<code>Databsets</code>");</li>
    <li> <b>start year</b> (i.e., the <b>year from which results</b> for the search query should be <b>identified</b> and <b>returned</b>);</li>
    <li> <b> end year</b> (i.e., the <b>year till which results</b> for the search query should be <b>identified</b> and <b>returned</b>);</li>
    <li> <b>country name</b> (i.e., the <b>name of the country</b> that the <b>content of the resources</b> to be identifed and retrieved with the help of the search results <b>should relate to</b>); </li>
    <li> <b>language</b> (i.e., the <b>language of the content of the resources</b> made available from the search results that are identified and retrieved);</li>
    <li> <b>content type</b> (i.e., the <b>type of the content of the resources</b> -theses, journal papers, reports, etc.- made available from the search results that are identified and retrieved); </li>
</ul>
To build the search query by taking account of the values provided to the search parameters listed above (i.e., the <b>configurable part of the search query</b>), we <b>define</b> and <b>use</b> the <code>buildConfigurableQueryStr</code> function.</p> 
<p>The <b>input</b> provided <b>to the function</b> are the <b>values of the search parameters</b>. In addition, the function <b>takes into consideration</b> the <b>number of tokens</b> included <b>in the search query</b> when "<b>constructing</b>" <b>the value</b> to be finally provided to the <code>subject</code> parameter.</p>

In [3]:
def buildConfigurableQueryStr (subject, resultType, startYear, endYear, countryName, language, contentType):
    
    numOfTokensInSubj = findNumOfTokens(subject)
    if numOfTokensInSubj == 1:
        filterString = "filterString=%2Bsubject%3A%28" + subject + "%29"
    else:
        filterString = ""
        for subjectToken in subject.split():
            filterString = filterString + "filterString=%2Bsubject%3A%28" + subjectToken + "%29"
    
    typeresultsField = "typeresultsField=" + resultType
    
    fromDate = "fromDate=" + str(startYear)
    toDate = "toDate=" + str(endYear)
    
    if countryName == "0":
        country = "country=" + str(countryName)
    else:
        country = "country=" + countryName 
    
    if language == "0":
        lang = "lang=" + str(0)
    else:
        lang = "lang=" + language
    
    if contentType == "0":
        typeToAdd = "typeToAdd=" + str(0)
    else:
        typeToAdd = "typeToAdd=" + contentType

    configurableQueryStr = filterString + "&" + typeresultsField + "&" + fromDate + "&" + toDate + "&" + country + "&" + lang + "&" + typeToAdd
    
    return configurableQueryStr

<p>Apart from the configurable part of the search query to be submitted to the AGRIS database, there is also a <b>part of the search query</b> consisting of <b>parameters that receive default values</b> (more specifically, most of those parameters receive <b>no values at all</b>!).</p> 
<p>This part of the search query can be named as the <b>default part of the search query</b>. The <b>parameters</b> receiving no values at all or specific values by default are: (i) <code>agrovocString</code>; (ii) <code>agrovocToRemove</code>; (iii) <code>advQuery</code>; (iv) <code>centerString</code>; (v) <code>centerToRemove</code>; (vi) <code>filterToRemove</code>; (vii) <code>typeString</code>; (viii) <code>typeToRemove</code>; and (ix) <code>filterQuery</code>.</p>

In [4]:
def AGRISqueryBuilder ():
    queryStr = ""
   
    # list of query parameters receiving no values
    paramsWithNullValues = ["agrovocString=", "agrovocToRemove=", "advQuery=", "centerString=", "centerToRemove=", 
                            "filterToRemove=", "typeString=", "typeToRemove=", "filterQuery="]

    # concatenating the parameters with no values to start assemblying the AGRIS query string
    for param in paramsWithNullValues:
        queryStr = queryStr + param + "&"
        
    # list of query parameters with default values, such as onlyFullText, enableField and aggregatorField
    # onlyFullText = false --> access resources that may not provide access to a full-text version!
    # enableField = Disable --> multi-lingual search is disabled!
    # aggregatorField = Disable --> include records from aggregators!
    paramsWithDefaultValues = ["onlyFullText=false", "operator=Required", "field=0", "enableField=Disable", 
                              "aggregatorField=Disable"]
    
    for param in paramsWithDefaultValues:
        queryStr = queryStr + param + "&"
        
    return queryStr

<p>By calling the <code>AGRISqueryBuilder</code> function, we are able to <b>create the first part of the search query</b> that will be submitted to the AGRIS database (i.e., the <b>default part of the search query</b> containing the search parameters that receive default values or no value at all).</p>

In [5]:
queryStr_1 = AGRISqueryBuilder()

<h4>Assignment of values to the search parameters to be used for creating the configurable part of the serch query</h4>

<p><b>Step 1</b>: Subject of the search query.</p>

In [6]:
subject = input("Type in the subject of your search in AGRIS: ")

Type in the subject of your search in AGRIS: agriculture


<p><b>Step 2</b>: <b>Type of the results</b> to be retrieved (namely: "<b>Publications</b>", "<b>Datasets</b>" or both).</p>

In [7]:
resultType = input("Type in the type of results (i.e., 'Publications', 'Datasets', 'Both') you are interested in: ")

Type in the type of results (i.e., 'Publications', 'Datasets', 'Both') you are interested in: Publications


<p><b>Step 3</b>: <b>Starting year</b> from which results should become available.</p>

In [8]:
 startYear = input("Find resources that have become available from this year and on: ")

Find resources that have become available from this year and on: 2000


<p><b>Step 4</b>: <b>Year</b> till which results should become available (i.e., <b>end year</b>).</p>

In [9]:
endYear = input("Find resources that have become available up until this year: ")

Find resources that have become available up until this year: 2021


<p><b>Step 5</b>: The <b>name of the country</b> that the <b>content of the resources</b> to be retrieved <b>should relate to</b>.</p>

In [10]:
countryName = input("Type in the name of the country the resource's content relates to. If not relevant, provide 0 as a value: ")

Type in the name of the country the resource's content relates to. If not relevant, provide 0 as a value: 0


<p><b>Step 6</b>: The <b>language of the content</b> that will become available from the resources to be retreved.</p>

In [11]:
language = input("Type in the language in which content should be made available. In the case of no particular preference provide 0 as a value: ")

Type in the language in which content should be made available. In the case of no particular preference provide 0 as a value: English


<p><b>Step 7</b>: The <b>type of the content</b> to be retrived (pertinent to the "<b>Publications</b>" result type - potential values are: theses, journal papers, reports, etc.).</p>

In [12]:
contentType = input("Provide the type of content you are interested in (applies only to Publications). If not relevant, provide 0 as a value: ")

Provide the type of content you are interested in (applies only to Publications). If not relevant, provide 0 as a value: 0


<p>By calling the <code>buildConfigurableQueryStr</code> function, we are able to <b>create the second part of the search query</b> that will be submitted to the AGRIS database (i.e., the <b>configurable part of the search query</b> containing the values provided to the search parameters as part of the steps executed above).</p>

In [13]:
queryStr_2 = buildConfigurableQueryStr(subject, resultType, startYear, endYear, countryName, language, contentType)

<p>The <b>search query</b> (i.e., the <code>baseQueryStr</code>) is <b>built</b> by <b>concatenating</b> the <b>default</b> (i.e., <code>queryStr_1</code>) and the <b>configurable part</b> (<code>queryStr_2</code>) of it.</p>

In [14]:
baseQueryStr = queryStr_1 + queryStr_2

<p><b>Display</b> the <b>search query</b> (i.e, the <code>baseQueryStr</code>) to be <b>finally submitted</b> to the AGRIS database.</p>

In [15]:
baseQueryStr

'agrovocString=&agrovocToRemove=&advQuery=&centerString=&centerToRemove=&filterToRemove=&typeString=&typeToRemove=&filterQuery=&onlyFullText=false&operator=Required&field=0&enableField=Disable&aggregatorField=Disable&filterString=%2Bsubject%3A%28agriculture%29&typeresultsField=Publications&fromDate=2000&toDate=2021&country=0&lang=English&typeToAdd=0'

<p>The constructed <b>search query gets submitted</b> to the AGRIS database.</p>

In [16]:
response = requests.get("https://agris.fao.org/agris-search/biblio.do?" + baseQueryStr)

<p><b>Printing out</b> the <b><code>status code</code> of the response</b> provided to the <b>query that has been submitted</b> in order to <b>receive feedback</b> on whether the <b>query submission</b> has <b>been successful or not</b> (a <b>response value equal to <code>200</code></b> reveals a <b>successful</b> query submission attempt!).</p>

In [17]:
response.status_code

200

<h4>Parsing content</h4> 
<p> The <b>page of the AGRIS database</b> that has been <b>retrieved</b> and <b>contains the results</b> related to the submitted query is <b>parsed</b> with the aim to <b>fetch the number of the search results</b>.</p>
<p>To do so, a <b>parsing object</b> (namely, an <code>instance</code> of the <code>BeautifulSoup</code> <b>class</b>) aiming to find the classes having the "<code>pull-left grey-scale-1 last</code>" label (this is the section/part of the results page where the number of the search results becomes available) is created. The <b>execution</b> of the <code>find</code> <b>method</b> called on the <b>parsing object</b> will allow to <b>get the record</b> in which the <b>number of the search results</b> is contained.</p>

In [18]:
soup = BeautifulSoup(response.content, "html.parser")
numOfResultsRecord = soup.find("div", class_ = "pull-left grey-scale-1 last")

<p>The <b>number of the search results</b> is <b>eventually retrieved</b> by <b> splitting</b> the <b>respective record</b> into pieces and <b>retrieving the appropriate one</b> (i.e., <b>piece</b>) after <b>converting it to an integer</b>. A check is also made to <b>figure out the existence</b> of the "<b>,</b>" <b>character</b> in the <b>results' number</b>. If <b>this is the case</b>, the "," sign is <b>removed</b>.</p>

In [19]:
if "," in numOfResultsRecord.find("p").find("strong").text.split()[-1]:
    numOfResults = int(numOfResultsRecord.find("p").find("strong").text.split()[-1].replace(",", ""))

<p><b>Displaying</b> the <b>number of the search results</b> that have been retrieved.</p>

In [20]:
numOfResults

912242

<p>A <b>quick check</b> is done to <b>make sure</b> that <b>there are indeed results that have been retrieved</b> from the <b>execution</b> of the <b>search query</b>. If the <b>number of search results is not 0</b>, then there is a <b>request</b> for the <b>number of the search results to keep</b> (in the case that there are too may and all of them are needed!).</p>

In [21]:
if numOfResults != 0:
    numOfResultsToKeep = int(input("Type in the number of results to keep: "))
else:
    print("No results have been found!")

Type in the number of results to keep: 1000


<p>The section of the script provided below is about the <b>calculation of the number of iterations</b> to be made in order to <b>skim through all the search results to be kept</b> (based on the number of the search results to be kept provided above). This part is necessary because of the fact the search results provided by the AGRIS database become available in batches of 10. The following cases are considered:
<ul><li>The <b>number of the search results</b> that have been returned is <b>exactly 10</b>.</li>
    <li>The <b>number of the search results</b> that have been returned is <b>more than 0 and less than 10</b>.</li>
    <li>The <b>number of the search results</b> that have been returned is a <b>multiple of 10</b>.</li>
    <li>The <b>number of the search results</b> that have been returned is <b>more than 10 but not an exact multiple of it</b>.</li>
    </ul></p>

In [22]:
if (numOfResultsToKeep // 10 == 1): 
    numOfIterations = 1
elif (numOfResultsToKeep // 10 == 0) and (numOfResultsToKeep % 10 > 0 and numOfResultsToKeep % 10 < 10):
    numOfIterations = 1
else:
    if numOfResultsToKeep % 10 == 0:
        numOfIterations = numOfResultsToKeep // 10
    else:
        numOfIterations = (numOfResultsToKeep // 10) + 1

<p><b>Priniting out</b> the <b>number of the iterations</b> that are <b>needed to retrieve</b> the <b>required number</b> of the <b>search result URLs</b>.</p>

In [23]:
numOfIterations

100

<p><b>Creating</b> a <b>text file</b> to <b>store</b> the <b>search result URLs</b>.</p>

In [24]:
fileName = input("Type in the name of the file to use of storing the query result URLs: ")

Type in the name of the file to use of storing the query result URLs: URLS_for_the_AGRIS_dataset


In [25]:
fullFileName = fileName + ".txt"

In [26]:
file = open (fullFileName, "w")

<p><b>Iterating</b> over the search results, <b>retrieving</b> the <b>search result URLs</b>, and <b>writing</b>/<b>storing</b> the search result URLs into the text file. To execute the iteration, the <b>index</b> from which <b>results should be scanned from</b> is asked.</p>

In [27]:
startIndex = int(input("Index to start the retrieval of search results from: "))

Index to start the retrival of search results from: 0


<p><b>Iteration over the search results</b> (from the index that has been provided and on) and <b>storage of the result URLs</b> that get retrieved into the text file.</p>

In [28]:
if numOfResultsToKeep >= 10:
    if startIndex == 0:
        iteration = 1
        response = requests.get("https://agris.fao.org/agris-search/biblio.do?" + baseQueryStr + "&" + "startIndexSearch=")
        soup = BeautifulSoup(response.content, "html.parser")
        resultUrls = soup.find_all("div", class_="col-md-10 col-sm-10 col-xs-12 inner")
        for resultUrl in resultUrls:
            url = resultUrl.find("a")
            file.write(url["href"] + "\n")
        iteration +=1
        while iteration <= numOfIterations:
            startIndex += 10
            response = requests.get("https://agris.fao.org/agris-search/biblio.do?" + baseQueryStr + "&" + "startIndexSearch=" + str(startIndex))
            soup = BeautifulSoup(response.content, "html.parser")
            resultUrls = soup.find_all("div", class_="col-md-10 col-sm-10 col-xs-12 inner")
            for resultUrl in resultUrls:
                url = resultUrl.find("a")
                file.write(url["href"] + "\n")
            iteration +=1
    else:
        iteration = 1
        while iteration <= numOfIterations:
            response = requests.get("https://agris.fao.org/agris-search/biblio.do?" + baseQueryStr + "&" + "startIndexSearch=" + str(startIndex))
            soup = BeautifulSoup(response.content, "html.parser")
            resultUrls = soup.find_all("div", class_="col-md-10 col-sm-10 col-xs-12 inner")
            for resultUrl in resultUrls:
                url = resultUrl.find("a")
                file.write(url["href"] + "\n")
            iteration += 1
            startIndex +=10
else:
    if startIndex == 0:
        response = requests.get("https://agris.fao.org/agris-search/biblio.do?" + baseQueryStr + "&" + "startIndexSearch=")
        soup = BeautifulSoup(response.content, "html.parser")
        resultUrls = soup.find_all("div", class_="col-md-10 col-sm-10 col-xs-12 inner")
        counter = 0
        for resultUrl in resultUrls:
            if counter < numOfResultsToKeep:
                counter +=1
                url = resultUrl.find("a")
                file.write(url["href"] + "\n")
            else:
                break
    else:
        response = requests.get("https://agris.fao.org/agris-search/biblio.do?" + baseQueryStr + "&" + "startIndexSearch=" + str(startIndex))
        soup = BeautifulSoup(response.content, "html.parser")
        resultUrls = soup.find_all("div", class_="col-md-10 col-sm-10 col-xs-12 inner")
        counter = 0
        for resultUrl in resultUrls:
            if counter < numOfResultsToKeep:
                counter +=1
                url = resultUrl.find("a")
                file.write(url["href"] + "\n")
            else:
                break

<p><b>Closing</b> the text file.</p>

In [29]:
file.close()