# Project: Strategic document monitoring from https://www.theses.fr

## Website presentation

The website [theses.fr](https://www.theses.fr/) centralize all thesis in France from 1985. This is why having a strategic document monitoring tool is necessary to stay informed about theses on our subject.

This notebook is the tool to make surveillance from [theses.fr](https://www.theses.fr/). 

## How to use this tool ?
Each time you run this notebook :  
1. the notebook will ask to you keywords/researches to search,
  - you can write multiple researches if you separate them with ';'. </br>
  <u>example</u>: <i>immunology; infectious diseases</i> correspond to 2 differents researches. Results to these 2 researches will be aggregate

2.  The notebook will extract informations and return to you a file will link, title and abstract of found thesis.

If you want to make weekly surveillance, to need to run this python tool each week, and each time use the same keywords/researches. Results from precedent extraction will not be presented.

If you run this notebook with differents keywords/researches, this will be consider as new surveillance, and results from precedent extraction will be presented.

# Import libraries

In [1]:
import pandas as pd
import os
import re
from unidecode import unidecode
import openpyxl

from bs4 import BeautifulSoup
import requests

from typing import List

from datetime import datetime

import tkinter as tk
from tkinter import simpledialog
from tkinter import messagebox

# Legal aspects

The website [theses.fr](https://www.theses.fr/) forbid to scrap some thesis. These thesis are listed here : [https://www.theses.fr/robots.txt](https://www.theses.fr/robots.txt).

## Load robots.txt from website

In [2]:
# Load robots.txt file
robots_df = pd.read_csv("https://www.theses.fr/robots.txt", sep = ": ").rename(columns = {"User-agent":"col", "*":"id_thesis"})
# display(robots_df.head(5))

  robots_df = pd.read_csv("https://www.theses.fr/robots.txt", sep = ": ").rename(columns = {"User-agent":"col", "*":"id_thesis"})


## Extract illegal URL from robots.txt

In [3]:
robots_df = robots_df[ (robots_df["col"] != "Crawl-delay") & (robots_df["col"] != "Sitemap") ] # Delete Site map and Crawl-delay rows

illegal_url_list = robots_df.id_thesis.apply(lambda x: "https://www.theses.fr"+x).tolist() # List of disallow URL

# robots_df.to_csv("illegal_urls.csv")
# display(illegal_url_list) # Check : OK

# App

## User input

In [60]:
user_input = ""
result = False

while user_input == "" : 
    # User input
    tk_window = tk.Tk()
    tk_window.geometry("150x150")

    tk_window.withdraw()
    # the input dialog
    user_input = simpledialog.askstring(title="Request",
                                      prompt="Please select your keywords : \n "+
                                        "(you cam make multiples research in the same time if you separate them with semicolon (;) )")

    # User confirmation
    tk_window.geometry("150x150")
    result = messagebox.askyesno("Request confirmation", "Your request is : " + user_input) # Renvoie True si oui, False si non
 
print("Raw user input : '" + user_input + "'")

Raw user input : 'test ; pathologie digitale ; transformation chimique'


In [5]:
# user_input = "      prout    dsfdsqf    ;      tesqfdsq   fdsqfdsqf   efdsq;           " # user-input for testing

## Parsing user request

In [61]:
# print(user_input)
# user_input = "    prîut    dsfdèqf;   tésqfdsq fdsùfdàqf      efééééésq   ;                " # user_input for testing

input_list = user_input.split(";")
mask = []

for count, element in enumerate(input_list) : 
    print("element n°", count+1, " on ", len(input_list))
    element = re.sub(' +', ' ', element) # Delete multiple spaces
    element = re.sub('^ +', '', element) # Delete spaces before reseach
    element = re.sub(' +$', '', element) # Delete spaces after reseach
    element = re.sub(' ', '+', element)
    element = unidecode(element) # delete accent
    
    print("element : '" + element + "'")
    
    if len(element) == 0 :
        print("1 element deleted because containing nothing")
    else : 
        mask = mask + [element]

input_list = mask

# Verifying output
# print("Liste finale : ")
print("Final list of elements: ",str(input_list))

if len(input_list) == 0 :
  sys.exit("No input user")

element n° 1  on  3
element : 'test'
element n° 2  on  3
element : 'pathologie+digitale'
element n° 3  on  3
element : 'transformation+chimique'
Final list of elements:  ['test', 'pathologie+digitale', 'transformation+chimique']


## Website site extraction function

In [4]:
# Function to extract results.
# theses.fr allow to export only 1000 results in the same file, so it's necessary to make multiple exports and stack them.

def scraping_number_results(url_short) :
  url_short = url_short

  # Number of results
  html = requests.get(url_short)
  soup = BeautifulSoup(html.content, "html.parser")
  number_results = int(soup.find("div", attrs={"id":"resumR"}).find("span", attrs={"id":"sNbRes"}).text)

  return number_results

def result_scraping(n_res, url) :
  url = url
  number_results = n_res
  
  # Variables
  df_temp = pd.DataFrame()
  definitive_df  = pd.DataFrame()

  # Loop extraction
  start = 0
  number_results_loop = number_results

  while number_results_loop >= 0 :
    number_results_loop -= 1000
    # print("url : " + str(url.format(start)))
    print("For element : " + element + ", extraction from " + str(start) + " to " + str(min(number_results, start+1000)))
    df_temp = pd.read_csv(url.format(start), sep = ";")

    definitive_df = pd.concat([definitive_df, df_temp], ignore_index = True)
    
    start += 1000

  # inital : df = pd.read_csv(url)
  # pb : si plus de 1000 résultats, csv ne charge que les 1000 premeirs résultats

  return definitive_df

In [17]:
# input_list = ["transformation+chimique", "pathologie+digestive+numerique"] # input list for testing

thesis_df_temp = []
thesis_df = pd.DataFrame(columns = ['keywords', 'id_thesis'])

for element in input_list : 
  print("\n-------------------------------------------------")
  print("Element : " + element)

  # Recover previous researches results 
  try : 
    seen_df = pd.read_csv(element + ".csv")["seen_id_thesis"].tolist()
    print("File from previous researches found.\n")

    print("Number of thesis already seen in preceent researches : " + str(len(seen_df)))
  except : 
    print("This request has no precedent.\n")
    seen_df = []
  
  # Verifying number of results
  try : 
    number_results = scraping_number_results("https://www.theses.fr/?q=" + element)
    print(number_results, " results for element : ", element)
  except : 
    number_results = 0
    print("No results for this element : " + str(element))

  # Extract results
  if (number_results > 0) : 
    try : 
      # scrap_results = result_scraping(number_results, "https://www.theses.fr/?q=" + element + "&fq=dateSoutenance:[1965-01-01T23:59:59Z%2BTO%2B""extract_['transformation+chimique', 'pathologie+digestive+numerique']_2023-06-08.xlsx"+ 
      # datetime.now().strftime("%Y-%m-%d") + "T" + datetime.now().strftime("%H:%M:%S") + "Z" + 
      # "]&checkedfacets=&start={}&sort=none&status=&access=&prevision=&filtrepersonne=&zone1=titreRAs&val1=&op1=AND&zone2=auteurs&val2=&op2=AND&zone3=etabSoutenances&val3=&op3=AND&zone4=dateSoutenance&val4a=&val4b=&type=&lng=fr/&checkedfacets=&format=csv")
      
      scrap_results = result_scraping( number_results,
        "https://www.theses.fr/?q="+ element + 
        "&fq=dateSoutenance:[1965-01-01T23:59:59Z%2BTO%2B2023-12-31T23:59:59Z]&checkedfacets=&start={}&sort=none&status=&access=&prevision=&filtrepersonne=&zone1=titreRAs&val1=&op1=AND&zone2=auteurs&val2=&op2=AND&zone3=etabSoutenances&val3=&op3=AND&zone4=dateSoutenance&val4a=&val4b=&type=&lng=fr/&checkedfacets=&format=csv"  )
      
      # if element == "transformation+chimique" :
      #  df.to_csv("transformation+chimique_raw.csv")
      
      scrap_results = scrap_results[["Statut", "Identifiant de la these", "Accessible en ligne", "Titre", "Auteur", "Directeur de these (nom prenom)", "Etablissement de soutenance", "Discipline"]]
      scrap_results["Identifiant de la these"] = scrap_results["Identifiant de la these"].apply(lambda x : "https://www.theses.fr/" + x)
      
      seen_precedent_researches = 0
      illegal_thesis = 0
      redundant_thesis = 0

      for id_thesis in scrap_results["Identifiant de la these"] : 
        # print("ID thesis evaluate : " + id_thesis)
        
        if id_thesis in seen_df : 
          seen_precedent_researches += 1
        else : 
          seen_df = seen_df + [id_thesis]
          if id_thesis in illegal_url_list :
            illegal_thesis += 1
          else : 
            if id_thesis in thesis_df_temp :
              redundant_thesis += 1
            else : 
              thesis_df_temp = thesis_df_temp + [id_thesis]
      
      # Save already seen thesis
      pd.DataFrame(seen_df, columns = ["seen_id_thesis"]).to_csv(str(element) + ".csv", index = False)
      
      # unser search results
      thesis_df = pd.concat([thesis_df,  
                            pd.DataFrame({'keywords':element, 'id_thesis':thesis_df_temp}).merge(scrap_results.rename(columns = {"Identifiant de la these":"id_thesis"}), 
                                                                                on="id_thesis", 
                                                                                how = "left")],
                            ignore_index = True)
      print("Thesis already seen in precedent researches : " + str(seen_precedent_researches))
      print("duplicate : " + str(redundant_thesis))
      print("Disallow thesis : " + str(illegal_thesis))

      print("Extraction from " + str(element) + " finished! ")

    except : 
      print("Error in element : ", element)  

# Save new thesis extracted
thesis_df.to_excel("extract_" + str(input_list) + "_" + str(datetime.now())[:10] + ".xlsx", sheet_name = "extraction", index = False)
# thesis_df.to_csv("extract_" + str(input_list) + "_" + str(datetime.now())[:10] + ".csv", index = False)

print("\n---------------------------\nExecution finished !")


-------------------------------------------------
Element : transformation+chimique
File from previous researches found.

Number of thesis already seen in preceent researches : 20905
20905  results for element :  transformation+chimique
For element : transformation+chimique, extraction from 0 to 1000
For element : transformation+chimique, extraction from 1000 to 2000
For element : transformation+chimique, extraction from 2000 to 3000
For element : transformation+chimique, extraction from 3000 to 4000
For element : transformation+chimique, extraction from 4000 to 5000
For element : transformation+chimique, extraction from 5000 to 6000
For element : transformation+chimique, extraction from 6000 to 7000
For element : transformation+chimique, extraction from 7000 to 8000
For element : transformation+chimique, extraction from 8000 to 9000
For element : transformation+chimique, extraction from 9000 to 10000
For element : transformation+chimique, extraction from 10000 to 11000
For element : 