# Popularité mots clés POC

- Yacine seba
- Amine Arouk

- Preuve de Concept : Popularité des couples de mots clés formés grace aux lexiques gestion et innovation
 - Objectifs : Déterminer les couples de mots clés les plus pertinents afin de réduire le nombre de requêtes journalières / hebdomadaires et fournir au groupe scrapper les couples qui ressortirons  les données les plus intéressantes à analyser
 - Méthode : Notre méthode se déroule en plusieurs étapes qui seront détaillés le long de ce notebook . Nous nous baserons principalement sur deux métriques : L'occurence des couples dans les titres et résumés des liens ( dans les liens qui ressortent) ainsi que l'évolution de cette occurence au fil du temps   
 - l'analyse de ces deux mesures nous permettra de calculer un score pondéré pour chaque couple.
 - Un seuil sera ensuite choisi afin de catégoriser les mots clés (requêtes journalières , hebdomadaires , mensuels)
 - Pourquoi ?
 - Le nombre de couples de mots clés étant très élevé (plus de 15 000) il est donc impossible de lancer autant de requetes (sur une période courte) . Afin de diminuer le nombre de requêtes à effectué nous avons décidé de ne selectionner seulement que les mots clés les plus importants.

### 4 ETAPES :
# Etape 1 
- Importation des librairies et des lexiques , création des liens contenenant les mots clés

# Etape 2
- On crawl et on parse les données afin de récupérer les titres / résumés de chaque liens

# Etape 3
- On analyse les titres et résumés afin de lister les mots du lexique qui y apparaissent (tokenization et suppression des stopwpords )

# Etape 4
- On répertorie les couples de mots clés via un produit cartésien , on compte leurs occurences et l'évolution au fil du temps et on calcule le score









In [1]:
pip install scrapy



In [2]:
########## Module import ##########

# Files
from google.colab import files
import json
import pandas as pd

# Maths
import random

# Extraction
import re

# Scraping
import scrapy
#from scrapy import Selector
from requests import get

# Parsing
from urllib.parse import urlencode
from urllib.parse import urlparse

# Format
import datetime
import time


#import nltk functions and french stopwords
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer

import numpy as np
import nltk
from nltk.stem import SnowballStemmer 
nltk.download('stopwords')
stop_words=nltk.corpus.stopwords.words('french')


import ast

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# Upload files (lex)
#Import lexique innoation , lexique gestion , listeCouple.json
uploaded = files.upload()

Saving Lexique_Gammes_Gestion.txt to Lexique_Gammes_Gestion (2).txt
Saving Lexique_Innovation.txt to Lexique_Innovation (2).txt
Saving listCouple.json to listCouple (2).json


In [4]:
# Get the list of word combinations
df = pd.read_json('listCouple.json', orient='index')
df_t = df.T
listCouple=df_t.values.tolist()

**Transformer les fichiers txt des lexiques utilisés en listes**
- Une liste pour le lexique d'innovation
- Une liste pour le lexique de gestion

In [5]:
#retrieves words from different lexicons and puts them in a list

with open('Lexique_Gammes_Gestion.txt') as img:
    gestion = img.readlines()
with open('Lexique_Innovation.txt') as img:
    innovation = img.readlines()
for i in range(len(gestion)):
  gestion[i]=gestion[i][:-1]
print(gestion)
for i in range(len(innovation)):
  innovation[i]=innovation[i][:-1]
print(innovation)

['Collectivité territoriale', 'Administration', 'Mairie', 'Commune', "Communauté d'agglomération", 'Intercommunalités', 'Collectivités locales', 'Agglomération', 'Organisation territoriale', 'ville', 'élus', 'Finance', 'Finances publiques', 'Gestion financière', 'Performance publique', 'Analyse financière', 'évaluation des politiques publiques', 'Rapport de la Cour des comptes', 'CAP22', 'Comité action publique', 'Pilotage des dépenses', 'Flux financiers', 'Gestion comptable', 'Gestion de trésorerie', 'Budget', 'Processus achat', 'management opérationnel', 'aide à la décision', 'programmation pluriannuelle', 'PPI', 'PPF', 'trajectoire financière', 'AP', 'APCP', 'CP', 'AP/CP', 'autorisation de programme', 'crédit de paiement', 'gestion budgétaire', 'immobilisation', 'subvention', 'suivi financier et technique des marchés', 'SIGF', 'simulation', 'arbitrage', 'maquette budgétaire', 'bons de commandes', 'factures', 'liquidation', 'signature électronique', 'trésorerie', 'archivage électroni

- **Initialiser la clé de l'API**
- **Initialiser les couples de mots à utiliser**
- **Initialiser le nombre couple et de requête**

In [39]:
########## Parameters to change ##########

# API Key (created on Scraper API)
API_KEY = '6e2f13665a14d50e37acdfc9636b4877'

# Test on a few couples
p_listCouple = listCouple
# Number of couples
p_length = 1
# Number of requests
p_requestNumber = 2

In [40]:
#list of sites we don't want to scrap 
banned_links = ['www.linkedin.com','fr.linkedin.com','www.linguee.com','www.researchgate.net','docplayer.org','docplayer.net','docplayer.fr','books.google.com','books.google.fr','www.cairn.info','www.pinterest.com','www.pinterest.fr','fr.indeed.com','www.indeed.fr','www.indeed.com','www.facebook.com','viadeo.journaldunet.com','fr.wikipedia.org','www.senat.fr','www.amazon.com','www.amazon.fr','www.youtube.com','twitter.com','www.slideshare.net','fr.slideshare.net','www.calameo.com','www.talent.com','fr.talent.com','issuu.com','www.aijobs.tech','www.economie.gouv.fr','theses.fr','www.theses.fr','hal.archives-ouvertes.fr','fr.scribd.com','www.scribd.com','www.legifrance.gouv.fr','www2.assemblee-nationale.fr','www.assemblee-nationale.fr','www2.assemblee-nationale.fr','www.ccomptes.fr','www.vie-publique.fr','www.banquedesterritoires.fr','core.ac.uk','www.cnfpt.fr','www.malt.fr','www.malt.com','link.springer.com','www.sciencedirect.com','slideplayer.fr','slideplayer.org','slideplayer.com']

#stopword=nltk.corpus.stopwords.words('english')

#this section is detailed in audrey's notebook 
def ban_links(links_banned = banned_links) :
  a=' '
  for i in links_banned :
    a += '-inurl:'+i+' '

  return a[:-1]


def remove_stopwords(tokenzed_list):
    text=[word for word in tokenzed_list if word not in stop_words]
    return text
    
def tokenize(text):
    
    tokens = re.split('\s|[\']', text)
    #W+ means a word character or - can go there
    return tokens


def around_query(lexique) :
  liste = lexique.copy()
  liste = [tokenize(i) for i in liste]
  liste = [remove_stopwords(i) for i in liste]
  for i in range(len(liste)) :
    if len(liste[i]) == 1 :
      liste[i] = liste[i][0]
    else :
      texte =  liste[i][0] 
      for j in range(1,len(liste[i])) :
        texte += ' AROUND(2) ' + liste[i][j]
      liste[i] = texte
  return liste

  
lexique_gestion_tokenize = around_query(gestion)
lexique_innovation_tokenize = around_query(innovation)




 


**Création des fonctions qui vont nous permettre de créer notre crawler Google en utilisant la liste de couple de mots initialisée en préalable**

In [41]:

def create_google_url(query):
    """ 
    Allows you to create a Google URL from a keyword

    Parameter :
        query : keyword to enter in the search bar
    
    Out :
        google_url : google URL created from the keyword
    """
    # num = number of results to be scraped
    google_dict = {'q': query, 'num': 20,}
    google_url = 'http://www.google.com/search?' + urlencode(google_dict)
    return google_url

def combAND(couple):
    """Documentation
    Parameters:
        couple: a list of 2 Strings
    Out :
        list : a combination of the 2 members of a couple with AND between them
    """

    return str(couple[0])+' '+'AND'+' '+str(couple[1])

def listToAND(listCouple):
    """Documentation
    Parameters:
        listCouple: a list of couple

    Out :
        list : a list of the combination of the couple of listCouple
    """

    #we use combAND
    return [combAND(i) for i in listCouple]


def combOR(tuple):
    """Documentation

    Parameters:
        tuple: a list of String

    Out :
        final : a combination of the members of the tuple with OR between them and framed with ()
    """

    #first step : initialisation of final
    final='('+str(tuple[0])+')'
    #second step : adding the rest of the tuple
    for i in range(1,len(tuple)):
        final=final+'|'+'('+tuple[i]+')'
    return final

#applying the previous function to a list of tuple
def listToOR(listTuple):
    """Documentation
    Parameters:
        listTuple: a list of tuples

    Out :
        list : a list of the combination of the tuple of listTuple
    """

    #we use combOR
    return [combOR(i) for i in listTuple]

#making a list of random tuples
def listComb(listAND,numbT=2,iteration=int(1000)): #we have to limit the number of request, by default 1000, and we make couples (2)
    """Documentation

    Parameters:
        listAND: a list Strings with AND
        numbT : the length of the tuple we want to create
        iteration : the maximum number of combination we want to create

    Out :
        finalList : a list of the combination of the tuple of listTuple
    """

    finalList=[]
    i=0
    #Step 1 : we loop until we have enough tuples or the list is empty
    while ((len(listAND)>=numbT) and (i<iteration)):
        i+=1
        #Step 2 : at each loop, we take some random elements of listAND and create a tuple with them
        listRand=random.sample(listAND,numbT)
        #Step 3 : we remove the elements from listAND
        for j in listRand :
            listAND.remove(j)
        #Step 4 :we add the tuple we created to our finalList
        finalList.append(listRand)
    return finalList

#function that allows to add a "limit date" parameter in the link
def link_sub_month(link, nbmonths):
    today = datetime.date.today()
    
    nbdays=nbmonths*31

    days_to_substract = datetime.timedelta(days=nbdays)

    date = today - days_to_substract
    #limit date
    
    jour=str(date.day)
    mois=str(date.month)
    annee=str(date.year)
    
    link = link+"&source=lnt&tbs=cdr%3A1%2Ccd_min%3A"+mois+"%2F"+jour+"%2F"+annee+"%2Ccd_max%3A&tbm="
    return link

def get_url(url):
    """ 
    Creation of the URL that will allow the legal scraping of Google results (use of the API key). 
    This URL is equivalent to a Google search.

    Parameter :
        url : google URL created from the keyword
    
    Out :
        proxy_url : URLs built using the API
    """

    payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'fr', 'pws': 0}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    date_url = proxy_url + link_sub_month(proxy_url, 6)
    return proxy_url

class GoogleSpider(scrapy.Spider):
    """ 
    This class lists functions for scraping Google results from a list of keywords
    """

    # GoogleSpider class name
    name = 'google'
    # Name of the site to be scraped
    allowed_domains = ['www.google.com']
    # Settings
    custom_settings = {
                        # Criticality level at which the log is displayed
                        'LOG_LEVEL': 'INFO', 
                        # Maximum number of simultaneous requests 
                        'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 
                        # Maximum number of retries to be made if the query fails
                        'RETRY_TIMES': 0}

    def start_requests(self, listCouple, length, requestNumber,banned_links=banned_links):
        #Initialisation of DataFrame
        df = pd.DataFrame(columns=['URL','Query'])
        #Adding quotes
        #listCouple = listExpression(listCouple)
        #Format changeover 
        lWork=listToAND(listCouple)
        #Selection of queries 
        lWork=listComb(lWork,numbT=length, iteration=requestNumber)
        #We change the format of Keywords
        lWork=listToOR(lWork)

        lURL=[]
        #We loop the keywords to generate the queries 
        #for query in ['emprunt écologie'] :
        for query in lWork:
            
            url = create_google_url(query+ban_links(banned_links[:10])) #Trier
            lURL.append(str(scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0}))[5:-1])
        # url = create_google_url('innovation ville')
        # lURL.append(str(scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0}))[5:-1])
        #column generation
        df['Query'] = lWork
        df['URL'] = lURL

        yield df

    def parse(self, response):
        di = json.loads(response.text)
        pos = response.meta['pos']
        dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        for result in di['organic_results']:
            title = result['title']
            snippet = result['snippet']
            link = result['link']
            item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
            pos += 1
            yield item
        next_page = di['pagination']['nextPageUrl']
        if next_page:
            yield print(scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos}).text)

**Génération du lien de scraper API contenant les mots clés **

In [42]:
# Start building URLs
if __name__ == '__main__':
    df_result = list(GoogleSpider().start_requests(listCouple = p_listCouple,length=p_length,requestNumber=p_requestNumber))[0]
df_result['URL'].values

array(['http://api.scraperapi.com/?api_key=6e2f13665a14d50e37acdfc9636b4877&url=http%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3D%2528%2522Finance%2522%2BAND%2B%2522cybers%25C3%25A9curit%25C3%25A9%2522%2529%2B-inurl%253Awww.linkedin.com%2B-inurl%253Afr.linkedin.com%2B-inurl%253Awww.linguee.com%2B-inurl%253Awww.researchgate.net%2B-inurl%253Adocplayer.org%2B-inurl%253Adocplayer.net%2B-inurl%253Adocplayer.fr%2B-inurl%253Abooks.google.com%2B-inurl%253Abooks.google.fr%2B-inurl%253Awww.cairn.info%26num%3D20&autoparse=true&country_code=fr&pws=0',
       'http://api.scraperapi.com/?api_key=6e2f13665a14d50e37acdfc9636b4877&url=http%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3D%2528%2522pilotage%2Bde%2Bl%2527achat%2522%2BAND%2B%2522augmented%2Bintelligence%2522%2529%2B-inurl%253Awww.linkedin.com%2B-inurl%253Afr.linkedin.com%2B-inurl%253Awww.linguee.com%2B-inurl%253Awww.researchgate.net%2B-inurl%253Adocplayer.org%2B-inurl%253Adocplayer.net%2B-inurl%253Adocplayer.fr%2B-inurl%253Abooks.google.com%2B-inurl%253Ab

In [43]:
# Crawling start time
print((datetime.datetime.now()++datetime.timedelta(hours=1)).strftime("%H:%M:%S"))

18:59:33


#Transition entre la première et la deuxième étape

**Crawler les liens API générés ci-dessus**

In [44]:
list_source = []
i = 0

for index, row in df_result.iterrows():
    link = row['URL']
    query = row['Query']
    # 1 minute break to avoid API overloading
    time.sleep(60)
    # URL scraping
    response = get(link)
    # Test if the request was successful
    if response.status_code == 200:
        # Addition of the scraped google results and the corresponding query
        list_source.append([response.text, query])

        i+=1
        # Saving the results every 20 queries
        if (i%20==0):
            with open('etape'+str(i)+'.json', 'w') as jsonfile:
                json.dump(list_source, jsonfile)
            files.download('etape'+str(i)+'.json')


In [45]:
list_source

[['{"search_information":{"total_results":113,"time_taken_displayed":7.14,"query_displayed":"(\\"pilotage de l\'achat\\" AND \\"augmented intelligence\\") -inurl:www.linkedin.com -inurl:fr.linkedin.com -inurl:www.linguee.com -inurl:www.researchgate.net -inurl:docplayer.org -inurl:docplayer.net -inurl:docplayer.fr -inurl:books.google.com -inurl:books.google.fr -inurl:www.cairn.info"},"ads":[],"knowledge_graph":{"title":"","description":"","source":{"name":""},"related":[],"related_link":null,"social_media":[],"see_more_about":[]},"related_questions":[],"answer_box":null,"organic_results":[{"position":1,"title":"La révolution de l\'intelligence - Préparer l\'avenir de ... - Deloitte","snippet":"apporté son lot de richesse et d\'innovation; elle est à l\'origine de ... automatiquement les achats au compte des ... Analyste de transport à pilotage ... ://dupress.deloitte.com/dup-us-en/deloitte-review/issue-20/augmented-intelligence-human-.","link":"https://www2.deloitte.com/content/dam/Delo

In [46]:
# Crawling end time
print((datetime.datetime.now()++datetime.timedelta(hours=1)).strftime("%H:%M:%S"))

19:03:05


**Extraction des informations qu'on veut garder :**
- Liens des articles
- Titres des articles
- Résumés des articles
- Positions des articles

In [47]:
#Creation of lists that will contain the data we will extract
links = []
title = []
query = []
resume = []
position = []

#The general pattern to extract blocks of information
pattern = re.compile("\"title\"[^}]+")
#The pattern that extracts url from the articles
pattern_link = re.compile("\"link\"[^,]+")
#The pattern that extracts the article titles
pattern_title = re.compile("\"title\"[^,]+")
#The pattern that extracts the summary of the articles
pattern_resume = re.compile("\"snippet\"[^,]+")
#The pattern that extracts the position of the article in the Google search page
pattern_pos = re.compile("\"position\"[^,]+")

#Creation of the output dataframe which will contain for each article its url, keywords, title, its position in the google page and its relevance score
df_sources = pd.DataFrame(columns=['URL','Query','title','resume','position','score'])


for source in list_source:
    #Extraction of blocks from the variable source[0] which contains the source code crawled by the Google API crawler
    bloc = pattern.findall(str(source[0]))
    #Storing these blocks in a primary list
    my_bloc = list([x for x in bloc])
    #Processing blocks to obtain the required information
    for i in my_bloc:
      #Extracting the link from the block
      link = pattern_link.findall(i)
      #Extracting the title from the block
      titre = pattern_title.findall(i)
      #Extracting the summary from the block
      snippet = pattern_resume.findall(i)
      try :
        #Remove all special characters that are not needed
        links.append(link[0][8:-1])
        title.append(titre[0][9:-1])
        query.append(source[1])
        resume.append(snippet[0][11:])
      except :
        a=1
    #Extracting the position 
    pos = pattern_pos.findall(str(source[0]))
    #Storage of the position in the first list
    my_position = list([x for x in pos])
    for i in my_position :
      #Deletion of additional information
      position.append(int(i[11:]))


#Implementation of the dataframe
df_sources['URL'] = links
df_sources['Query'] = query
df_sources['title'] = title
df_sources['resume'] = resume
df_sources['position'] = position
for i in range(df_sources.shape[0]):
  #Calculation of relevance score of each link from its rank (position) in the Google search page
  df_sources['score'] = 1 - df_sources['position']/df_sources[df_sources['Query'] == df_sources['Query'].iloc[i]].shape[0]

df_sources

Unnamed: 0,URL,Query,title,resume,position,score
0,https://www2.deloitte.com/content/dam/Deloitte...,"(""pilotage de l'achat"" AND ""augmented intellig...",La révolution de l'intelligence - Préparer l'a...,apporté son lot de richesse et d'innovation; e...,1,0.95
1,https://www.oezratty.net/wordpress/wp-content/...,"(""pilotage de l'achat"" AND ""augmented intellig...",Les usages de l'intelligence artificielle - Ol...,Nov 18,2,0.9
2,https://www.ebg.net/publications/pdf/100063.pdf,"(""pilotage de l'achat"" AND ""augmented intellig...",Banques / Assurances - l'EBG,2// Banques / Assurances : Comment la data lea...,3,0.85
3,https://www.people-doc.fr/transformation-digit...,"(""pilotage de l'achat"" AND ""augmented intellig...",Définitions RH - Transformation digitale et Fu...,Toutes les définitions RH du Future of Work su...,4,0.8
4,https://espas.secure.europarl.europa.eu/orbis/...,"(""pilotage de l'achat"" AND ""augmented intellig...",Données médicales - European Strategy and Poli...,B. Poursuivre la consolidation de l'infrastruc...,5,0.75
5,https://www.aijobs.tech/fr/j/t/emplois-chef-de...,"(""pilotage de l'achat"" AND ""augmented intellig...",Emplois : Chef De Projet - janvier 2021 - AI J...,A la recherche d'un emploi : Chef De Projet ? ...,6,0.7
6,https://fr.indeed.com/Emplois-Business-Intelli...,"(""pilotage de l'achat"" AND ""augmented intellig...",Emplois : Business Intelligence Manager - janv...,Vous êtes à la recherche d'un emploi : Busines...,7,0.65
7,https://www.tableau.com/fr-fr/partners/search?...,"(""pilotage de l'achat"" AND ""augmented intellig...",Trouver un partenaire - Tableau Software,DATA SEMANTICS has the ability to work with Mu...,8,0.6
8,https://www.tableau.com/fr-fr/partners/allianc...,"(""pilotage de l'achat"" AND ""augmented intellig...",Trouver un partenaire | Tableau Software,DECIDEOM est expert dans le domaine de l'infor...,9,0.55
9,http://lacgt44.fr/IMG/pdf/20180214_numerique_l...,"(""pilotage de l'achat"" AND ""augmented intellig...",20180214_numerique_livret_v2_sr_rc_ca_v3_1_.pd...,18. V. DE L'UTILITÉ DU SYNDICALISME CGT DANS L...,10,0.5


**Stockage de la dataframe de sortie dans un fichier json**

In [48]:
#Storing the dataframe in a Json file
df_sources.to_json("df_sources.json")

In [49]:
  #This section presents an alternative method of determining the popularity of each couple.
  #the popularity calculation is based on the number of google results 
  
  """
def popCouple(couples = listCouple ,nbreCouples = 3) :
  random.seed(41)
  listcouples =random.sample(couples,nbreCouples)
  listeAJeter = [i for i in couples if i not in listcouples ]
  list_source =[]
  pattern_results = re.compile("\"total_results\"[^,]+")
  
  
  
  df = (list(GoogleSpider().start_requests(listCouple=listcouples,length=p_length,requestNumber=nbreCouples)))
  df = df[0]  
  link = df['URL']
  query = df['Query']
  cpt = 0
  for liens in link :
    print(liens)
    print(query[cpt])
    # URL scraping
    response = get(liens)
    print(response.text)
    # Test if the request was successful
    if response.status_code == 200:
        # Addition of the scraped google results and the corresponding query
        total_results = pattern_results.findall(response.text)
        total_results = list([x for x in total_results])
        print(total_results)
        list_source.append([response.text, query])
    cpt = cpt+1
    time.sleep(60)


    

  
  return listcouples,listeAJeter
a = popCouple()
print(len(a[0]))
print(len(a[1]))
"""

'\ndef popCouple(couples = listCouple ,nbreCouples = 3) :\nrandom.seed(41)\nlistcouples =random.sample(couples,nbreCouples)\nlisteAJeter = [i for i in couples if i not in listcouples ]\nlist_source =[]\npattern_results = re.compile(""total_results"[^,]+")\n\n\n\ndf = (list(GoogleSpider().start_requests(listCouple=listcouples,length=p_length,requestNumber=nbreCouples)))\ndf = df[0]  \nlink = df[\'URL\']\nquery = df[\'Query\']\ncpt = 0\nfor liens in link :\n  print(liens)\n  print(query[cpt])\n  # URL scraping\n  response = get(liens)\n  print(response.text)\n  # Test if the request was successful\n  if response.status_code == 200:\n      # Addition of the scraped google results and the corresponding query\n      total_results = pattern_results.findall(response.text)\n      total_results = list([x for x in total_results])\n      print(total_results)\n      list_source.append([response.text, query])\n  cpt = cpt+1\n  time.sleep(60)\n\n\n  \n\n\nreturn listcouples,listeAJeter\na = popCoupl

#Transition entre la deuxième et la troisième étape

In [50]:
#Cas ou on récupère les données de la deuxième étape
df_work = df_sources
df_work
#-----------------------------------------------#
#Cas ou on importe des données json sans la colonnes words

#df_work = pd.read_json('df_crawling_clean.json',orient='split')
#df_work = df.rename(columns={"Title": "title", "Snippet": 'resume'})
#df_work = df_work.iloc[:10000]

Unnamed: 0,URL,Query,title,resume,position,score
0,https://www2.deloitte.com/content/dam/Deloitte...,"(""pilotage de l'achat"" AND ""augmented intellig...",La révolution de l'intelligence - Préparer l'a...,apporté son lot de richesse et d'innovation; e...,1,0.95
1,https://www.oezratty.net/wordpress/wp-content/...,"(""pilotage de l'achat"" AND ""augmented intellig...",Les usages de l'intelligence artificielle - Ol...,Nov 18,2,0.9
2,https://www.ebg.net/publications/pdf/100063.pdf,"(""pilotage de l'achat"" AND ""augmented intellig...",Banques / Assurances - l'EBG,2// Banques / Assurances : Comment la data lea...,3,0.85
3,https://www.people-doc.fr/transformation-digit...,"(""pilotage de l'achat"" AND ""augmented intellig...",Définitions RH - Transformation digitale et Fu...,Toutes les définitions RH du Future of Work su...,4,0.8
4,https://espas.secure.europarl.europa.eu/orbis/...,"(""pilotage de l'achat"" AND ""augmented intellig...",Données médicales - European Strategy and Poli...,B. Poursuivre la consolidation de l'infrastruc...,5,0.75
5,https://www.aijobs.tech/fr/j/t/emplois-chef-de...,"(""pilotage de l'achat"" AND ""augmented intellig...",Emplois : Chef De Projet - janvier 2021 - AI J...,A la recherche d'un emploi : Chef De Projet ? ...,6,0.7
6,https://fr.indeed.com/Emplois-Business-Intelli...,"(""pilotage de l'achat"" AND ""augmented intellig...",Emplois : Business Intelligence Manager - janv...,Vous êtes à la recherche d'un emploi : Busines...,7,0.65
7,https://www.tableau.com/fr-fr/partners/search?...,"(""pilotage de l'achat"" AND ""augmented intellig...",Trouver un partenaire - Tableau Software,DATA SEMANTICS has the ability to work with Mu...,8,0.6
8,https://www.tableau.com/fr-fr/partners/allianc...,"(""pilotage de l'achat"" AND ""augmented intellig...",Trouver un partenaire | Tableau Software,DECIDEOM est expert dans le domaine de l'infor...,9,0.55
9,http://lacgt44.fr/IMG/pdf/20180214_numerique_l...,"(""pilotage de l'achat"" AND ""augmented intellig...",20180214_numerique_livret_v2_sr_rc_ca_v3_1_.pd...,18. V. DE L'UTILITÉ DU SYNDICALISME CGT DANS L...,10,0.5


In [51]:
#words cleaning
def cleandesc(desc):
    """ 
    word cleaning 

    Parameter :
        descn : word we want to clean up  
    
    Out :
        proxy_url : cleaned word
    """


 
    sent = desc
    #Lower case
    sent = "".join([x.lower() if x.isalpha()  else " " for x in sent])
    Porter=SnowballStemmer('french')
    #Clean stop words
    sent = " ".join([Porter.stem(x) if x.lower() not in stop_words  else "" for x in sent.split()])
    sent = " ".join(sent.split())
    
    return sent


- Environ 60 itérations par seconde pour le calcule 
- La fonction prend environ 28 minutes pour un dataFrame de 28 000 lignes

In [52]:
def lexique_finder_row(row,lexique_gestion,lexique_innovation): 
    """Documentation
    This function is taken from @Corentin PM code
    """
    #gérer les dérivés (pluriel, féminin, mot composé, etc)
    dico={}  #{mot1:[gestion ou innovation, nb d'apparition dans le titre, nb d'apparition dans les resume]}
    for mot_gestion in (lexique_gestion) :
      #We search management words on resume and title 

      mot=cleandesc(mot_gestion)
      nresume=row['resume'].count(mot+' ')
      ntitre=row['title'].count(mot+' ')
      dico[mot_gestion]=[0,ntitre,nresume]
    for mot_innovation in lexique_innovation :
      #We search innovation words on resume and title 
      mot=cleandesc(mot_innovation)
      nresume=row['resume'].count(mot+' ')
      ntitre=row['title'].count(mot+' ')
      dico[mot_innovation]=[0,ntitre,nresume]
    final_df=pd.DataFrame.from_dict(dico, orient='index', columns=['Gestion/Innovation', 'nb_in_title', 'nb_in_resume'])
    return final_df

def lexique_finder_google(df,lexique_gestion,lexique_innovation): 
    """Documentation
    This function is taken from @Corentin PM code
    """
    #list coll
    list_df=[]
    for index, row in (df.iterrows()):
      
      list_df.append(lexique_finder_row(row,lexique_gestion,lexique_innovation))
    return list_df

df_work['title']= [cleandesc(x.title) for x in (df_work.itertuples())]
df_work['resume']= [cleandesc(x.resume) for x in df_work.itertuples()]

listeDf = lexique_finder_google(df_work,gestion,innovation) 

listeMots = []
for i in listeDf :
  listeMots.append(list(i[(i['nb_in_title']>0) | (i['nb_in_resume']>0 )].index))


df_work['mots'] = listeMots
#listeMots


In [53]:
listeMots

[['M14',
  'M832',
  'M22',
  'pilotage',
  'innovation',
  '3d',
  'ar',
  'augmented intelligence',
  '5g',
  '4.0'],
 ['M14', 'M832', 'M22', 'intelligence artificielle', '3d', '5g', '4.0'],
 ['M14',
  'M4X',
  'M832',
  'M22',
  'pilotage',
  'activités',
  '3d',
  'augmented intelligence',
  '5g',
  '4.0'],
 ['M14', 'M832', 'M22', 'RH', '3d', '5g', '4.0', 'du futur'],
 ['Administration',
  'M14',
  'M832',
  'M22',
  'pilotage',
  '3d',
  'vr',
  'augmented intelligence',
  '5g',
  '4.0'],
 ['M14', 'M832', 'M22', 'emplois', '3d', '4.0'],
 ['M14', 'M832', 'M22', 'emplois', '3d', '5g', '4.0'],
 ['M14', 'M4X', 'M832', 'M22', '3d', '5g', '4.0'],
 ['M14', 'M832', 'M22', 'formation', '3d', '4.0'],
 ['M14', 'M832', 'M22', 'iot', '3d', '4.0'],
 ['M14', 'M832', 'M22', '3d', '4.0'],
 ['M14', 'M832', 'M22', '3d', '4.0'],
 ['M14', 'M832', 'M22', '3d', '5g', '4.0'],
 ['M14', 'M832', 'M22', '3d', 'ar', 'big data', 'big-data', '5g', '4.0'],
 ['M14', 'M832', 'M22', '3d', '4.0'],
 ['M14', 'M832', '

#Transition entre la Troisème et la quatrième étape

In [54]:
#-----------------------------------------#
#Cas ou on importe des données depuis un csv ou un json , le df doit contenir une colonne mots avec les listes des mots contenus dans les titres et articles

#df_work = pd.read_csv('df7000.csv') 

**Implémentation de la fonction qui fait le produit cartésien entre les mots du lexique**

In [55]:
def produitCartesien(liste1: list,liste2: list) -> list() :
    """Documentation
    Parameters:
      liste1: list of strings containing words for queries
      liste2: same.

    Out:
      listeRetour: list containing cartesian product of arguments
    """
    listeRetour: list = []
    for i in liste1 :
      for j in liste2:
        listeRetour.append([i,j])
    return listeRetour

listCouples = produitCartesien(list(set(gestion)),innovation)
#change list of 2words into a string separated with a space
vraiListeCouples = [' '.join(i) for i in listCouples] 

**Implémentation de la fonction qui calcule :**
- L'occurence de chaque couple de mot clé dans le titre et le resumé
- L'évolution de ces couples 
- Le score de pertinence de ces couples

In [56]:
def popularite_motscles(data: pd.DataFrame,dataAnterieur=False,
                        lexique_gestion: list = gestion,
                        lexique_innovation: list = innovation,
                        listCouple: list = vraiListeCouples) -> pd.DataFrame() :
    """ Documentation
      Calculation of popularity for each keyword 

      Parameter :
          data : dataframe containing the keywords to analyze 
          dataAnterieur = old dataframe containing the data to be compared 
          lexique_gestion = management lexicon 
          lexique_innovation = innovation lexixon 
          listCouple = list of keyword pairs 
      
      Out :
          dfPopularite : dataframe containing the list of keywords, 
          their occurrence and their evolution
    """

    dfPopularite = pd.DataFrame(
        columns =['couples','occurence','evolution','score']
        )
    dfPopularite['couples'] = vraiListeCouples
    dfPopularite.index = dfPopularite['couples']
    motsgestions: list = []
    motsinnovations: list = []
    couples: list = []
    # separation of the words management and innovation 
    for i in range((data.shape[0])):
      #This condition treats the case when a list is loaded from the dataframe
      if type(data['mots'].iloc[i]) == list : 
        motsgestions.append(
            [j for j in data['mots'].iloc[i] if j in lexique_gestion]
            )
        motsinnovations.append(
            [j for j in data['mots'].iloc[i] if j in lexique_innovation]
            )
      else :
        data['mots'].iloc[i] = ast.literal_eval(data['mots'].iloc[i])
        
        motsgestions.append(
            [j for j in data['mots'].iloc[i] if j in lexique_gestion]
            )
        motsinnovations.append(
            [j for j in data['mots'].iloc[i] if j in lexique_innovation]
            )
  
        
    #creation of colums
    data['mots innovation'] = motsinnovations
    data['mots gestion'] = motsgestions
    
    #generation of couples 
    for i in range(data.shape[0]) :
      couples.append(
          produitCartesien(data['mots gestion'].iloc[i],
                           data['mots innovation'].iloc[i])
          )
    data['couples'] = couples
    dictOccurences: dict = {}
    for ligne in range(data.shape[0]) :
      for couple in data['couples'].iloc[ligne] :
        if ' '.join(couple) in dictOccurences.keys() :
          dictOccurences[' '.join(couple)] +=  1
        else :
          dictOccurences[' '.join(couple)] =  1
    
    for i in dictOccurences.keys() :
      
      dfPopularite['occurence'].loc[i] = dictOccurences[i]
    dfPopularite['occurence'] = dfPopularite['occurence'].fillna(0)
    #Calculation of evolution
    if type(dataAnterieur) != bool:
      for i in (dfPopularite.index) :
        try :
          if dataAnterieur['occurence'].loc[i]  != 0:
            dfPopularite['evolution'].loc[i] = (
                dfPopularite['occurence'].loc[i] / dataAnterieur['occurence'].loc[i]-1)*100
          
          elif dataAnterieur['occurence'].loc[i]  == 0 and dfPopularite['occurence'].loc[i]  != 0 :
            dfPopularite['evolution'].loc[i] = 100
          elif dataAnterieur['occurence'].loc[i]  == 0 and dfPopularite['occurence'].loc[i]  == 0  :
            dfPopularite['evolution'].loc[i] = 0
        except :
          print( dataAnterieur['occurence'].loc[i])

    dfPopularite['score'] = ((dfPopularite['occurence']/dfPopularite['occurence'].max())*100 + dfPopularite['evolution'])/2

    return dfPopularite.sort_values('occurence',ascending=False)
anciendf = popularite_motscles(df_work)  #Cas avec pas d'historique 
#nouveaudf = popularite_motscles(dfTest2,anciendf) #cas ou l'on possède un historique

In [57]:
anciendf.head(40).sort_values('occurence',ascending=False)


Unnamed: 0_level_0,couples,occurence,evolution,score
couples,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M22 4.0,M22 4.0,20,,
M14 3d,M14 3d,20,,
M832 4.0,M832 4.0,20,,
M22 3d,M22 3d,20,,
M832 3d,M832 3d,20,,
M14 4.0,M14 4.0,20,,
M832 5g,M832 5g,10,,
M22 5g,M22 5g,10,,
M14 5g,M14 5g,10,,
pilotage 5g,pilotage 5g,4,,


Implémentation d'un algo de machine learning pour catégoriser les différents couples

In [73]:
from sklearn.cluster import KMeans
import numpy as np
X = df[['occurence','evolution']]



kmeans = KMeans(n_clusters=3).fit(X)

KeyError: ignored