<h1>EMPLOYEE EXPERIENCE ANALYSIS IN LABOR SOCIAL MEDIA</h1>
<h2>APPLYING NLP TECHNIQUES</h2>



# **Data Resource: Glassdoor**

<p>The data related to the opinions of employees about their experience and job satisfaction in various companies across Spain will be obtained from the 'Glassdoor' website, originally dedicated to reviews of companies at a global level, but which also works as an employment portal.</p>

<a href='https://www.glassdoor.es/index.htm'>link</a>

<img src='https://www.glassdoor.com/about-us//app/uploads/sites/2/2018/06/Group-7.png'>





Having a look at the reviews display on Glassdoor, I take examples about Amazon's reviews. 

The review of this example is coming from an employee with a Data Analyst role, based in Madrid and dated in 2019:
<img src= 'Images/Screenshot_2021-03-14 Amazon.png'>



# **Getting the data: Web Scraping**

To get the data from the Glassdoor website, it is necessary to do web scrapping and I will use five python's libraries for that:

* Selenium: to automate the access into the websites 
* Request: to get requests to the different websites
* Beautiful Soup: to download the content in html and create an object tree  
* Pandas: to create tables with the data obtained
* Numpy: to create dataframes from tables created with numpy 

### Importing Data Science libraries:

In [2]:
import numpy as np
import pandas as pd

### Installing and importing Selenium library:

The steps I will take are:
1. Intall Selenium
2. Download 'webdriver'from source: https://github.com/mozilla/geckodriver/releases
(the firefox driver into the local disk, remembering the folder where it is)
3. Import the required libraries for the scraping

In [None]:
!pip install selenium

In [3]:
from selenium import webdriver
#from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.keys import Keys
import time

### Importing Beautiful Soup & Requests libraries:

In [4]:
from bs4 import BeautifulSoup
import requests
from datetime import datetime

### Web Scraping:

I will have a look about the content that I can take with a first quick glassdoor first scrapping:

In [10]:
#check if we capture content in a first scraping
url = "https://www.glassdoor.es/Explorar/buscar-empresas.htm?overall_rating_low=0&page=1&isHiringSurge=0&locId=2664239&locType=C&locName=Madrid&sector=10013"
r = requests.get(url)
print(r.content)

b'<!DOCTYPE html>\n            <!DOCTYPE html>\n            <html lang="es" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraph.org/schema/">\n                <head>\n                <title>Apache Tomcat - Error report</title>\n\n                \n                \n                \n                \n\n                \n                \n\n                \n                \n\n                <link href="/explore/static/gd-companyExplorer.bundle.css" rel="stylesheet" media="all"/>\n                <script defer type="text/javascript" src="/explore/static/js/gd-vendor.bundle.js?v=e806a3ca142ba8002197"></script><script defer type="text/javascript" src="/explore/static/js/gd-companyExplorer.bundle.js?v=e806a3ca142ba8002197"></script>\n\n                \n                \n                \n            </head>\n                <body  ><h1>HTTP Status 403 - Bots not allowed</h1><div class="line"></div><p><b>type</b> Status report</p><p><b>message</b> <u>Bots not allowed<

I can see that it was not possible to get the content, probably because I must to login to the website before to get it. For that reason, I will do the same but using the 'network -> headers' path:

In [11]:
#The data I need is found after the login, so I need to use 'headers'. Having a look:

headers = {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0"}

url = "https://www.glassdoor.es/Explorar/buscar-empresas.htm?overall_rating_low=0&page=1&isHiringSurge=0&locId=2664239&locType=C&locName=Madrid&sector=10013"
r = requests.get(url, headers=headers)
print(r.content)




As we have to do the scrapping on a large of consecutives websites, before to the scraping, we can check if all these pages are allowed to scrap or if Glassdoor is blocking any of them, defining the next function:

In [3]:
#Checking if the requests were succesfull on all the pages, or if glassdoor is blocking them

headers = {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0"}

def Searchsuccess (i):
    url = ("https://www.glassdoor.es/Opiniones/IBM-Espa%C3%B1a-Opiniones-EI_IE354.0,3_IL.4,10_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    print(url)
    page = requests.get(url, headers=headers)
    if page.status_code==200:
        return page
    else:
        return "Error"

In [4]:
Searchsuccess (100)

https://www.glassdoor.es/Opiniones/IBM-Espa%C3%B1a-Opiniones-EI_IE354.0,3_IL.4,10_IN219_IP100.htm?filter.iso3Language=spa


<Response [200]>

I can see that it works now, so I will make the request to that path and processing the code through of a 'Beautiful Soap' object, and to create dataframes as csv files with all the data captured.


Having a look into the Glassdoor website there is a page review for campany, it is impossible to filter to getting all the reviews for city or zone including all the companies. I mean, there is a filter only for country and city, for language, and for labor situation or type of contract (it is not necessary for me), but I can not select the reviews of all the companies I interested in all together. 

Definitely, the scraping phase is going to be more complicated that I thought, because I will need to scrape the reviews for each company in a separated way, putting each company's data on different dataframes and finally gathering all this data within a single .csv file.

In this sense, for example, I will have different dataframes per company with the following names: 'IBM_reviews', 'Accenture_reviews', 'Amadeus_reviews', 'Adevinta_reviews', etc. And finally I will have a single dataframe doing an 'append' with all the companies reviews and creating an unified .csv file.

The features/columns of all these dataframes will be the same:
* 'Company'
* 'Date'
* 'Title'
* 'Rating'
* 'Role'
* 'Location'
* 'Pros'
* 'Contras'
* 'Language'

Each of these companies dataframes need to be captured from various web sites that share the same page structure. For this reason, the steps I will take will be:
1. Collecting the URLs
2. Creating a list of URLs loop
3. Selecting the data to extract
4. Running the task to get the data
5. Create the dataframes in spanish and english separated with the data captured, to later unifiy them

In [32]:
headers = {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0"}
col = ['Company','Date','Title','Rating','Role','Location','Pros','Contras', 'Language']

In [16]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/IBM-Espa%C3%B1a-Opiniones-EI_IE354.0,3_IL.4,10_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping IBM Spain: spanish reviews (205 in total) + english reviews (150 in total)
#Creating a dataframe for all IBM reviews across Spain

IBM_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
IBMrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 22):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/IBM-Espa%C3%B1a-Opiniones-EI_IE354.0,3_IL.4,10_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    IBMrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in IBMrev_esp_containers:
        
        com = "IBM"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        IBMrev_esp = IBMrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
IBMrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 16):
    url = ("https://www.glassdoor.es/Opiniones/IBM-Espa%C3%B1a-Opiniones-EI_IE354.0,3_IL.4,10_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    IBMrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in IBMrev_eng_containers:
        
        com = "IBM"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        IBMrev_eng = IBMrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all IBM reviews across Spain
IBM_reviews = IBMrev_esp.append(IBMrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'IBM_reviews' dataframe
IBM_reviews.to_csv('IBM_reviews.csv')   

driver.quit()

In [17]:
IBM_reviews['Pros'].head(20)

0             Skills sueldo experiencia contactos lider
1     Sueldo Proyección Beneficios Equipo Últimas te...
2               Mentalidad europea y flexibilidad total
3     Puedes preguntar información sobre cualquier c...
4     Buena cultura de empresa, americana, muy moder...
5     Marca. Research. Presencia global. Oportunidad...
6                Gran ambiente de trabajo en la empresa
7     Aprendes mucho, proyección profesional, inclusión
8                     Buen ambiente y buenos compañeros
9     - Equipo espectacular\r\n- Sueldo competitivo\...
10    Salario competitivo, trabajar para IBM y ser p...
11    Buen sueldo, grandes clientes, muchas oportuni...
12             Muy buen ámbito de trabajo, flexibilidad
13    Opción de trabajar en diferentes áreas de la c...
14               buena ambiente en general a dia de hoy
15              Salario medio en consultoría en España.
16    Estabilidad economica y laboral. muchos client...
17    Marca reconocida, posibilidad de Teletraba

In [18]:
IBM_reviews.head(), len(IBM_reviews)

(  Company                     Date  \
 0     IBM      12 de abril de 2021   
 1     IBM      12 de abril de 2021   
 2     IBM      16 de marzo de 2021   
 3     IBM  10 de diciembre de 2020   
 4     IBM    23 de febrero de 2021   
 
                                                Title Rating  \
 0                   «Gran multinacional informática»    5,0   
 1                      «Gran empresa donde trabajar»    5,0   
 2                                       «Me encanta»    5,0   
 3  «Gente talentosa, tradicional pero innovandose...    5,0   
 4                                        «Excelente»    5,0   
 
                                             Role   Location  \
 0           Exempleado - Representante Comercial     Madrid   
 1                    Empleado actual - Ingeniero     Madrid   
 2   Empleado actual - Senior Front End Developer  Barcelona   
 3                  Exempleado - Business Analyst     Madrid   
 4  Empleado actual - Digital Strategy Consultant     Madr

In [5]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Accenture-Espa%C3%B1a-Opiniones-EI_IE4138.0,9_IL.10,16_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()


#Scrapping Accenture Spain: spanish reviews (594 in total) + english reviews (353 in total)
#Creating a dataframe for all Accenture reviews across Spain

Accenture_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Accenturerev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 62):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Accenture-Espa%C3%B1a-Opiniones-EI_IE4138.0,9_IL.10,16_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Accenturerev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Accenturerev_esp_containers:
        
        com = "Accenture"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Accenturerev_esp = Accenturerev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Accenturerev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 37):
    url = ("https://www.glassdoor.es/Opiniones/Accenture-Espa%C3%B1a-Opiniones-EI_IE4138.0,9_IL.10,16_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Accenturerev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Accenturerev_eng_containers:
        
        com = "Accenture"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Accenturerev_eng = Accenturerev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Accenture reviews across Spain
Accenture_reviews = Accenturerev_esp.append(Accenturerev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Accenture_reviews' dataframe
Accenture_reviews.to_csv('Accenture_reviews.csv')

driver.quit()

In [6]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Indra-Espa%C3%B1a-Opiniones-EI_IE9757.0,5_IL.6,12_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()


#Scrapping Indra Spain: spanish reviews (476 in total) + english reviews (208 in total)
#Creating a dataframe for all Indra reviews across Spain

Indra_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Indrarev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 49):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Indra-Espa%C3%B1a-Opiniones-EI_IE9757.0,5_IL.6,12_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Indrarev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Indrarev_esp_containers:
        
        com = "Indra"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Indrarev_esp = Indrarev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Indrarev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 22):
    url = ("https://www.glassdoor.es/Opiniones/Indra-Espa%C3%B1a-Opiniones-EI_IE9757.0,5_IL.6,12_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Indrarev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Indrarev_eng_containers:
        
        com = "Indra"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Indrarev_eng = Indrarev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Indra reviews across Spain
Indra_reviews = Indrarev_esp.append(Indrarev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Indra_reviews' dataframe
Indra_reviews.to_csv('Indra_reviews.csv')

driver.quit()

In [7]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/everis-Espa%C3%B1a-Opiniones-EI_IE141439.0,6_IL.7,13_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()


#Scrapping Everis Spain: spanish reviews (561 in total) + english reviews (203 in total)
#Creating a dataframe for all Everis reviews across Spain

Everis_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Everisrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 58):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/everis-Espa%C3%B1a-Opiniones-EI_IE141439.0,6_IL.7,13_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Everisrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Everisrev_esp_containers:
        
        com = "Everis"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Everisrev_esp = Everisrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Everisrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 22):
    url = ("https://www.glassdoor.es/Opiniones/everis-Espa%C3%B1a-Opiniones-EI_IE141439.0,6_IL.7,13_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Everisrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Everisrev_eng_containers:
        
        com = "Everis"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Everisrev_eng = Everisrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Everis reviews across Spain
Everis_reviews = Everisrev_esp.append(Everisrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Everis_reviews' dataframe
Everis_reviews.to_csv('Everis_reviews.csv')

driver.quit()

In [8]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Capgemini-Espa%C3%B1a-Opiniones-EI_IE3803.0,9_IL.10,16_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Capgemini Spain: spanish reviews (200 in total) + english reviews (66 in total)
#Creating a dataframe for all Capgemini reviews across Spain

Capgemini_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Capgeminirev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 21):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Capgemini-Espa%C3%B1a-Opiniones-EI_IE3803.0,9_IL.10,16_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Capgeminirev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Capgeminirev_esp_containers:
        
        com = "Capgemini"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Capgeminirev_esp = Capgeminirev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Capgeminirev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 8):
    url = ("https://www.glassdoor.es/Opiniones/Capgemini-Espa%C3%B1a-Opiniones-EI_IE3803.0,9_IL.10,16_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Capgeminirev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Capgeminirev_eng_containers:
        
        com = "Capgemini"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Capgeminirev_eng = Capgeminirev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Capgemini reviews across Spain
Capgemini_reviews = Capgeminirev_esp.append(Capgeminirev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Capgemini_reviews' dataframe
Capgemini_reviews.to_csv('Capgemini_reviews.csv')

driver.quit()

In [9]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/HP-Inc.-Espa%C3%B1a-Opiniones-EI_IE1093161.0,7_IL.8,14_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping HPInc Spain: spanish reviews (187 in total) + english reviews (141 in total)
#Creating a dataframe for all HPInc reviews across Spain

HPInc_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
HPIncrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 20):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/HP-Inc.-Espa%C3%B1a-Opiniones-EI_IE1093161.0,7_IL.8,14_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    HPIncrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in HPIncrev_esp_containers:
        
        com = "HP Inc"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        HPIncrev_esp = HPIncrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
HPIncrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 16):
    url = ("https://www.glassdoor.es/Opiniones/HP-Inc.-Espa%C3%B1a-Opiniones-EI_IE1093161.0,7_IL.8,14_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    HPIncrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in HPIncrev_eng_containers:
        
        com = "HP Inc"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        HPIncrev_eng = HPIncrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all HPInc reviews across Spain
HPInc_reviews = HPIncrev_esp.append(HPIncrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'HPInc_reviews' dataframe
HPInc_reviews.to_csv('HPInc_reviews.csv')

driver.quit()

In [16]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Hewlett-Packard-Enterprise-%7C-HPE-Espa%C3%B1a-Opiniones-EI_IE1093046.0,32_IL.33,39_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping HPE Spain: spanish reviews (91 in total) + english reviews (91 in total)
#Creating a dataframe for all HPE reviews across Spain

HPE_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
HPErev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 11):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Hewlett-Packard-Enterprise-%7C-HPE-Espa%C3%B1a-Opiniones-EI_IE1093046.0,32_IL.33,39_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    HPErev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in HPErev_esp_containers:
        
        com = "HPE"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        HPErev_esp = HPErev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
HPErev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 11):
    url = ("https://www.glassdoor.es/Opiniones/HP-Inc.-Espa%C3%B1a-Opiniones-EI_IE1093161.0,7_IL.8,14_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    HPErev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in HPErev_eng_containers:
        
        com = "HPE"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        HPErev_eng = HPErev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all HPE reviews across Spain
HPE_reviews = HPErev_esp.append(HPErev_eng, ignore_index=True)

#Creating a 'csv' file with the 'HPE_reviews' dataframe
HPE_reviews.to_csv('HPE_reviews.csv')

driver.quit()

In [25]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/DXC-Technology-Espa%C3%B1a-Opiniones-EI_IE1603125.0,14_IL.15,21_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping DXC Technology Spain: spanish reviews (66 in total) + english reviews (33 in total)
#Creating a dataframe for all DXC reviews across Spain

DXC_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
DXCrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 8):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/DXC-Technology-Espa%C3%B1a-Opiniones-EI_IE1603125.0,14_IL.15,21_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    DXCrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in DXCrev_esp_containers:
        
        com = "DXC Technology"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        DXCrev_esp = DXCrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
DXCrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 5):
    url = ("https://www.glassdoor.es/Opiniones/DXC-Technology-Espa%C3%B1a-Opiniones-EI_IE1603125.0,14_IL.15,21_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    DXCrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in DXCrev_eng_containers:
        
        com = "DXC Technology"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        DXCrev_eng = DXCrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all DXC reviews across Spain
DXC_reviews = DXCrev_esp.append(DXCrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'DXC_reviews' dataframe
DXC_reviews.to_csv('DXC_reviews.csv')

driver.quit()

In [11]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Adevinta-Espa%C3%B1a-Opiniones-EI_IE2558314.0,8_IL.9,15_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Adevinta: spanish reviews (27 in total) + english reviews (44 in total)
#Creating a dataframe for all Adevinta reviews across Spain

Adevinta_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Adevintarev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 4):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Adevinta-Espa%C3%B1a-Opiniones-EI_IE2558314.0,8_IL.9,15_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Adevintarev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Adevintarev_esp_containers:
        
        com = "Adevinta"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Adevintarev_esp = Adevintarev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Adevintarev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 6):
    url = ("https://www.glassdoor.es/Opiniones/Adevinta-Espa%C3%B1a-Opiniones-EI_IE2558314.0,8_IL.9,15_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Adevintarev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Adevintarev_eng_containers:
        
        com = "Adevinta"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Adevintarev_eng = Adevintarev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all HPInc reviews across Spain
Adevinta_reviews = Adevintarev_esp.append(Adevintarev_eng, ignore_index=True)

#Creating a 'csv' file with the 'HPInc_reviews' dataframe
Adevinta_reviews.to_csv('Adevinta_reviews.csv')

driver.quit()

In [12]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Vistaprint-Espa%C3%B1a-Opiniones-EI_IE21719.0,10_IL.11,17_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Vistaprint: spanish reviews (62 in total) + english reviews (66 in total)
#Creating a dataframe for all Vistaprint reviews across Spain

Vistaprint_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Vistaprintrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 8):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Vistaprint-Espa%C3%B1a-Opiniones-EI_IE21719.0,10_IL.11,17_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Vistaprintrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Vistaprintrev_esp_containers:
        
        com = "Vistaprint"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Vistaprintrev_esp = Vistaprintrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Vistaprintrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 8):
    url = ("https://www.glassdoor.es/Opiniones/Vistaprint-Espa%C3%B1a-Opiniones-EI_IE21719.0,10_IL.11,17_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Vistaprintrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Vistaprintrev_eng_containers:
        
        com = "Vistaprint"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Vistaprintrev_eng = Vistaprintrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all HPInc reviews across Spain
Vistaprint_reviews = Vistaprintrev_esp.append(Vistaprintrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'HPInc_reviews' dataframe
Vistaprint_reviews.to_csv('Vistaprint_reviews.csv')

driver.quit()

In [13]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Typeform-Espa%C3%B1a-Opiniones-EI_IE991912.0,8_IL.9,15_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Typeform: spanish reviews (72 in total) + english reviews (72 in total)
#Creating a dataframe for all Typeform reviews across Spain

Typeform_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Typeformrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 9):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Typeform-Espa%C3%B1a-Opiniones-EI_IE991912.0,8_IL.9,15_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Typeformrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Typeformrev_esp_containers:
        
        com = "Typeform"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Typeformrev_esp = Typeformrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Typeformrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 9):
    url = ("https://www.glassdoor.es/Opiniones/Typeform-Espa%C3%B1a-Opiniones-EI_IE991912.0,8_IL.9,15_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Typeformrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Typeformrev_eng_containers:
        
        com = "Typeform"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Typeformrev_eng = Typeformrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Typeform reviews across Spain
Typeform_reviews = Typeformrev_esp.append(Typeformrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Typeform_reviews' dataframe
Typeform_reviews.to_csv('Typeform_reviews.csv')

driver.quit()

In [14]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/GFT-Technologies-Espa%C3%B1a-Opiniones-EI_IE935116.0,16_IL.17,23_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping GFT: spanish reviews (77 in total) + english reviews (56 in total)
#Creating a dataframe for all GFT reviews across Spain

GFT_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
GFTrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 9):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/GFT-Technologies-Espa%C3%B1a-Opiniones-EI_IE935116.0,16_IL.17,23_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    GFTrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in GFTrev_esp_containers:
        
        com = "GFT"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        GFTrev_esp = GFTrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
GFTrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 7):
    url = ("https://www.glassdoor.es/Opiniones/GFT-Technologies-Espa%C3%B1a-Opiniones-EI_IE935116.0,16_IL.17,23_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    GFTrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in GFTrev_eng_containers:
        
        com = "GFT"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        GFTrev_eng = GFTrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all GFT reviews across Spain
GFT_reviews = GFTrev_esp.append(GFTrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'GFT_reviews' dataframe
GFT_reviews.to_csv('GFT_reviews.csv')

driver.quit()

In [15]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Sopra-Steria-Espa%C3%B1a-Opiniones-EI_IE466295.0,12_IL.13,19_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

log_in = driver.find_element_by_class_name("sign-in").click()
#log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Sopra Steria: spanish reviews (107 in total) + english reviews (41 in total)
#Creating a dataframe for all Sopra Steria reviews across Spain

Sopra_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Soprarev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 12):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Sopra-Steria-Espa%C3%B1a-Opiniones-EI_IE466295.0,12_IL.13,19_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Soprarev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Soprarev_esp_containers:
        
        com = "Sopra Steria"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Soprarev_esp = Soprarev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Soprarev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 6):
    url = ("https://www.glassdoor.es/Opiniones/Sopra-Steria-Espa%C3%B1a-Opiniones-EI_IE466295.0,12_IL.13,19_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Soprarev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Soprarev_eng_containers:
        
        com = "Sopra Steria"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Soprarev_eng = Soprarev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Sopra Steria reviews across Spain
Sopra_reviews = Soprarev_esp.append(Soprarev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Sopra_reviews' dataframe
Sopra_reviews.to_csv('Sopra_reviews.csv')

driver.quit()

In [15]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Glovo-Espa%C3%B1a-Opiniones-EI_IE1080730.0,5_IL.6,12_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Glovo: spanish reviews (148 in total) + english reviews (160 in total)
#Creating a dataframe for all Glovo reviews across Spain

Glovo_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Glovorev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 16):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Glovo-Espa%C3%B1a-Opiniones-EI_IE1080730.0,5_IL.6,12_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Glovorev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Glovorev_esp_containers:
        
        com = "Glovo"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Glovorev_esp = Glovorev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Glovorev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 17):
    url = ("https://www.glassdoor.es/Opiniones/Glovo-Espa%C3%B1a-Opiniones-EI_IE1080730.0,5_IL.6,12_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Glovorev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Glovorev_eng_containers:
        
        com = "Glovo"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Glovorev_eng = Glovorev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Glovo reviews across Spain
Glovo_reviews = Glovorev_esp.append(Glovorev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Glovo_reviews' dataframe
Glovo_reviews.to_csv('Glovo_reviews.csv')

driver.quit()

In [17]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/eDreams-ODIGEO-Espa%C3%B1a-Opiniones-EI_IE12822.0,14_IL.15,21_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping eDreams: spanish reviews (156 in total) + english reviews (137 in total)
#Creating a dataframe for all eDreams reviews across Spain

eDreams_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
eDreamsrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 17):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/eDreams-ODIGEO-Espa%C3%B1a-Opiniones-EI_IE12822.0,14_IL.15,21_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    eDreamsrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in eDreamsrev_esp_containers:
        
        com = "eDreams"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        eDreamsrev_esp = eDreamsrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
eDreamsrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 15):
    url = ("https://www.glassdoor.es/Opiniones/eDreams-ODIGEO-Espa%C3%B1a-Opiniones-EI_IE12822.0,14_IL.15,21_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    eDreamsrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in eDreamsrev_eng_containers:
        
        com = "eDreams"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        eDreamsrev_eng = eDreamsrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all eDreams reviews across Spain
eDreams_reviews = eDreamsrev_esp.append(eDreamsrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'eDreams_reviews' dataframe
eDreams_reviews.to_csv('eDreams_reviews.csv')

driver.quit()

In [18]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Amadeus-Espa%C3%B1a-Opiniones-EI_IE6940.0,7_IL.8,14_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Amadeus: spanish reviews (135 in total) + english reviews (136 in total)
#Creating a dataframe for all Amadeus reviews across Spain

Amadeus_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Amadeusrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 15):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Amadeus-Espa%C3%B1a-Opiniones-EI_IE6940.0,7_IL.8,14_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Amadeusrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Amadeusrev_esp_containers:
        
        com = "Amadeus"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Amadeusrev_esp = Amadeusrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Amadeusrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 15):
    url = ("https://www.glassdoor.es/Opiniones/Amadeus-Espa%C3%B1a-Opiniones-EI_IE6940.0,7_IL.8,14_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Amadeusrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Amadeusrev_eng_containers:
        
        com = "Amadeus"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Amadeusrev_eng = Amadeusrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Amadeus reviews across Spain
Amadeus_reviews = Amadeusrev_esp.append(Amadeusrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Amadeus_reviews' dataframe
Amadeus_reviews.to_csv('Amadeus_reviews.csv')

driver.quit()

In [19]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Amazon-Espa%C3%B1a-Opiniones-EI_IE6036.0,6_IL.7,13_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Amazon: spanish reviews (288 in total) + english reviews (261 in total)
#Creating a dataframe for all Amazon reviews across Spain

Amazon_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Amazonrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 30):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Amazon-Espa%C3%B1a-Opiniones-EI_IE6036.0,6_IL.7,13_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Amazonrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Amazonrev_esp_containers:
        
        com = "Amazon"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Amazonrev_esp = Amazonrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Amazonrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 28):
    url = ("https://www.glassdoor.es/Opiniones/Amazon-Espa%C3%B1a-Opiniones-EI_IE6036.0,6_IL.7,13_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Amazonrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Amazonrev_eng_containers:
        
        com = "Amazon"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Amazonrev_eng = Amazonrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Amazon reviews across Spain
Amazon_reviews = Amazonrev_esp.append(Amazonrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Amazon_reviews' dataframe
Amazon_reviews.to_csv('Amazon_reviews.csv')

driver.quit()

In [20]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Nestl%C3%A9-Espa%C3%B1a-Opiniones-EI_IE3492.0,6_IL.7,13_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Nestle: spanish reviews (76 in total) + english reviews (47 in total)
#Creating a dataframe for all Nestle reviews across Spain

Nestle_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Nestlerev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 9):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Nestl%C3%A9-Espa%C3%B1a-Opiniones-EI_IE3492.0,6_IL.7,13_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Nestlerev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Nestlerev_esp_containers:
        
        com = "Nestle"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Nestlerev_esp = Nestlerev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Nestlerev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 6):
    url = ("https://www.glassdoor.es/Opiniones/Nestl%C3%A9-Espa%C3%B1a-Opiniones-EI_IE3492.0,6_IL.7,13_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Nestlerev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Nestlerev_eng_containers:
        
        com = "Nestle"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Nestlerev_eng = Nestlerev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Nestle reviews across Spain
Nestle_reviews = Nestlerev_esp.append(Nestlerev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Nestle_reviews' dataframe
Nestle_reviews.to_csv('Nestle_reviews.csv')

driver.quit()

In [21]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Novartis-Espa%C3%B1a-Opiniones-EI_IE6667.0,8_IL.9,15_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Novartis: spanish reviews (43 in total) + english reviews (22 in total)
#Creating a dataframe for all Novartis reviews across Spain

Novartis_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Novartisrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 6):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Novartis-Espa%C3%B1a-Opiniones-EI_IE6667.0,8_IL.9,15_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Novartisrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Novartisrev_esp_containers:
        
        com = "Novartis"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Novartisrev_esp = Novartisrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Novartisrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 4):
    url = ("https://www.glassdoor.es/Opiniones/Novartis-Espa%C3%B1a-Opiniones-EI_IE6667.0,8_IL.9,15_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Novartisrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Novartisrev_eng_containers:
        
        com = "Novartis"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Novartisrev_eng = Novartisrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Novartis reviews across Spain
Novartis_reviews = Novartisrev_esp.append(Novartisrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Novartis_reviews' dataframe
Novartis_reviews.to_csv('Novartis_reviews.csv')

driver.quit()

In [22]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Roche-Espa%C3%B1a-Opiniones-EI_IE3480.0,5_IL.6,12_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Roche: spanish reviews (72 in total) + english reviews (55 in total)
#Creating a dataframe for all Roche reviews across Spain

Roche_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Rocherev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 9):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Roche-Espa%C3%B1a-Opiniones-EI_IE3480.0,5_IL.6,12_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Rocherev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Rocherev_esp_containers:
        
        com = "Roche"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Rocherev_esp = Rocherev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Rocherev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 7):
    url = ("https://www.glassdoor.es/Opiniones/Roche-Espa%C3%B1a-Opiniones-EI_IE3480.0,5_IL.6,12_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Rocherev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Rocherev_eng_containers:
        
        com = "Roche"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Rocherev_eng = Rocherev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Roche reviews across Spain
Roche_reviews = Rocherev_esp.append(Rocherev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Roche_reviews' dataframe
Roche_reviews.to_csv('Roche_reviews.csv')

driver.quit()

In [23]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/BBVA-Espa%C3%B1a-Opiniones-EI_IE1237.0,4_IL.5,11_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping BBVA: spanish reviews (321 in total) + english reviews (158 in total)
#Creating a dataframe for all BBVA reviews across Spain

BBVA_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
BBVArev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 34):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/BBVA-Espa%C3%B1a-Opiniones-EI_IE1237.0,4_IL.5,11_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    BBVArev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in BBVArev_esp_containers:
        
        com = "BBVA"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        BBVArev_esp = BBVArev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
BBVArev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 17):
    url = ("https://www.glassdoor.es/Opiniones/BBVA-Espa%C3%B1a-Opiniones-EI_IE1237.0,4_IL.5,11_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    BBVArev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in BBVArev_eng_containers:
        
        com = "BBVA"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        BBVArev_eng = BBVArev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all BBVA reviews across Spain
BBVA_reviews = BBVArev_esp.append(BBVArev_eng, ignore_index=True)

#Creating a 'csv' file with the 'BBVA_reviews' dataframe
BBVA_reviews.to_csv('BBVA_reviews.csv')

driver.quit()

In [24]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/CaixaBank-Espa%C3%B1a-Opiniones-EI_IE100259.0,9_IL.10,16_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Caixabank: spanish reviews (91 in total) + english reviews (28 in total)
#Creating a dataframe for all Caixabank reviews across Spain

Caixabank_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Caixabankrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 11):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/CaixaBank-Espa%C3%B1a-Opiniones-EI_IE100259.0,9_IL.10,16_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Caixabankrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Caixabankrev_esp_containers:
        
        com = "Caixabank"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Caixabankrev_esp = Caixabankrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Caixabankrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 4):
    url = ("https://www.glassdoor.es/Opiniones/CaixaBank-Espa%C3%B1a-Opiniones-EI_IE100259.0,9_IL.10,16_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Caixabankrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Caixabankrev_eng_containers:
        
        com = "Caixabank"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Caixabankrev_eng = Caixabankrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Caixabank reviews across Spain
Caixabank_reviews = Caixabankrev_esp.append(Caixabankrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Caixabank_reviews' dataframe
Caixabank_reviews.to_csv('Caixabank_reviews.csv')

driver.quit()

In [5]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Atos-Espa%C3%B1a-Opiniones-EI_IE10686.0,4_IL.5,11_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Atos: spanish reviews (76 in total) + english reviews (58 in total)
#Creating a dataframe for all Atos reviews across Spain

Atos_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Atosrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 9):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Atos-Espa%C3%B1a-Opiniones-EI_IE10686.0,4_IL.5,11_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Atosrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Atosrev_esp_containers:
        
        com = "Atos"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Atosrev_esp = Atosrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Atosrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 7):
    url = ("https://www.glassdoor.es/Opiniones/Atos-Espa%C3%B1a-Opiniones-EI_IE10686.0,4_IL.5,11_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Atosrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Atosrev_eng_containers:
        
        com = "Atos"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Atosrev_eng = Atosrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Atos reviews across Spain
Atos_reviews = Atosrev_esp.append(Atosrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Atos_reviews' dataframe
Atos_reviews.to_csv('Atos_reviews.csv')

driver.quit()

In [6]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Criteo-Espa%C3%B1a-Opiniones-EI_IE426672.0,6_IL.7,13_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Criteo: spanish reviews (92 in total) + english reviews (129 in total)
#Creating a dataframe for all Criteo reviews across Spain

Criteo_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Criteorev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 11):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Criteo-Espa%C3%B1a-Opiniones-EI_IE426672.0,6_IL.7,13_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Criteorev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Criteorev_esp_containers:
        
        com = "Criteo"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Criteorev_esp = Criteorev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Criteorev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 14):
    url = ("https://www.glassdoor.es/Opiniones/Criteo-Espa%C3%B1a-Opiniones-EI_IE426672.0,6_IL.7,13_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Criteorev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Criteorev_eng_containers:
        
        com = "Criteo"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Criteorev_eng = Criteorev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Criteo reviews across Spain
Criteo_reviews = Criteorev_esp.append(Criteorev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Criteo_reviews' dataframe
Criteo_reviews.to_csv('Criteo_reviews.csv')

driver.quit()

In [7]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/SEAT-Espa%C3%B1a-Opiniones-EI_IE10808.0,4_IL.5,11_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping SEAT: spanish reviews (77 in total) + english reviews (33 in total)
#Creating a dataframe for all SEAT reviews across Spain

SEAT_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
SEATrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 9):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/SEAT-Espa%C3%B1a-Opiniones-EI_IE10808.0,4_IL.5,11_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    SEATrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in SEATrev_esp_containers:
        
        com = "SEAT"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        SEATrev_esp = SEATrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
SEATrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 5):
    url = ("https://www.glassdoor.es/Opiniones/SEAT-Espa%C3%B1a-Opiniones-EI_IE10808.0,4_IL.5,11_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    SEATrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in SEATrev_eng_containers:
        
        com = "SEAT"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        SEATrev_eng = SEATrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all SEAT reviews across Spain
SEAT_reviews = SEATrev_esp.append(SEATrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'SEAT_reviews' dataframe
SEAT_reviews.to_csv('SEAT_reviews.csv')

driver.quit()

In [8]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Deloitte-Espa%C3%B1a-Opiniones-EI_IE2763.0,8_IL.9,15_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Deloitte: spanish reviews (575 in total) + english reviews (211 in total)
#Creating a dataframe for all Deloitte reviews across Spain

Deloitte_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Deloitterev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 59):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Deloitte-Espa%C3%B1a-Opiniones-EI_IE2763.0,8_IL.9,15_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Deloitterev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Deloitterev_esp_containers:
        
        com = "Deloitte"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Deloitterev_esp = Deloitterev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Deloitterev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 23):
    url = ("https://www.glassdoor.es/Opiniones/Deloitte-Espa%C3%B1a-Opiniones-EI_IE2763.0,8_IL.9,15_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Deloitterev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Deloitterev_eng_containers:
        
        com = "Deloitte"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Deloitterev_eng = Deloitterev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Deloitte reviews across Spain
Deloitte_reviews = Deloitterev_esp.append(Deloitterev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Deloitte_reviews' dataframe
Deloitte_reviews.to_csv('Deloitte_reviews.csv')

driver.quit()

In [7]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/KPMG-Espa%C3%B1a-Opiniones-EI_IE2867.0,4_IL.5,11_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping KPMG: spanish reviews (283 in total) + english reviews (115 in total)
#Creating a dataframe for all KPMG reviews across Spain

KPMG_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
KPMGrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 30):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/KPMG-Espa%C3%B1a-Opiniones-EI_IE2867.0,4_IL.5,11_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    KPMGrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in KPMGrev_esp_containers:
        
        com = "KPMG"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        KPMGrev_esp = KPMGrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
KPMGrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 13):
    url = ("https://www.glassdoor.es/Opiniones/KPMG-Espa%C3%B1a-Opiniones-EI_IE2867.0,4_IL.5,11_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    KPMGrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in KPMGrev_eng_containers:
        
        com = "KPMG"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        KPMGrev_eng = KPMGrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all KPMG reviews across Spain
KPMG_reviews = KPMGrev_esp.append(KPMGrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'KPMG_reviews' dataframe
KPMG_reviews.to_csv('KPMG_reviews.csv')

driver.quit()

In [9]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/PwC-Espa%C3%B1a-Opiniones-EI_IE8450.0,3_IL.4,10_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping PwC: spanish reviews (383 in total) + english reviews (168 in total)
#Creating a dataframe for all PwC reviews across Spain

PwC_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
PwCrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 30):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/PwC-Espa%C3%B1a-Opiniones-EI_IE8450.0,3_IL.4,10_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    PwCrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in PwCrev_esp_containers:
        
        com = "PwC"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        PwCrev_esp = PwCrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
PwCrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 13):
    url = ("https://www.glassdoor.es/Opiniones/PwC-Espa%C3%B1a-Opiniones-EI_IE8450.0,3_IL.4,10_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    PwCrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in PwCrev_eng_containers:
        
        com = "PwC"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        PwCrev_eng = PwCrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all PwC reviews across Spain
PwC_reviews = PwCrev_esp.append(PwCrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'PwC_reviews' dataframe
PwC_reviews.to_csv('PwC_reviews.csv')

driver.quit()

In [10]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/EY-Espa%C3%B1a-Opiniones-EI_IE2784.0,2_IL.3,9_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping EY: spanish reviews (259 in total) + english reviews (122 in total)
#Creating a dataframe for all EY reviews across Spain

EY_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
EYrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 27):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/EY-Espa%C3%B1a-Opiniones-EI_IE2784.0,2_IL.3,9_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    EYrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in EYrev_esp_containers:
        
        com = "EY"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        PwCrev_esp = PwCrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
EYrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 14):
    url = ("https://www.glassdoor.es/Opiniones/EY-Espa%C3%B1a-Opiniones-EI_IE2784.0,2_IL.3,9_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    EYrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in EYrev_eng_containers:
        
        com = "EY"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        EYrev_eng = EYrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all EY reviews across Spain
EY_reviews = EYrev_esp.append(EYrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'EY_reviews' dataframe
EY_reviews.to_csv('EY_reviews.csv')

driver.quit()

In [11]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Santander-Espa%C3%B1a-Opiniones-EI_IE828048.0,9_IL.10,16_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Banco Santander: spanish reviews (299 in total) + english reviews (158 in total)
#Creating a dataframe for all Banco Santander reviews across Spain

Santander_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Santanderrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 31):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Santander-Espa%C3%B1a-Opiniones-EI_IE828048.0,9_IL.10,16_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Santanderrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Santanderrev_esp_containers:
        
        com = "Banco Santander"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Santanderrev_esp = Santanderrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Santanderrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 17):
    url = ("https://www.glassdoor.es/Opiniones/Santander-Espa%C3%B1a-Opiniones-EI_IE828048.0,9_IL.10,16_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Santanderrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Santanderrev_eng_containers:
        
        com = "Banco Santander"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Santanderrev_eng = Santanderrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Banco Santander reviews across Spain
Santander_reviews = Santanderrev_esp.append(Santanderrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Santander_reviews' dataframe
Santander_reviews.to_csv('Santander_reviews.csv')

driver.quit()

In [33]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Minsait-Espa%C3%B1a-Opiniones-EI_IE1201389.0,7_IL.8,14_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Minsait (by Indra): spanish reviews (164 in total) + english reviews (33 in total)
#Creating a dataframe for all Minsait reviews across Spain

Minsait_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Minsaitrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 18):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Minsait-Espa%C3%B1a-Opiniones-EI_IE1201389.0,7_IL.8,14_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Minsaitrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Minsaitrev_esp_containers:
        
        com = "Minsait"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Minsaitrev_esp = Minsaitrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Minsaitrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 5):
    url = ("https://www.glassdoor.es/Opiniones/Minsait-Espa%C3%B1a-Opiniones-EI_IE1201389.0,7_IL.8,14_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Minsaitrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Minsaitrev_eng_containers:
        
        com = "Minsait"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Minsaitrev_eng = Minsaitrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Minsait reviews across Spain
Minsait_reviews = Minsaitrev_esp.append(Minsaitrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Minsait_reviews' dataframe
Minsait_reviews.to_csv('Minsait_reviews.csv')

driver.quit()

In [34]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Telef%C3%B3nica-Espa%C3%B1a-Opiniones-EI_IE3511.0,10_IL.11,17_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Telefónica (by Indra): spanish reviews (340 in total) + english reviews (181 in total)
#Creating a dataframe for all telefónica reviews across Spain

Telefonica_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Telefonicarev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 35):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Telef%C3%B3nica-Espa%C3%B1a-Opiniones-EI_IE3511.0,10_IL.11,17_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Telefonicarev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Telefonicarev_esp_containers:
        
        com = "Telefonica"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Telefonicarev_esp = Telefonicarev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Telefonicarev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 20):
    url = ("https://www.glassdoor.es/Opiniones/Telef%C3%B3nica-Espa%C3%B1a-Opiniones-EI_IE3511.0,10_IL.11,17_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Telefonicarev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Telefonicarev_eng_containers:
        
        com = "Telefonica"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Telefonicarev_eng = Telefonicarev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Telefonica reviews across Spain
Telefonica_reviews = Telefonicarev_esp.append(Telefonicarev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Telefonica_reviews' dataframe
Telefonica_reviews.to_csv('Telefonica_reviews.csv')

driver.quit()

In [35]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/TravelPerk-Espa%C3%B1a-Opiniones-EI_IE1174739.0,10_IL.11,17_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping TravelPerk: spanish reviews (92 in total) + english reviews (89 in total)
#Creating a dataframe for all TravelPerk reviews across Spain

TravelPerk_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
TravelPerkrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 11):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/TravelPerk-Espa%C3%B1a-Opiniones-EI_IE1174739.0,10_IL.11,17_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    TravelPerkrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in TravelPerkrev_esp_containers:
        
        com = "TravelPerk"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        TravelPerkrev_esp = TravelPerkrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
TravelPerkrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 10):
    url = ("https://www.glassdoor.es/Opiniones/TravelPerk-Espa%C3%B1a-Opiniones-EI_IE1174739.0,10_IL.11,17_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    TravelPerkrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in TravelPerkrev_eng_containers:
        
        com = "TravelPerk"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        TravelPerkrev_eng = TravelPerkrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all TravelPerk reviews across Spain
TravelPerk_reviews = TravelPerkrev_esp.append(TravelPerkrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'TravelPerk_reviews' dataframe
TravelPerk_reviews.to_csv('TravelPerk_reviews.csv')

driver.quit()

In [36]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/King-Espa%C3%B1a-Opiniones-EI_IE597128.0,4_IL.5,11_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping King: spanish reviews (56 in total) + english reviews (59 in total)
#Creating a dataframe for all King reviews across Spain

King_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Kingrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 7):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/King-Espa%C3%B1a-Opiniones-EI_IE597128.0,4_IL.5,11_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Kingrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Kingrev_esp_containers:
        
        com = "King"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Kingrev_esp = Kingrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Kingrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 7):
    url = ("https://www.glassdoor.es/Opiniones/King-Espa%C3%B1a-Opiniones-EI_IE597128.0,4_IL.5,11_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Kingrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Kingrev_eng_containers:
        
        com = "King"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Kingrev_eng = Kingrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all King reviews across Spain
King_reviews = Kingrev_esp.append(Kingrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'King_reviews' dataframe
King_reviews.to_csv('King_reviews.csv')

driver.quit()

In [37]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Mango-Espa%C3%B1a-Opiniones-EI_IE315079.0,5_IL.6,12_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Mango: spanish reviews (72 in total) + english reviews (37 in total)
#Creating a dataframe for all Mango reviews across Spain

Mango_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Mangorev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 9):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Mango-Espa%C3%B1a-Opiniones-EI_IE315079.0,5_IL.6,12_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Mangorev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Mangorev_esp_containers:
        
        com = "Mango"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Mangorev_esp = Mangorev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Mangorev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 5):
    url = ("https://www.glassdoor.es/Opiniones/Mango-Espa%C3%B1a-Opiniones-EI_IE315079.0,5_IL.6,12_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Mangorev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Kingrev_eng_containers:
        
        com = "Mango"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Mangorev_eng = Mangorev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Mango reviews across Spain
Mango_reviews = Mangorev_esp.append(Mangorev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Mango_reviews' dataframe
Mango_reviews.to_csv('Mango_reviews.csv')

driver.quit()

In [38]:
#Calling to webdriver

driver = webdriver.Firefox(executable_path = "/home/dsc/Selenium/Firefox_driver/geckodriver")

driver.get("https://www.glassdoor.es/Opiniones/Socialpoint-Espa%C3%B1a-Opiniones-EI_IE662378.0,11_IL.12,18_IN219.htm?filter.iso3Language=spa")

time.sleep(5)  #para esperar 5 segs a que cargue la web

cookies_button = driver.find_element_by_id("onetrust-accept-btn-handler").click() 

#log_in = driver.find_element_by_class_name("sign-in").click()
log_in = driver.find_element_by_id("SignInButton").click() 
username = driver.find_element_by_name("username").send_keys("micasbot@gmail.com")
userpwd = driver.find_element_by_name("password").send_keys("Fofito88")
login_enter = driver.find_element_by_name("submit").click()

#Scrapping Socialpoint: spanish reviews (57 in total) + english reviews (72 in total)
#Creating a dataframe for all Socialpoint reviews across Spain

Socialpoint_reviews = pd.DataFrame(columns = col)

#Iteramos cada elemento de la lista de urls para quedarnos con cada url y poder a posteriori hacer scrapping en una por una de las urls de manera correlativa

#Spanish reviews:
Socialpointrev_esp = pd.DataFrame(columns = col)
url_list1 = []
for i in range(1, 7):
    #El objetivo es iterar para acceder a todas las reviews para cada página 
    url = ("https://www.glassdoor.es/Opiniones/Socialpoint-Espa%C3%B1a-Opiniones-EI_IE662378.0,11_IL.12,18_IN219_IP" + str(i) + ".htm?filter.iso3Language=spa")
    page = requests.get(url, headers=headers)  #petición a la web
    soup = BeautifulSoup(page.content, "html.parser")  #descarga de contenido 
    Socialpointrev_esp_containers = soup.find_all("div", {"class": "gdReview"})   #creamos el árbol de objeto 
    
    #Relizamos las consultas por review:
    for container in Socialpointrev_esp_containers:
        
        com = "Socialpoint"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "Español"

        data1 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]]) 
        data1.columns = col
        Socialpointrev_esp = Socialpointrev_esp.append(data1, ignore_index=True)  #introducimos todos los datos obtenidos en un dataframe

        
#English reviews:
Socialpointrev_eng = pd.DataFrame(columns = col)
url_list2 = []
for i in range(1, 9):
    url = ("https://www.glassdoor.es/Opiniones/Socialpoint-Espa%C3%B1a-Opiniones-EI_IE662378.0,11_IL.12,18_IN219_IP" + str(i) + ".htm?filter.iso3Language=eng")
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, "html.parser")
    Socialpointrev_eng_containers = soup.find_all("div", {"class": "gdReview"})
                                                  
    for container in Socialpointrev_eng_containers:
        
        com = "Socialpoint"
        
        Date = container.find_all("time", {"class": "date subtle small"})
        if len(Date) != 0:
          da = Date[0].text.strip()
        else:
          da = "NaN"

        Title = container.find_all("h2", {"class": "h2 summary strong mb-xsm mt-0"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        except:
            if len(Title) != 0:
              ti = Title[0].text.strip()
            else:
              ti = "NaN"
        
        Rating =  container.find_all("div", {"class": "v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small"})
        if len(Rating) != 0:
          rat = Rating[0].text.strip()
        else:
          rat = "NaN"  
    
        Role = container.find_all("span", {"class": "authorJobTitle middle "})
        if len(Role) != 0:
          rol = Role[0].text.strip()
        else:
          rol = "NaN"
                                           
        Location = container.find_all("span", {"class": "authorLocation"})
        if len(Location) != 0:
          loc = Location[0].text.strip()
        else:
          loc = "NaN"                                   
        
        Pros = container.find_all("span", {"data-test": "pros"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        except:
            if len(Pros) != 0:
              pro = Pros[0].text.strip()
            else:
              pro = "NaN"
        
        Contras = container.find_all("span", {"data-test": "cons"}, {"lang": "es-x-mtfrom-en"})
        try:
            followreading = driver.find_element_by_css_selector(".v2__EIReviewDetailsV2__clickable").click()
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        except:
            if len(Contras) != 0:
              con = Contras[0].text.strip()
            else:
              con = "NaN"
        
        lan = "English"

        data2 = pd.DataFrame([[com, da, ti, rat, rol, loc, pro, con, lan]])
        data2.columns = col
        Socialpointrev_eng = Socialpointrev_eng.append(data2, ignore_index=True)
        
#Creating a dataframe for all Socialpoint reviews across Spain
Socialpoint_reviews = Socialpointrev_esp.append(Socialpointrev_eng, ignore_index=True)

#Creating a 'csv' file with the 'Socialpoint_reviews' dataframe
Socialpoint_reviews.to_csv('Socialpoint_reviews.csv')

driver.quit()

### Importing the '.csv' files:

When scrapping the data of each company, I have already created a dataframe per company with all this information, and then create a .csv file from it. To work with these dataframes I would have to re-execute all the scrapping and variables every time I open the notebook, so I will choose to import the .csv files already created and saved previously, to create a single dataframe with which to work throughout the project.

In [39]:
Accenture_reviews = pd.read_csv("Data/Accenture_reviews.csv")
Adevinta_reviews = pd.read_csv("Data/Adevinta_reviews.csv")
Amadeus_reviews = pd.read_csv("Data/Amadeus_reviews.csv")
Amazon_reviews = pd.read_csv("Data/Amazon_reviews.csv")
Atos_reviews = pd.read_csv("Data/Atos_reviews.csv")
BBVA_reviews = pd.read_csv("Data/BBVA_reviews.csv")
Caixabank_reviews = pd.read_csv("Data/Caixabank_reviews.csv")
Capgemini_reviews = pd.read_csv("Data/Capgemini_reviews.csv")
Criteo_reviews = pd.read_csv("Data/Criteo_reviews.csv")
Deloitte_reviews = pd.read_csv("Data/Deloitte_reviews.csv")
DXC_reviews = pd.read_csv("Data/DXC_reviews.csv")
eDreams_reviews = pd.read_csv("Data/eDreams_reviews.csv")
Everis_reviews = pd.read_csv("Data/Everis_reviews.csv")
EY_reviews = pd.read_csv("Data/EY_reviews.csv")
GFT_reviews = pd.read_csv("Data/GFT_reviews.csv")
Glovo_reviews = pd.read_csv("Data/Glovo_reviews.csv")
HPE_reviews = pd.read_csv("Data/HPE_reviews.csv")
HPInc_reviews = pd.read_csv("Data/HPInc_reviews.csv")
IBM_reviews = pd.read_csv("Data/IBM_reviews.csv")
Indra_reviews = pd.read_csv("Data/Indra_reviews.csv")
King_reviews = pd.read_csv("Data/King_reviews.csv")
KPMG_reviews = pd.read_csv("Data/KPMG_reviews.csv")
Mango_reviews = pd.read_csv("Data/Mango_reviews.csv")
Minsait_reviews = pd.read_csv("Data/Minsait_reviews.csv")
Nestle_reviews = pd.read_csv("Data/Nestle_reviews.csv")
Novartis_reviews = pd.read_csv("Data/Novartis_reviews.csv")
PwC_reviews = pd.read_csv("Data/PwC_reviews.csv")
Roche_reviews = pd.read_csv("Data/Roche_reviews.csv")
Santander_reviews = pd.read_csv("Data/Santander_reviews.csv")
SEAT_reviews = pd.read_csv("Data/SEAT_reviews.csv")
Socialpoint_reviews = pd.read_csv("Data/Socialpoint_reviews.csv")
Sopra_reviews = pd.read_csv("Data/Sopra_reviews.csv")
Telefonica_reviews = pd.read_csv("Data/Telefonica_reviews.csv")
TravelPerk_reviews = pd.read_csv("Data/TravelPerk_reviews.csv")
Typeform_reviews = pd.read_csv("Data/Typeform_reviews.csv")
Vistaprint_reviews = pd.read_csv("Data/Vistaprint_reviews.csv")

### Merging it all into an unique dataframe:

Next, I will join all the dataframes to create a single one to work with. As all the dataframes have the same type of information and the columns in the same order, for the unification I will use the '.concat' function. On the other hand, as I have to join them by columns, I will add 'axis = 0'.

In [57]:
Companies_reviews = pd.concat([Accenture_reviews, Adevinta_reviews, Amadeus_reviews, Amazon_reviews, Atos_reviews, BBVA_reviews, Caixabank_reviews, Capgemini_reviews, Criteo_reviews, Deloitte_reviews, DXC_reviews, eDreams_reviews, Everis_reviews, EY_reviews, GFT_reviews, Glovo_reviews, HPE_reviews, HPInc_reviews, IBM_reviews, Indra_reviews, King_reviews, KPMG_reviews, Mango_reviews, Minsait_reviews, Nestle_reviews, Novartis_reviews, PwC_reviews, Roche_reviews, Santander_reviews, SEAT_reviews, Socialpoint_reviews, Sopra_reviews, Telefonica_reviews, TravelPerk_reviews, Typeform_reviews, Vistaprint_reviews], axis=0, ignore_index = True)
Companies_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10488 entries, 0 to 10487
Data columns (total 10 columns):
Unnamed: 0    10488 non-null int64
Company       10488 non-null object
Date          10483 non-null object
Title         10488 non-null object
Rating        10488 non-null object
Role          10382 non-null object
Location      10486 non-null object
Pros          10487 non-null object
Contras       10486 non-null object
Language      10488 non-null object
dtypes: int64(1), object(9)
memory usage: 819.5+ KB


### Creating an unified .csv file:

In [58]:
Companies_reviews.to_csv('Companies_reviews.csv')

### Every time we open the notebook we only need to read the unified .csv file:

In [4]:
Companies_reviews = pd.read_csv('Companies_reviews.csv')

# **Data Cleaning**

### Having a look at the data I have:

In [106]:
Companies_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10488 entries, 0 to 10487
Data columns (total 11 columns):
Unnamed: 0      10488 non-null int64
Unnamed: 0.1    10488 non-null int64
Company         10488 non-null object
Date            10483 non-null object
Title           10488 non-null object
Rating          10488 non-null object
Role            10382 non-null object
Location        10486 non-null object
Pros            10487 non-null object
Contras         10486 non-null object
Language        10488 non-null object
dtypes: int64(2), object(9)
memory usage: 901.4+ KB


In [107]:
Companies_reviews.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Company,Date,Title,Rating,Role,Location,Pros,Contras,Language
0,0,0,Accenture,15 de abril de 2021,«Prácticas consultoría Accenture»,50,Exbecario - Internship,Madrid,excelente ambiente de trabajo y proyectos inte...,"mucho trabajo, rara vez sales a la hora predef...",Español
1,1,1,Accenture,21 de abril de 2021,«Salario Especialista Accenture»,40,Empleado actual - Especialista,Madrid,Posibilidad de crecimiento. Estabilidad laboral,En ciertas areas los horarios pueden ser exces...,Español
2,2,2,Accenture,18 de abril de 2021,«Buen ambiente de trabajo»,40,Empleado actual - Consulting Analyst,Bilbao,"Bien ambiente, contrato indefinido, buen traba...","Carrera lenta, en algunos clientes la carga de...",Español
3,3,3,Accenture,19 de abril de 2021,«Becario»,50,Empleado actual - Consulting Intern,Madrid,Buen trato y buen ambiente dentro de la empresa.,"Ninguna por el momento, si eso el no poder ele...",Español
4,4,4,Accenture,15 de abril de 2021,«Sueldo accenture»,50,Empleado actual - Data Analyst,Madrid,Cuidan al empleado ayudandolea a encajar y bus...,Ir de proyecto en proyecto,Español


We have an initial dataframe with 10488 reviews.

### Removing the columns that are not necessary:

In [5]:
Companies_reviews.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'Company', 'Date', 'Title', 'Rating',
       'Role', 'Location', 'Pros', 'Contras', 'Language'],
      dtype='object')

In [6]:
Companies_reviews = Companies_reviews.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis = 1)
Companies_reviews.head()

Unnamed: 0,Company,Date,Title,Rating,Role,Location,Pros,Contras,Language
0,Accenture,15 de abril de 2021,«Prácticas consultoría Accenture»,50,Exbecario - Internship,Madrid,excelente ambiente de trabajo y proyectos inte...,"mucho trabajo, rara vez sales a la hora predef...",Español
1,Accenture,21 de abril de 2021,«Salario Especialista Accenture»,40,Empleado actual - Especialista,Madrid,Posibilidad de crecimiento. Estabilidad laboral,En ciertas areas los horarios pueden ser exces...,Español
2,Accenture,18 de abril de 2021,«Buen ambiente de trabajo»,40,Empleado actual - Consulting Analyst,Bilbao,"Bien ambiente, contrato indefinido, buen traba...","Carrera lenta, en algunos clientes la carga de...",Español
3,Accenture,19 de abril de 2021,«Becario»,50,Empleado actual - Consulting Intern,Madrid,Buen trato y buen ambiente dentro de la empresa.,"Ninguna por el momento, si eso el no poder ele...",Español
4,Accenture,15 de abril de 2021,«Sueldo accenture»,50,Empleado actual - Data Analyst,Madrid,Cuidan al empleado ayudandolea a encajar y bus...,Ir de proyecto en proyecto,Español


### Cheking if there are 'NaN' values in the dataframe:

The NaNs ("Not a Number") are non-computable empty values that we must deal with in our data sets before working with them. 

In [110]:
Companies_reviews.isnull().values.any()

#This code will return 'True' if there is any NaN value in the DataFrame. 

True

In [111]:
Companies_reviews.isnull().any()

#This code will return us in which columns our NaN values are found.

Company     False
Date         True
Title       False
Rating      False
Role         True
Location     True
Pros         True
Contras      True
Language    False
dtype: bool

In [7]:
Companies_reviews.isnull().sum()

#This code will return us how many NaN are in each column 

Company       0
Date          5
Title         0
Rating        0
Role        106
Location      2
Pros          1
Contras       2
Language      0
dtype: int64

The most relevant data is that we have 106 reviews in which the role of the employee is not specified. As for us it is relevant to analyze those reviews that come from technical roles (because we want to know the opinions of the IT market), we will eliminate these reviews, as well as the few that do not have positive or negative comments, or those that do not have the location indicated.

In [8]:
Companies_reviews.isnull().sum().sum()

#This code will return us how many NaN we have in total

116

In [9]:
nan_rows = Companies_reviews[Companies_reviews.isnull().any(1)]
nan_rows.info()

#This code will create a new DataFrame with only those rows that contain NaN values, to have a look into the reviews with NaN values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 115 entries, 339 to 10486
Data columns (total 9 columns):
Company     115 non-null object
Date        110 non-null object
Title       115 non-null object
Rating      115 non-null object
Role        9 non-null object
Location    113 non-null object
Pros        114 non-null object
Contras     113 non-null object
Language    115 non-null object
dtypes: object(9)
memory usage: 9.0+ KB


To remove the NaN values, we have two options:

1. Delete the entire row or review that contains the NaN value:

In [None]:
#df_sin_nan = df.dropna(how='any')

2. Fill in the NaN value with the mean or with the value that we want:

In [None]:
#df = df.fillna(df.mean())

If we would deal with numerical data, I would choose for the second option, that is, substitute the NaN for the mean. But when dealing with texts, we cannot make an average and I do not consider appropriate to invent or make an assumption, so I will choose the first option of remove all the reviews that contain a NaN in the relevant columns: location, role, pros, cons. 

In [10]:
Companies_reviews = Companies_reviews.dropna(how='any')

#Removing the entire rows or reviews that contains NaN values and creating a new dataframe without NaN values

In [11]:
Companies_reviews.isnull().values.any()

#Checking once again if there is any NaN value in the new DataFrame

False

In [12]:
len(Companies_reviews)

10373

Now, after removing the NaN values, the dataframe has 10373 reviews in total.

### Removing day and month from the date (only the year is interested):

The steps I will take are specified below:
1. Removing the word "de" from de original 'Date' column (using `replace()` function)
2. Converting the type of the 'Date' column from Series to String
3. Creating a datetime object from the string (using `Datetime` library and its `strptime()` function)
4. Replacing the 'str' of the original column 'Date' by that object with format 'datetime' created
5. Removing day and month from the date, and creating a new column named 'Year' with only the year

In [113]:
#from datetime import datetime (already imported)

In the dataframe I have, the format of the Date column is something like "15 de abril de 2021". At first, like I don't need the words "de" for anything, I will remove it line by line (or review by review), from all the dates with a 'for loop' and using the `replace()` function.

In [13]:
#1. Removing the word "de" from de original 'Date' column:

date_column = Companies_reviews['Date']

for line in date_column:
    '''Remove "de" words from the original date format'''
    date_column = date_column.str.replace(" de ", " ")
    
date_column

0            15 abril 2021
1            21 abril 2021
2            18 abril 2021
3            19 abril 2021
4            15 abril 2021
5            12 abril 2021
6            12 abril 2021
7             7 abril 2021
8             9 abril 2021
9            30 marzo 2021
10           30 marzo 2021
11            7 abril 2021
12            7 abril 2021
13           20 abril 2021
14           30 marzo 2021
15           15 marzo 2021
16           20 abril 2021
17            2 abril 2021
18           29 marzo 2021
19           24 marzo 2021
20           26 marzo 2021
21            9 marzo 2021
22           24 marzo 2021
23           25 marzo 2021
24           23 marzo 2021
25           15 marzo 2021
26           20 marzo 2021
27           17 marzo 2021
28           16 marzo 2021
29           12 abril 2021
               ...        
10456      30 octubre 2015
10457    17 noviembre 2013
10458         13 mayo 2013
10459        22 enero 2018
10460        10 julio 2018
10461        17 julio 2015
1

In [14]:
Companies_reviews['Date'] = date_column

In [15]:
date_columnsplit = date_column.str.split(expand=True)

date_columnsplit.columns = ['Date_day', 'Date_month', 'Date_year']

date_columnsplit.head(), print(type(date_columnsplit))

<class 'pandas.core.frame.DataFrame'>


(  Date_day Date_month Date_year
 0       15      abril      2021
 1       21      abril      2021
 2       18      abril      2021
 3       19      abril      2021
 4       15      abril      2021, None)

In [16]:
Companies_reviews = pd.concat([Companies_reviews, date_columnsplit], axis=1)

Companies_reviews.head()

Unnamed: 0,Company,Date,Title,Rating,Role,Location,Pros,Contras,Language,Date_day,Date_month,Date_year
0,Accenture,15 abril 2021,«Prácticas consultoría Accenture»,50,Exbecario - Internship,Madrid,excelente ambiente de trabajo y proyectos inte...,"mucho trabajo, rara vez sales a la hora predef...",Español,15,abril,2021
1,Accenture,21 abril 2021,«Salario Especialista Accenture»,40,Empleado actual - Especialista,Madrid,Posibilidad de crecimiento. Estabilidad laboral,En ciertas areas los horarios pueden ser exces...,Español,21,abril,2021
2,Accenture,18 abril 2021,«Buen ambiente de trabajo»,40,Empleado actual - Consulting Analyst,Bilbao,"Bien ambiente, contrato indefinido, buen traba...","Carrera lenta, en algunos clientes la carga de...",Español,18,abril,2021
3,Accenture,19 abril 2021,«Becario»,50,Empleado actual - Consulting Intern,Madrid,Buen trato y buen ambiente dentro de la empresa.,"Ninguna por el momento, si eso el no poder ele...",Español,19,abril,2021
4,Accenture,15 abril 2021,«Sueldo accenture»,50,Empleado actual - Data Analyst,Madrid,Cuidan al empleado ayudandolea a encajar y bus...,Ir de proyecto en proyecto,Español,15,abril,2021


In [17]:
Companies_reviews = Companies_reviews.drop(['Date', 'Date_day', 'Date_month'], axis = 1)
Companies_reviews.head()

Unnamed: 0,Company,Title,Rating,Role,Location,Pros,Contras,Language,Date_year
0,Accenture,«Prácticas consultoría Accenture»,50,Exbecario - Internship,Madrid,excelente ambiente de trabajo y proyectos inte...,"mucho trabajo, rara vez sales a la hora predef...",Español,2021
1,Accenture,«Salario Especialista Accenture»,40,Empleado actual - Especialista,Madrid,Posibilidad de crecimiento. Estabilidad laboral,En ciertas areas los horarios pueden ser exces...,Español,2021
2,Accenture,«Buen ambiente de trabajo»,40,Empleado actual - Consulting Analyst,Bilbao,"Bien ambiente, contrato indefinido, buen traba...","Carrera lenta, en algunos clientes la carga de...",Español,2021
3,Accenture,«Becario»,50,Empleado actual - Consulting Intern,Madrid,Buen trato y buen ambiente dentro de la empresa.,"Ninguna por el momento, si eso el no poder ele...",Español,2021
4,Accenture,«Sueldo accenture»,50,Empleado actual - Data Analyst,Madrid,Cuidan al empleado ayudandolea a encajar y bus...,Ir de proyecto en proyecto,Español,2021


In [33]:
#2. Converting the type of the 'Date_year' column from Series to String:

Year = Companies_reviews['Date_year']
print("Type of date column:", type(Year))

Year_string = str(Year)
print("Type of date string:", type(Year_string))

Type of date column: <class 'pandas.core.frame.DataFrame'>
Type of date string: <class 'str'>


### Removing the first part of the text, before the "-", from de original 'Role' column:

In [24]:
role_column = Companies_reviews['Role']

for line in role_column:
    '''Remove the words before the "-" from the original role column format'''
    role_column = role_column.str.replace("Empleado", "")
    role_column = role_column.str.replace("Exempleado", "")
    role_column = role_column.str.replace("Becario", "")
    role_column = role_column.str.replace("Exbecario", "")
    role_column = role_column.str.replace("actual", "")
    role_column = role_column.str.replace("-", "")
    
role_column.head(20)

0                               Internship
1                             Especialista
2                       Consulting Analyst
3                        Consulting Intern
4                             Data Analyst
5                       Associate Director
6                    Internship Technology
7             Technology Consulting Intern
8                        Senior Consultant
9                    Consultor Estratégico
10                   Consultor Estratégico
11              Desarrollador web frontend
12                              Consultant
13                               Consultor
14                     ABOGADO CORPORATIVO
15       SW/App/Cloud Tech Support Analyst
16                                Analista
17                     IT Security Manager
18                       Big Data Engineer
19                       Junior Consultant
Name: Role, dtype: object

In [25]:
Companies_reviews['Role'] = role_column

Companies_reviews.head()

Unnamed: 0,Company,Title,Rating,Role,Location,Pros,Contras,Language,Date_year
0,Accenture,«Prácticas consultoría Accenture»,50,Internship,Madrid,excelente ambiente de trabajo y proyectos inte...,"mucho trabajo, rara vez sales a la hora predef...",Español,2021
1,Accenture,«Salario Especialista Accenture»,40,Especialista,Madrid,Posibilidad de crecimiento. Estabilidad laboral,En ciertas areas los horarios pueden ser exces...,Español,2021
2,Accenture,«Buen ambiente de trabajo»,40,Consulting Analyst,Bilbao,"Bien ambiente, contrato indefinido, buen traba...","Carrera lenta, en algunos clientes la carga de...",Español,2021
3,Accenture,«Becario»,50,Consulting Intern,Madrid,Buen trato y buen ambiente dentro de la empresa.,"Ninguna por el momento, si eso el no poder ele...",Español,2021
4,Accenture,«Sueldo accenture»,50,Data Analyst,Madrid,Cuidan al empleado ayudandolea a encajar y bus...,Ir de proyecto en proyecto,Español,2021


### Removing **locations** outside of Spain:

In [26]:
locations_count = Companies_reviews.groupby(['Location']).count().sort_values(['Company'], ascending=False)
locations_count


Unnamed: 0_level_0,Company,Title,Rating,Role,Pros,Contras,Language,Date_year
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Madrid,4918,4918,4918,4918,4918,4918,4918,4918
Barcelona,3619,3619,3619,3619,3619,3619,3619,3619
Sant Cugat del Vallès,197,197,197,197,197,197,197,197
"Valencia, Comunidad Valenciana, Comunidad Valenciana",161,161,161,161,161,161,161,161
Sevilla,120,120,120,120,120,120,120,120
Bilbao,69,69,69,69,69,69,69,69
Alicante,57,57,57,57,57,57,57,57
Málaga,52,52,52,52,52,52,52,52
Alcobendas,44,44,44,44,44,44,44,44
Murcia,44,44,44,44,44,44,44,44


Through 'counting' the number of reviews by locations, we can see that when scraping, reviews of employees located outside of Spain that do not interest me have been included without sense.

In the same way, we also observe that all these locations outside of Spain do not count more than 12 reviews. Therefore, by doing a 'slicing' by condition of a number of reviews less than or equal to 12, we can see what all these locations are and then I will drop them from the dataframe with which I will do the analysis.

In [27]:
type(locations_count)

pandas.core.frame.DataFrame

In [28]:
#Converting the 'string' data to 'integer' data
locations_count['Company'] = locations_count['Company'].astype(int)

print(type(locations_count['Company']))

<class 'pandas.core.series.Series'>


In [29]:
#Slicing by condition:

locations_count.loc[(locations_count.Company <= 12)]
                                    

Unnamed: 0_level_0,Company,Title,Rating,Role,Pros,Contras,Language,Date_year
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Badalona,12,12,12,12,12,12,12,12
Bangalore,12,12,12,12,12,12,12,12
Gerona,11,11,11,11,11,11,11,11
"London, England, England",11,11,11,11,11,11,11,11
"León, Castilla y León, Castilla y León",11,11,11,11,11,11,11,11
Ciudad Real,11,11,11,11,11,11,11,11
Aranjuez,11,11,11,11,11,11,11,11
"Vigo, La Coruña, Galicia, Galicia",10,10,10,10,10,10,10,10
"New York, NY",10,10,10,10,10,10,10,10
Getafe,10,10,10,10,10,10,10,10


In [30]:
locations_count.loc[(locations_count.Company <= 4)]


Unnamed: 0_level_0,Company,Title,Rating,Role,Pros,Contras,Language,Date_year
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Tarragona,4,4,4,4,4,4,4,4
"Grand Rapids, MI",4,4,4,4,4,4,4,4
"San Ramon, CA",4,4,4,4,4,4,4,4
Vitoria-Gasteiz,4,4,4,4,4,4,4,4
"Pasco, WA",4,4,4,4,4,4,4,4
"Windsor, NH",4,4,4,4,4,4,4,4
"Chicago, IL",4,4,4,4,4,4,4,4
Toledo,4,4,4,4,4,4,4,4
Makati City,4,4,4,4,4,4,4,4
Badajoz,4,4,4,4,4,4,4,4


In [31]:
locations_count.loc[(locations_count.Company <= 2)]


Unnamed: 0_level_0,Company,Title,Rating,Role,Pros,Contras,Language,Date_year
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
New Delhi,2,2,2,2,2,2,2,2
"Verona, WI",2,2,2,2,2,2,2,2
"Miami, FL",2,2,2,2,2,2,2,2
"Ashburn, VA",2,2,2,2,2,2,2,2
Martorelles,2,2,2,2,2,2,2,2
"Bothell, WA",2,2,2,2,2,2,2,2
Lugo,2,2,2,2,2,2,2,2
"Ada, MI",2,2,2,2,2,2,2,2
Logroño,2,2,2,2,2,2,2,2
Vargem Grande Paulista,2,2,2,2,2,2,2,2


In [32]:
locations_count.loc[(locations_count.Company <= 1)]


Unnamed: 0_level_0,Company,Title,Rating,Role,Pros,Contras,Language,Date_year
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Castellón de la Plana,1,1,1,1,1,1,1,1
Arteixo,1,1,1,1,1,1,1,1
"Scarborough, ON",1,1,1,1,1,1,1,1
"Eden Prairie, MN",1,1,1,1,1,1,1,1
"Scottsdale, AZ",1,1,1,1,1,1,1,1
Castellbisbal,1,1,1,1,1,1,1,1
"Arden Hills, MN",1,1,1,1,1,1,1,1
Dos Hermanas,1,1,1,1,1,1,1,1
"Shorewood, IL",1,1,1,1,1,1,1,1
Siero,1,1,1,1,1,1,1,1


I see that these places come from different American states mainly. As I see that there are quite a few, I consider instead of removing them from the dataframe, it is faster and easier to select those locations that I will keep from the dataframe, which are those based in Spain.

To do this, at first, I will collect all the locations based in Spain.

In [33]:
locations_count.index.values

array(['Madrid', 'Barcelona', 'Sant Cugat del Vallès',
       'Valencia, Comunidad Valenciana, Comunidad Valenciana', 'Sevilla',
       'Bilbao', 'Alicante', 'Málaga', 'Alcobendas', 'Murcia',
       'Las Rozas de Madrid', 'Zaragoza', 'Torrejón de Ardoz',
       'Pozuelo de Alarcón', 'Martorell', 'Santa Cruz, Canarias',
       'A Coruña', 'Boadilla del Monte', 'San Fernando de Henares',
       'Lérida', 'Oviedo', 'Salamanca', 'Avilés, Asturias, Asturias',
       'Madison, WI', 'Gijón', 'El Prat de Llobregat', 'Valladolid',
       'Esplugues de Llobregat', 'Badalona', 'Bangalore', 'Gerona',
       'London, England, England',
       'León, Castilla y León, Castilla y León', 'Ciudad Real',
       'Aranjuez', 'Vigo, La Coruña, Galicia, Galicia', 'New York, NY',
       'Getafe', 'Hyderābād', 'San Francisco, CA',
       'Las Palmas de Gran Canaria, Canarias', 'Palma', 'Pamplona',
       'Santiago, Galicia', 'Barberà del Vallès', 'Granada',
       'Washington, DC', 'São Paulo, São Paulo, São P

In [41]:
#Collecting all the locations based across Spain with the most numbers of reviews:
locations_spain_topreviews = locations_count.iloc[lambda locations_count: list(locations_count.Company >= 12)].drop(['Bangalore', 'Madison, WI'])
locations_spain_topreviews

Unnamed: 0_level_0,Company,Title,Rating,Role,Pros,Contras,Language,Date_year
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Madrid,4918,4918,4918,4918,4918,4918,4918,4918
Barcelona,3619,3619,3619,3619,3619,3619,3619,3619
Sant Cugat del Vallès,197,197,197,197,197,197,197,197
"Valencia, Comunidad Valenciana, Comunidad Valenciana",161,161,161,161,161,161,161,161
Sevilla,120,120,120,120,120,120,120,120
Bilbao,69,69,69,69,69,69,69,69
Alicante,57,57,57,57,57,57,57,57
Málaga,52,52,52,52,52,52,52,52
Alcobendas,44,44,44,44,44,44,44,44
Murcia,44,44,44,44,44,44,44,44


In [42]:
#Collecting all the locations based across Spain with the less numbers of reviews:

rows = ['Gerona', 'León, Castilla y León, Castilla y León', 'Ciudad Real', 'Aranjuez', 'Vigo, La Coruña, Galicia, Galicia', 'Getafe', 'Las Palmas de Gran Canaria, Canarias',
        'Palma', 'Pamplona', 'Santiago, Galicia', 'Barberà del Vallès', 'Granada', 'Córdoba', 'Langreo', 'Burgos', 'Hospitalet, Barcelona, Cataluña, Cataluña',
        'Donostia / San Sebastián, Basque Country', 'Palau-solita i Plegamans', 'Tarragona', 'Vitoria-Gasteiz', 'Toledo', 'Badajoz', 'Almería', 'Cerdanyola del Vallès',
        'Sant Joan Despí', 'Pontevedra', 'Cartagena', 'Noia', 'Ibiza', 'Marbella', 'San Juan de Alicante', 'Martorelles', 'Lugo', 'Logroño', 'Alcorcón', 'Santa Coloma de Gramenet',
        'Benidorm', 'Peal de Becerro', 'Sevilla La Nueva', 'Orense', 'Illescas', 'Castellón de la Plana', 'Arteixo', 'Castellbisbal', 'Dos Hermanas', 'Castro-Urdiales',
        'Sant Esteve Sesrovires', 'Sabadell', 'Barcelone', 'Figueres', 'Fuengirola', 'Porriño', 'Basauri, País Vasco', 'Quart de les Valls', 
        'Oleiros, La Coruña, Galicia, Galicia', 'Palafolls']
locations_spain_lessreviews = locations_count.loc[rows]
locations_spain_lessreviews

Unnamed: 0_level_0,Company,Title,Rating,Role,Pros,Contras,Language,Date_year
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Gerona,11,11,11,11,11,11,11,11
"León, Castilla y León, Castilla y León",11,11,11,11,11,11,11,11
Ciudad Real,11,11,11,11,11,11,11,11
Aranjuez,11,11,11,11,11,11,11,11
"Vigo, La Coruña, Galicia, Galicia",10,10,10,10,10,10,10,10
Getafe,10,10,10,10,10,10,10,10
"Las Palmas de Gran Canaria, Canarias",9,9,9,9,9,9,9,9
Palma,9,9,9,9,9,9,9,9
Pamplona,9,9,9,9,9,9,9,9
"Santiago, Galicia",7,7,7,7,7,7,7,7


In [43]:
#Doing an .append() with both: locations_spain_topreviews + locations_spain_lessreviews

locations_spain = locations_spain_topreviews.append(locations_spain_lessreviews)
locations_spain

Unnamed: 0_level_0,Company,Title,Rating,Role,Pros,Contras,Language,Date_year
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Madrid,4918,4918,4918,4918,4918,4918,4918,4918
Barcelona,3619,3619,3619,3619,3619,3619,3619,3619
Sant Cugat del Vallès,197,197,197,197,197,197,197,197
"Valencia, Comunidad Valenciana, Comunidad Valenciana",161,161,161,161,161,161,161,161
Sevilla,120,120,120,120,120,120,120,120
Bilbao,69,69,69,69,69,69,69,69
Alicante,57,57,57,57,57,57,57,57
Málaga,52,52,52,52,52,52,52,52
Alcobendas,44,44,44,44,44,44,44,44
Murcia,44,44,44,44,44,44,44,44


In [44]:
#Creating a list with only the locations in Spain that I have previously checked are in the dataframe

locations_spain_list = list(locations_spain.index)
locations_spain_list

['Madrid',
 'Barcelona',
 'Sant Cugat del Vallès',
 'Valencia, Comunidad Valenciana, Comunidad Valenciana',
 'Sevilla',
 'Bilbao',
 'Alicante',
 'Málaga',
 'Alcobendas',
 'Murcia',
 'Las Rozas de Madrid',
 'Zaragoza',
 'Torrejón de Ardoz',
 'Pozuelo de Alarcón',
 'Martorell',
 'Santa Cruz, Canarias',
 'A Coruña',
 'Boadilla del Monte',
 'San Fernando de Henares',
 'Lérida',
 'Oviedo',
 'Salamanca',
 'Avilés, Asturias, Asturias',
 'Gijón',
 'El Prat de Llobregat',
 'Valladolid',
 'Esplugues de Llobregat',
 'Badalona',
 'Gerona',
 'León, Castilla y León, Castilla y León',
 'Ciudad Real',
 'Aranjuez',
 'Vigo, La Coruña, Galicia, Galicia',
 'Getafe',
 'Las Palmas de Gran Canaria, Canarias',
 'Palma',
 'Pamplona',
 'Santiago, Galicia',
 'Barberà del Vallès',
 'Granada',
 'Córdoba',
 'Langreo',
 'Burgos',
 'Hospitalet, Barcelona, Cataluña, Cataluña',
 'Donostia / San Sebastián, Basque Country',
 'Palau-solita i Plegamans',
 'Tarragona',
 'Vitoria-Gasteiz',
 'Toledo',
 'Badajoz',
 'Almería',


In [130]:
#Filtering the original dataframe 'Companies_reviews' with only the reviews located across Spain

filtering = Companies_reviews['Location'].isin(locations_spain_list)
Companies_reviews = Companies_reviews.loc[filtering]
Companies_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9989 entries, 0 to 10487
Data columns (total 9 columns):
Company      9989 non-null object
Title        9989 non-null object
Rating       9989 non-null object
Role         9989 non-null object
Location     9989 non-null object
Pros         9989 non-null object
Contras      9989 non-null object
Language     9989 non-null object
Date_year    9989 non-null object
dtypes: object(9)
memory usage: 780.4+ KB


I see that from an initial dataframe without NaN of a total of 10373 reviews, I am left with a final one of 9989 reviews. That is, in the dataframe there were 384 reviews of employees located outside of Spain that I have deleted.

In [136]:
Companies_reviews.groupby(['Location']).count().sort_values(['Company'], ascending=False)

Unnamed: 0_level_0,Company,Title,Rating,Role,Pros,Contras,Language,Date_year
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Madrid,4918,4918,4918,4918,4918,4918,4918,4918
Barcelona,3619,3619,3619,3619,3619,3619,3619,3619
Sant Cugat del Vallès,197,197,197,197,197,197,197,197
"Valencia, Comunidad Valenciana, Comunidad Valenciana",161,161,161,161,161,161,161,161
Sevilla,120,120,120,120,120,120,120,120
Bilbao,69,69,69,69,69,69,69,69
Alicante,57,57,57,57,57,57,57,57
Málaga,52,52,52,52,52,52,52,52
Las Rozas de Madrid,44,44,44,44,44,44,44,44
Murcia,44,44,44,44,44,44,44,44


In [160]:
#Checking all the reviews based in 'Las Rozas de Madrid' for example:
Companies_reviews[(Companies_reviews['Location'] == 'Las Rozas de Madrid')]

Unnamed: 0,Company,Title,Rating,Role,Location,Pros,Contras,Language,Date_year
58,Accenture,«profesional»,50,Consultoria,Las Rozas de Madrid,"Buen ambiente, buen sueldo y se preocupan por ...",La verdad que en términos todo bien,Español,2021
492,Accenture,«Excelencia de procesos y Habilitación de camb...,50,Anonymous,Las Rozas de Madrid,Es una gran oportunidad de crecer,"Demasiadas horas de trabajo, pero merece la pena",Español,2015
546,Accenture,«Reseña de Accenture»,20,Asociado de operaciones comerciales Líder de e...,Las Rozas de Madrid,Nombre de marca.\nPosibilidad de conocer a nue...,"No se motiva lo suficiente a los empleados, ha...",Español,2015
573,Accenture,«Buen punto de partida»,30,Analista De Negocios,Las Rozas de Madrid,"Gente agradable, ambiente agradable, buena vis...","Opiniones muy políticas, muchas horas",Español,2016
600,Accenture,«A meat grinder»,10,Exproveedor de servicios Human Resources Admi...,Las Rozas de Madrid,I guess you get paid (just about minimum wage ...,Basically fraudulent practices in employment c...,English,2021
602,Accenture,«Good»,40,Help Desk Analyst,Las Rozas de Madrid,The possibility of working from home,Working long hours during the week,English,2021
604,Accenture,«Not coordination client-comapany-worker»,20,QA Language Tester,Las Rozas de Madrid,The facilities are really good.,Depending on the project you will be not provi...,English,2021
644,Accenture,«Unmature puppies - Overtime (not compensated)...,10,anónimo,Las Rozas de Madrid,Nice brand which may look nice in a CV during ...,"Exagerated internal competition culture, lack ...",English,2020
898,Accenture,«Accenture review»,20,Business Operations Associate Team Lead,Las Rozas de Madrid,Brand name\nAbility to meet new clients\nFace ...,"Not enough to motivate employees, poor soft sk...",English,2015
925,Accenture,«Manager accenture»,20,Manager,Las Rozas de Madrid,Carreer opportunities. Projects in big companies.,Compensation no very high. Poor worklife balance.,English,2012


### Joining or grouping locations by **regions**:

Once those locations outside of Spain have been removed from the original dataframe, then I will group those that correspond to the same regions.

In the same way, I will add a new column with the regions of each review, because it can help me with the analysis and visualization that I will do at the end of all the data cleaning.

I remember that the dataframe has reviews based in the locations specified below: 

In [48]:
locations_spain_list, len(locations_spain_list)

(['Madrid',
  'Barcelona',
  'Sant Cugat del Vallès',
  'Valencia, Comunidad Valenciana, Comunidad Valenciana',
  'Sevilla',
  'Bilbao',
  'Alicante',
  'Málaga',
  'Alcobendas',
  'Murcia',
  'Las Rozas de Madrid',
  'Zaragoza',
  'Torrejón de Ardoz',
  'Pozuelo de Alarcón',
  'Martorell',
  'Santa Cruz, Canarias',
  'A Coruña',
  'Boadilla del Monte',
  'San Fernando de Henares',
  'Lérida',
  'Oviedo',
  'Salamanca',
  'Avilés, Asturias, Asturias',
  'Gijón',
  'El Prat de Llobregat',
  'Valladolid',
  'Esplugues de Llobregat',
  'Badalona',
  'Gerona',
  'León, Castilla y León, Castilla y León',
  'Ciudad Real',
  'Aranjuez',
  'Vigo, La Coruña, Galicia, Galicia',
  'Getafe',
  'Las Palmas de Gran Canaria, Canarias',
  'Palma',
  'Pamplona',
  'Santiago, Galicia',
  'Barberà del Vallès',
  'Granada',
  'Córdoba',
  'Langreo',
  'Burgos',
  'Hospitalet, Barcelona, Cataluña, Cataluña',
  'Donostia / San Sebastián, Basque Country',
  'Palau-solita i Plegamans',
  'Tarragona',
  'Vitor

In [132]:
Companies_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9989 entries, 0 to 10487
Data columns (total 9 columns):
Company      9989 non-null object
Title        9989 non-null object
Rating       9989 non-null object
Role         9989 non-null object
Location     9989 non-null object
Pros         9989 non-null object
Contras      9989 non-null object
Language     9989 non-null object
Date_year    9989 non-null object
dtypes: object(9)
memory usage: 780.4+ KB


In [154]:
BARCELONA = ['Barcelona', 'Sant Cugat del Vallès', 'Barcelone', 'El Prat de Llobregat', 'Esplugues de Llobregat', 'Badalona', 'Barberà del Vallès', 'Hospitalet, Barcelona, Cataluña, Cataluña', 'Palau-solita i Plegamans', 'Cerdanyola del Vallès', 'Martorell', 'Martorelles', 'Santa Coloma de Gramenet', 'Sant Joan Despí', 'Castellbisbal', 'Sant Esteve Sesrovires', 'Sabadell', 'Palafolls']
MADRID = ['Madrid', 'Alcobendas', 'Las Rozas de Madrid', 'Torrejón de Ardoz', 'Pozuelo de Alarcón', 'Boadilla del Monte', 'San Fernando de Henares', 'Aranjuez', 'Getafe', 'Alcorcón', 'Sevilla La Nueva']
GERONA = ['Gerona', 'Figueres']
TARRAGONA = ['Tarragona']
LERIDA = ['Lérida']
VALENCIA = ['Valencia, Comunidad Valenciana, Comunidad Valenciana', 'Quart de les Valls']
ALICANTE = ['Alicante', 'San Juan de Alicante', 'Benidorm']
CASTELLON = ['Castellón de la Plana']
LA_CORUÑA = ['A Coruña', 'Santiago, Galicia', 'Noia', 'Arteixo', 'Oleiros, La Coruña, Galicia, Galicia']
LUGO = ['Lugo']
OURENSE = ['Orense']
PONTEVEDRA = ['Vigo, La Coruña, Galicia, Galicia', 'Pontevedra', 'Porriño']
CANTABRIA = ['Castro-Urdiales']
ASTURIAS = ['Oviedo', 'Avilés, Asturias, Asturias', 'Gijón', 'Langreo']
VIZCAYA = ['Bilbao', 'Basauri, País Vasco']
GUIPUZCUA = ['Donostia / San Sebastián, Basque Country', 'Vitoria-Gasteiz']
LA_RIOJA = ['Logroño']
NAVARRA = ['Pamplona']
SEVILLA = ['Sevilla', 'Dos Hermanas']
MALAGA = ['Málaga', 'Marbella', 'Fuengirola']
CORDOBA = ['Córdoba']
GRANADA = ['Granada']
ALMERIA = ['Almería']
JAEN = ['Peal de Becerro']
BADAJOZ = ['Badajoz']
MURCIA = ['Murcia', 'Cartagena']
TOLEDO = ['Toledo', 'Illescas']
CIUDAD_REAL = ['Ciudad Real']
ISLAS_BALEARES = ['Palma', 'Ibiza']
TENERIFE = ['Santa Cruz, Canarias']
LAS_PALMAS = ['Las Palmas de Gran Canaria, Canarias']
SALAMANCA = ['Salamanca']
VALLADOLID = ['Valladolid']
LEON = ['León, Castilla y León, Castilla y León']
BURGOS = ['Burgos']
ZARAGOZA = ['Zaragoza']

In [156]:

#BARCELONA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de BARCELONA
for index, row in Companies_reviews.iterrows():
    filtering_barcelona = Companies_reviews['Location'].isin(BARCELONA)
    locations_barcelona = Companies_reviews.loc[filtering_barcelona]

###Renombramos a todas ellas por el nombre de la provincia: 'BARCELONA'
for line in locations_barcelona['Location']:
    locations_barcelona['Location'] = 'BARCELONA'

#MADRID
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de MADRID
for index, row in Companies_reviews.iterrows():
    filtering_madrid = Companies_reviews['Location'].isin(MADRID)
    locations_madrid = Companies_reviews.loc[filtering_madrid]

###Renombramos a todas ellas por el nombre de la provincia: 'MADRID'
for line in locations_madrid['Location']:
    locations_madrid['Location'] = 'MADRID'
    
#GERONA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de GERONA
for index, row in Companies_reviews.iterrows():
    filtering_gerona = Companies_reviews['Location'].isin(GERONA)
    locations_gerona = Companies_reviews.loc[filtering_gerona]

###Renombramos a todas ellas por el nombre de la provincia: 'GERONA'
for line in locations_gerona['Location']:
    locations_gerona['Location'] = 'GERONA'
    
#TARRAGONA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de TARRAGONA
for index, row in Companies_reviews.iterrows():
    filtering_tarragona = Companies_reviews['Location'].isin(TARRAGONA)
    locations_tarragona = Companies_reviews.loc[filtering_tarragona]

###Renombramos a todas ellas por el nombre de la provincia: 'TARRAGONA'
for line in locations_tarragona['Location']:
    locations_tarragona['Location'] = 'TARRAGONA'
    
#LERIDA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de LERIDA
for index, row in Companies_reviews.iterrows():
    filtering_lerida = Companies_reviews['Location'].isin(LERIDA)
    locations_lerida = Companies_reviews.loc[filtering_lerida]

###Renombramos a todas ellas por el nombre de la provincia: 'LERIDA'
for line in locations_lerida['Location']:
    locations_lerida['Location'] = 'LERIDA'
    
#VALENCIA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de VALENCIA
for index, row in Companies_reviews.iterrows():
    filtering_valencia = Companies_reviews['Location'].isin(VALENCIA)
    locations_valencia = Companies_reviews.loc[filtering_valencia]

###Renombramos a todas ellas por el nombre de la provincia: 'VALENCIA'
for line in locations_valencia['Location']:
    locations_valencia['Location'] = 'VALENCIA'
    
#ALICANTE
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de ALICANTE
for index, row in Companies_reviews.iterrows():
    filtering_alicante = Companies_reviews['Location'].isin(ALICANTE)
    locations_alicante = Companies_reviews.loc[filtering_alicante]

###Renombramos a todas ellas por el nombre de la provincia: 'ALICANTE'
for line in locations_alicante['Location']:
    locations_alicante['Location'] = 'ALICANTE'

#CASTELLON
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de CASTELLON
for index, row in Companies_reviews.iterrows():
    filtering_castellon = Companies_reviews['Location'].isin(CASTELLON)
    locations_castellon = Companies_reviews.loc[filtering_castellon]

###Renombramos a todas ellas por el nombre de la provincia: 'CASTELLON'
for line in locations_castellon['Location']:
    locations_castellon['Location'] = 'CASTELLON'

#LA CORUÑA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de LA CORUÑA
for index, row in Companies_reviews.iterrows():
    filtering_coruña = Companies_reviews['Location'].isin(LA_CORUÑA)
    locations_coruña = Companies_reviews.loc[filtering_coruña]

###Renombramos a todas ellas por el nombre de la provincia: 'LA CORUÑA'
for line in locations_coruña['Location']:
    locations_coruña['Location'] = 'LA CORUÑA'

#LUGO
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de LUGO
for index, row in Companies_reviews.iterrows():
    filtering_lugo = Companies_reviews['Location'].isin(LUGO)
    locations_lugo = Companies_reviews.loc[filtering_lugo]

###Renombramos a todas ellas por el nombre de la provincia: 'LUGO'
for line in locations_lugo['Location']:
    locations_lugo['Location'] = 'LUGO'

#OURENSE
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de OURENSE
for index, row in Companies_reviews.iterrows():
    filtering_ourense = Companies_reviews['Location'].isin(OURENSE)
    locations_ourense = Companies_reviews.loc[filtering_ourense]

###Renombramos a todas ellas por el nombre de la provincia: 'OURENSE'
for line in locations_ourense['Location']:
    locations_ourense['Location'] = 'OURENSE'

#PONTEVEDRA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de PONTEVEDRA
for index, row in Companies_reviews.iterrows():
    filtering_pontevedra = Companies_reviews['Location'].isin(PONTEVEDRA)
    locations_pontevedra = Companies_reviews.loc[filtering_pontevedra]

###Renombramos a todas ellas por el nombre de la provincia: 'PONTEVEDRA'
for line in locations_pontevedra['Location']:
    locations_pontevedra['Location'] = 'PONTEVEDRA'
    
#CANTABRIA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de CANTABRIA
for index, row in Companies_reviews.iterrows():
    filtering_cantabria = Companies_reviews['Location'].isin(CANTABRIA)
    locations_cantabria = Companies_reviews.loc[filtering_cantabria]

###Renombramos a todas ellas por el nombre de la provincia: 'CANTABRIA'
for line in locations_cantabria['Location']:
    locations_cantabria['Location'] = 'CANTABRIA'

#ASTURIAS
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de ASTURIAS
for index, row in Companies_reviews.iterrows():
    filtering_asturias = Companies_reviews['Location'].isin(ASTURIAS)
    locations_asturias = Companies_reviews.loc[filtering_asturias]

###Renombramos a todas ellas por el nombre de la provincia: 'ASTURIAS'
for line in locations_asturias['Location']:
    locations_asturias['Location'] = 'ASTURIAS'

#VIZCAYA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de VIZCAYA
for index, row in Companies_reviews.iterrows():
    filtering_vizcaya = Companies_reviews['Location'].isin(VIZCAYA)
    locations_vizcaya = Companies_reviews.loc[filtering_vizcaya]

###Renombramos a todas ellas por el nombre de la provincia: 'VIZCAYA'
for line in locations_vizcaya['Location']:
    locations_vizcaya['Location'] = 'VIZCAYA'
    
#GUIPUZCUA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de GUIPUZCUA
for index, row in Companies_reviews.iterrows():
    filtering_guipuzcua = Companies_reviews['Location'].isin(GUIPUZCUA)
    locations_guipuzcua = Companies_reviews.loc[filtering_guipuzcua]

###Renombramos a todas ellas por el nombre de la provincia: 'GUIPUZCUA'
for line in locations_guipuzcua['Location']:
    locations_guipuzcua['Location'] = 'GUIPUZCUA'

#LA RIOJA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de LA RIOJA
for index, row in Companies_reviews.iterrows():
    filtering_larioja = Companies_reviews['Location'].isin(LA_RIOJA)
    locations_larioja = Companies_reviews.loc[filtering_larioja]

###Renombramos a todas ellas por el nombre de la provincia: 'LA RIOJA'
for line in locations_larioja['Location']:
    locations_larioja['Location'] = 'LA RIOJA'
    
#NAVARRA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de NAVARRA
for index, row in Companies_reviews.iterrows():
    filtering_navarra = Companies_reviews['Location'].isin(NAVARRA)
    locations_navarra = Companies_reviews.loc[filtering_navarra]

###Renombramos a todas ellas por el nombre de la provincia: 'NAVARRA'
for line in locations_navarra['Location']:
    locations_navarra['Location'] = 'NAVARRA'
    
#SEVILLA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de SEVILLA
for index, row in Companies_reviews.iterrows():
    filtering_sevilla = Companies_reviews['Location'].isin(SEVILLA)
    locations_sevilla = Companies_reviews.loc[filtering_sevilla]

###Renombramos a todas ellas por el nombre de la provincia: 'SEVILLA'
for line in locations_sevilla['Location']:
    locations_sevilla['Location'] = 'SEVILLA'
    
#MALAGA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de MALAGA
for index, row in Companies_reviews.iterrows():
    filtering_malaga = Companies_reviews['Location'].isin(MALAGA)
    locations_malaga = Companies_reviews.loc[filtering_malaga]

###Renombramos a todas ellas por el nombre de la provincia: 'MALAGA'
for line in locations_malaga['Location']:
    locations_malaga['Location'] = 'MALAGA'
    
#CORDOBA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de CORDOBA
for index, row in Companies_reviews.iterrows():
    filtering_cordoba = Companies_reviews['Location'].isin(CORDOBA)
    locations_cordoba = Companies_reviews.loc[filtering_cordoba]

###Renombramos a todas ellas por el nombre de la provincia: 'CORDOBA'
for line in locations_cordoba['Location']:
    locations_cordoba['Location'] = 'CORDOBA'
    
#GRANADA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de GRANADA
for index, row in Companies_reviews.iterrows():
    filtering_granada = Companies_reviews['Location'].isin(GRANADA)
    locations_granada = Companies_reviews.loc[filtering_granada]

###Renombramos a todas ellas por el nombre de la provincia: 'GRANADA'
for line in locations_granada['Location']:
    locations_granada['Location'] = 'GRANADA'
    
#ALMERIA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de ALMERIA
for index, row in Companies_reviews.iterrows():
    filtering_almeria = Companies_reviews['Location'].isin(ALMERIA)
    locations_almeria = Companies_reviews.loc[filtering_almeria]

###Renombramos a todas ellas por el nombre de la provincia: 'ALMERIA'
for line in locations_almeria['Location']:
    locations_almeria['Location'] = 'ALMERIA'
    
#JAEN
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de JAEN
for index, row in Companies_reviews.iterrows():
    filtering_jaen = Companies_reviews['Location'].isin(JAEN)
    locations_jaen = Companies_reviews.loc[filtering_jaen]

###Renombramos a todas ellas por el nombre de la provincia: 'JAEN'
for line in locations_jaen['Location']:
    locations_jaen['Location'] = 'JAEN'
    
#BADAJOZ
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de BADAJOZ
for index, row in Companies_reviews.iterrows():
    filtering_badajoz = Companies_reviews['Location'].isin(BADAJOZ)
    locations_badajoz = Companies_reviews.loc[filtering_badajoz]

###Renombramos a todas ellas por el nombre de la provincia: 'BADAJOZ'
for line in locations_badajoz['Location']:
    locations_badajoz['Location'] = 'BADAJOZ'
    
#MURCIA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de MURCIA
for index, row in Companies_reviews.iterrows():
    filtering_murcia = Companies_reviews['Location'].isin(MURCIA)
    locations_murcia = Companies_reviews.loc[filtering_murcia]

###Renombramos a todas ellas por el nombre de la provincia: 'MURCIA'
for line in locations_murcia['Location']:
    locations_murcia['Location'] = 'MURCIA'
    
#TOLEDO
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de TOLEDO
for index, row in Companies_reviews.iterrows():
    filtering_toledo = Companies_reviews['Location'].isin(TOLEDO)
    locations_toledo = Companies_reviews.loc[filtering_toledo]

###Renombramos a todas ellas por el nombre de la provincia: 'TOLEDO'
for line in locations_toledo['Location']:
    locations_toledo['Location'] = 'TOLEDO'
    
#CIUDAD REAL
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de CIUDAD REAL
for index, row in Companies_reviews.iterrows():
    filtering_ciudadreal = Companies_reviews['Location'].isin(CIUDAD_REAL)
    locations_ciudadreal = Companies_reviews.loc[filtering_ciudadreal]

###Renombramos a todas ellas por el nombre de la provincia: 'CIUDAD REAL'
for line in locations_ciudadreal['Location']:
    locations_ciudadreal['Location'] = 'CIUDAD REAL'
    
#ISLAS BALEARES
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de ISLAS BALEARES
for index, row in Companies_reviews.iterrows():
    filtering_islasbaleares = Companies_reviews['Location'].isin(ISLAS_BALEARES)
    locations_islasbaleares = Companies_reviews.loc[filtering_islasbaleares]

###Renombramos a todas ellas por el nombre de la provincia: 'ISLAS BALEARES'
for line in locations_islasbaleares['Location']:
    locations_islasbaleares['Location'] = 'ISLAS BALEARES'
    
#TENERIFE
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de TENERIFE
for index, row in Companies_reviews.iterrows():
    filtering_tenerife = Companies_reviews['Location'].isin(TENERIFE)
    locations_tenerife = Companies_reviews.loc[filtering_tenerife]

###Renombramos a todas ellas por el nombre de la provincia: 'TENERIFE'
for line in locations_tenerife['Location']:
    locations_tenerife['Location'] = 'TENERIFE'
    
#LAS PALMAS
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de LAS PALMAS
for index, row in Companies_reviews.iterrows():
    filtering_laspalmas = Companies_reviews['Location'].isin(LAS_PALMAS)
    locations_laspalmas = Companies_reviews.loc[filtering_laspalmas]

###Renombramos a todas ellas por el nombre de la provincia: 'LAS PALMAS'
for line in locations_laspalmas['Location']:
    locations_laspalmas['Location'] = 'LAS PALMAS'
    
#SALAMANCA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de SALAMANCA
for index, row in Companies_reviews.iterrows():
    filtering_salamanca = Companies_reviews['Location'].isin(SALAMANCA)
    locations_salamanca = Companies_reviews.loc[filtering_salamanca]

###Renombramos a todas ellas por el nombre de la provincia: 'SALAMANCA'
for line in locations_salamanca['Location']:
    locations_salamanca['Location'] = 'SALAMANCA'
    
#VALLADOLID
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de VALLADOLID
for index, row in Companies_reviews.iterrows():
    filtering_valladolid = Companies_reviews['Location'].isin(VALLADOLID)
    locations_valladolid = Companies_reviews.loc[filtering_valladolid]

###Renombramos a todas ellas por el nombre de la provincia: 'VALLADOLID'
for line in locations_valladolid['Location']:
    locations_valladolid['Location'] = 'VALLADOLID'
    
#LEON
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de LEON
for index, row in Companies_reviews.iterrows():
    filtering_leon = Companies_reviews['Location'].isin(LEON)
    locations_leon = Companies_reviews.loc[filtering_leon]

###Renombramos a todas ellas por el nombre de la provincia: 'LEON'
for line in locations_leon['Location']:
    locations_leon['Location'] = 'LEON'

#BURGOS
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de BURGOS
for index, row in Companies_reviews.iterrows():
    filtering_burgos = Companies_reviews['Location'].isin(BURGOS)
    locations_burgos = Companies_reviews.loc[filtering_burgos]

###Renombramos a todas ellas por el nombre de la provincia: 'BURGOS'
for line in locations_burgos['Location']:
    locations_burgos['Location'] = 'BURGOS'
    
#ZARAGOZA
###Filtramos el dataframe por todas las diferentes localizaciones ubicadas en la provincia de ZARAGOZA
for index, row in Companies_reviews.iterrows():
    filtering_zaragoza = Companies_reviews['Location'].isin(ZARAGOZA)
    locations_zaragoza = Companies_reviews.loc[filtering_zaragoza]

###Renombramos a todas ellas por el nombre de la provincia: 'ZARAGOZA'
for line in locations_zaragoza['Location']:
    locations_zaragoza['Location'] = 'ZARAGOZA'
    
Companies_reviews_v2 = pd.concat([locations_barcelona, locations_madrid, locations_gerona, locations_tarragona, locations_lerida, locations_valencia, locations_alicante, locations_castellon, locations_coruña, locations_lugo, locations_ourense, locations_pontevedra, locations_cantabria, locations_asturias, locations_vizcaya, locations_guipuzcua, locations_larioja, locations_navarra, locations_sevilla, locations_malaga, locations_cordoba, locations_granada, locations_almeria, locations_jaen, locations_badajoz, locations_murcia, locations_toledo, locations_ciudadreal, locations_islasbaleares, locations_tenerife, locations_laspalmas, locations_salamanca, locations_valladolid, locations_leon, locations_burgos, locations_zaragoza], axis=0, ignore_index = True)
Companies_reviews_v2.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/in

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9989 entries, 0 to 9988
Data columns (total 9 columns):
Company      9989 non-null object
Title        9989 non-null object
Rating       9989 non-null object
Role         9989 non-null object
Location     9989 non-null object
Pros         9989 non-null object
Contras      9989 non-null object
Language     9989 non-null object
Date_year    9989 non-null object
dtypes: object(9)
memory usage: 702.4+ KB


In [157]:
#Creating a new column called 'Region'

Companies_reviews_v2['Region'] = ''

#Filtering and adding a new value for review acording to the Spain's region of its location

Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['BARCELONA', 'GERONA', 'TARRAGONA', 'LERIDA']), 'Region'] = 'CATALUÑA'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['MADRID']), 'Region'] = 'MADRID'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['LA CORUÑA', 'LUGO', 'OURENSE', 'PONTEVEDRA']), 'Region'] = 'GALICIA'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['CANTABRIA']), 'Region'] = 'CANTABRIA'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['ASTURIAS']), 'Region'] = 'ASTURIAS'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['VIZCAYA', 'GUIPUZCUA']), 'Region'] = 'PAIS VASCO'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['LA RIOJA']), 'Region'] = 'LA RIOJA'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['NAVARRA']), 'Region'] = 'NAVARRA'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['SEVILLA', 'CORDOBA', 'MALAGA', 'JAEN', 'ALMERIA', 'GRANADA']), 'Region'] = 'ANDALUCIA'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['BADAJOZ']), 'Region'] = 'EXTREMADURA'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['MURCIA']), 'Region'] = 'MURCIA'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['TOLEDO', 'CIUDAD REAL']), 'Region'] = 'CASTILLA LA MANCHA'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['ISLAS BALEARES']), 'Region'] = 'ISLAS BALEARES'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['TENERIFE', 'LAS PALMAS']), 'Region'] = 'ISLAS CANARIAS'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['SALAMANCA', 'VALLADOLID', 'LEON', 'BURGOS']), 'Region'] = 'CASTILLA Y LEON'
Companies_reviews_v2.loc[Companies_reviews_v2['Location'].isin(['ZARAGOZA']), 'Region'] = 'ZARAGOZA'

Companies_reviews_v2.head(50)

Unnamed: 0,Company,Title,Rating,Role,Location,Pros,Contras,Language,Date_year,Region
0,Accenture,«Aprendes mucho»,40,Internship Technology,BARCELONA,"Para el primer empleo esta bien, ya que aprendes",A veces haces horas extras,Español,2021,CATALUÑA
1,Accenture,«Buen lugar para aprender»,50,Technology Consulting Intern,BARCELONA,Aprendes mucho acerca de proyectos importantes...,Puede ser un poco burocratico para tener acces...,Español,2021,CATALUÑA
2,Accenture,«Excelente empresa»,50,Consultant,BARCELONA,8 dias de vacaciones adicionales\nFormacion co...,No encuentro desventaja al analisis,Español,2021,CATALUÑA
3,Accenture,«muchas horas proyectos aburridos»,30,Consultor,BARCELONA,"Buen ambiente, buen sueldo para ser una consul...","Muchas horas, poyectos aburridos e impuestos",Español,2021,CATALUÑA
4,Accenture,«.»,50,Junior Consultant,BARCELONA,"Aprendes mucho, buenos beneficios, buena geren...",hay mucha burocracia para los procesos,Español,2021,CATALUÑA
5,Accenture,«Intensivo»,50,consultora,BARCELONA,Buenas condiciones salariales y proyectos inte...,Mucho trabajo a destajo y horas extra,Español,2021,CATALUÑA
6,Accenture,«Una buena empresa en la que crecer profesiona...,40,IT Consultant,BARCELONA,Beneficios propios de una multinacional; segur...,Hay algunas áreas (cada vez menos) que todavía...,Español,2021,CATALUÑA
7,Accenture,«Un trabajo normal»,50,Consultor SAP Senior,BARCELONA,Es una consultora como cualquier otra,Pocas posibilidades de crecimiento real.,Español,2021,CATALUÑA
8,Accenture,«Muy buena experiencia y buen nombre para el c...,40,anónimo,BARCELONA,En poco tiempo puedes llegar a aprender muchas...,No es para todo el mundo. Si es algo que te gu...,Español,2020,CATALUÑA
9,Accenture,«Está bien para conocer una gran empresa en lo...,40,Technology Consulting Analyst,BARCELONA,Los recursos y la infraestructuras que te ofre...,Al ser una empresa tan grande es difícil termi...,Español,2021,CATALUÑA


### Creating a new .csv file with the **dataframe updated**:

In [158]:
Companies_reviews_v2.to_csv('Companies_reviews_v2.csv')

### **Importing** the dataframe updated every time we open the notebook:

In [5]:
Companies_reviews_v2 = pd.read_csv('Companies_reviews_v2.csv')

### Selecting a range of **date** by 'slicing' and 'condition' ?:

In [80]:
year_count = Companies_reviews.groupby(['Date_year']).count().sort_values(['Company'], ascending=False)
year_count

Unnamed: 0_level_0,Company,Date,Title,Rating,Role,Location,Pros,Contras,Language,Date_day,Date_month
Date_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2020,2883,2883,2883,2883,2883,2883,2883,2883,2883,2883,2883
2021,2058,2058,2058,2058,2058,2058,2058,2058,2058,2058,2058
2019,1532,1532,1532,1532,1532,1532,1532,1532,1532,1532,1532
2018,998,998,998,998,998,998,998,998,998,998,998
2017,913,913,913,913,913,913,913,913,913,913,913
2016,807,807,807,807,807,807,807,807,807,807,807
2015,474,474,474,474,474,474,474,474,474,474,474
2014,248,248,248,248,248,248,248,248,248,248,248
2013,207,207,207,207,207,207,207,207,207,207,207
2012,129,129,129,129,129,129,129,129,129,129,129


Counting the number of reviews per year, we observe that in the dataframe we have reviews since 2008 and that most of them are counted in the last 6 years. This is most likely due to the fact that Glassdoor was founded only two years earlier, in 2007, and therefore in its first years of life the number of registrations on its platform was much lower, growing over the years.

It is true that, in a decade, the situation in each company can change or evolve considerably. Therefore, in another situation, I would get only  the reviews of the last 6 years, from 2015. But, if I do it, I will lose thousands of reviews of the dataset as we can see below and in this case I do not have a very large one, so I will keep it completely.

In [57]:
#Converting the 'string' data to 'integer' data
Companies_reviews['Date_year'] = Companies_reviews['Date_year'].astype(int)

print(type(Companies_reviews['Date_year']))

<class 'pandas.core.series.Series'>


In [60]:
#Slicing by columns labels
cols = ['Company', 'Rating', 'Role', 'Location', 'Pros', 'Contras', 'Language', 'Date_year']

#Condition
Date_slicing = Companies_reviews.loc[(Companies_reviews.Date_year >= 2015), cols]
len(slicing)

7471

<h1>Summary</h1>

### Web Scraping:

Tasks I have be done:
    - Getting reviews and creating a dataframe per company
    - Merging it all into an unique dataframe
    - Creating an unified .csv file
    
Issues I have found:
    - Using network headers path to log in into the website
    - Checking if all the pages are allowed to scrap or if Glassdoor is blocking any of them
    - Using Selenium to automate access, click 'cookies' and 'log in' buttons, enter credentials, etc
    - The reviews are not shown by location for all companies, which would facilitate the scrapping, 
      but I have had to do the scrapping by company and by language

### Data Cleaning:

Tasks I have done:

    - Removing columns that are not necessary
    - Removing 'NaN' values
    - Removing words from the 'Date'. Split the date in diferent columns. Drop day and month. Convert to a datetime object
    - Removing the first part of the text, before the "-", from de original 'Role' column
    - Removing locations outside of Spain
    - Joining locations by provinces
    - Adding a new column grouping provinces by regions
    - Creating a new .csv file with the dataframe updated