<a href="https://colab.research.google.com/github/ndescussebrown/SecondColabrepo/blob/main/webscraping_withbeautifulsoup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will be working with web scraping tools such as beautifulsoup

Some example of how to work with beautifulsoup: [link text](http://webcache.googleusercontent.com/search?q=cache:o1CFsRJ-eMEJ:https://towardsdatascience.com/in-10-minutes-web-scraping-with-beautiful-soup-and-selenium-for-data-professionals-8de169d36319&hl=en&gl=fr&strip=1&vwsrc=0)

In [6]:
#import beautifulsoup package as it was already installed as a package within
#colab
import bs4, requests
from bs4 import BeautifulSoup



In [7]:
# Sending a request to the url. The header is used to spoof the request so that
# it looks like it comes from a legitimate browser.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
url = "https://www.rungineer.com/blog-1/2021/5/24/jurassic-coast-half-challenge-or-the-race-that-nearly-killed-my-love-of-hills"
r = requests.get(url, headers=headers)


In [8]:
#we need to check the response from the website (200 ok, 404 not found etc)
print(r)

<Response [200]>


In [9]:
#the below code parses the html and retrieve only images
soup = BeautifulSoup(r.content, "html.parser")
images = soup.find_all('img')

But how to access attributes of that tag??

In [None]:
#checking parsing of website
soup

In [None]:
#checking contents of the images variable defined previously
images

In [None]:
#creating a loop on images variable to return urls of the images on our website
all_images_url=[]
for element in images:
  print(element.get('data-src'))




In [13]:
#defining a function to return urls of the images on our website
def extract_url(result_set):
  return [element.get('data-src') for element in result_set]

In [None]:
#The below only returns url of the pictures if it exists
for element in images:
  if element.get('data-src') !=None: 
    print(element.get('data-src'))

In [None]:
#the above defined function return something for each element even if it doesn't
# exist.
extract_url(images)

In [16]:
#improving on the previous function to return urls only for the elements of images where they exist:
def extract_url_2(result_set):
  return [element.get('src') for element in result_set if (element.get('src') is not None) ]

In [17]:
url2_images=extract_url_2(images)

In [19]:
soup_apec = BeautifulSoup(r_apec.content, "html.parser")
jobs = soup_apec.find_all('h2')

Try the above to fakejobs website: [link text](https://realpython.github.io/fake-jobs/)

In [20]:
url_fakejobs = 'https://realpython.github.io/fake-jobs/'
r_fakejobs = requests.get(url_fakejobs, headers=headers)

In [218]:
soup_fakejobs = BeautifulSoup(r_fakejobs.content, "html.parser")
fake_jobs_h2 = soup_fakejobs.find_all('h2')
fake_jobs_h3 = soup_fakejobs.find_all('h3')

In [188]:
#check response from website apec
print(r_fakejobs)

<Response [200]>


In [None]:
fake_jobs_h2

In [211]:
def extract_url_fakejobs(result_set):
  return [element.string for element in result_set]

In [None]:
extract_url_fakejobs(fake_jobs)

We now want to put the contents of our scraping (jobs and company) in a dictionary

In [251]:
my_dict={}

In [252]:
#def dict_fakejobs:
for i in range(len(fake_jobs_h2)):
   key=fake_jobs_h2[i].text + '_' + str(i)
   value=fake_jobs_h3[i].text
   my_dict[key]=value

In [None]:
my_dict

In [254]:
fake_jobs_img = soup_fakejobs.find_all('img')

In [None]:
#We can call the function previously defined to find all pictures on the site
extract_url_2(fake_jobs_img)

In [None]:
#Here is how to list all keys for a dictionary
my_dict.keys()

we now want to put our webscraping output into a dataframe

In [3]:
import pandas as pd

In [4]:
rungineer_df=pd.DataFrame(url2_images)

NameError: ignored

In [284]:
rungineer_df.columns=['image_url']

In [285]:
rungineer_df

Unnamed: 0,image_url
0,https://images.squarespace-cdn.com/content/v1/...
1,https://images.squarespace-cdn.com/content/v1/...
2,https://images.squarespace-cdn.com/content/v1/...
3,https://images.squarespace-cdn.com/content/v1/...
4,https://images.squarespace-cdn.com/content/v1/...
5,https://images.squarespace-cdn.com/content/v1/...
6,https://images.squarespace-cdn.com/content/v1/...
7,https://images.squarespace-cdn.com/content/v1/...


####Try the above to APEC website: https://www.apec.fr

We want to build a function that returns all jobs from the apec site that
match specific keywords as defined below.

In [46]:
#import required libraries
import bs4,requests
from bs4 import BeautifulSoup

1. First step is to build the url of the page we want to access based on the 
keywords specified.

In [22]:
#We look for a specific job within the apec page and search for specific key words
keywords=['data','engineer']
str_keywords = '%20'.join(keywords)

In [27]:
#we build the url required
url_apec_base='https://www.apec.fr/candidat/recherche-emploi.html/emploi?motsCles='
url_apec_end='&page='
url_apec=url_apec_base + str_keywords + url_apec_end

In [28]:
#we check we have the right url:
url_apec

'https://www.apec.fr/candidat/recherche-emploi.html/emploi?motsCles=data%20engineer&page='

In [26]:
# Sending a request to the url. The header is used to spoof the request so that
# it looks like it comes from a legitimate browser.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}


In [41]:
#we use the page number to define thge url. Careful that on this website the
#first page is actually page 0, not page 1. Here we test the built of the url
#to check it is working.
page_number=1
url=url_apec + str(page_number)

In [40]:
#we check that we are landing on the page we want
url

'https://www.apec.fr/candidat/recherche-emploi.html/emploi?motsCles=data%20engineer&page=1'

2. We now want to loop on the various pages returned for our search to return all results.

In [50]:
#we create a request to the url defined above.
r_apec = requests.get(url, headers=headers)



In [51]:
#we check the response from the website and check it is 200
r_apec

<Response [200]>

In [52]:
soup_apec = BeautifulSoup(r_apec.content, "html.parser")

In [56]:
#The below doesn't work to find the job titles although when inspecting the
#website job title is associated with h2. Reason is that the website is
#dynamic, so in this case beautifulsoup will not help us find what we need.
#we need to use a different tool.
soup_apec.find_all('h2')

[]

In [None]:
#So we will try to use another tool adapted to dynamic websites, selenium.
#First we need to install the package
!pip install selenium

In [60]:
#Now we import selenium.
import selenium