# Web Scraping Project: Extracting Company Data  


This project extracts company names and details from a website using **Selenium** and saves them in CSV files.  

## Tools & Methods I used
- **Selenium**: Automates browser actions  
- **Pandas**: Stores data in CSV format  
- **Chrome WebDriver**: Controls the browser  
- **Explicit Waits**: Ensures elements load before interacting  
- **Try-Except Blocks**: Handles errors  


I divided the project into **Two steps:**


## Step 1: Extracting Company Names  

## Step 2: Extracting Company Details  

## Issues I faced and fixed:

- **Page reset after clicking next:** After clicking on "previous page" button it comes back to the first page.  
- **Missing data:** Used `try-except` to prevent errors  
- **Search issues:** Cleared and re-entered search terms  

## Final Output  
- **Company names** in a text file  
- **Company details** in CSV files  
- **Fully automated process**  


### First part of the project: To extract all the comanies' names and save it to CSV file

In [2]:
#The libraries

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time


Here we've put all parts of the code in one cell because the program has to do all these steps at once.


- Start the webdriver and opens the chrome to go to the target url
- Waits for the page to load compeletely
- We made a list where the names of the companies will be saved
- The program uses the CSS selector that we've provided to find all elements containing company names on the current page.
- For each company element found, the program extracts the company name (company_element.text) and appends it to the company_names list.
- After extracting names from the current page, the program looks for the "Next Page" button.
- If the button is found and clickable, it scrolls to the button and clicks it, then waits for the next page to load
- And at the end it breaks (stops) when there are no more pages


In [None]:
#Using webdriver:
driver = webdriver.Chrome()
driver.get("https://example.com/path-to-data")

#Waiting for the page to load:
time.sleep(5)  


#Extracting companies' names:
company_names = []

while True:
    #Finding companies' names on the page:
    company_elements = driver.find_elements(By.CSS_SELECTOR, "p.MuiTypography-root.MuiTypography-body1.css-8uynmr")
    for company_element in company_elements:
        company_names.append(company_element.text)

    try:
        #Next page button:
        next_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, "//button[contains(., 'Seguente')]"))
        )
        driver.execute_script("arguments[0].scrollIntoView();", next_button)
        next_button.click()
        
        #Waiting for the next page to load:
        time.sleep(5)  

    except Exception:
        print("No more pages or error while clicking 'Next Page'. Stopping.")
        #Break if there are no more pages:
        break  


Then we save the companies names from the list we made to a CSV file to use it for the second part of the project.

In [None]:
#Saving the companies' names in a CSV file:
with open("company_names.csv", "w") as f:
    for name in company_names:
        f.write(f"{name}\n")

driver.quit()


Now we move on to the **Second part** of the project.


### Second part of the project: 
### To use the company_names.csv and extract the required info of the companies one by one

We import the libraries we need:

In [None]:
#We have imported these libraries in the previous part but I just wrote them to remind that we need them also here:
#from selenium import webdriver
#from selenium.webdriver.common.by import By
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#import time


from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd

During the project we faced some phone numbers that are ended with slash mark "/" and the bot couldn't extract them well so here we defined a function to prevent this problem:

In [None]:
def clean_phone_number(phone):
    return phone.replace('/', '').strip()

And another issue that was occured was found out if the program crashes, there will be no saved CSV file and the extracted data before the crash would be lost. So here we defined another function to actively save the info of each company as we go to the next one:

In [None]:
def save_to_csv(data, filename):
    pd.DataFrame(data).to_csv(filename, index=False)

We start the Chrome WebDriver again, using `ChromeOptions` for customization. And we open the URL.

This step loads the webpage to begin the scraping process.


In [None]:
url = "https://example.com/path-to-data"
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
driver.get(url)

To reduce the risk of code crash and data loss I divided the `company_names.csv` to 3 smaller CSV files

In [None]:
company_files = ["company_names.csv", "company_names2.csv", "company_names3.csv"]

Again I put all parts of the code at once because it has to do all the steps together.

- Loop through company files: Goes to each CSV file containing the companies' names starting with `company_names.csv` to scrape data for each company.


 - For each company name:
   - Wait for the search bar to load on the webpage.
   - Enter the company name in the search bar and hit enter.
   - Wait for the company result to appear on the page.
   - Click on the first search result to navigate to the company details page.
   - Extract the company's address, 2 phone numbers, and website. The addresses are written in 3 lines, first line was for the street name and number, the second line was the name of the city, and third line was for the CAP(5-digit number) and "Italia". There were 2 phone numbers for each company which had 2 different CSS selectors so I extracted them seperately and saved them in 2 different columns (as the costumer wanted them) 
   - If data is missing, by default `"not found"` will be saved.
  
- After extracting data for one company, reload the page to prepare for the next search.

- After processing all companies in the current file, save the data into a CSV file with a name based on the original input file.

- Continue this process for each csv file in `company_file` list until all files have been processed.

- End of scraping: Once all files have been processed, close the browser and print a success message.


In [None]:
for company_file in company_files:
    df = pd.read_csv(company_file)
    company_names = df["Company Name"].tolist()
    #Store the extracted details:
    data = []  

    for company in company_names:
        try:
            #Loading search bar:
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "input[placeholder='Cerca espositore per nome']"))
            )

            #Search for the company name:
            search_bar = driver.find_element(By.CSS_SELECTOR, "input[placeholder='Cerca espositore per nome']")
            driver.execute_script("arguments[0].scrollIntoView(true);", search_bar)
            time.sleep(1)
            search_bar.click()
            #Select all text at once because if not, the search bar starts searching the remaining characters:
            search_bar.send_keys(Keys.CONTROL + "a")
            #Delete the selected text:
            search_bar.send_keys(Keys.BACKSPACE)
            #Type the next company name:
            search_bar.send_keys(company)
            time.sleep(2)
            search_bar.send_keys(Keys.RETURN) 

            #Waiting for the search result:
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, f"//p[contains(text(), '{company}')]"))
            )

            #Click on the first search result:
            search_result = driver.find_element(By.XPATH, f"//p[contains(text(), '{company}')]")
            search_result.click()

            #Wait for the company page to load:
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "MuiStack-root")))

            #Extract Address:
            try:
                address_lines = driver.find_elements(By.XPATH, "//p[contains(@class, 'MuiTypography-body2 css-jpwjwb')]")
                address = ", ".join([line.text.strip() for line in address_lines[:3]])  #Extract first 3 lines
            except:
                address = "Address not found"

            #Extract Phone 1:
            try:
                phone1_element = driver.find_element(By.XPATH, "//p[contains(@class, 'MuiTypography-body2')]/span")
                phone1 = clean_phone_number(phone1_element.text)
            except:
                phone1 = "Phone 1 not found"

            #Extract Phone 2:
            try:
                phone2_elements = driver.find_elements(By.XPATH, "//p[contains(@class, 'MuiTypography-body2')]")
                phone2 = None
                for element in phone2_elements:
                    text = element.text.strip()
                    if text.startswith("+") and "Tel:" not in text:
                        phone2 = text
                        break
                phone2 = phone2 if phone2 else "Phone 2 not found"
            except:
                phone2 = "Phone 2 not found"

            #Extract Website:
            try:
                website = driver.find_element(By.XPATH, "//span[contains(@class, 'MuiTypography-body2-bold') and contains(text(), 'www')]").text
            except:
                website = "Website not found"

            #Store the data in 5 columns:
            data.append({
                "Company Name": company,
                "Phone 1": phone1,
                "Phone 2": phone2,
                "Address": address,
                "Website": website
            })

            #Reload the page to avoid errors:
            driver.get(url)
            time.sleep(3)


        #print the error (if there was any) and the company name where it occured
        except Exception as e:
            print(f"Error processing {company}: {e}")

    #Saving the extracted data to CSV:
    output_filename = company_file.replace("company_names", "company_details")
    save_to_csv(data, output_filename)

    
    data.clear()

driver.quit()
print("Scraping completed for all files!")


### Data Cleaning

For the last part of the project I had to clean the extracted and saved data.

I had to:
- Merge the 3 result CSV files that each were extracted from one of the 3 company names files
- Check if there are any duplicated data and remove them
- Compare the "Company name" column of the result file to the original company name CSV file and check if there is any missing company and if there is save them to a `.txt` file.
- Add the missing companies' data to the result file
- Extract the CAP code (the 5-digit code) from the Address column, make another column and add the CAPs for each company
- And finally group the result file by CAPs starting with same digit

### Merging the 3 result files:

In [None]:
#There is no need to import Pandas as we did it before

#Loading the three result CSV files:
df1 = pd.read_csv("company_details.csv")
df2 = pd.read_csv("company_details2.csv")
df3 = pd.read_csv("company_details3.csv")

#Merging:
merged_df = pd.concat([df1, df2, df3], ignore_index=True)

#Saving:
merged_df.to_csv("merged_company_details.csv", index=False)


### Removing duplicate data (if there is any)

In [None]:
merged_df = pd.read_csv("merged_company_details.csv")

#Removing duplicates from the "Company Name" column:
merged_df_cleaned = merged_df.drop_duplicates(subset="Company Name", keep='first')

#Saving:
merged_df_cleaned.to_csv("merged_company_details_cleaned.csv", index=False)


### Finding and handling missing values:

In [None]:
#Loading the merged file:
merged_df = pd.read_csv("merged_company_details.csv")

#Reading the original company names from the text file:
with open("company_names.txt", "r", encoding="latin1") as file:
    company_names = [line.strip() for line in file.readlines()]


#Getting the company names and make a list of them:
scraped_companies = merged_df["Company Name"].tolist()

#Finding missing company names:
missing_companies = [name for name in company_names if name not in scraped_companies]

#Saving missing company names to a text file:
with open("missing_companies.txt", "w", encoding="utf-8") as file:
    for company in missing_companies:
        file.write(company + "\n")


### Adding missing values 
We did it manually because there were just 4 of them

In [None]:
#Loading the merged company details CSVL
merged_csv = "merged_company_details_cleaned.csv"
df = pd.read_csv(merged_csv)

#List of missing companies
missing_companies = [
    {
        "Company Name": "company 1",
        "Address": "address 1",
        "Phone 1": "###########",
        "Phone 2": "##########",
        "Website": "www.#####.com"
    },
    {
        "Company Name": "company 2",
        "Address": "address 2",
        "Phone 1": "##########",
        "Phone 2": "##########",
        "Website": "www.#######.it"
    },
    {
        "Company Name": "company 3",
        "Address": "address 3",
        "Phone 1": "###########",
        "Phone 2": "###########",
        "Website": "www.#######.it"
    },
    {
        "Company Name": "company 4",
        "Address": "address 4",
        "Phone 1": "############",
        "Phone 2": "############",
        "Website": "www.########.com"
    }
]

#Convert list to a DataFrame:
missing_df = pd.DataFrame(missing_companies)

#Appending missing companies to the main DataFrame:
df = pd.concat([df, missing_df], ignore_index=True)

#Saving:
df.to_csv("merged_company_details_cleaned_updated.csv", index=False)


### Extracting the CAP from the addresses:

In [None]:
import re

#Loading the merged and cleaned CSV:
merged_df_cleaned = pd.read_csv("merged_company_details_cleaned.csv")

#Function to extract CAP from the address:
def extract_cap(address):
    #regex (AI did this part)
    match = re.search(r'\b\d{5}\b', address)
    return match.group(0) if match else None

#Applying the CAP extraction function to the 'Address' column:
merged_df_cleaned['CAP'] = merged_df_cleaned['Address'].apply(extract_cap)

#Group by the first digit of the CAP:
merged_df_cleaned['CAP Group'] = merged_df_cleaned['CAP'].apply(lambda x: x[0] if x else None)

#Groupping the data by the new 'CAP Group' column:
grouped_df = merged_df_cleaned.groupby('CAP Group')

#Saving to CSV:
grouped_df_list = [group for _, group in grouped_df]
final_df = pd.concat(grouped_df_list, ignore_index=True)

#Save the final grouped data to a CSV file:
final_df.to_csv("final_grouped_company_details.csv", index=False)


### The End :]

Done by `Hassan Mansouri` 
Linkedin: www.linkedin.com/in/hassan-mansourii/  
Github: www.github.com/hasanmansouri