# Car Web Scraping Data Project

![web scraping using python!](web-scraping-in-python.png "web scraping using python")

**Description:** 

In this data science project, I leveraged Python and its powerful libraries, including Beautiful Soup and Requests, to scrape valuable information from car-related websites. The goal of the project was to gather data about various car models, their specifications, prices, and other relevant details.

**Description:**
In this data science project, I leveraged Python and its powerful libraries, including Beautiful Soup and Requests, to scrape valuable information from car-related websites. The goal of the project was to gather data about various car models, their specifications, prices, and other relevant details.

**Scope:**
The scope of the project encompassed identifying and scraping data from a car listing website, representing  different car model or a category. The scraped data included text descriptions and numeric specifications. I cleaned and structured the scraped data to ensure its usability for further analysis.

**Challenges:**
While scraping the data, I encountered challenges related to website structures, varying layouts, and potential rate-limiting or IP-blocking from the websites due to excessive requests. To mitigate these challenges, I implemented measures such as using user-agents and incorporating time delays between requests.

**Outcome:**
The project resulted in a well-organized dataset containing a comprehensive collection of car information. This dataset serves as a valuable resource for conducting analyses such as price trends, feature comparisons, and sentiment analysis based on customer reviews. The scraped data can also be integrated into machine learning models for predictive analytics related to car pricing and customer preferences.

**Key Skills:**
Web scraping, HTML parsing, data cleaning, data structuring, Python (Beautiful Soup, urllib3), data analysis, data visualization.

**Future Directions:**
In the future, this project can be expanded to include data from more websites, encompassing a broader range of car models and brands. Additionally, advanced techniques such as implementing automated scraping scripts, using proxies, and handling dynamic web content (JavaScript-rendered pages) could enhance the project's scope and capabilities.

### Import Library

In [101]:
from bs4 import BeautifulSoup
import requests
import urllib3
import certifi


import pandas as pd

In [102]:
# car data
car_dict = {
    'car_id': [],
    'description': [],
    'amount': [],
    'region': [],
    'make': [],
    'model': [],
    'year_of_man': [],
    'color': [],
    'condition': [],
    'mileage': [],
    'engine_size': [],
    'selling_cond': [],
    'bought_cond': [],
    'trim': [],
    'drive_train': [],
    'reg_city': [],
    'seat': [],
    'num_cylinder': [],
    'horse_power': []
}

In [103]:
# Getting other data

def get_details(id):
        
    car_details = {}
    
    res = http.request('GET', f"https://www.cars45.com/{id}")

    soup = BeautifulSoup(res.data, "html.parser")

    car_overview = soup.find('div', class_='svg flex').get_text().strip()    

    # Find the <span> tag with the specified text
    make_span = soup.find('span', string='Make')
    model_span = soup.find('span', string='Model')
    year_of_man_span = soup.find('span', string='Year of manufacture')
    color_span = soup.find('span', string='Colour')
    condition_span = soup.find('span', string='Condition')
    mileage_span = soup.find('span', string='Mileage')
    engine_size_span = soup.find('span', string='Engine Size')
    selling_cond_span = soup.find('span', string='Selling Condition')
    bought_cond_span = soup.find('span', string='Bought Condition')
    trim_span = soup.find('span', string='Trim')
    drive_train_span = soup.find('span', string='Drivetrain')
    reg_city_span = soup.find('span', string='Registered city')
    seat_span = soup.find('span', string='Seats')
    num_cylinder_span = soup.find('span', string='Number of Cylinders')
    horse_power_span = soup.find('span', string='Horse Power')

    # If the target <span> tag is found, find the preceding <p> tag
    if make_span:
        preceding_p_tag = make_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['make'] = preceding_p_tag.get_text().strip()
        else:
            car_details['make'] = ''
    else:
        car_details['make'] = ''
    
    if model_span:
        preceding_p_tag = model_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['model'] = preceding_p_tag.get_text().strip()
        else:
            car_details['model'] = ''
    else:
        car_details['model'] = ''
    
    if year_of_man_span:
        preceding_p_tag = year_of_man_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['year_of_man'] = preceding_p_tag.get_text().strip()
        else:
            car_details['year_of_man'] = ''
    else:
        car_details['year_of_man'] = ''
        
    
    if color_span:
        preceding_p_tag = color_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['color'] = preceding_p_tag.get_text().strip()
        else:
            car_details['color'] = ''
    else:
        car_details['color'] = ''
        
    
    if condition_span:
        preceding_p_tag = condition_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['condition'] = preceding_p_tag.get_text().strip()
        else:
            car_details['condition'] = ''
    else:
        car_details['condition'] = ''
            
    
    if mileage_span:
        preceding_p_tag = mileage_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['mileage'] = preceding_p_tag.get_text().strip()
        else:
            car_details['mileage'] = ''
    else:
        car_details['mileage'] = ''
            
        
    if engine_size_span:
        preceding_p_tag = engine_size_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['engine_size'] = preceding_p_tag.get_text().strip()
        else:
            car_details['engine_size'] = ''
    else:
        car_details['engine_size'] = ''
            
    
    if selling_cond_span:
        preceding_p_tag = selling_cond_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['selling_cond'] = preceding_p_tag.get_text().strip()
        else:
            car_details['selling_cond'] = ''
    else:
        car_details['selling_cond'] = ''
        
    
    if bought_cond_span:
        preceding_p_tag = bought_cond_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['bought_cond'] = preceding_p_tag.get_text().strip()
        else:
            car_details['bought_cond'] = ''
    else:
        car_details['bought_cond'] = ''
        
    
    if trim_span:
        preceding_p_tag = trim_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['trim'] = preceding_p_tag.get_text().strip()
        else:
            car_details['trim'] = ''
    else:
        car_details['trim'] = ''
            
    
    if drive_train_span:
        preceding_p_tag = drive_train_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['drive_train'] = preceding_p_tag.get_text().strip()
        else:
            car_details['drive_train'] = ''
    else:
        car_details['drive_train'] = ''
            
    
    if reg_city_span:
        preceding_p_tag = reg_city_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['reg_city'] = preceding_p_tag.get_text().strip()
        else:
            car_details['reg_city'] = ''
    else:
        car_details['reg_city'] = ''
        
    
    if seat_span:
        preceding_p_tag = seat_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['seat'] = preceding_p_tag.get_text().strip()
        else:
            car_details['seat'] = ''
    else:
        car_details['seat'] = ''

    
    if num_cylinder_span:
        preceding_p_tag = num_cylinder_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['num_cylinder'] = preceding_p_tag.get_text().strip()
        else:
            car_details['num_cylinder'] = ''
    else:
        car_details['num_cylinder'] = ''
            
    
    if horse_power_span:
        preceding_p_tag = horse_power_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['horse_power'] = preceding_p_tag.get_text().strip()
        else:
            car_details['horse_power'] = ''
    else:
        car_details['horse_power'] = ''
            
        

    return car_details
    

In [104]:
# funtion to return all car list in a dictionary|
def get_info(car_listings):
    # loop through all list
    for car in car_listings:
        car_id = car['href'].replace('/','')
        description = car.find('p', class_="car-feature__name").get_text().strip()
        amount = car.find('p', class_="car-feature__amount").get_text().strip()
        region = car.find('p', class_="car-feature__region").get_text().strip()
        
        car_details = get_details(car_id)
        
        car_dict['car_id'].append(car_id)
        car_dict['description'].append(description)
        car_dict['amount'].append(amount)
        car_dict['region'].append(region)
        
        car_dict['make'].append(car_details['make'])
        car_dict['model'].append(car_details['model'])
        car_dict['year_of_man'].append(car_details['year_of_man'])
        car_dict['color'].append(car_details['color'])
        car_dict['condition'].append(car_details['condition'])
        car_dict['mileage'].append(car_details['mileage'])
        car_dict['engine_size'].append(car_details['engine_size'])
        car_dict['selling_cond'].append(car_details['selling_cond'])
        car_dict['bought_cond'].append(car_details['bought_cond'])
        car_dict['trim'].append(car_details['trim'])
        car_dict['drive_train'].append(car_details['drive_train'])
        car_dict['reg_city'].append(car_details['reg_city'])
        car_dict['seat'].append(car_details['seat'])
        car_dict['num_cylinder'].append(car_details['num_cylinder'])
        car_dict['horse_power'].append(car_details['horse_power'])

In [105]:
def main():
    # loop through the pages
    
    http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
    
    for page in range(1,201):
        res = http.request('GET', f'https://www.cars45.com/listing?page={page}')
    
        soup = BeautifulSoup(res.data, 'html.parser')
        car_listings = soup.find_all('a', class_='car-feature car-feature--wide-mobile')
            
        # invoke the get_info()
        get_info(car_listings)
    
    return car_dict

In [106]:
car_data = main()

KeyError: 'selling_cond'

## Convert Data to  DataFrame

In [100]:
car_df = pd.DataFrame(car_data)

In [65]:
car_df

Unnamed: 0,car_id,description,amount,region,make,model,year_of_man,color,condition,mileage,engine_size,selling_cond,bought_cond,trim,drive_train,reg_city,seat,num_cylinder,horse_power
0,hy7yBdW3a3EbwkZU7WGHqUPr,Lexus ES 350 2008 White,"₦ 4,162,500","Abuja (FCT), Kubwa",Lexus,ES,2008,White,Nigerian Used,126841,3500,Registered,Registered,350,Front,ABUJA,5.0,6.0,
1,x2Uz0Pm5w43ConBBsmAOQPOV,Ford Escape XLS 4x4 2005 Ivory,"₦ 1,721,250","Lagos State, Ikeja",Ford,Escape,2005,Ivory,Nigerian Used,246930,3000,Registered,Imported,XLS 4x4,All Wheel,LAGOS,5.0,4.0,156.0
2,n8PHuxWTdbg0HvvedjTf5CJZ,Acura MDX 2011 Blue,"₦ 4,590,000","Lagos State, Ikotun/Igando",Acura,MDX,2011,Blue,Nigerian Used,122734,3700,Registered,Registered,,,Lagos,,,
3,xEGLEYM0Nt5xNQJatGMewIVr,Toyota Highlander 2015 Teal,"₦ 18,000,000","Lagos State, Lekki",Toyota,Highlander,2015,Teal,Nigerian Used,130078,3500,Registered,Imported,,,,,,
4,nhqZPJ6HvZfoeOT9u0mNppQV,Lexus GX 2003 Black,"₦ 4,050,000","Lagos State, Amuwo-Odofin",Lexus,GX,2003,Black,Nigerian Used,115676,4700,Registered,Imported,,,,,,
5,xCKl4TrUzfxyQZUzuxkkttGq,BMW 323i 2008 Gray,"₦ 3,937,500","Abuja (FCT), Jabi",BMW,323i,2008,Gray,Nigerian Used,143258,2500,Registered,Registered,,,KADUNA,,,
6,pGdxpP788sdLQXZoNAQYRMlz,Toyota Corolla 2006 Gray,"₦ 2,925,000","Abuja (FCT), Jabi",Toyota,Corolla,2006,Gray,Nigerian Used,299999,1800,Registered,Registered,,,ABUJA,,,
7,mPa4MnSwhQp4wzOAB8SFudOg,Toyota Corolla 2014 Silver,"₦ 6,187,500","Oyo State, Ibadan",Toyota,Corolla,2014,Silver,Nigerian Used,101872,1800,Registered,Registered,,,Lagos State,,,
8,Bok4bZ3DUH2NppYYGMHDkQe0,Lexus GS 2008 Gray,"₦ 2,953,125","Lagos State, Lekki",Lexus,GS,2008,Gray,Nigerian Used,98229,3500,Registered,Imported,,,LAGOS,,,
9,kLIpsBxMp7ExegbbzB29OHVN,Honda Accord 2010 Black,"₦ 2,025,000","Abuja (FCT), Lugbe District",Honda,Accord,2010,Black,Nigerian Used,82091,3300,Registered,Registered,,,,,,


## Data Inspection and Wrangling

In [48]:
car_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2250 entries, 0 to 2249
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   car_id     2250 non-null   object
 1   name       2250 non-null   object
 2   amount     2250 non-null   object
 3   region     2250 non-null   object
 4   condition  2250 non-null   object
 5   mileage    2250 non-null   object
dtypes: object(6)
memory usage: 105.6+ KB


In [54]:
# check number of duplicate
car_df.duplicated().sum()

119

In [57]:
# remove duplicate
car_df.drop_duplicates(inplace=True)

In [59]:
car_df.head()

Unnamed: 0,car_id,name,amount,region,condition,mileage
0,/pyU1lsSHiGI7uIroUCL9BWmq,Toyota Camry 2010 Gray,"₦ 3,937,500","Abuja (FCT), Lugbe District",Nigerian Used,146335 km
1,/BvePFLZUaFkEdfYL7PzPaVoS,Peugeot 307 2004 Silver,"₦ 2,475,000","Kaduna State, Kaduna / Kaduna State",Nigerian Used,437514 km
2,/axF2TH1vmfLooOtgVwo0jZFy,Honda Accord 2006 Black,"₦ 1,575,000","Kaduna State, Kaduna / Kaduna State",Nigerian Used,389165 km
3,/Bkct9yfKgQEadVTwXZuKKS0w,Hyundai Elantra 2010 Silver,"₦ 2,205,000","Rivers State, Port-Harcourt",Nigerian Used,119715 km
4,/a7bZ0ZLY8RtsoTroijAcFtKd,Mazda 3 2008 Blue,"₦ 2,250,000","Lagos State, Yaba",Nigerian Used,463776 km


In [62]:
# replace '/' with ''
car_df['car_id'] = car_df['car_id'].str.replace('/','')

In [64]:
# replace '₦' with ''
car_df['amount'] = car_df['amount'].str.replace('₦','')

In [66]:
# replace ',' with ''
car_df['amount'] = car_df['amount'].str.replace(',','')

In [68]:
# replace 'km' with ''
car_df['mileage'] = car_df['mileage'].str.replace('km','')

In [69]:
car_df.head()

Unnamed: 0,car_id,name,amount,region,condition,mileage
0,pyU1lsSHiGI7uIroUCL9BWmq,Toyota Camry 2010 Gray,3937500,"Abuja (FCT), Lugbe District",Nigerian Used,146335
1,BvePFLZUaFkEdfYL7PzPaVoS,Peugeot 307 2004 Silver,2475000,"Kaduna State, Kaduna / Kaduna State",Nigerian Used,437514
2,axF2TH1vmfLooOtgVwo0jZFy,Honda Accord 2006 Black,1575000,"Kaduna State, Kaduna / Kaduna State",Nigerian Used,389165
3,Bkct9yfKgQEadVTwXZuKKS0w,Hyundai Elantra 2010 Silver,2205000,"Rivers State, Port-Harcourt",Nigerian Used,119715
4,a7bZ0ZLY8RtsoTroijAcFtKd,Mazda 3 2008 Blue,2250000,"Lagos State, Yaba",Nigerian Used,463776


In [70]:
car_df

Unnamed: 0,car_id,name,amount,region,condition,mileage
0,pyU1lsSHiGI7uIroUCL9BWmq,Toyota Camry 2010 Gray,3937500,"Abuja (FCT), Lugbe District",Nigerian Used,146335
1,BvePFLZUaFkEdfYL7PzPaVoS,Peugeot 307 2004 Silver,2475000,"Kaduna State, Kaduna / Kaduna State",Nigerian Used,437514
2,axF2TH1vmfLooOtgVwo0jZFy,Honda Accord 2006 Black,1575000,"Kaduna State, Kaduna / Kaduna State",Nigerian Used,389165
3,Bkct9yfKgQEadVTwXZuKKS0w,Hyundai Elantra 2010 Silver,2205000,"Rivers State, Port-Harcourt",Nigerian Used,119715
4,a7bZ0ZLY8RtsoTroijAcFtKd,Mazda 3 2008 Blue,2250000,"Lagos State, Yaba",Nigerian Used,463776
...,...,...,...,...,...,...
2244,hYnWah7Qg2EhFpE6OkMo019s,Toyota Camry 2012 Green,7280000,"Oyo State, Ibadan",Foreign Used,112997
2246,wQzoEV0VjBLtQYQvdD7X9JpH,Nissan Frontier 2003 Red,5229000,"Lagos State, Amuwo-Odofin",Foreign Used,129459
2247,Cx5sfWBoRRXtefZhylUD2sG7,Hyundai Santa Fe 3.3 Limited AWD 2007 Blue,3120000,"Abuja (FCT), Garki 2",Nigerian Used,170529
2248,r4WdVDoq4lU2oqGTcGgYocQu,Nissan Maxima QX 3.5 2002 Blue,2730000,"Abuja (FCT), Garki 2",Nigerian Used,241248


## Export into CSV

In [71]:
car_df.to_csv("car45_data.csv")