# Project Presentation 1 - Nov 7

##### In this project, we aim to collect and filter data on cars available for purchase, and recommend cars to users based on their preferences. We will demonstrate the following steps:
1. Web scraping using BeautifulSoup to gather car information from cars.com.
2. An example of filtering the data based on user-defined criteria.

# Here is the website we scraped: 

https://www.cars.com/shopping/results/?dealer_id=&keyword=&list_price_max=&list_price_min=&maximum_distance=all&mileage_max=&monthly_payment=&page=2&page_size=20&sort=best_match_desc&stock_type=cpo&year_max=&year_min=&zip=

In [2]:
# Import necessary libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [3]:
# Initialize lists to store scraped data
year = []
name = []
mileage = []
rating = []
review_count = []
price = []

# Loop through multiple pages of car listings
for i in range(1, 11):
    # Construct the URL for each page
    website = 'https://www.cars.com/shopping/results/?page=' + str(i) + '&page_size=20&dealer_id=&list_price_max=&list_price_min=&makes[]=mercedes_benz&maximum_distance=20&mileage_max=&sort=best_match_desc&stock_type=cpo&year_max=&year_min=&zip=' 

    # Send a request to the website
    response = requests.get(website)

    # Create a BeautifulSoup object
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all car listings on the page
    results = soup.find_all('div', {'class' : 'vehicle-card'})

    # Loop through the car listings to extract data
    for result in results:
        # Get the full name from the result
        full_name = result.find('h2').get_text()

        # Initialize variables for year and name
        year_result = 'n/a'
        name_result = full_name

        # Find the first occurrence of four consecutive digits (representing the year)
        for i in range(len(full_name) - 3):
            if full_name[i:i + 4].isdigit():
                year_result = full_name[i:i + 4]
                name_result = full_name[:i] + full_name[i + 4:].strip()
                break

        # Append data to lists (with error handling)
        try:
            name.append(name_result) 
        except:
            name.append('n/a')

        try:
            year.append(year_result) 
        except:
            year.append('n/a')

        try:
            mileage.append(result.find('div', {'class':'mileage'}).get_text())
        except:
            mileage.append('n/a')

        try:
            rating.append(result.find('span', {'class':'sds-rating__count'}).get_text())
        except:
            rating.append('n/a')

        try:
            review_count.append(result.find('span', {'class':'sds-rating__link'}).get_text().strip('reviews)').strip('('))
        except:
            review_count.append('n/a')

        try:
            price.append(result.find('span', {'class':'primary-price'}).get_text())
        except:
            price.append('n/a')


In [4]:
# Create a DataFrame to store the scraped data
car_dealer = pd.DataFrame({'Year': year, 'Name': name, 'Mileage':mileage, 'Rating': rating, 
                           'Review Count': review_count, 'Price': price})

# Data Cleaning
# Convert the 'Year' column to integers and handle missing values ('n/a')
car_dealer['Year'] = pd.to_numeric(car_dealer['Year'], errors='coerce').astype('Int64')

# Display the DataFrame
car_dealer

Unnamed: 0,Year,Name,Mileage,Rating,Review Count,Price
0,2024,Mercedes-Benz AMG GLE 53 Base,"4,128 mi.",4.7,795,"$89,610"
1,2020,Mercedes-Benz GLS 450 Base 4MATIC,"42,186 mi.",4.7,373,"$57,995"
2,2022,Mercedes-Benz AMG CLA 35 Base 4MATIC,"23,804 mi.",4.7,795,"$45,784"
3,2022,Mercedes-Benz S-Class S 580 4MATIC,"10,908 mi.",4.6,476,"$98,760"
4,2020,Mercedes-Benz AMG GT 63 S 4-Door,"16,335 mi.",4.8,647,"$113,892"
...,...,...,...,...,...,...
205,2022,Mercedes-Benz GLE 350 Base 4MATIC,"18,335 mi.",4.8,485,"$57,980"
206,2023,Mercedes-Benz EQB 250 Base,"5,135 mi.",4.9,6925,"$43,988"
207,2022,Mercedes-Benz AMG GLE 53 Base,"11,438 mi.",4.9,2327,"$84,990"
208,2021,Mercedes-Benz AMG GLE 53 Base,"17,998 mi.",4.9,2327,"$72,490"


# 2. Example of App function

##### Here is an example of what our app does:

In [7]:
# Example of App function
# Example of user input (you can prompt users for their desired filters)
min_year = 2020
max_year = 2023
min_price = 0
max_price = 40000
min_rating = 4.5

# Filter the DataFrame based on user input
filtered_cars = car_dealer[
    (car_dealer['Year'].between(min_year, max_year, inclusive='both')) &
    (car_dealer['Price'].str.replace('[\$,]', '', regex=True).astype(float).between(min_price, max_price, inclusive='both')) &
    (pd.to_numeric(car_dealer['Rating'], errors='coerce') >= min_rating)
]

# Display the filtered DataFrame
filtered_cars

Unnamed: 0,Year,Name,Mileage,Rating,Review Count,Price
16,2021,Mercedes-Benz A-Class A 220,"22,886 mi.",4.7,342,"$32,900"
20,2022,Mercedes-Benz GLB 250 Base 4MATIC,"8,116 mi.",4.9,106,"$38,997"
22,2022,Certified Mercedes-Benz GLB 250 Base 4MATIC,"8,116 mi.",4.9,106,"$38,997"
56,2021,Mercedes-Benz CLA 250 Base 4MATIC,"30,220 mi.",4.6,810,"$36,500"
64,2021,Mercedes-Benz GLB 250 Base,"25,901 mi.",4.8,1374,"$38,496"
79,2021,Mercedes-Benz GLC 300 Base 4MATIC,"20,542 mi.",4.7,1227,"$38,538"
83,2022,Mercedes-Benz GLB 250 Base 4MATIC,"20,635 mi.",4.6,633,"$39,490"
98,2021,Mercedes-Benz GLB 250 Base 4MATIC,"17,451 mi.",4.9,2352,"$36,399"
111,2020,Certified Mercedes-Benz GLA 250 Base 4MATIC,"37,267 mi.",4.9,98,"$29,197"
113,2021,Certified Mercedes-Benz A-Class A 220 4MATIC,"2,970 mi.",4.9,106,"$32,997"


In [8]:
filtered_cars.shape

(17, 6)

##### As shown, the user inputs will narrow down the data and eventually return a list of cars that the user could consider buying.

##### In this presentation, we demonstrated the first steps of our project, which involves web scraping data from cars.com, cleaning the data, and filtering it based on user-defined criteria. We successfully created a DataFrame with the scraped data and filtered it to show a subset of the cars that meet specific requirements.

# Plan of Action:

#### In the future, we will be adding more parameters (e.g. Number of Doors, Interior Color, Exterior Color, Fuel Type, etc.) to give more accurate recommendations. Also, we will develop an interactive Web App using flask.


#### we will be switching from BeautifulSoup to Scrapy (In progress) in order to support a larger dataset. We will also scrape additional websites (autotrader.com, truecar.com, etc) for a larger dataset. 
