<a href="https://colab.research.google.com/github/miguel-peralta/cars_ista322/blob/main/cars.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cars Relational Databases
ISTA 322 Final Project, Spring 2024 <br>
Miguel Candido Aurora Peralta <br>
## Extract
### KBB Web Scraping
The cars are separated into new and used cars categories. The lists new and used cars are on separate pages. These lists will be combined into one for this project.
#### Used Cars



In [2]:
# Imports
import requests
from bs4 import BeautifulSoup
import pandas as pd

First we need to get a list of all of the URLs for the car models on KBB so that more information can be extracted from those pages.

In [8]:
def get_html_doc(url):
  '''Returns the HTML document as a JSON response for the given URL'''
  # requests HTML document for URL
  response = requests.get(url)
  # returns JSON response
  return response.text

In [44]:
def get_models_df():
  '''
  Returns a dataframe containing the url, make, model, and year from the
  relative URLs listed on the car models list pages.
  Returns:
    car_info (DataFrame): make, model, year, and url of each model
  '''
  # Create dataframe and lists to store info
  car_info = pd.DataFrame()
  urls = []
  make = []
  model = []
  year = []
  base_url = 'https://www.kbb.com'

  for page in ['new', 'used']:
    url = f'https://www.kbb.com/car-make-model-list/{page}'
    # Create HTML object from url
    html = get_html_doc(url)
    soup = BeautifulSoup(html, 'html.parser')
    # Create list to store relative URLS from the page
    links = []
    # Get all links from the page (the links to models all have the same style)
    for link in soup.find_all('a', attrs={'style':"padding:12px 8px;display:inline-block"}):
      # Add links to the list
      links.append(link.get('href'))
    # Split links using / as delimeter and add information to lists
    for car in links:
      urls.append(base_url+car)
      link_split = car.split('/')
      make.append(link_split[1])
      model.append(link_split[2])
      year.append(link_split[3])

  # Use lists to populate dataframe
  car_info['url'] = urls
  car_info['make'] = make
  car_info['model'] = model
  car_info['year'] = year

  return car_info


In [45]:
cars = get_models_df()

In [32]:
url = f'https://www.kbb.com/car-make-model-list/new'
# Create HTML object from url
html = get_html_doc(url)
soup = BeautifulSoup(html, 'html.parser')
print(soup.text)

New Car Model/Make Reference List


Car ValuesPrice New/UsedMy Car's ValueInstant Cash OfferCars for SaleCars for SaleFree Dealer Price QuoteVehicle History ReportFind Local DealersPrivate Seller ExchangePrivate Seller CarsSell Your CarCar ReviewsBest CarsDealer ReviewsKBB Expert ReviewsElectric Vehicle GuideKBB AwardsLatest Car NewsCar RepairAuto Repair PricesCar RecallsMaintenance PricingFind an Auto ShopService AdvisorOBD-II CodesResearch ToolsCar ResearchCar FinderCompare CarsVehicle History ReportCar ValuesCar LoansInsuranceCheck My CreditExtended WarrantyRecallsHomeAll CarsNew Car Model/Make Reference ListSee Used Car ListNew Car Model/Make Reference ListSort by:ModelMakeJump to:ABCDEFGHIJKLMNOPQRSTUVWXYZNumberView AllModelMakeYearsA3Audi2025, 2024, 2023A4Audi2024, 2023A4 allroadAudi2024, 2023A5Audi2024, 2023A6Audi2024, 2023A6 allroadAudi2024, 2023A7Audi2024, 2023A8Audi2024, 2023AcadiaGMC2024, 2023AccordHonda2024, 2023Accord HybridHonda2024, 2023ADXAcura2025AirLucid2024, 2023Alas

Next we need to retrieve the URLs for the styles for each model of car. Some cars that are too new don't have any styles listed yet. In this case, we will just return the model's URL as all of the information that would be contained on the individual style pages is already there.

In [None]:
def get_styles_urls(url)

In [16]:
# Create HTML object from url
url = f'https://www.kbb.com/audi/a3/2022/'
html = get_html_doc(url)
soup = BeautifulSoup(html, 'html.parser')
# Create list to store relative URLS from the page
links = []
# Get all links from the page (they all have the attribute target="_self")
for link in soup.find_all('a', attrs={'width': '220px'}):
  # Add links to the list
  links.append(link.get('href'))

for i in links:
  print(i)

/audi/a3/2022/premium-plus-sedan-4d/
/audi/a3/2022/premium-sedan-4d/
/audi/a3/2022/prestige-sedan-4d/
