<a href="https://colab.research.google.com/github/miguel-peralta/cars_ista322/blob/main/cars.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cars Relational Databases
ISTA 322 Final Project, Spring 2024 <br>
Miguel Candido Aurora Peralta <br>
## Extract
### KBB Web Scraping
The cars are separated into new and used cars categories. The lists new and used cars are on separate pages. These lists will be combined into one for this project.
#### Used Cars



In [2]:
# Imports
import requests
from bs4 import BeautifulSoup
import pandas as pd

First we need to get a list of all of the URLs for the car models on KBB so that more information can be extracted from those pages.

In [8]:
def get_html_doc(url):
  '''Returns the HTML document as a JSON response for the given URL'''
  # requests HTML document for URL
  response = requests.get(url)
  # returns JSON response
  return response.text

In [51]:
def get_kbb_df():
  '''
  Returns a dataframe containing the url, make, model, and year from the
  relative URLs listed on the car models list pages.
  Returns:
    car_info (DataFrame): make, model, year, and url of each model
  '''
  # Create dataframe and lists to store info
  car_info = pd.DataFrame()
  urls = []
  make = []
  model = []
  year = []
  base_url = 'https://www.kbb.com'

  for page in ['new', 'used']:
    url = f'https://www.kbb.com/car-make-model-list/{page}'
    # Create HTML object from url
    html = get_html_doc(url)
    soup = BeautifulSoup(html, 'html.parser')
    # Create list to store relative URLS from the page
    links = []
    # Get all links from the page (the links to models all have the same style)
    for link in soup.find_all('a', attrs={'style':"padding:12px 8px;display:inline-block"}):
      # Add links to the list
      links.append(link.get('href'))
    # Split links using / as delimeter and add information to lists
    for car in links:
      urls.append(base_url+car)
      link_split = car.split('/')
      make.append(link_split[1])
      model.append(link_split[2])
      year.append(link_split[3])

  # Use lists to populate dataframe
  car_info['url'] = urls
  car_info['make'] = make
  car_info['model'] = model
  car_info['year'] = year

  return car_info


In [52]:
kbb = get_kbb_df()

In [None]:
def get_styles_urls(url):
    '''
    Given the URL to a year's model of a car, returns a list of the urls to the
    styles of that model. If there is no style information available, returns a
    1-element list with just the model page URL.
    Args:
      url (string): url to a year's model of a car
    Returns:
      styles (list): list of urls for that model's styles
    '''
    # Create HTML object from url
    url = f'https://www.kbb.com/audi/a3/2022/'
    html = get_html_doc(url)
    soup = BeautifulSoup(html, 'html.parser')
    styles = []
    # The elements containing the style links are always 220px wide
    for style in soup.find_all('a', attrs={'width': '220px'}):
      # Add links to the list
      styles.append(url+style.get('href'))
    if len(styles) < 1:
      styles.append(url)
    return styles

In [None]:
for i in kbb.index:
  kbb['styles'] = get_styles_urls(kbb['url'])

## Creating make and model tables

In [None]:
def create_make_table(kbb):
  make = pd.DataFrame(columns = ['make'])
  make_list = kbb['make'].unique()
  make['make'] = make_list
  make = make.reset_index(inplace=True)
  make = make.rename(columns={'index':'make_id'})
  return make

In [None]:
make = create_make_table(kbb)

In [None]:
def create_models_table(kbb, make):
  models = pd.DataFrame(columns = ['make_id', 'year', 'styles'])
  for i in kbb.index:
    models = kbb

Next we need to retrieve the ends of the urls for the styles for each model of car. Some cars that are too new don't have any styles listed yet. In this case, we will just return an empty array as all of the information that would be contained on the individual style pages is already on the model page.