# Activity: Clean data using SQL

## Overview

As a data analyst working with a used car dealership startup venture, you need to find out which cars are most popular with customers so that the investors can make sure to stock accordingly. For this activity, we will:

- create a custom dataset in BigQuery,
- import a .csv file as a new table in the BigQuery dataset, and
- use SQL queries to clean automobile data.

## Dataset

The data is obtained from an external source and contains historical sales data on car prices and their features. The data can be downloaded from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/10/automobile), [Google Sheets](https://docs.google.com/spreadsheets/d/1GRu_BUz4T6GcsQQindn_pkltkmA94M-HU6BWCfpjMWE/edit?usp=sharing) or directly by downloading the [.csv file](/activities/sql/c04m03-clean-data-using-sql/c04m03-automobile-data.csv). A preview of the comma-delimited file is show below.

![Automobile data in csv](c04m03-automobile-data-csv.png 'Automobile data in csv')

## Importing the data in BigQuery

The following steps are followed to import the baby names data for 2014 to BigQuery:

- **Create dataset** with **Dataset ID** `cars`
- In the **Dataset info** window, select the **CREATE TABLE** button
- In the **Source** section, select the ***Upload*** option in **Create table from**
- Browse to the `c04m03-automobile-data.csv` file and open
- Set the file format to `.csv`
- In the **Destination** section, name the table as `car_info`
- In the **Schema** section, select **Auto detect**

Finally, select **Create table**. A new table `car_info` has been created and appear in the explorer pane under the database `cars`. A preview of the data is show below.

![Automobile data in BigQuery](c04m03-automobile-data-bigquery.png 'Automobile data in BigQuery')

## Data cleaning

The data is described in full by [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/10/automobile) (the **"data description"**) and the acceptable values or ranges for every variable. It also makes a note of which variables contain missing values.

### Variable | make

This column should only contain one of these values: alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo. According to the data description, there are no missing values. I run the following query for confirm this:

In [None]:
SELECT *
FROM `plucky-aegis-427011-v5.cars.car_info`
WHERE
  make NOT IN ('alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda', 'isuzu', 'jaguar', 'mazda', 'mercedes-benz', 'mercury', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'renault', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo')
  OR
  make IS NULL;

The above query does not deliver any results so this variable is clean. If I were able to execute UPDATE queries in a BigQuery sandbox account, however, I would run this query to correct the spelling of "peogot" to "peugeot" in the dataset:

In [None]:
UPDATE `plucky-aegis-427011-v5.cars.car_info`
SET
  make = 'peugeot'
WHERE
  make = 'peugot';

### Variable | fuel_type

According to the data description, the fuel type can only be diesel or gas, and there should be no missing values. Running the query below confirms that values are only either diesel or gas, and that there are no null values:

In [None]:
SELECT
  DISTINCT fuel_type
FROM `plucky-aegis-427011-v5.cars.car_info`;

## Variable | num_of_doors

The data description indicates that the number of doors should be either two or four, but that there are missing values present.