# From spreadsheets to BigQuery

## Scenario

I have been working for a national store chain as a data analyst. Management is interested in the amount of inventory being kept in storage at regional sites. My supervisor has requested an analysis on inventory and sales data to make recommendations for changes to inventory management practices.

## Dataset

To complete my analysis, I have received three datasets containing information about:

- **inventory** that can be viewed in [Google Sheets](https://drive.google.com/file/d/1FcntnrgBf2jY67FIwlAU4RiRDflSS9tY/view?usp=drive_link) or the [.cvs file](/activities/combined/c05m04-spreadsheets-to-bigquery/c05m04-data-inventory.csv),
- **products** that can be viewed in [Google Sheets](https://drive.google.com/file/d/1Q6WjypLQCAJa8aw_9TlVB-fgMu07qgqs/view?usp=drive_link) or the [.cvs file](/activities/combined/c05m04-spreadsheets-to-bigquery/c05m04-data-products.csv), and
- **sales** that can be viewed in [Google Sheets](https://drive.google.com/file/d/1h6yhN7ccy-cszOavRv1BcbzSMdh6T_1I/view?usp=drive_link) or the [.cvs file](/activities/combined/c05m04-spreadsheets-to-bigquery/c05m04-data-sales.csv).

## Preparation

To inspect the three datasets, I import all the .csv files into a Google Sheet spreadsheet by following these steps:

1. in Google Sheets, open a new blank Google Sheet
2. select *Import* from the **File** menu
3. navigate to .csv files location on my computer, select `c05m04-data-inventory.csv` and *Insert*
4. on the **Import file** menu, I select:
  - *Import location* drop-down: Replace current sheet
  - *Separator type* drop-down: Detect automatically
  - "Convert text to numbers, dates and formulas" is checked
  and *Import data*
5. right-click on the worksheet tab, select *Rename* and change the name to *inventory*

I repeat steps 1 to 5 for the products and sales datasets. Now that all the datasets have been combined in one spreadsheet, I use the Google Sheets *Convert to table* tool to apply automatic formatting to every dataset enable quick filtering and sorting.

## Cleaning

Below is a log of all the changes I made in cleaning the datasets:

### Worksheet: **inventory**

  - ProductId: No blank values, all types number
  - StoreId: No blank values, all types number
  - StoreName:
    - Blank for ProductId 748 with StoreId 21791
    - Every other product with the StoreId 21791 as the StoreName assigned as "Dollar Tree"
    - Update StoreName for ProductID 748 to "Dollar Tree"
    - No blank values, all types text
  - Address: No blank values, all types text
  - neighborhood:
    - No blank values, all types text
    - Capitalize column name for consistency
  - QuantityAvailable: No blank values, all types number
  
  ### Worksheet: **products**

  - ProductId:
    - No blank values, but not all types number
    - The last line of the table has the value NA assigned for ProductId, ProductName, Supplier and ProductCost
    - Dataset owner confirms row can be deleted as it was entered by mistake
    - Row deleted and all types number
  - ProductName: No blank values, all types text
  - Supplier: No blank values, all types text
  - ProductCost: No blank values, all types number

### Worksheet: **sales**

  - SalesId: No blank values, all types number
  - StoreId: No blank values, all types number
  - ProductId: No blank values, all types number
  - Date: No blank values, all types number
  - UnitPrice: No blank values, all types number
  - Quantity: No blank values, all types number

The data has now been cleaned and verified. The cleaned data combined into one spreadsheet can be viewed in [Google Sheets](https://docs.google.com/spreadsheets/d/1SEQvvrrhy7lOMcXIf3FkE2D5Qx38PSbwQTbGOxkEXeY/edit?usp=sharing) or the [Excel file](/activities/combined/c05m04-spreadsheets-to-bigquery/c05m04-product-sales-data.xlsx).

## Import in BigQuery

To enable me to create a BigQuery dataset with tables, in the clean Google Sheets I select *Download* from the **File** menu and the "Comma-separated values (.csv)" option. In BigQuery, I take the following steps to import the data:

- **Create dataset** with **Dataset ID** `product_sales`
- In the **Dataset info** window, select the **CREATE TABLE** button
- In the **Source** section, select the ***Upload*** option in **Create table from**
- Browse to the newly downloaded inventory .csv file and open
- Set the file format to `.csv`
- In the **Destination** section, name the table as `inventory`
- In the **Schema** section, select **Auto detect**
- Finally, select **Create table**

A new table `warehouse` has been created and appear in the explorer pane under the database `product-sales`. The above steps are repeated to create tables for `products` and `sales` from the newly downloaded .csv files.

## Query: How many years of sales data is included?

To determine the oldest and newest sales data dates, I execute the following query:

In [None]:
SELECT
  MIN(Date) AS oldest,
  MAX(Date) AS newest
 FROM
  `plucky-aegis-427011-v5.customer_data.sales`;

Output indicates that the sales data has been captured from 1 January 2017 up to and including 30 December 2020 as shown below:

![Query to determine start and end date](c05m04-query-start-end.png 'Query to determine start and end date')