Skip to content

isis-santos-costa/kaggle-datasets-in-bigquery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 

Repository files navigation

pull requests commit activity Data Analyst

Kaggle datasets into BigQuery

☁️ loading and troubleshooting • 2023   Google-BigQuery

This repository shows how to load publicly available datasets from Kaggle into BigQuery.

Tags: dataset, kaggle, bigquery


Steps

📍 Kaggle Dataset used in this repository: « Marketing Funnel by Olist »
Step   1 • Download the dataset from Kaggle
Step   2 • On BigQuery: create a project / select a project
Step   3 • Click [ +ADD ]
Step   4 • Click [ Local file | Upload a local file ]
Step   5 • CREATE NEW DATASET / Choose from loaded datasets
Step   6 • Create dataset | Set a Dataset ID | Set expiration if applicable
Step   7 • CREATE TABLE --> Select file
Step   8 • CREATE TABLE --> Set a Table Name
Step   9 • CREATE TABLE --> Header rows to skip = 1
Step 10 • CREATE TABLE --> Add fields, defining the SCHEMA
🧐 TROUBLESHOOTING
✔️ TABLE CREATED


Troubleshooting items

Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name »
Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns »


📍 Kaggle Datasets used in this repository: « Marketing Funnel by Olist »

Kaggle Datasets used in this repository (License: CC BY-NC-SA 4.0):
  ↳ « Brazilian E-Commerce Public Dataset by Olist » 100k orders with product, customer and reviews info, and
  ↳ « Marketing Funnel by Olist » 8k leads, closed deals and connection to 100k orders
      ↳ related to « Joining Marketing Funnel with Brazilian E-Commerce » Python


Step 1 • Download the dataset from Kaggle

01---download-from-kaggle


Step 2 • On BigQuery: create a project / select a project

02---create-project---or---select-a-project


Step 3 • Click [ +ADD ]

03---add


Step 4 • Click [ Local file | Upload a local file ]

04---local-file---upload-a-local-file


Step 5 • CREATE NEW DATASET / Choose from loaded datasets

05---create-new-dataset---or---choose-from-loaded-datasets


Step 6 • Create dataset | Set a Dataset ID | Set expiration if applicable

06---create-dataset---set-id---set-expiration-if-applicable


Step 7 • CREATE TABLE --> Select file

07---create-table---select-file


Step 8 • CREATE TABLE --> Set a Table Name

08---create-table---set-a-table-name


Step 9 • CREATE TABLE --> Header rows to skip = 1

09---create-table---header-rows-to-skip-equals-1


Step 10 • CREATE TABLE --> Add fields, defining the SCHEMA

10---add-fields-defining-the-schema

10---a---schema-is-available-on-kaggle-data-card--6MB.mp4
10---b---add-fields+---6MB.mp4

10---CHECK---create-table


🧐 TROUBLESHOOTING

Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name »

Message:
« Could not parse '2018-01-18 0:00:00' as DATE for field review_creation_date (position 5) starting at location 122 with message 'Unable to parse' »

↳ Try changing Field Type from DATE to DATETIME

Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name » (warning)

11---TROUBLESHOOTING---1

11---TROUBLESHOOTING---1---date-to-datetime

Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns »

Message:
« CSV table references column position 6, but line starting at position:1765 contains only 5 columns »
➥ A possible reason for it is the confusing of rich text commas , with delimiting commas in a comma-separated value (.csv) table.
➥ Some approaches have been tried, none having worked so far ( if you know and would like to share the answer, would appreciate to learn it 💬 )

   ↳ ✅ Attempt 8 (finally: worked!): regex deleting on Google Sheets: paragraph breaks, commas, and double quote marks: \n|,|")
   ↳ ❌ Attempt 1 (didn't work): tab separated values (.tsv) table (set as csv with tab delimiter on BigQuery)
   ↳ ❌ Attempt 2 (didn't work): deleting commas on Google Sheets
   ↳ ❌ Attempt 3 (didn't work): find and replace blank values by 'unknown' (didn't find)
   ↳ ❌ Attempt 4 (didn't work): auxiliary columns to replace blanks by 'unknown': replaced, but didn't solve
   ↳ ❌ Attempt 5 (didn't work): change col type from number to text, where appropriate
   ↳ ❌ Attempt 6 (didn't work): formatting datetime as YYYY-MM-DD HH:MM:SS
   ↳ ❌ Attempt 7 (didn't work): relieving the requirement for a value on the id field

Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns » (warning)

11---TROUBLESHOOTING---2

✅ Attempt 8 (finally: worked!): regex deleting on Google Sheets: paragraph breaks, commas, and double quote marks: \n|,|")

11---TROUBLESHOOTING---2---w---regex-delete-paragraph-breaks-and-commas-and-double-quotes-on-gsheets

11---TROUBLESHOOTING---2---w---1
11---TROUBLESHOOTING---2---w---2
11---TROUBLESHOOTING---2---w---3

❌ Attempt 1 (didn't work): tab separated values (.tsv) table (set as csv with tab delimiter on BigQuery)

11---TROUBLESHOOTING---2---dw-1-tsv---1-gsheets 11---TROUBLESHOOTING---2---dw-1-tsv---2-bq 11---TROUBLESHOOTING---2---dw-1-tsv---3-bq-field-delimiter-tab 11---TROUBLESHOOTING---2---dw-1-tsv---4-didnt-work

❌ Attempt 2 (didn't work): deleting commas on Google Sheets

11---TROUBLESHOOTING---2---dw-2-no-commas---1-gsheets 11---TROUBLESHOOTING---2---dw-2-no-commas---2-didnt-work

❌ Attempt 3 (didn't work): find and replace blank values by 'unknown' (didn't find)

11---TROUBLESHOOTING---2---dw-3-find-replace---1-gsheets 11---TROUBLESHOOTING---2---dw-3-find-replace---1-gsheets---didnt-find

❌ Attempt 4 (didn't work): auxiliary columns to replace blanks by 'unknown': replaced, but didn't solve

11---TROUBLESHOOTING---2---dw-4-unknown-aux-col---1-gsheets---PASTE-VALUES

❌ Attempt 5 (didn't work): change col type from number to text, where appropriate

11---TROUBLESHOOTING---2---dw-5-col-type-text---1-gsheets 11---TROUBLESHOOTING---2---dw-5-col-type-text---2-gsheets

❌ Attempt 6 (didn't work): formatting datetime as `YYYY-MM-DD HH:MM:SS`

11---TROUBLESHOOTING---2---dw-6-col-type-datetime---1-gsheets 11---TROUBLESHOOTING---2---dw-6-col-type-datetime---2-gsheets

❌ Attempt 7 (didn't work): relieving the requirement for a value on the id field

11---TROUBLESHOOTING---2---dw-7-id-nullable---1-bq


✔️ TABLE CREATED

CHECK---table-created---1---schema

CHECK---table-created---2---preview


SQL:

SELECT
  origin
  , count(origin) 
FROM `ecommerce-olist.olist.olist_marketing_qualified_leads_dataset` 
GROUP BY origin 
ORDER BY 2 desc;

File | On BigQuery

CHECK---table-created---3---query-results