# Introduction to E-Commerce Data Analysis Project
This notebook documents my exploration of an e-commerce dataset as part of a self-guided learning project. My goal is to develop and refine my skills in data analysis, focusing on practical application of various tools and techniques. What follows is a comprehensive record of my process, including the challenges I encounter and the insights I gain. The full repository for this project can be found at https://github.com/michael-patsko/uk-ecommerce-analysis.

## Project Overview
The focus of this analysis is the **E-Commerce Analysis - UK** dataset from **Atharva Arya** on Kaggle, at https://www.kaggle.com/datasets/atharvaarya25/e-commerce-analysis-uk/data. This dataset is licensed under the [Community Data License Agreement – Sharing, Version 1.0 (CDLA-Sharing-1.0)](https://cdla.dev/sharing-1-0/) license. More details can be found at the link provided, or in the README of the GitHub repository for this project.

Through this project, I aim to enhance my data analysis capabilities and gain hands-on experience with relevant tools. Specifically, I intend to develop proficiency with Python for data analysis, improve my skills in data cleaning and preprocessing, explore various data visualisation techniques, and refine my abilities with Jupyter Notebooks, PowerBI, and SQL in the context of data analysis.

### Tools and Dataset
For this analysis, I'm planning to use:

- Python: The primary programming language for data analysis
- Pandas: For data manipulation and analysis
- Matplotlib and Seaborn: For data visualisation
- Jupyter Notebook: The environment for conducting and documenting the analysis
- PowerBI: For creating interactive visualisations and dashboards
- SQL: For database querying and data manipulation

## Analysis
With the preliminaries out of the way, I can begin the analysis.

First, I begin by installing Pandas and Numpy:

In [1]:
%%capture
%pip install pandas numpy

Then, I can import them as `pd` and `np`.

In [2]:
import pandas as pd
import numpy as np

When attempting to load the dataset using `pd.read_csv` with default options, I obtained the following **UnicodeDecodeError**:

> `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 79780: invalid start byte`

Looking up the byte 0xa3, I could see that this corresponds to the Unicode character for the pound sign (£), indicating that there may have been an unescaped Unicode character causing the issue. In this case, I could have tried determining the encoding scheme used, or attempted to use a common encoding scheme like ISO-8859-1. Instead, I opted to use the Python codec `unicode_escape` which can gracefully handle these issues:

In [3]:
df = pd.read_csv('data.csv', encoding='unicode_escape')

This code executes successfully, indicating that this has likely solved the issue.