# Introduction to E-Commerce Data Analysis Project
This notebook documents my exploration of an e-commerce dataset as part of a self-guided learning project. My goal is to develop and refine my skills in data analysis, focusing on practical application of various tools and techniques. What follows is a comprehensive record of my process, including the challenges I encounter and the insights I gain. The full repository for this project can be found [here](https://github.com/michael-patsko/uk-ecommerce-analysis).

## Project Overview
The focus of this analysis is the **E-Commerce Analysis - UK** dataset from **Atharva Arya** on Kaggle, found [here](https://www.kaggle.com/datasets/atharvaarya25/e-commerce-analysis-uk/data). This dataset is licensed under the [Community Data License Agreement – Sharing, Version 1.0 (CDLA-Sharing-1.0)](https://cdla.dev/sharing-1-0/) license. More details can be found at the link provided, or in the README of the GitHub repository for this project.

Through this project, I aim to enhance my data analysis capabilities and gain hands-on experience with relevant tools. Specifically, I plan to:
- Develop proficiency with Python for data analysis
- Improve my skills in data cleaning and preprocessing
- Explore various data visualisation techniques
- Refine my abilities with Jupyter Notebooks, PowerBI, and SQL in the context of data analysis

### Tools and Dataset
For this analysis, I plan to use:

- Python: The primary programming language for data analysis
- Pandas: For data manipulation and analysis
- Matplotlib and Seaborn: For data visualisation
- Jupyter Notebook: The environment for conducting and documenting the analysis
- PowerBI: For creating interactive visualisations and dashboards
- SQL: For database querying and data manipulation

## Analysis
With the preliminaries out of the way, I can begin the analysis.

First, I begin by installing Pandas and Numpy:

In [1]:
%%capture
%pip install pandas numpy

Then, I can import them as `pd` and `np`.

In [2]:
import pandas as pd 
import numpy as np # Import both packages under sensibly chosen aliases

When attempting to load the dataset using `pd.read_csv` with default options, I obtained the following **UnicodeDecodeError**:

> `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 79780: invalid start byte`

Looking up the byte 0xa3, I could see that this corresponds to the Unicode character for the pound sign (£), indicating that there may have been an unescaped Unicode character causing the issue. In this case, I could have tried determining the encoding scheme used, or attempted to use a common encoding scheme like ISO-8859-1. Instead, I opted to use the Python codec `unicode_escape` which can gracefully handle these issues:

In [3]:
df = pd.read_csv('data.csv', encoding='unicode_escape') # Read the CSV data into a Pandas DataFrame

This code executes successfully, indicating that this has likely solved the issue.

Now, I can print the first 5 rows of our data to get an idea of what we're working with.

In [4]:
print(df.head()) # Look at the first 5 rows of our data

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

      InvoiceDate  UnitPrice  CustomerID         Country  
0  12/1/2010 8:26       2.55     17850.0  United Kingdom  
1  12/1/2010 8:26       3.39     17850.0  United Kingdom  
2  12/1/2010 8:26       2.75     17850.0  United Kingdom  
3  12/1/2010 8:26       3.39     17850.0  United Kingdom  
4  12/1/2010 8:26       3.39     17850.0  United Kingdom  


Doing so, we can see we have fields for **InvoiceNo**, **StockCode**, **Description**, **Quantity**, **InvoiceDate**, **UnitPrice**, **CustomerID**, and **Country**. We can see that multiple entries share the same InvoiceNo, suggesting that a single purchase can contain multiple items of varying quantity. We can also use `df.info` and `df.describe` to get some additional details about our data.

In [5]:
print(df.info(),df.describe()) # Get some details about our dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
None             Quantity      UnitPrice     CustomerID
count  541909.000000  541909.000000  406829.000000
mean        9.552250       4.611114   15287.690570
std       218.081158      96.759853    1713.600303
min    -80995.000000  -11062.060000   12346.000000
25%         1.000000       1.250000   13953.000000
50%         3.000000       2.080000   15152.000000
75%        10.000000       4.130000

From this, we can see that we have a dataset with **541909** entries. It is structured with 8 columns, including two 64-bit floating-point numbers, one 64-bit integer, and five columns categorised as objects, which could represent various data types. We can also see at a glance that about 1500 entries have no value for Description, and about 135,000 entries have no value for CustomerID. We will investigate this in more detail later.

Now we know from the data card on Kaggle that this dataset contains erroneous duplicates, and it is our job to remove them. We can first identify duplicates using `df.duplicated`. This returns a [Panda Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) object, containing a list of booleans that tell us whether an entry in our original data frame was a duplicate or not.

In [6]:
print(df.duplicated().sum()) # Return the number of duplicate entries
print(df.duplicated().mean()) # Return the percentage of duplicate entries

5268
0.009721189350979592


From this, we see we have 5268 duplicates, making up just under 1% of our dataset.