# Data Analytics Project

First, the necessary modules are imported.

In [24]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
from matplotlib import pyplot as plt 
import warnings 
warnings.simplefilter('ignore', category=UserWarning)

%matplotlib inline

Second, a DataFrame is created using the ecommerce data.

In [25]:
# The file containing the dataset uses a different encoding than the default 'utf-8', so this is specified
df = pd.read_csv('ecommerce.csv', encoding='cp1252') 

## Exploratory Data Analysis

It is useful to implement some basic Pandas functions to gain some initial understanding about the data. Using the head() function, the first five rows are displayed.

In [56]:
df.head() # returns the first five rows of the dataframe

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


At first glance, the data seems to be correct. However, upon further inspection, a peculiarity comes to light. The 'CustomerID' column seems to be registered as a float value. This is probably a mistake, so it would be a good idea to check that each variable is the correct type. The info() function is useful here.

In [57]:
df.info() # returns a list of columns (variables), non-null counts, and dtypes (data types)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


As suspected, the 'CustomerID' variable has a dtype of 'float64', likely because there are many null values present. To preserve the null values and avoid attributing sales to a pseudocustomer, the column is converted to a dtype of 'Int64' (since 'int64' does not accept null values).

In [61]:
df['CustomerID'] = df['CustomerID'].astype('Int64')
df.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID       Int64
Country         object
dtype: object

It seems there is also an issue with the 'InvoiceNo' variable. Its dtype is set to 'object', but the values in each row are integers. Attempts to convert the type to 'int64' generate error messages because some values have alphabetic characters. This is probably an error in the dataset and requires further investigation.

### Categorical Variables

For each categorical variable, the number of unique values is as follows:

In [40]:
df.describe(include='object')

Unnamed: 0,InvoiceNo,StockCode,Description,InvoiceDate,Country
count,541909,541909,540455,541909,541909
unique,25900,4070,4223,23260,38
top,573585,85123A,WHITE HANGING HEART T-LIGHT HOLDER,10/31/2011 14:41,United Kingdom
freq,1114,2313,2369,1114,495478


In [43]:
df['InvoiceNo'].value_counts()

573585     1114
581219      749
581492      731
580729      721
558475      705
           ... 
554023        1
554022        1
554021        1
554020        1
C558901       1
Name: InvoiceNo, Length: 25900, dtype: int64