# Exploratory Data Analysis (EDA) for Online Retail Store Data

In [15]:
import pandas as pd 
import numpy as np
import datetime as dt

In [20]:
data_url = "https://raw.githubusercontent.com/nyangweso-rodgers/Data_Analytics/main/Datasets/online-retail.csv"
original_df = pd.read_csv(data_url, encoding= 'unicode_escape', parse_dates=['InvoiceDate'])
original_df.shape

(541909, 8)

### DataFrame Preview

In [21]:
original_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


### Check For Missing Values
* We check for missing values by utilizing the `isna()` method, which returns a dataframe of boolean values indicating whether or not a field is null. 
* We can use the `sum()` method to group all missing values by column.

In [22]:
# sum missing values in each column
original_df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

### We can see the following:
1. `Description` and `CustomerID` fields have `null` values within the dataset.

### Proportion of Missing Values
* We can also check for the proportion (%) of mising values:

In [23]:
original_df.isnull().sum() * 100/original_df.shape[0]

InvoiceNo       0.000000
StockCode       0.000000
Description     0.268311
Quantity        0.000000
InvoiceDate     0.000000
UnitPrice       0.000000
CustomerID     24.926694
Country         0.000000
dtype: float64

### Extract Rows with `null` values

In [24]:
null_values_df = original_df[original_df.isna().any(axis = 1)]
# save to a csv file
null_values_df.to_csv("null_values_df.csv", index = False)

# check the shape of the null values df
null_values_df.shape

(135080, 8)

### Drop Rows with `null` values

In [25]:
non_null_values_df = original_df.dropna()
# save the data to a new dataframe
non_null_values_df.to_csv("non_null_values_df.csv", index = False)
# confirm the shape of the new dataframe
non_null_values_df.shape

(406829, 8)

## Data Filtering
1. Check for `zero` and `negative` values within the `UnitPrice` and `Quantity` columns.
2. If these records, exists, we might treat them as items sold on credit.

In [26]:
# count Rows with negative Quantity values
print("Negative Qty values count:", non_null_values_df[non_null_values_df['Quantity'] < 0].shape)

Negative Qty values count: (8905, 8)


In [27]:
# generate a new df, v3 to exclude rows with negative values
v3_df = non_null_values_df[non_null_values_df['Quantity'] > 0]
# save the records with positive Quantity values to a new file
v3_df.to_csv("v3_df.csv", index = False)

# check the new shape of the dataframe
v3_df.shape

(397924, 8)

## Data Engineering - Generate Additional Fields/Features from Data
* From our data preview, we can generate the following fields:
    - `TotalAmount` = `Quantity` x `UnitPrice`
    - `Date` - extracted from the `InvoiceDate` field
    - `MonthYear` - extracted from the `InvoiceDate` field

In [28]:
# get a date column from InvoiceDate
v3_df['Date'] = v3_df["InvoiceDate"].dt.date

# get MonthYear column from InvoiceDate
v3_df['MonthYear'] = v3_df['InvoiceDate'].dt.to_period('M')

# get TotalAmount from Quantity and Unit Price
v3_df['TotalAmount'] = v3_df['Quantity'] * v3_df['UnitPrice']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  v3_df['Date'] = v3_df["InvoiceDate"].dt.date
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  v3_df['MonthYear'] = v3_df['InvoiceDate'].dt.to_period('M')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  v3_df['TotalAmount'] = v3_df['Quantity'] * v3_df['UnitPrice']


In [29]:
v3_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Date,MonthYear,TotalAmount
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01,2010-12,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,2010-12,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01,2010-12,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,2010-12,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,2010-12,20.34


### Validate Column Data Types

In [30]:
# basic information about the new data
v3_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397924 entries, 0 to 541908
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    397924 non-null  object        
 1   StockCode    397924 non-null  object        
 2   Description  397924 non-null  object        
 3   Quantity     397924 non-null  int64         
 4   InvoiceDate  397924 non-null  datetime64[ns]
 5   UnitPrice    397924 non-null  float64       
 6   CustomerID   397924 non-null  float64       
 7   Country      397924 non-null  object        
 8   Date         397924 non-null  object        
 9   MonthYear    397924 non-null  period[M]     
 10  TotalAmount  397924 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(5), period[M](1)
memory usage: 36.4+ MB


From the above, we have to convert `CustomerID` from `float64` to `string`.

In [31]:
convert_dict = {
    'CustomerID': str,
    'InvoiceNo': str
}
v3_df = v3_df.astype(convert_dict)
v3_df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID             object
Country                object
Date                   object
MonthYear           period[M]
TotalAmount           float64
dtype: object

### Descriptive Statistics

In [32]:
# Specifying datetime_is_numeric=True adopts the future behavior of Treating datetime data as categorical rather than numer
v3_df.describe(include='all', datetime_is_numeric=True)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Date,MonthYear,TotalAmount
count,397924.0,397924,397924,397924.0,397924,397924.0,397924.0,397924,397924,397924,397924.0
unique,18536.0,3665,3877,,,,4339.0,37,305,13,
top,576339.0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,,,,17841.0,United Kingdom,2011-11-06,2011-11,
freq,542.0,2035,2028,,,,7847.0,354345,3423,64545,
mean,,,,13.021823,2011-07-10 23:43:36.912475648,3.116174,,,,,22.394749
min,,,,1.0,2010-12-01 08:26:00,0.0,,,,,0.0
25%,,,,2.0,2011-04-07 11:12:00,1.25,,,,,4.68
50%,,,,6.0,2011-07-31 14:39:00,1.95,,,,,11.8
75%,,,,12.0,2011-10-20 14:33:00,3.75,,,,,19.8
max,,,,80995.0,2011-12-09 12:50:00,8142.75,,,,,168469.6


### Preliminary Insights
1. Transactions Dates from `2010-12-01` to `2011-12-09`
2. There are 397,924 item level transactions
3. 4,339 Unique Customers Invoiced with a total of 18,536 invoices being processed.
4. Transactions performed across 37 distinct countries
5. Mean transaction value = $22

## Next Steps
1. Save the `v3-df` to a csv file for later use in performing the following steps
   1. [Data Visualization]()
   2. [Customer Retention Analysis]()
   3. [Customer Segmentation Analysis]()

In [33]:
# save the resultant data frame to a csv file
v3_df.to_csv("v3_df.csv", index = False)