# Exploratory Data Analysis (EDA) for Online Retail Store Data

In [1]:
import pandas as pd 
import numpy as np
import datetime as dt

In [2]:
data_url = "https://raw.githubusercontent.com/nyangweso-rodgers/Data_Analytics/main/Datasets/online-retail.csv"
original_df = pd.read_csv(data_url, encoding= 'unicode_escape', parse_dates=['InvoiceDate'])
original_df.shape

(541909, 8)

### DataFrame Preview

In [3]:
original_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


### Check for Summary Information
#### Check for `null` Values

In [47]:
original_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


* From the above `df.info()` function, we see the following:
   1. `CustomerID` and `Description` fields have null values.
   2. We check for missing values by utilizing the `isna()` method, which returns a dataframe of boolean values indicating whether or not a field is null. 
   3. We can use the `sum()` method to group all missing values by column.
   4. We can also check for the proportion (%) of mising values

* We can create a function, `process_null_values(df)` to process the dataframe and get a `count` and `proportion` of missing values.

In [41]:
# function to get the count and proportion of missing values
def process_df(df):
    # count of null values in the dataframe
    null_features = df.isnull().sum()
    null_features_df = null_features.to_frame().reset_index().rename(columns = {'index': 'Feature', 0: 'NullValues'})
    
    # percent of null values in the dataframe
    percent_null_features_array = (df.isnull().sum() * 100) / df.shape[0]
    percent_null_features_array = percent_null_features_array.round(1)
    # convert the above series to DataFrame
    percent_null_features_df = percent_null_features_array.to_frame().reset_index().rename(columns = {'index': 'Feature', 0: 'PercentNull'})
    
    # merge the two dfs
    null_features_df = pd.merge(null_features_df, percent_null_features_df, on='Feature', how='left' )
    
    return null_features_df

process_df(original_df)

Unnamed: 0,Feature,NullValues,PercentNull
0,InvoiceNo,0,0.0
1,StockCode,0,0.0
2,Description,1454,0.3
3,Quantity,0,0.0
4,InvoiceDate,0,0.0
5,UnitPrice,0,0.0
6,CustomerID,135080,24.9
7,Country,0,0.0


##### Remarks:
1. We can extract `null` values from the dataset and save them to a new file, `null_values_df.csv`:
2. Additionally, we can create another dataset without `null` values, i.e., `non_null_values_df.csv`

In [50]:
def handling_missing_values(df):
    # null values df
    null_values_df = df[df.isna().any(axis=1)]
    # save the results to a file
    null_values_df.to_csv("null_values_df.csv", index = False)
    
    # drop null values from the dataframe
    non_null_values_df = df.dropna()
    # you can save the output to a new dataframe
    non_null_values_df.to_csv("non_null_values_df.csv", index = False)
    
    
    print("Count Of Null Values: ", null_values_df.shape) 
    print("Count Of Non-Null Values: ", non_null_values_df.shape)
    
    # return clean dataframe
    return non_null_values_df

non_null_values_df = handling_missing_values(original_df)

Count Of Null Values:  (135080, 8)
Count Of Non-Null Values:  (406829, 8)


### Data Filtering
Here, we
1. Check for `zero` and `negative` values within the `UnitPrice` and `Quantity` columns.
2. If these records, exists, we might treat them as items sold on credit, or
3. Filter them out from the original data

In [55]:
# define anothe function to use the non-null values df and check for negative values
def filter_df(non_null_values_df):
    # count Rows with negative Quantity values
    print("Negative Qty values count:", non_null_values_df[non_null_values_df['Quantity'] < 0].shape)
    
    # generate a new df, v3 to exclude rows with negative values
    v3_df = non_null_values_df[non_null_values_df['Quantity'] > 0]
    # save the records with positive Quantity values to a new file
    v3_df.to_csv("v3_df.csv", index = False)
    print("Non-negative DataFrame Shape: " ,v3_df.shape)
    
    return v3_df
    
    
v3_df = filter_df(non_null_values_df)

Negative Qty values count: (8905, 8)
Non-negative DataFrame Shape:  (397924, 8)


## Feature Engineering - Generate Additional Fields/Features from Data
* From our data preview, we can generate the following fields:
    - `TotalAmount` = `Quantity` x `UnitPrice`
    - `Date` - extracted from the `InvoiceDate` field
    - `MonthYear` - extracted from the `InvoiceDate` field
    - `DayOfWeek` - extracted from the `InvoiceDate` field

* Additionally, 
  * convert `CustomerID` from `float64` to `string`.

In [72]:
def feature_engineering(v3_df):
    copy_v3_df = v3_df.copy()
    # get a date column from InvoiceDate
    copy_v3_df['Date'] = copy_v3_df["InvoiceDate"].dt.date
    
    # get Day of Week 
    copy_v3_df['DayOfWeek'] = copy_v3_df['InvoiceDate'].dt.day_name()
    
    # get MonthYear column from InvoiceDate
    copy_v3_df['MonthYear'] = copy_v3_df['InvoiceDate'].dt.to_period('M')
    
    # get TotalAmount from Quantity and Unit Price
    copy_v3_df['TotalAmount'] = copy_v3_df['Quantity'] * copy_v3_df['UnitPrice']
    
    # covert to appropriate data types
    convert_dict = {
        'CustomerID': str,
        'InvoiceNo': str
    }
    copy_v3_df = copy_v3_df.astype(convert_dict)
    
    # save the resultant data frame to a csv file
    copy_v3_df.to_csv("v3_df.csv", index = False)
    
    #print("Data Shape:", copy_v3_df.shape)
    print(copy_v3_df.info())
    
    return copy_v3_df

v3_df = feature_engineering(v3_df)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397924 entries, 0 to 541908
Data columns (total 12 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    397924 non-null  object        
 1   StockCode    397924 non-null  object        
 2   Description  397924 non-null  object        
 3   Quantity     397924 non-null  int64         
 4   InvoiceDate  397924 non-null  datetime64[ns]
 5   UnitPrice    397924 non-null  float64       
 6   CustomerID   397924 non-null  object        
 7   Country      397924 non-null  object        
 8   Date         397924 non-null  object        
 9   MonthYear    397924 non-null  period[M]     
 10  TotalAmount  397924 non-null  float64       
 11  DayOfWeek    397924 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(7), period[M](1)
memory usage: 39.5+ MB
None


### Descriptive Statistics
* We proceed to get summary statistics and insights from the above output.

In [70]:
# descriptive statistics function
def descriptive_statistic(v3_df):
    # Specifying datetime_is_numeric=True adopts the future behavior of Treating datetime data as categorical rather than numeric
    descriptive_statistic_df = v3_df.describe(include='all', datetime_is_numeric=True)
    return descriptive_statistic_df

descriptive_statistic(v3_df)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Date,MonthYear,TotalAmount,DayOfWeek
count,397924.0,397924,397924,397924.0,397924,397924.0,397924.0,397924,397924,397924,397924.0,397924
unique,18536.0,3665,3877,,,,4339.0,37,305,13,,6
top,576339.0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,,,,17841.0,United Kingdom,2011-11-06,2011-11,,Thursday
freq,542.0,2035,2028,,,,7847.0,354345,3423,64545,,80052
mean,,,,13.021823,2011-07-10 23:43:36.912475648,3.116174,,,,,22.394749,
min,,,,1.0,2010-12-01 08:26:00,0.0,,,,,0.0,
25%,,,,2.0,2011-04-07 11:12:00,1.25,,,,,4.68,
50%,,,,6.0,2011-07-31 14:39:00,1.95,,,,,11.8,
75%,,,,12.0,2011-10-20 14:33:00,3.75,,,,,19.8,
max,,,,80995.0,2011-12-09 12:50:00,8142.75,,,,,168469.6,


### Preliminary Insights
1. Transactions Dates from `2010-12-01` to `2011-12-09` i.e., 305 Days (13 Months) excluding Sundays.
2. There are 397,924 item level transactions
3. 4,339 Unique Customers Invoiced with a total of 18,536 invoices being processed.
4. Transactions performed across 37 distinct countries
5. Mean transaction value = $22
6. Most frequent transactions occurred on `Thursday`, 80,052x

## Next Steps
1. Save the `v3-df` to a csv file for later use in performing the following steps
   1. [Data Visualization]()
   2. [Customer Retention Analysis]()
   3. [Customer Segmentation Analysis]()