# Portfolio Project: Online Retail Exploratory Data Analysis with Python

## Overview

In this project, you will step into the shoes of an entry-level data analyst at an online retail company, helping interpret real-world data to help make a key business decision.

Source files can be found here: https://www.coursera.org/learn/perform-exploratory-data-analysis-on-retail-data-with-python/home/week/1

## Case Study
In this project, you will be working with transactional data from an online retail store. The dataset contains information about customer purchases, including product details, quantities, prices, and timestamps. Your task is to explore and analyze this dataset to gain insights into the store's sales trends, customer behavior, and popular products. 

By conducting exploratory data analysis, you will identify patterns, outliers, and correlations in the data, allowing you to make data-driven decisions and recommendations to optimize the store's operations and improve customer satisfaction. Through visualizations and statistical analysis, you will uncover key trends, such as the busiest sales months, best-selling products, and the store's most valuable customers. Ultimately, this project aims to provide actionable insights that can drive strategic business decisions and enhance the store's overall performance in the competitive online retail market.

## Project Objectives
1. Describe data to answer key questions to uncover insights
2. Gain valuable insights that will help improve online retail performance
3. Provide analytic insights and data-driven recommendations

## Dataset

The dataset you will be working with is the "Online Retail" dataset. It contains transactional data of an online retail store from 2010 to 2011. The dataset is available as a .xlsx file named `Online Retail.xlsx`. This data file is already included in the Coursera Jupyter Notebook environment, however if you are working off-platform it can also be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx).

The dataset contains the following columns:

- InvoiceNo: Invoice number of the transaction
- StockCode: Unique code of the product
- Description: Description of the product
- Quantity: Quantity of the product in the transaction
- InvoiceDate: Date and time of the transaction
- UnitPrice: Unit price of the product
- CustomerID: Unique identifier of the customer
- Country: Country where the transaction occurred

## Tasks

You may explore this dataset in any way you would like - however if you'd like some help getting started, here are a few ideas:

1. Load the dataset into a Pandas DataFrame and display the first few rows to get an overview of the data.
2. Perform data cleaning by handling missing values, if any, and removing any redundant or unnecessary columns.
3. Explore the basic statistics of the dataset, including measures of central tendency and dispersion.
4. Perform data visualization to gain insights into the dataset. Generate appropriate plots, such as histograms, scatter plots, or bar plots, to visualize different aspects of the data.
5. Analyze the sales trends over time. Identify the busiest months and days of the week in terms of sales.
6. Explore the top-selling products and countries based on the quantity sold.
7. Identify any outliers or anomalies in the dataset and discuss their potential impact on the analysis.
8. Draw conclusions and summarize your findings from the exploratory data analysis.

## Task 1: Load the Data

In [14]:
import pandas as pd
pd.set_option('max_colwidth', 400)

In [15]:
retail_data = pd.read_excel('Online Retail.xlsx', sheet_name='Online Retail')

In [16]:
retail_data.head(500)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
495,536409,20669,RED HEART LUGGAGE TAG,1,2010-12-01 11:45:00,1.25,17908.0,United Kingdom
496,536409,90129F,RED GLASS TASSLE BAG CHARM,1,2010-12-01 11:45:00,2.95,17908.0,United Kingdom
497,536409,90210B,CLEAR ACRYLIC FACETED BANGLE,1,2010-12-01 11:45:00,2.95,17908.0,United Kingdom
498,536409,90199C,5 STRAND GLASS NECKLACE CRYSTAL,1,2010-12-01 11:45:00,6.35,17908.0,United Kingdom


In [17]:
retail_data.describe()
retail_data.info()
retail_data.isnull().sum()
retail_data.nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


InvoiceNo      25900
StockCode       4070
Description     4223
Quantity         722
InvoiceDate    23260
UnitPrice       1630
CustomerID      4372
Country           38
dtype: int64

This dataset contains 541,909 rows. It's alarming that 135,080 (25%) of those rows contain null `CustomerID` values. This is a piece of important information to provide back to the store owner.  Additionally, 1,454 rows are missing a `Description` but, while that's a problem, the corresponding StockCode doesn't seem to be missing.  `Description` would be helpful to have from a readability standpoint but probably isn't needed for EDA.

I am choosing to keep the row with missing `CustomerID` rows for now because I may be able to do analysis on the data set (E.g., product analysis, time analysis, geospacial analysis) that won't require the data.

# Task 2: Clean the data


To Do Ideas:
1. Identify blank `Descriptions` and populate the correct value based on StockCode (Code of the Product), if possible.
2. Identify blank `CustomerID` and populate the correct valuea based on InvoiceNo, since multiple customers shouldn't be on the same Invoice?
   

## Fix Descriptions

Possible process:

- Create a list of `StockCode` and the most frequent corresponding `Description`. E.g. for `10080`, there are three unique `Description`: "check" (1), "GROOVY CACTUS INFLATABLE" (22), and "blank" (1). The most common `Description` is likely the valid value.
- Replace the existing value with the most-popular value
- I don't think it'll be easier to filter `StockCode` that have more than one unique `Description` but maybe.

### Count unique Descriptions per StockCode

In [63]:
# Count nunique Descriptions per StockCode. Counts > 1 indiciate that the corresponding StockCode has more than one Description. Based on the description of the Dataset's columns, this shouldn't be possible. 
unique_description_counts = retail_data.groupby('StockCode')['Description'].nunique().reset_index()
display(unique_description_counts[unique_description_counts['Description'] > 1])

Unnamed: 0,StockCode,Description
1,10080,2
4,10133,2
12,16008,2
21,16045,2
50,20622,2
...,...,...
3974,90195A,2
4008,90210D,2
4043,DCGS0003,2
4050,DCGS0069,2


In [37]:
# Define a function to find the most common value or return None if the Series is empty
def most_common(series):
    if series.dropna().empty:
        return None
    return series.mode().iloc[0]

# Creates a Lookup of all StockCode and hopefully-correct Descriptions. Group by 'StockCode' and apply the most_common function to 'Description'
stockcode_description_mode = retail_data.groupby('StockCode')['Description'].apply(most_common).reset_index()
display(stockcode_description_mode)

# Convert DataFrame to dictionary
stockcode_description_dict = dict(zip(stockcode_description_mode['StockCode'], stockcode_description_mode['Description']))
# display(stockcode_description_dict)

Unnamed: 0,StockCode,Description
0,10002,INFLATABLE POLITICAL GLOBE
1,10080,GROOVY CACTUS INFLATABLE
2,10120,DOGGY RUBBER
3,10125,MINI FUNKY DESIGN TAPES
4,10133,COLOURING PENCILS BROWN TUBE
...,...,...
4065,gift_0001_20,Dotcomgiftshop Gift Voucher £20.00
4066,gift_0001_30,Dotcomgiftshop Gift Voucher £30.00
4067,gift_0001_40,Dotcomgiftshop Gift Voucher £40.00
4068,gift_0001_50,Dotcomgiftshop Gift Voucher £50.00


In [None]:
# Merge dictionary to retail_data
retail_data_new = pd.merge(retail_data, stockcode_description_mode, on="StockCode")

# Fix Column Order and Rename
retail_data_new = retail_data_new[['InvoiceNo', 'StockCode','Description_y', 'Quantity','InvoiceDate','UnitPrice','CustomerID','Country','Description_x']]
retail_data_new.rename(columns={
    'Description_y': 'Description',
    'Description_x': 'Description_old'
}, inplace=True)

### Reviewing the fix

In [61]:
filt = retail_data_new['StockCode'] == 10080
retail_data_new.loc[filt]

retail_data_new.describe()
retail_data_new.info()
retail_data_new.isnull().sum()
retail_data_new.nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   InvoiceNo        541909 non-null  object        
 1   StockCode        541909 non-null  object        
 2   Description      541797 non-null  object        
 3   Quantity         541909 non-null  int64         
 4   InvoiceDate      541909 non-null  datetime64[ns]
 5   UnitPrice        541909 non-null  float64       
 6   CustomerID       406829 non-null  float64       
 7   Country          541909 non-null  object        
 8   Description_old  540455 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(5)
memory usage: 37.2+ MB


InvoiceNo          25900
StockCode           4070
Description         3822
Quantity             722
InvoiceDate        23260
UnitPrice           1630
CustomerID          4372
Country               38
Description_old     4223
dtype: int64