<h2 align="center"> Exploratory Data Analysis on E-commerce Data </h2>

**Dataset:** The dataset used here for the analysis is downloaded from:
https://www.kaggle.com/carrie1/ecommerce-data

**About the Dataset:** This is a transnational dataset which contains all the transactions occurring between `01/12/2010` and `09/12/2011` for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

**Column Description:**
* InvoiceNo: A number assigned to each transaction
* StockCode: Product code
* Description: Product name
* Quantity: No. of products purchased for each transaction
* InvoiceDate: Timestamp for each transaction
* UnitPrice: Product price per unit
* CustomerID: Unique identifier of each customer
* Country: Country name

**Queries**

1. How to find out total no. of records?
2. How to find out total no. of columns?
3. Display all column names.
4. How to find out all missing values for each column?
5. Show all Quantity which has negative values records?
6. Find out Total shape of Quantity which has negative values records?
7. How to remove all Quantity records with negative values?
8. Find out total records shape?
9. After UnitPrice column insert new column AmountSpent & also assign the formula?
10. InvoiceDate column converts string format into datetime format?
11. Add columns after InvoiceDate that consists of the Year, Month, Day, and Hour for each transaction for analysis work.
12. How to find out top 5 customers with highest money spent records? Show only 3 columns ['CustomerID','Country','Amount_spent']
13. Plotting bar chart: How many orders (per month)?
14. Plotting bar chart: How many orders (per day)?
15. Plot bar chart: How many orders (per Hour)?

In [None]:
# Basic EDA Tools:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
# Importing the dataset
df = pd.read_csv("data.csv", encoding='cp1252')
df.head()

In [None]:
df.describe().round(2) # upto 2 decimal points

In [None]:
df.info()

In [None]:
# 1. How to find out total no. of records?
print(f"Total Records: {len(df)}")

In [None]:
# 2. How to find out total no. of columns?
print(f"Total no. of columns: {len(df.columns)}")

In [None]:
# 3. Display all column names.
print(f"Different columns are: \n{df.columns.tolist()}")

In [None]:
# 4. How to find out all missing values for each column?
df.isna().sum()

**Column `Description` has 1454 missing values & `CustomerID` has 135080 missing values**

Now, Quantity has some -ve values in the dataset and since Quantity can't be negative we can either drop all records having -ve quantities or replace it with any value of our choice.

In [None]:
# 5. Show all Quantity which has negative values records?
df[df.Quantity < 0]

In [None]:
# 6. Find out Total shape of Quantity which has negative values records?
df[df.Quantity < 0].shape

In [None]:
print(f"Total no. of records before dropping the -ve Qty records: {len(df)}")

In [None]:
# df_new = df[~(df.Quantity < 0)] # contains records of all filled Qty.
# df_new.head()

In [None]:
# 7. How to remove all Quantity records with negative values?
index = df[df.Quantity < 0].index
df.drop(index = index, inplace=True)
df

In [None]:
# 8. Find out new total records shape?
print(f"Total no. of records after dropping the -ve Qty records: {len(df)}")

In [None]:
# 9. After UnitPrice column insert new column AmountSpent & also assign the formula?
AmountSpent = df.Quantity * df.UnitPrice
df.insert(loc=6,column='AmountSpent',value=AmountSpent)
df.head()

In [None]:
# 10. InvoiceDate column converts string format into datetime format?
df.InvoiceDate = pd.to_datetime(df.InvoiceDate)
df.head()

In [None]:
# 11. Add columns after InvoiceDate that consists of the Year, Month, Day, and Hour for each transaction for analysis work.
Year = df.InvoiceDate.dt.year
Month = df.InvoiceDate.dt.month
Day = df.InvoiceDate.dt.day
Hour = df.InvoiceDate.dt.hour
df.insert(5,"Year",Year)
df.insert(6,"Month",Month)
df.insert(7,"Day",Day)
df.insert(8,"Hour",Hour)
df.head()

In [None]:
# df.assign(Year = df.InvoiceDate.dt.year, Month = df.InvoiceDate.dt.month,
#           Day = df.InvoiceDate.dt.day, Hour = df.InvoiceDate.dt.hour )
# df.head()

In [None]:
# df.nlargest(5,'AmountSpent')

In [None]:
# 12. How to find out top 5 customers with highest money spent records? Show only 3 columns ['CustomerID','Country','AmountSpent']
df.loc[df.sort_values("AmountSpent", ascending=False).index[:5],['CustomerID','Country','AmountSpent']]

**Note:** December is present in both the years : Year 2010 & 2011

In [None]:
df.Month.value_counts()

In [None]:
# 13. Plotting bar chart: How many orders (per month)?
df.groupby(["Year","Month"]).sum()['Quantity'].plot.bar(figsize=(10,5),color="red")
plt.ylabel("Orders")
plt.title("Orders per Month");

In [None]:
# 14. Plotting bar chart: How many orders (per day)?
df.groupby("Day").sum()['Quantity'].plot.bar(figsize=(10,5),color="gold")
plt.xticks(rotation=0)
plt.ylabel("Orders")
plt.title("Orders per Day");

In [None]:
# 15. Plot bar chart: How many orders (per Hour)?
df.groupby("Hour").sum()['Quantity'].plot.bar(figsize=(10,5),color="black")
plt.xticks(rotation=0)
plt.ylabel("Order")
plt.title("Orders per Hour");