# Introduction
Many things have changed since Michael Aldrich invented the earliest form of e-commerce in 1979. Due to COVID-19 lockdowns introduced by governments in 2020, many customers began online shopping and are turning now into loyal online shoppers. 

To make the right decisions about your products and marketing strategies, it is fundamental to take steps to learn and obtain a better understanding of your customers, old and new.

In this dataset hosted by [Kaggle](https://www.kaggle.com/datasets/ytgangster/online-sales-in-usa), we will explore and will use analytical tools to identify purchase patterns and to predict future sales so that we can focus our marketing efforts. 

## Exploratory Data Analysis
EDA is used to understand the underlying structure of the dataset, possible relationships or correlations that hold between covariates, distributions that may affect the election of an algorithm and etcetera.

### Import Data
The data collected is about the online sales of different products, several merchandise and electronic in different in USA. 

Large retailers are actively searching for ways to increase their profit. Sales analysis is one such key techniques used by large retailers to to increase sales by understanding the customers' purchasing behavior & patterns. Market basket analysis examines collections of items to find relationships between items that go together within the business context.

In [10]:
# Importing the libraries

from __future__ import print_function, division
import numpy as np
import pandas as pd

In [12]:
# Loading datasets
onSales_df = pd.read_csv(
    './data/sales.csv'
)

In [13]:
print('{0} observations and {1} characteristics'.format( onSales_df.shape[0], onSales_df.shape[1]))
print('First 5 rows:')
pd.set_option('display.max_columns', None)
onSales_df.head()

286392 observations and 36 characteristics
First 5 rows:


Unnamed: 0,order_id,order_date,status,item_id,sku,qty_ordered,price,value,discount_amount,total,category,payment_method,bi_st,cust_id,year,month,ref_num,Name Prefix,First Name,Middle Initial,Last Name,Gender,age,full_name,E Mail,Customer Since,SSN,Phone No.,Place Name,County,City,State,Zip,Region,User Name,Discount_Percent
0,100354678,2020-10-01,received,574772.0,oasis_Oasis-064-36,21.0,89.9,1798.0,0.0,1798.0,Men's Fashion,cod,Valid,60124.0,2020,Oct-2020,987867,Drs.,Jani,W,Titus,F,43.0,"Titus, Jani",jani.titus@gmail.com,8/22/2006,627-31-5251,405-959-1129,Vinson,Harmon,Vinson,OK,73571,South,jwtitus,0.0
1,100354678,2020-10-01,received,574774.0,Fantastic_FT-48,11.0,19.0,190.0,0.0,190.0,Men's Fashion,cod,Valid,60124.0,2020,Oct-2020,987867,Drs.,Jani,W,Titus,F,43.0,"Titus, Jani",jani.titus@gmail.com,8/22/2006,627-31-5251,405-959-1129,Vinson,Harmon,Vinson,OK,73571,South,jwtitus,0.0
2,100354680,2020-10-01,complete,574777.0,mdeal_DMC-610-8,9.0,149.9,1199.2,0.0,1199.2,Men's Fashion,cod,Net,60124.0,2020,Oct-2020,987867,Drs.,Jani,W,Titus,F,43.0,"Titus, Jani",jani.titus@gmail.com,8/22/2006,627-31-5251,405-959-1129,Vinson,Harmon,Vinson,OK,73571,South,jwtitus,0.0
3,100354680,2020-10-01,complete,574779.0,oasis_Oasis-061-36,9.0,79.9,639.2,0.0,639.2,Men's Fashion,cod,Net,60124.0,2020,Oct-2020,987867,Drs.,Jani,W,Titus,F,43.0,"Titus, Jani",jani.titus@gmail.com,8/22/2006,627-31-5251,405-959-1129,Vinson,Harmon,Vinson,OK,73571,South,jwtitus,0.0
4,100367357,2020-11-13,received,595185.0,MEFNAR59C38B6CA08CD,2.0,99.9,99.9,0.0,99.9,Men's Fashion,cod,Valid,60124.0,2020,Nov-2020,987867,Drs.,Jani,W,Titus,F,43.0,"Titus, Jani",jani.titus@gmail.com,8/22/2006,627-31-5251,405-959-1129,Vinson,Harmon,Vinson,OK,73571,South,jwtitus,0.0


In [14]:
onSales_df.tail()

Unnamed: 0,order_id,order_date,status,item_id,sku,qty_ordered,price,value,discount_amount,total,category,payment_method,bi_st,cust_id,year,month,ref_num,Name Prefix,First Name,Middle Initial,Last Name,Gender,age,full_name,E Mail,Customer Since,SSN,Phone No.,Place Name,County,City,State,Zip,Region,User Name,Discount_Percent
286387,100562365,2021-09-30,paid,905179.0,APPCHA5AF14939B8F8A,2.0,4419.9,4419.9,0.0,4419.9,Appliances,Easypay,Valid,115323.0,2021,Sep-2021,967309,Prof.,Brady,K,Latham,M,51.0,"Latham, Brady",brady.latham@gmail.com,3/21/2007,613-87-0361,212-772-7404,Rushville,Yates,Rushville,NY,14544,Northeast,bklatham,0.0
286388,100562376,2021-09-30,cod,905191.0,MEFCOT5A8D1E973B886,2.0,39.9,39.9,0.0,39.9,Men's Fashion,cod,Valid,115324.0,2021,Sep-2021,335358,Prof.,Bennie,M,Brunetti,M,52.0,"Brunetti, Bennie",bennie.brunetti@gmail.com,10/24/2011,101-02-1040,229-817-9451,Lawrenceville,Gwinnett,Lawrenceville,GA,30044,South,bmbrunetti,0.0
286389,100562383,2021-09-30,cod,905200.0,WOFVAL59D5EA84167F9-M,2.0,40.0,40.0,0.0,40.0,Women's Fashion,cod,Valid,115325.0,2021,Sep-2021,675384,Mrs.,Francesca,N,Giusti,F,38.0,"Giusti, Francesca",francesca.giusti@btinternet.com,7/25/1987,399-31-7238,252-414-8396,Durham,Durham,Durham,NC,27701,South,fngiusti,0.0
286390,100562384,2021-09-30,cod,905202.0,WOFNIG5B4D7EB0E9FDD-L,2.0,49.9,49.9,0.0,49.9,Women's Fashion,cod,Valid,115325.0,2021,Sep-2021,675384,Mrs.,Francesca,N,Giusti,F,38.0,"Giusti, Francesca",francesca.giusti@btinternet.com,7/25/1987,399-31-7238,252-414-8396,Durham,Durham,Durham,NC,27701,South,fngiusti,0.0
286391,100562386,2021-09-30,processing,905205.0,MATHUA5AF70A7D1E50A,2.0,3559.9,3559.9,0.0,3559.9,Mobiles & Tablets,bankalfalah,Gross,115326.0,2021,Sep-2021,489455,Mr.,Rolf,E,Schlosser,M,28.0,"Schlosser, Rolf",rolf.schlosser@comcast.net,1/28/2015,320-11-8748,423-276-2699,Knoxville,Knox,Knoxville,TN,37920,South,reschlosser,0.0


In [15]:
onSales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286392 entries, 0 to 286391
Data columns (total 36 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   order_id          286392 non-null  object 
 1   order_date        286392 non-null  object 
 2   status            286392 non-null  object 
 3   item_id           286392 non-null  float64
 4   sku               286392 non-null  object 
 5   qty_ordered       286392 non-null  float64
 6   price             286392 non-null  float64
 7   value             286392 non-null  float64
 8   discount_amount   286392 non-null  float64
 9   total             286392 non-null  float64
 10  category          286392 non-null  object 
 11  payment_method    286392 non-null  object 
 12  bi_st             286392 non-null  object 
 13  cust_id           286392 non-null  float64
 14  year              286392 non-null  int64  
 15  month             286392 non-null  object 
 16  ref_num           28

By observing the data, we quickly notice that there are 11 numeric variables (9 float and 3 integers), 24 objects and 286392. There are none missing or non-available values (Non-null counts in all features is equal to the number of rows 286392). However, there are issues with some columns content e.g., age is not a whole number, same item_id, customer_id, qty_ordered are decimal numbers which make no sense, value doesn't add up with price times qty_ordered and etcetera. 