# Data Exploration (Sample Super Store)
This notebook is intended to give some brief overview on how to explore a dataset, which in this case is the Sample SuperStore dataset, where I believe to be a fictional ecommerce company. With that being said, let's try to explore certain data angle with the Panda's library for Python using the [Jupyter Notebook](https://jupyter.org/).

## Importing the required libraries into the Jupyter Notebook.
First and foremost, we need to load the library onto our Jupyter Notebook environment. This exploration would assume you to have a proper and working installation of [Python](https://www.python.org/) programming language, secondly you have the [Jupyter Notebook](https://jupyter.org/) and the [Pandas](https://pandas.pydata.org/) library installed at the same time which you'll have access to throughout the course of the exploration.you to have a proper and working installation of Python 

In [24]:
# Importing packages
import pandas as pd

### Let's try to read the CSV record 

The following code would imply these instructions
- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`pd`* = stands for Panda, it's the convention the community is using.
- *`.read_csv`* = is a method within to read the CSV file.
- *index_col* ='Order ID'

In [2]:
# Let's try to read from the superstore.csv
df_orders = pd.read_csv('superstore.csv')

In [3]:
# Let's try to display the datasets rows and columns.
df_orders.head(3)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136
1,2,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3.0,0.0,219.582
2,3,CA-2017-138688,6/12/2017,6/16/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2.0,0.0,6.8714


### Setting the default records to display.

Let's try to set the maximum column and row to display, since by default the pandas library would only display **10 records** in total for a single dataset. The first 5 would coming from the top records, and the remaining would be coming from the last 5 records as a whole.

- *`pd.set_option('display.max_columns', 21)`* = setting the default column's view.
- *`pd.set_option('display.max_rows', 21)`* = setting the default row's view.

In [4]:
# Let's try to read from the superstore.csv
pd.set_option('display.max_columns', 21)
pd.set_option('display.max_rows', 21)
df_orders

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2.0,0.00,41.9136
1,2,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3.0,0.00,219.5820
2,3,CA-2017-138688,6/12/2017,6/16/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2.0,0.00,6.8714
3,4,US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5.0,0.45,-383.0310
4,5,US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2.0,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10795,Yes,US-2018-147886,,,,,,,,,,,,,,,,,,,
10796,Yes,US-2018-147998,,,,,,,,,,,,,,,,,,,
10797,Yes,US-2018-151127,,,,,,,,,,,,,,,,,,,
10798,Yes,US-2018-155999,,,,,,,,,,,,,,,,,,,


### Understanding The Dataset Structure
The following code would imply these instructions
- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`.shape()`* = is the method to display the number of rows and column.

In [5]:
df_orders.shape

(10800, 21)

### So now we know, we have the following data records from the data set:
- 10800 coloumns
- 21 coloumns

### Let's try to have a short glimpse of the first 5 rows of the records.
The following code would imply these instructions
- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`.head()`* = is the method to display the first five records of data coming from the dataset.

In [6]:
df_orders.head(3)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136
1,2,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3.0,0.0,219.582
2,3,CA-2017-138688,6/12/2017,6/16/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2.0,0.0,6.8714


### Let's try to have a short glimpse of the last 5 rows of the records.
The following code would imply these instructions
- `df_orders` = is the name of the variable, that will be using throughout the example of this tutorial.
- `.tail()` = is the method to display the last five records of data coming from the dataset.

In [7]:
df_orders.tail(5)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
10795,Yes,US-2018-147886,,,,,,,,,,,,,,,,,,,
10796,Yes,US-2018-147998,,,,,,,,,,,,,,,,,,,
10797,Yes,US-2018-151127,,,,,,,,,,,,,,,,,,,
10798,Yes,US-2018-155999,,,,,,,,,,,,,,,,,,,
10799,Yes,US-2018-155999,,,,,,,,,,,,,,,,,,,


## Columns (Features) Modification
### Renaming Columns For Clarity.

The following code would imply these instructions:
- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`.columns`* = is the method to rename the column names.

In [8]:
# Let's try to rename the column.
df_orders.columns = ['RowID', 'OrderID', 'OrderDate', 'ShipDate', 'ShipMode', 'CustomerID', 'CustomerName', 'Segment' , 'Country', 'City', 'State', 'PostalCode', 'Region', 'ProductID', 'Category', 'SubCategory', 'ProductName' , 'Sales', 'Quantity', 'Discount', 'Profit']

In [9]:
df_orders

Unnamed: 0,RowID,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,State,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,Quantity,Discount,Profit
0,1,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2.0,0.00,41.9136
1,2,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3.0,0.00,219.5820
2,3,CA-2017-138688,6/12/2017,6/16/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2.0,0.00,6.8714
3,4,US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5.0,0.45,-383.0310
4,5,US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2.0,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10795,Yes,US-2018-147886,,,,,,,,,,,,,,,,,,,
10796,Yes,US-2018-147998,,,,,,,,,,,,,,,,,,,
10797,Yes,US-2018-151127,,,,,,,,,,,,,,,,,,,
10798,Yes,US-2018-155999,,,,,,,,,,,,,,,,,,,


### Trying to get a sense of what the data types from the columns.

The following code would imply these instructions

- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`.columns`* = is the method to display all the columns available in the dataset.
- *`.dtypes()`* = is the method to display the data types from the dataset available.

In [10]:
# Let's print the columns (features) names.
df_orders.columns

Index(['RowID', 'OrderID', 'OrderDate', 'ShipDate', 'ShipMode', 'CustomerID',
       'CustomerName', 'Segment', 'Country', 'City', 'State', 'PostalCode',
       'Region', 'ProductID', 'Category', 'SubCategory', 'ProductName',
       'Sales', 'Quantity', 'Discount', 'Profit'],
      dtype='object')

In [11]:
# Let's print the columns data types.
df_orders.dtypes

RowID            object
OrderID          object
OrderDate        object
ShipDate         object
ShipMode         object
CustomerID       object
CustomerName     object
Segment          object
Country          object
City             object
State            object
PostalCode      float64
Region           object
ProductID        object
Category         object
SubCategory      object
ProductName      object
Sales           float64
Quantity        float64
Discount        float64
Profit          float64
dtype: object

### We discovered, some columns/features data type is not correct.

The following code would imply these instructions

- *df_orders* = is the name of the variable, that will be using throughout the example of this tutorial.
- *.astype* = is the method to change the coloumns data type in the dataset.

In [12]:
# Let's try to change the datatypes of the following column in the dataset.
df_orders['OrderDate'] = df_orders['OrderDate'].astype('datetime64[ns]')
df_orders['ShipDate'] = df_orders['ShipDate'].astype('datetime64[ns]')
df_orders['PostalCode'] = df_orders['PostalCode'].astype('object')

And now let's try to recheck them again, to see if the codes have worked as intended.

In [13]:
df_orders.dtypes

RowID                   object
OrderID                 object
OrderDate       datetime64[ns]
ShipDate        datetime64[ns]
ShipMode                object
CustomerID              object
CustomerName            object
Segment                 object
Country                 object
City                    object
State                   object
PostalCode              object
Region                  object
ProductID               object
Category                object
SubCategory             object
ProductName             object
Sales                  float64
Quantity               float64
Discount               float64
Profit                 float64
dtype: object

### Let's try to have different perspective from the coloumn header.
The following code would imply these instructions
- df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
- index_col = is the method to put the new coloumn perspective.

In [14]:
# df_orders = pd.read_csv('superstore.csv', index_col='RowID')
# I do not know what went wrong, tried to debug them, but the "index_col" don't work around the previous command.

In [15]:
df_orders.head(3)

Unnamed: 0,RowID,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,State,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,Quantity,Discount,Profit
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3.0,0.0,219.582
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2.0,0.0,6.8714


In [16]:
# len() to show numbers of rows in a dataframe
len(df_orders)

10800

### Let's try to get some decent statistic figures from the dataset.
The following code would imply these instructions
- df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
- .describe = is the method to pull out some statistics figures from the dataset.
- Short note, the .describe method would only work for numerical coloumn, and not categorical.
- While for the (include='all'), would work on both numerical & categorical values.

In [17]:
# Describing statistical information on the dataset
df_orders.describe()

Unnamed: 0,Sales,Quantity,Discount,Profit
count,9994.0,9994.0,9994.0,9994.0
mean,229.858001,3.789574,0.156203,28.656896
std,623.245101,2.22511,0.206452,234.260108
min,0.444,1.0,0.0,-6599.978
25%,17.28,2.0,0.0,1.72875
50%,54.49,3.0,0.2,8.6665
75%,209.94,5.0,0.2,29.364
max,22638.48,14.0,0.8,8399.976


In [18]:
# Describing more statistical information on the dataset
df_orders.describe(include='all')

Unnamed: 0,RowID,OrderID,OrderDate,ShipDate,ShipMode,CustomerID,CustomerName,Segment,Country,City,State,PostalCode,Region,ProductID,Category,SubCategory,ProductName,Sales,Quantity,Discount,Profit
count,10800,10800,9994,9994,9994,9994,9994,9994,9994,9994,9994,9983.0,9994,9994,9994,9994,9994,9994.0,9994.0,9994.0,9994.0
unique,10001,5015,1236,1334,4,793,793,3,1,531,49,630.0,4,1862,3,17,1850,,,,
top,Yes,CA-2018-100111,2017-09-05 00:00:00,2016-12-16 00:00:00,Standard Class,WB-21850,William Brown,Consumer,United States,New York City,California,10035.0,West,OFF-PA-10001970,Office Supplies,Binders,Staple envelope,,,,
freq,800,28,38,35,5968,37,37,5191,9994,915,2001,263.0,3203,19,6026,1523,48,,,,
first,,,2015-01-03 00:00:00,2015-01-07 00:00:00,,,,,,,,,,,,,,,,,
last,,,2018-12-30 00:00:00,2019-01-05 00:00:00,,,,,,,,,,,,,,,,,
mean,,,,,,,,,,,,,,,,,,229.858001,3.789574,0.156203,28.656896
std,,,,,,,,,,,,,,,,,,623.245101,2.22511,0.206452,234.260108
min,,,,,,,,,,,,,,,,,,0.444,1.0,0.0,-6599.978
25%,,,,,,,,,,,,,,,,,,17.28,2.0,0.0,1.72875


### Let's try to get some decent statistic figures from the dataset.

The following code would imply these instructions

*- df_orders* = is the name of the variable, that will be using throughout the example of this tutorial.
*- .count* = is the count value to a specific column.
*- .mean* = is the mean value to a specific column.
*- .std* = is the std value to a specific column.
*- .min* = is the min value to a specific column.

In [19]:
df_orders["Sales"].count()

9994

In [20]:
df_orders["Sales"].mean()

229.8580008304938

In [21]:
df_orders["Sales"].std()

623.2451005086818

In [22]:
df_orders["Sales"].min()

0.444

### Exporting Dataset

Once that we've satisfied with out results, let's export them a new CSV dataset, so we could work with them on the next notebook.

*- df_orders* = is the name of the variable, that will be using throughout the example of this tutorial.
*- .to_csv* = is the export method to a CSV dataset.
*- index = False* = we need to past this index value set to False, since we don't want the index column.

### From let's import them from previous superstore.csv to df_orders.csv

In [26]:
df_orders.to_csv('df_orders.csv', index = False)