# Data Exploration (Sample SuperStore)

![unsplash.jpg](attachment:unsplash.jpg)

This notebook is intended to give some quick and brief overview on how to explore a dataset, which in this case is the Sample SuperStore. To my best knowledge, it's coming from a fictional ecommerce company sales data. With that being said, let's try to explore further the dataset, with the various built-in functionalities of the Panda's library for Python have to offer, while utilizing the [Jupyter Notebook](https://jupyter.org/) as the main IDE of choice.

## Installation
First and foremost, this data exploration would assume you to have a proper and working installation of [Python](https://www.python.org/) programming language, and secondly you have the [Jupyter Notebook](https://jupyter.org/) and the [Pandas](https://pandas.pydata.org/) library installed on your workstation, that you would have access to, throughout the course of the exploration. 

As the installation part of each of the mentioned piece of software go beyond the scope of this tutorial, I would suggest to head over to the official sites, and you may discover steps on how to download and install the required programming language and libraries according to your operating systems of choice.

1. [Python Official Website](https://www.python.org/).
2. [Panda's Official Website](https://pandas.pydata.org/).
3. [Jupyter Official Website](https://jupyter.org/).

## Importing Library
First and foremost, we need to load the library onto our Jupyter Notebook environment. This exploration would assume you to have a proper and working installation of [Python](https://www.python.org/) programming language, secondly you have the [Jupyter Notebook](https://jupyter.org/) and the [Pandas](https://pandas.pydata.org/) library installed at the same time which you have access to, throughout the course of the exploration.


In [1]:
# Importing packages
import pandas as pd

## Reading Datasets

The following code would imply these instructions
- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`pd`* = stands for Panda, it's the convention the community is using.
- *`.read_csv`* = is a method within to read the CSV file.
- *`index_col`* ='Order ID'

In [2]:
# Let's try to read from the superstore.csv
df_orders = pd.read_csv('data/superstore.csv')

In [3]:
df_orders

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2.0,0.00,41.9136
1,2,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3.0,0.00,219.5820
2,3,CA-2017-138688,6/12/2017,6/16/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2.0,0.00,6.8714
3,4,US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5.0,0.45,-383.0310
4,5,US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2.0,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10795,Yes,US-2018-147886,,,,,,,,,...,,,,,,,,,,
10796,Yes,US-2018-147998,,,,,,,,,...,,,,,,,,,,
10797,Yes,US-2018-151127,,,,,,,,,...,,,,,,,,,,
10798,Yes,US-2018-155999,,,,,,,,,...,,,,,,,,,,


By default, Panda's built-in functionality, only showing 20 columns and 10 rows for each dataset, everytime time you try to display them in the view. If you notice from the tabular data above, the dataset get truncated with triple dots sign **'...'** both for the rows and the columns. And since this dataset has 10800 rows with 21 columns, it'll only show the first 10 records for the row, with only 20 columns to the right instead of 21.

### Dropping The Row ID

The "Row ID" column is not really that informative, I think it would be safe enough for us to simply just delete them. That way, it would give us much more clarity over our dataset. 

- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`.drop`* = the method being used to drop column.
- *`inplace=True`* = we used them to keep the changes onward.

In [4]:
df_orders.drop("Row ID", axis=1, inplace=True)

In [5]:
# Let's call the previously defined data variable, the 'df_orders'
df_orders

Unnamed: 0,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2.0,0.00,41.9136
1,CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3.0,0.00,219.5820
2,CA-2017-138688,6/12/2017,6/16/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2.0,0.00,6.8714
3,US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5.0,0.45,-383.0310
4,US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2.0,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10795,US-2018-147886,,,,,,,,,,,,,,,,,,,
10796,US-2018-147998,,,,,,,,,,,,,,,,,,,
10797,US-2018-151127,,,,,,,,,,,,,,,,,,,
10798,US-2018-155999,,,,,,,,,,,,,,,,,,,


Now you can see, from the above table, we don't have the 'ROW ID' in place no longer.

### Change The Index Column

And each time you're using `pd.read_csv('somedata.csv')` that would yield the dataset's actual rows and columns, and we certainly have quite an extensive records of data, as being displayed from the previous function. 

As you may notice, the first column isn't the actual `'Row ID`' column, rather it's the default built-in feature Panda's bringing into the dataset. Let's try to change that into something much more useful. Now, let's get back to the previous Panda's function, and add another parameter, the `'index_col'` to be exact.

And as we wish to redo again with a clean dataset, let's call it again one more time the dataset with the <br>
`'df_orders = pd.read_csv('data/superstore.csv', index_col='Order ID')'` fucntion.<br>

And continue with the '`df_orders'` syntax again. So don't be surprise if you see the '`Row ID'` column reappear in the dataset since that would illustrate our objective, but this time the index column values have changed from the value coming from the `'Order ID`' column.

In [6]:
# Let's try to read again from the superstore.csv
df_orders = pd.read_csv('data/superstore.csv', index_col='Order ID') 
# added the index_col='Order ID', parameter.

In [7]:
# Let's call the previously defined data variable, the 'df_orders'
df_orders 

Unnamed: 0_level_0,Row ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
CA-2017-152156,1,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2.0,0.00,41.9136
CA-2017-152156,2,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3.0,0.00,219.5820
CA-2017-138688,3,6/12/2017,6/16/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2.0,0.00,6.8714
US-2016-108966,4,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5.0,0.45,-383.0310
US-2016-108966,5,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2.0,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
US-2018-147886,Yes,,,,,,,,,,,,,,,,,,,
US-2018-147998,Yes,,,,,,,,,,,,,,,,,,,
US-2018-151127,Yes,,,,,,,,,,,,,,,,,,,
US-2018-155999,Yes,,,,,,,,,,,,,,,,,,,


Once that we tried to add the additional parameter, as you may notice, the first column have changed to 'Order ID' column, rather then the previous Panda's built-in index column, and the other thing was, the fine print below each table now have changed, from 21 columns, to only 20 columns instead.

### Drop the Row ID & Change The Index

On to our another objective, what if we wish to combine both of the features, with dropping the `'Row ID`' and to change the `'Index`' columns at the same time, so that we could get even leaner dataset to work with. With that kind of objective, we might need to combine both of the syntax to achive our objective.

`'df_orders.drop("Row ID", axis=1, inplace=True, index_col='Order ID')`'

- `'.drop()`' = this method to drop a column from the dataset.
- `'axis=1`' = is the attribution value, on the dataset.
- `'inplace=True`' = we used them to keep our changes onward.
- `'index_col`' = make the define column, as our newly active index column instead.

In [19]:
df_orders = pd.read_csv('data/superstore.csv', index_col='Order ID')
df_orders.drop("Row ID", axis=1, inplace=True)

In [27]:
# Let's call the dataset again.
df_orders

Unnamed: 0_level_0,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2.0,0.00,41.9136
CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3.0,0.00,219.5820
CA-2017-138688,6/12/2017,6/16/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2.0,0.00,6.8714
US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5.0,0.45,-383.0310
US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2.0,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
US-2018-147886,,,,,,,,,,,,,,,,,,,
US-2018-147998,,,,,,,,,,,,,,,,,,,
US-2018-151127,,,,,,,,,,,,,,,,,,,
US-2018-155999,,,,,,,,,,,,,,,,,,,


Now, as you may see, from the below printed information, we only have 19 columns remaining left, coming from the initial 21 columns previously being shows. You may ask, "But how come it's down to 19 columns, while we recall we only dropped 1 column?". The answer to that is due to the `'index_col`' method, whereas we define the `'Order ID`' to settle as the Index of the dataset. Pandas doesn't count that to a column, rather just another index in the dataset.

### Default Number Rows & Columns 
Let's try to set the maximum column and row to display, since by default the pandas library would display **10 records** of rows in total for a single dataset. The first 5 would coming from the top records, and the remaining would be coming from the last 5 records as a whole. But that's a little too much information anyone could digest in a short glimpse, why don't we just minimize them down to 5 records instead. The same thing with the columns view, whereas Pandas would display you 20 columns, but since our current dataset only have 19 of them, then that should be fine.

- *`pd.set_option('display.max_columns', 20)`* = setting the default column's view.
- *`pd.set_option('display.max_rows', 5)`* = setting the default row's view.

In [35]:
# Let's try to read from the superstore.csv
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 5)

In [36]:
# Let's try to give it a go with the new setting.
df_orders

Unnamed: 0_level_0,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136
CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3.0,0.0,219.5820
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
US-2018-155999,,,,,,,,,,,,,,,,,,,
US-2018-155999,,,,,,,,,,,,,,,,,,,


### The First 5 rows of The Dataset
Here's another Pandas built-in method that may come handy. When you fell like taking a quick peek of the first 5 recrods from the top, the following code would deliver you those ouputs.
<br>
- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`.head()`* = is the method to display the first five records of data coming from the dataset.

In [37]:
df_orders.head()

Unnamed: 0_level_0,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136
CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3.0,0.0,219.582
CA-2017-138688,6/12/2017,6/16/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2.0,0.0,6.8714
US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5.0,0.45,-383.031
US-2016-108966,10/11/2016,10/18/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2.0,0.2,2.5164


### The Last 5 Rows of The Dataset
Much like the above previous syntax, the similar can be apply to the bottom 5 records coming from your dataset. And you guess it right, the syntax would be `.tail()` and that would give you the last 5 records from the dataset.
<br>
- `df_orders` = is the name of the variable, that will be using throughout the example of this tutorial.
- `.tail()` = is the method to display the last five records of data coming from the dataset.

In [25]:
df_orders.tail(5)

Unnamed: 0_level_0,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
US-2018-147886,,,,,,,,,,,,,,,,,,,
US-2018-147998,,,,,,,,,,,,,,,,,,,
US-2018-151127,,,,,,,,,,,,,,,,,,,
US-2018-155999,,,,,,,,,,,,,,,,,,,
US-2018-155999,,,,,,,,,,,,,,,,,,,


# The Dataset Structure

Now that you have one finer understanding on the previous aspect of importing library, loading the dataset and manipulate the views of the rows and the columns, let's now move on to the Dataset structure aspect. Whereas it's also an important area, before continuing the journey of exploring the dataset further.<br>

The dataset you get from the wild, might not always have the proper structure and data types you need. And before you could do further analysis and manipulation, let's make sure that both the structure and data types have been taken care of properly.

## Rows & Columns

Following are both the built-in method to achive our next objective, as we go more deeper over the analysis part of the dataset. Let's try to understand further of what how many rows and columns are there, we know this information from the previous part, but lucky for us, Pandas also provide us with a method di display the information in hand.

- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`.shape()`* = is the method to display the number of rows and column.

In [41]:
df_orders.shape

(10800, 19)

Aside from other ways to know how many Rows and Columns available from your dataset, Pandas also have a builtin method to dispay those information. So now we understand that the dataset has the following total of information records.
<br>
- 10800 coloumns
- 19 coloumns

## Columns in Dataset

Many times before you wish to explore further your columns in the dataset with many operators, you may need to make sure it's the correct data type that you're working with. You wouldn't be able to do division operation over a Timestamp data format with a String data type for that matter.  


- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`.columns`* = is the method to display all the columns available in the dataset.

In [45]:
# Let's print the columns (features) names.
df_orders.columns

Index(['Order Date', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name',
       'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region',
       'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Sales',
       'Quantity', 'Discount', 'Profit'],
      dtype='object')

## Columns Data Type

Many times before you wish to explore further your columns in the dataset with many operators, you may need to make sure it's the correct data type that you're working with. You wouldn't be able to do division operation over a Timestamp data format with a String data type for that matter.  

- *`df_orders`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`.columns`* = is the method to display all the columns available in the dataset.
- *`.dtypes()`* = is the method to display the data types from the dataset available.

In [53]:
# Let's print the columns data types.
df_orders.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10800 entries, CA-2017-152156 to US-2018-155999
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Order Date     9994 non-null   object 
 1   Ship Date      9994 non-null   object 
 2   Ship Mode      9994 non-null   object 
 3   Customer ID    9994 non-null   object 
 4   Customer Name  9994 non-null   object 
 5   Segment        9994 non-null   object 
 6   Country        9994 non-null   object 
 7   City           9994 non-null   object 
 8   State          9994 non-null   object 
 9   Postal Code    9983 non-null   float64
 10  Region         9994 non-null   object 
 11  Product ID     9994 non-null   object 
 12  Category       9994 non-null   object 
 13  Sub-Category   9994 non-null   object 
 14  Product Name   9994 non-null   object 
 15  Sales          9994 non-null   float64
 16  Quantity       9994 non-null   float64
 17  Discount       9994 non-null   fl

## Columns Modification
### Renaming Columns For Clarity.

The following code would imply these instructions:
- *`df_orders_edit`* = is the name of the variable, that will be using throughout the example of this tutorial.
- *`.columns`* = is the method to rename the column names.

In [None]:
# Let's try to rename the column.
df_orders_edit.columns = ['OrderID', 'OrderDate', 'ShipDate', 'ShipMode', 'CustomerID', 'CustomerName', 'Segment' , 'Country', 'City', 'State', 'PostalCode', 'Region', 'ProductID', 'Category', 'SubCategory', 'ProductName' , 'Sales', 'Quantity', 'Discount', 'Profit']

In [51]:
df_orders

Unnamed: 0_level_0,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136
CA-2017-152156,11/8/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3.0,0.0,219.5820
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
US-2018-155999,,,,,,,,,,,,,,,,,,,
US-2018-155999,,,,,,,,,,,,,,,,,,,


### We discovered, some columns/features data type is not correct.

The following code would imply these instructions

- *df_orders* = is the name of the variable, that will be using throughout the example of this tutorial.
- *.astype* = is the method to change the coloumns data type in the dataset.

In [None]:
# Let's try to change the datatypes of the following column in the dataset.
df_orders_edit['OrderDate'] = df_orders['OrderDate'].astype('datetime64[ns]')
df_orders_edit['ShipDate'] = df_orders['ShipDate'].astype('datetime64[ns]')
df_orders_edit['PostalCode'] = df_orders['PostalCode'].astype('object')

And now let's try to recheck them again, to see if the codes have worked as intended.

### Let's try to get some decent statistic figures from the dataset.
The following code would imply these instructions
- df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
- .describe = is the method to pull out some statistics figures from the dataset.
- Short note, the .describe method would only work for numerical coloumn, and not categorical.
- While for the (include='all'), would work on both numerical & categorical values.

In [54]:
# Describing statistical information on the dataset
df_orders.describe()

Unnamed: 0,Postal Code,Sales,Quantity,Discount,Profit
count,9983.000000,9994.000000,9994.000000,9994.000000,9994.000000
mean,55245.233297,229.858001,3.789574,0.156203,28.656896
...,...,...,...,...,...
75%,90008.000000,209.940000,5.000000,0.200000,29.364000
max,99301.000000,22638.480000,14.000000,0.800000,8399.976000


In [55]:
# Describing more statistical information on the dataset
df_orders.describe(include='all')

Unnamed: 0,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
count,9994,9994,9994,9994,9994,9994,9994,9994,9994,9983.0,9994,9994,9994,9994,9994,9994.00,9994.0,9994.0,9994.000
unique,1236,1334,4,793,793,3,1,531,49,,4,1862,3,17,1850,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75%,,,,,,,,,,90008.0,,,,,,209.94,5.0,0.2,29.364
max,,,,,,,,,,99301.0,,,,,,22638.48,14.0,0.8,8399.976


### Let's try to get some decent statistic figures from the dataset.

The following code would imply these instructions

*- df_orders* = is the name of the variable, that will be using throughout the example of this tutorial.
*- .count* = is the count value to a specific column.
*- .mean* = is the mean value to a specific column.
*- .std* = is the std value to a specific column.
*- .min* = is the min value to a specific column.

In [56]:
df_orders["Sales"].count()

9994

In [57]:
df_orders["Sales"].mean()

229.8580008304938

In [58]:
df_orders["Sales"].std()

623.2451005086818

In [59]:
df_orders["Sales"].min()

0.444

# Exporting Dataset

Once that we've satisfied with out results, let's export them a new CSV dataset, so we could work with them on the next notebook.

*- df_orders* = is the name of the variable, that will be using throughout the example of this tutorial.
*- .to_csv* = is the export method to a CSV dataset.
*- index = False* = we need to past this index value set to False, since we don't want the index column.

### From let's import them from previous superstore.csv to df_orders.csv

In [None]:
df_orders_edit.to_csv('data/df_orders_exported.csv', index =False)