## Setting up the environment

Before we can do anything or make use of any tools python might have to offer, we need to setup the environment. I will assume you have installed Python and are ready to go. If you have not, look at Anaconda as it presents an easy way to install python and all tools you will need.

Once these are installed, we need to load the tools specifically like I have below.

Pandas gives us the ability to import and export excel and csv files, along with the use of **dataframes**, a python version of spreadsheet-like data.

Numpy gives us additional tools and methods we can use to analyze various numbers and arrays. This will prove helpful as we move forward in our analysis.

In [74]:
import pandas as pd
import numpy as np

## Load The Data

Loading the data is often the first step to analyzing it. With tools such as excel that have graphic interfaces, this task is easy. You search visually for the file you need through your directories, and then once found, double click to open.

When using command line tools or programable notebooks, you need to find your file using text.

First I will find what directory my python is currently in

In [77]:
! pwd

/Users/joshua


It is currently in my home directory. I keep all of my projects in a folder called *development*. This particular project is *business-analysis*. Within that folder I put all the data files I will be using in a *data* folder.

So all I have to do is move python to where my data is.

In [78]:
%cd ./Development/business-analysis/Data

/Users/joshua/Development/business-analysis/Data


Now that I am there, let's see what files I have available.

In [79]:
!ls

sales_data_sample.csv  sales_data_sample.xlsx


I have a csv file and excel file available full of sales data. By listing them above I can more easily reference them below when I import them. 

I will show how to import both a CSV and Excel file. With the Excel file there is an extra step of selecting which sheet I want to load into the dataframe.

To load a csv you would do the following

In [80]:
df_2 = pd.read_csv('sales_data_sample.csv')

df_2 now has the sample data loaded. 

For Excel you would load the excel file like below

In [52]:
xlsx = pd.ExcelFile('sales_data_sample.xlsx')
xlsx

<pandas.io.excel.ExcelFile at 0x1a1ed4e908>

In [None]:
You can then see what sheets that file has

In [53]:
sheets = xlsx.sheet_names
sheets

['sales_data_sample']

To create the dataframe, we simply load the excel file and specify the sheet.

By calling `.head()` on the dataframe, we can see a sample of the data. This gives us some context as to what we are looking at, much like excel would if you opened a file.

In [95]:
df = pd.read_excel(xlsx, sheets[0])
df.head()

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,...,897 Long Airport Avenue,,NYC,NY,10022.0,USA,,Yu,Kwai,Small
1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,...,59 rue de l'Abbaye,,Reims,,51100.0,France,EMEA,Henriot,Paul,Small
2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,...,27 rue du Colonel Pierre Avia,,Paris,,75508.0,France,EMEA,Da Cunha,Daniel,Medium
3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,...,78934 Hillside Dr.,,Pasadena,CA,90003.0,USA,,Young,Julie,Medium
4,10159,49,100.0,14,5205.27,2003-10-10,Shipped,4,10,2003,...,7734 Strong St.,,San Francisco,CA,,USA,,Brown,Julie,Medium


One of the reasons I have moved to doing business analysis in python rather than excel is the speed and quickness with which I can gather and understand the data. In short no scrolling, clicking, or searching!

Let's get a quick view of what information our sales data contains.

In [58]:
df.columns

Index(['ORDERNUMBER', 'QUANTITYORDERED', 'PRICEEACH', 'ORDERLINENUMBER',
       'SALES', 'ORDERDATE', 'STATUS', 'QTR_ID', 'MONTH_ID', 'YEAR_ID',
       'PRODUCTLINE', 'MSRP', 'PRODUCTCODE', 'CUSTOMERNAME', 'PHONE',
       'ADDRESSLINE1', 'ADDRESSLINE2', 'CITY', 'STATE', 'POSTALCODE',
       'COUNTRY', 'TERRITORY', 'CONTACTLASTNAME', 'CONTACTFIRSTNAME',
       'DEALSIZE'],
      dtype='object')

And how big it is (how many rows and columns)

In [59]:
df.shape

(2823, 25)

From a business perspective, some of the first things I am wondering are
1. how many customers are in this database?
2. how many products are covered here?
3. how many individual orders have been placed total?

These questions can all be answered by gathering the number of unqiue entries for each column.

In [87]:
df['CUSTOMERNAME'].nunique()

92

In [88]:
df['PRODUCTCODE'].nunique()

109

In [89]:
df['ORDERNUMBER'].nunique()

307

Next I want pair the dataframe down the to the data I want for my analysis. For this round, I am going to assume I want the customer, order, product, quantity and price.

I can create a new dataframe called `data` as just a few of the columns from the original dataframe named `df`.

In [91]:
data = df[['CUSTOMERNAME','ORDERNUMBER','PRODUCTCODE','QUANTITYORDERED','PRICEEACH']].copy()
data.head()

Unnamed: 0,CUSTOMERNAME,ORDERNUMBER,PRODUCTCODE,QUANTITYORDERED,PRICEEACH
0,Land of Toys Inc.,10107,S10_1678,30,95.7
1,Reims Collectables,10121,S10_1678,34,81.35
2,Lyon Souveniers,10134,S10_1678,41,94.74
3,Toys4GrownUps.com,10145,S10_1678,45,83.26
4,Corporate Gift Ideas Co.,10159,S10_1678,49,100.0


I also want to create a column that tell me the total spent on each product for each order. So if they ordered 2 of a 10 dollar product, I want to know that it was a total $20 for that product on that order. 

It is quite easy to create columns on the fly and add them. The simplicity in using python over dragging, dropping, ctrl-ing, and dealing with other problems in Excel makes this a much easier tool for analysis. 


In [93]:
data['extended_price'] = (data.QUANTITYORDERED * data.PRICEEACH)
data.head()

Unnamed: 0,CUSTOMERNAME,ORDERNUMBER,PRODUCTCODE,QUANTITYORDERED,PRICEEACH,extended_price
0,Land of Toys Inc.,10107,S10_1678,30,95.7,2871.0
1,Reims Collectables,10121,S10_1678,34,81.35,2765.9
2,Lyon Souveniers,10134,S10_1678,41,94.74,3884.34
3,Toys4GrownUps.com,10145,S10_1678,45,83.26,3746.7
4,Corporate Gift Ideas Co.,10159,S10_1678,49,100.0,4900.0


The last thing I will do in prepping the sheet for analysis is to rename all the columns to something I am more familiar with, and that I use for all my consulting projects.

In [94]:
data.columns = ['customer','order_number','part_number','quantity_sold','sell_price','item_total']
data.head()

Unnamed: 0,customer,order_number,part_number,quantity_sold,sell_price,item_total
0,Land of Toys Inc.,10107,S10_1678,30,95.7,2871.0
1,Reims Collectables,10121,S10_1678,34,81.35,2765.9
2,Lyon Souveniers,10134,S10_1678,41,94.74,3884.34
3,Toys4GrownUps.com,10145,S10_1678,45,83.26,3746.7
4,Corporate Gift Ideas Co.,10159,S10_1678,49,100.0,4900.0
