First we import our libraries as normal

In [1]:
import numpy as np
import pandas as pd

Next we need to find where we are, and get to where our data is

In [2]:
pwd

'/Users/joshua/Development/business-analysis/Tutorials'

In [3]:
ls ../Data

BlackFriday.csv*		sales_data_sample.csv
brazilian-ecommerce/		sales_data_sample.xlsx
customer_data.xlsx		sales_data_sample_no_customer.xlsx
government_purchase_orders.csv	san_francisco_purchase_data.csv
part_usage_trailing_12.xlsx	total-business-inventories-and-sales-data/
purchases_by_vendor.xlsx


As we showed in the "Loading Data for Analysis" tutorial, we now need to load the data we will be looking at / working with

In [5]:
sales_sheet = pd.read_excel('../Data/sales_data_sample.xlsx')

... and let's take a look to make sure we got what we were expecting...

In [6]:
sales_sheet.head()

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,...,897 Long Airport Avenue,,NYC,NY,10022.0,USA,,Yu,Kwai,Small
1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,...,59 rue de l'Abbaye,,Reims,,51100.0,France,EMEA,Henriot,Paul,Small
2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,...,27 rue du Colonel Pierre Avia,,Paris,,75508.0,France,EMEA,Da Cunha,Daniel,Medium
3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,...,78934 Hillside Dr.,,Pasadena,CA,90003.0,USA,,Young,Julie,Medium
4,10159,49,100.0,14,4900.0,2003-10-10,Shipped,4,10,2003,...,7734 Strong St.,,San Francisco,CA,,USA,,Brown,Julie,Medium


To simplify the different functions you can call on an array (specifically a numpy array), I am going to isolate just the sales column of the dataframe and create a numpy array from it.

In [7]:
sales = np.array(sales_sheet['SALES'])

In order to **sum** the column...

In [30]:
np.sum(sales)

8290886.789999999

In order to find the **max** of the column...

In [31]:
np.max(sales)

9048.16

In order to find the **min** of the column...

In [8]:
np.min(sales)

482.13

With some of the aggregation functions, you can call them directly on the array itself like this

In [9]:
sales.sum()

8290886.789999999

In [10]:
sales.max()

9048.16

In [53]:
sales.min()

482.13

You can also get *where* the min and max are by preceding it with arg. This returns the index of where the min and max were found.

In [54]:
sales.argmin()

2249

... and just to test it, let's call the sales figure at that location...

In [55]:
sales[2249]

482.13

Yup!

Let's look at some more complex aggregation functions.

Let's say you didn't have the **sales** column, but only each individual line item. Numpy combines arrays "element-wsie", so you can just multiply a **quantity** array by an a **price** array and get sales for each line item.

In [11]:
quantity = np.array(sales_sheet['QUANTITYORDERED'])
prices = np.array(sales_sheet['PRICEEACH'])

In [12]:
sales_2 = quantity * prices

... and just to test, we can take the sum to see if it is the same as above.

In [13]:
sales_2.sum()

8290886.789999999

You can also grab the **mean**, **median**, and **standard deviation**.

In [14]:
sales.mean()

2936.9064080765143

In [15]:
np.median(sales)

2800.0

In [52]:
np.std(sales)

1105.44843850813

In order to grab the cutoff for a percentile, you can simply use the **percentile** function for numpy arrays.

In [44]:
np.percentile(sales, 20)

1995.9279999999999

Then, to get the top 20% of sales figures, you could run...

In [46]:
sales[sales >= np.percentile(sales,20)]

array([2871.  , 2765.9 , 3884.34, ..., 4300.  , 2116.16, 3079.44])

... and to get the bottom 80%...

In [47]:
sales[sales < np.percentile(sales,20)]

array([1451.  ,  733.11, 1329.9 , 1822.17, 1201.25, 1962.22, 1900.  ,
       1735.3 , 1200.  , 1888.26, 1466.91, 1809.5 , 1000.  , 1800.24,
       1340.64, 1762.08, 1628.  , 1939.2 , 1484.2 , 1685.42, 1627.92,
       1517.88, 1749.79, 1958.88, 1960.14, 1167.25, 1516.62, 1746.6 ,
       1742.4 , 1423.29, 1504.12, 1164.4 , 1500.75, 1557.36, 1345.68,
       1795.24, 1105.25, 1237.95, 1593.02, 1320.75, 1293.75, 1449.76,
       1611.4 , 1364.25, 1262.8 , 1626.24, 1339.8 , 1930.39, 1832.6 ,
        974.1 , 1746.63, 1463.7 , 1608.  , 1496.64, 1879.74, 1495.26,
       1643.12, 1322.16, 1423.8 , 1574.  , 1729.  , 1834.5 , 1331.1 ,
       1236.84,  728.4 ,  600.  , 1189.98, 1667.4 , 1859.7 , 1476.6 ,
       1732.  , 1674.17, 1500.98, 1797.58, 1701.7 , 1694.  ,  935.18,
       1490.16, 1824.72, 1570.17, 1482.6 , 1853.4 , 1490.1 , 1605.  ,
       1944.3 , 1986.8 , 1695.96, 1281.56, 1774.22,  981.2 , 1550.72,
       1771.06, 1864.8 , 1142.41, 1172.6 , 1264.08,  990.78, 1687.4 ,
       1308.  , 1448