[View in Colaboratory](https://colab.research.google.com/github/ming-zhao/Business-Analytics-Course/blob/master/Python_Basics.ipynb)

# Python Basics for Business Analytics
Prepared by: Ming Zhao, Ph.D.

This notebook provides an introduction to the programming language Python, primarily oriented towards functions that are useful for business analytic applications. 

> What is the primary programming language used in artificial intelligence (data analytics, data mining, etc.)?

>Python is an interpreted high-level programming language for general-purpose programming. 

>More about Python language can be found in https://en.wikipedia.org/wiki/Python_(programming_language%29

The following table of contents lists the topics discussed in this notebook. Clicking on any topic will advance the notebook to the associated area.

#### Disclaimer

This notebook by no means represents a *comprehensive* overview of the Python programming language. Instead, it provides basic details on libraries and operations that are useful for addressing problems in operations management. Ultimately, this notebook aims to provide enough details so that a beginning user can familiarize themselves with the capabilities of the Python programming language. 

Also, it is important to realize that the Python language and the available libraries will continue to evolve. That being said, the opbjects, functions, and methods described in this notebook may one day change. If changes occur, areas of this notebook that use deprecated features may cease to work and will need to be revised or omitted.

# Python Basics
<a id="Python Basics"> </a>

### Getting Help

You can find information regarding Python functions using the built in `help()` function.

In [0]:
help(print)

You can also find information regarding functions and attributes of imported libraries. The following cell block provides an example that shows how to find for the `std()` function that is part of the NumPy library. We will explore the NumPy library in more detail later in this notebook. 

In [0]:
import numpy as np
help(np.std)

Finally, in a notebook, hitting the key combination `<TAB>` within the arguments area of a Python function will bring up help on the associated function.

### The importance of spacing 

If you are familiar with other coding languages, you are likely used to used some form of braces to indicate nesting. For example, defining a loop that prints the numbers in the interval [0, 10] in C++ may be accomplished with:



```
#include <iostream>
using namespace std;
```

```
for(int i = 1; i < 11; i++) { 
    value = i;
    cout << value << endl; 
}
```

The key thing to note is the use of braces to define the statements that are nested in the `for` loop, and the use of semi-colons to indicate the end of a statment. In Python, nesting is indicated by spacing. Most Python editors will attempt to *anticipate* the spacing that is needed. However, if you get errors that state *unexpected indents* exist, you should double check your spacing. The following code performs the same function as the C++-style loop.

In [0]:
for i in range(1,11):
  value = i
  print(value)

### Lists

Lists are a versatile Python datatype. Lists can be initialized as empty, with sequences, or with comma-separated initialization values. Lists can be appended to our deleted from in loops and do not require that all values be of the same type. The following code blocks provide several examples of list initialization.

In [0]:
list1 = list(range(1,11))
print(list1)

In [0]:
list1 = [0, 1, 2, 3, 4, 5, 6, 7, 8]
print(list1)

In [0]:
list1 = [] # Creates an empty list
for i in range(20,31):
    list1.append(i)

print(list1)

Lists can be multi-dimensional, but it is generally better to use other storage objects such as dictionaries or Pandas dataframes (both covered later in this notebook). For completeness, the following code block shows one method to create a simple two-dimensional list.

In [0]:
list1 = [[1,2],[3,4],[5,6],[7,8]]
print(list1)

### List comprehensions

In mathematics, it is common to see sets described as follows:
$$ S = \{x^{2}:x\in 1, \ldots, 10\}.$$
This notation defines a set $S$ that contains the squares of the integers $1 - 10$. In Python, we can use similar syntax to define the set as a list. The following code block demonstrates such syntax, which is referred to as `list comprehension`.

**Note that the range function we use takes a tuple argument (*start*, *stop*) and returns all integers in the interval [*start*, *stop*).**

In [0]:
S = [x**2 for x in range(1,11)]
print(S)
del(S)

### Dictionaries

Like lists dictionaries are a versatile Python data object that can easily be changed. With respect to the differences between lists and dictionaries: 1) lists are ordered sets of objects, whereas dictionaries are unordered sets, 2) items in dictionaries are accessed via keys and not via their position, and 3) the values of a dictionary can be any Python data type. So dictionaries are unordered key-value-pairs. 

The following code block provides an example that is adapted from https://automatetheboringstuff.com/chapter5/ (accessed 1/9/2018) that clearly demonstrates the key differences between lists and dictionaries.

In [0]:
list1 = ['cats', 'dogs', 'moose']
list2 = ['dogs', 'moose', 'cats']
if (list1 == list2):
    print("The two lists are the same.\n")
else:
    print("The two lists are different.\n")

dict1 = {'name': 'Zophie', 'species': 'cat', 'age': '8'}
dict2 = {'species': 'cat', 'age': '8', 'name': 'Zophie'}
if (dict1 == dict2):
    print("The two dictionaries are the same.\n")
else:
    print("The two dictionaries are different.\n")

The important thing to note in the previous example is that the two lists are comprised of the same items, just in a different order, and the two dictionaries are also comprised of the same key-value pairs, just in different orders. However, Python interprets the lists as being different and the dictionaries as being equal. This clearly demonstrates the ordering differences between the two structures.

The following code block shows how to access elements of a dictionary by key.

In [0]:
dict1['name']

Dictionaries have three methods that allow for easy iteration over a dictionary, i.e., the `keys`, `values`, and `items` methods. The following code block demonstrates these methods.

In [0]:
print("The keys in dict1 are:")
for key in dict1.keys():
    print(key)
    
print("\nThe values in dict1 are:")
for value in dict1.values():
    print(value)
    
print("\nThe items (key-value pairs) in dict1 are:")
for item in dict1.items():
    print(item)

You can also use these methods with the `in` operator to easily search for keys and values in a dictionary as shown below.

**Note that the `\"` statements are needed to print the quotation marks in the printed string. If they are not printed, Python will interpret the quotes as the end of a string and produce an error.

In [0]:
print('Is the key \"name\" in dict1?','name' in dict1.keys())
print('Is the key \"cat\" in dict1?','cat' in dict1.keys())
print('Is the value \"cat\" in dict1?','cat' in dict1.values())

### 'New' string formatting style

Python recently introduced a new method for mixing variables and static string content. The new format allows users to insert placeholders in statements and to insert variable values into these place holders in the order that they are specified in a tuple that follows the statement. The following code block demonstrates this formatting. Specifically, the code block defines two python variables, one a string and the other an integer. The values of these variables are inserted into two strings according to there index in the argument to the `format()` function. The escape sequence `\n` starts a new line. 

In [0]:
first_variable = 'arg1'
second_variable = 1

mystring = 'My first variable is {0} and my second is {1}.\n'.format(first_variable,second_variable)
print(mystring)

mystring = 'My second variable is {1} and my first is {0}.'.format(first_variable,second_variable)
print(mystring)

del(mystring)

# NumPy Basics


From https://en.wikipedia.org/wiki/NumPy (accessed on 1/6/2018):

>NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors.

This section of the notebook will demonstrate several features of the NumPy, multi-dimensional array. 

### Motivation for  NumPy

The first example illustrates the motivation of NumPy. Specifically, NumPy was developed to support scientific computations via the efficient implementation of a multi-dimensional array. In addition to an efficient array implementation, NumPy also includes functions for performing operations on Numpy arrays that are optimized for computational efficeincy. The following code block illustrate the substantial increas in efficiency that NumPy provides in comparison to a standard python list. Specifically, the example considers the task of adding two vectors of a specified size using both standard Python lists and NumPy arrays. The time of the additiona and the size of the reulting objects are reported for comparison purposes.

In [0]:
import numpy as np 
import time
import sys

SIZE = 250000

list1 = range(SIZE)
list2 = range(SIZE)

start = time.time()
result = [(x+y) for x,y in zip(list1,list2)]
print("Using Python lists, the addition took",(time.time() - start)*1000,"milliseconds.")
print("The size of the result object based on Python lists is",sys.getsizeof(result),"bytes.\n")

del(list1,list2,result)

nparray1 = np.arange(SIZE)
nparray2 = np.arange(SIZE)
start = time.time()
result = nparray1 + nparray2
print("Using NumPy arrays, the addition took",(time.time() - start)*1000,"milliseconds.")
print("The size of the result object based on NumPy arrays is",sys.getsizeof(result),"bytes.\n")

del(nparray1, nparray2, result)

Although the exact speed differences will vary based on the machine that is being used, it should be obvious that NumPy performs the addition much faster than standard Python lists, while at the same time consuming less system memory. **In general, whenever it is possible to use NumPy arrays for a task, they should be used!!!**

In addition to demonstrating the substantial performance gains offered by NumPy, the previous code block also illsutrates some of the subtle differences of working with Python lists and NumPy arrays, and a method for checking the computation time that is required to execute a portion of code. Specifically:

- The `time.time()` function, from the `time` library, returns the current system time. Saving the value of the current time in a variable `start` and then computing the difference `time.time() - start` returns the seconds elapsing between the two calls to `time.time()` in seconds. Multiplying by 1000 converts the elapsed time to milliseconds.


- When working with Python lists, the `range()` function returns a sequence of integers starting at zero and ending at the argument passed to `range()`. In our example, we pass a variable `SIZE` to the `range()` function. Thus, the sequence stored in the list is 0, 1, ..., `SIZE`-2, `SIZE`-1.


- When working with NumPy arrays, the `np.range()` function returns a sequence of integers starting at zero and ending at the argument passed to `np.arange()`. In our example, we pass a variable `SIZE` to the `np.arange()` function. Thus, the sequence stored in the NumPy array is 0, 1, ..., `SIZE`-2, `SIZE`-1.


- The `sys.getsizeof()` function, from the `sys` library, returns the size of an object in bytes.


- When working with Python lists, the `zip()` function essentially combines two or more list objects and allows element-wise operations to be performed.


- When working with NumPy arrays, there is no need to *zip* arrays. Instead, element-wise operations are performed using standard mathematical operators.

Another approach for timing operations is shown in the following code block, i.e., the special `%timeit` command. Note that specifying `-r 10` tells the `%timeit` function to perform 100 runs of the test.

In [0]:
import numpy as np 

SIZE = 250000

list1 = range(SIZE)
list2 = range(SIZE)

print("Time statistics for Python lists:")
%timeit -r 10 [(x+y) for x,y in zip(list1,list2)]

del(list1,list2)

nparray1 = np.arange(SIZE)
nparray2 = np.arange(SIZE)

print("\nTime statistics for NumPy arrays:")
%timeit -r 10 nparray1 + nparray2

del(nparray1, nparray2)

### Looking for NumPy Methods to Achieve a Task

In addition to the help functionality built in to Python, the NumPy library also allows you to search for functions that are included in the library using keywords. This search capability is achieved using the `lookfor` function as shown in the following cell that searches for functions associated with the search phrase *Standard deviation*.

In [0]:
np.lookfor('Standard deviation')

### Operations Involving NumPy Arrays
<a id="Operations Involving NumPy Arrays"> </a>

The following code blocks demonstrate several functions that perform operations on NumPy arrays and the reshape function. Since it is rather easy to deduce the operations that the functions perform, no additional explanation is provided.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
import numpy as np
a = np.arange(12).reshape((3,4)) # Uses the numbers 0,...,11 to create an array with 3 rows and 4 columns
a

In [0]:
a*a # Multiplies the array a by itself

In [0]:
np.sqrt(a) # Takes the square root of all elements in a

In [0]:
a/2 # Divides all elements in a by 2

In [0]:
a.mean() # Mean of elements in a

In [0]:
a.std() # Standard deviation of elements in a

In [0]:
a.sum() # Sums the elements in the array a

In [0]:
a.argmin() # Returns the minimum argument in the array a

In [0]:
a.argmax() # Returns the maximum argument in the array a

In [0]:
a.shape # Returns the shape of the NumPy array a

In [0]:
a.dtype # Returns the type of the NumPy array a

### Special Operations of NumPy Arrays
<a id="Special Operations Involving NumPy Arrays"> </a>

This subsection looks at special operations that may be performed on NumPy arrays. The first special operation that we illustrate is the ability to filter NumPy arrays based on conditions. The following code block shows how we may use an inequality to create an array of boolean values based on another NumPy array.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
b = a>4
print(b)

Multiplying the boolean array with the original array allows us to set all values not meeting the specified condition to zero.

In [0]:
a*b

We may also use the boolean array to specify user-defined values for the entries in the original array that do not meet the condition.

In [0]:
a[b==False]=999
print(a)

The following cell shows two NumPy methods that allow us to flatten a multi-dimensional array to a one-dimensional array.

In [0]:
a = np.arange(12).reshape((3,4)) # Uses the numbers 0,...,11 to create an array with 3 rows and 4 columns
print("The original array is\n\n",a,"\n")
print("Using ravel(), we get",a.ravel(),"\n")
print("Using flatten(), we get",a.flatten(),"\n")
print("It is important to note that the array is still defined as the original\n\n",a,"\n")

### Accessing Elements of NumPy Arrays
<a id="Accessing Elements of NumPy Arrays"> </a>

This subsection looks at how to access elements of a NumPy arrays. The following code block shows basic methods that can be used to access elements or portions of a NumPy array. It is important to keep in mind that Python and NumPy both consider 0 to be the first index. In other words, the first row of an array is row 0.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
a[1,1] # Access the element in the second row, second column of array a.

In [0]:
a[:,1] # Access the second column of the array a.

In [0]:
a[1,:] # Access the second row of the array a.

The following cell illustrates the `nditer` method that allows us to iterate over all values in a NumPy array. When using this method, we can specify the `order` to control whether we iterate by columns or by rows. Specifying `order = 'C'` tells NumPy to use *C-style* order that processes records across rows. Specifying `order = 'F'` tells NumPy to use *Fortran-style* order that processes records down columns.

In [0]:
a = np.arange(12).reshape((3,4)) # Uses the numbers 0,...,11 to create an array with 3 rows and 4 columns
print("The original array is\n\n",a,"\n")

print("We will now iterate over the array using C-style order.")
for x in np.nditer(a,order = 'C'):
    print(x)
    
print("\nWe will now iterate over the array using Fortran-style order.")
for x in np.nditer(a,order = 'F'):
    print(x)

## Random Number Generation
<a id="Random Number Generation"> </a>

Although NumPy has several more interesting capabilities, we will conclude this section by looking at NumPy's random number generation functionalities. The following code block shows how we can generate an array of random numbers drawn from the interval [0, 1) with a user-specified shape.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
np.random.rand(3,4) # Generates an array of random numbers with three rows and four columns

The following code block shows how we can generate an array of random integers numbers drawn from a user-specified interval.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
np.random.randint(1,101,25) # Randomly draws 25 integers from the interval [1,101).

The following code block shows how we can randomly select elements from an existing array (in this example *a*). The function arguments specify the array to select from, the number of elements to select, and whether or not the sampling should be with replacement.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
a = np.random.randint(1,101,25)
print("The original array is",a)

print("The randomly drawn values are",np.random.choice(a,10,True))

Using two arrays, one for values and one for probabilities, we can define a custom probability distribution by passing the probability array as an additional argument to the `choice` function. This is demonstrated below.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
support = np.arange(5)
probabilities = [0.3, 0.2, 0.1, 0.2, 0.2]

distribution = np.random.choice(support,1000,True,probabilities)

import matplotlib.pyplot as plt
tnrfont = {'fontname':'Times New Roman'}

fig, ax = plt.subplots(figsize=(12,8))
ax.hist(distribution)
ax.set_title('Histogram of Randomly Generated Values',fontsize=25)
ax.set_xlabel('Value',fontsize = 20)
ax.set_ylabel('Occurences',fontsize = 20)
ax.xaxis.set_tick_params(labelsize=20)
ax.yaxis.set_tick_params(labelsize=20)
plt.show() 

In addition to custom distributions, there are several well-known distributions that may be selected. The following code block demonstrates randomly sampling from a Normal distribution.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
mean = 1000
sd = 100
distribution = np.random.normal(mean,sd,10000)

import matplotlib.pyplot as plt
tnrfont = {'fontname':'Times New Roman'}

fig, ax = plt.subplots(figsize=(12,8))
ax.hist(distribution,bins = 50)
ax.set_title('Histogram of Randomly Generated Values',fontsize=25)
ax.set_xlabel('Value',fontsize = 20)
ax.set_ylabel('Occurences',fontsize = 20)
ax.xaxis.set_tick_params(labelsize=20)
ax.yaxis.set_tick_params(labelsize=20)
plt.show()

The following code block demonstrates randomly sampling from a Poisson distribution.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
mean = 2
distribution = np.random.poisson(mean,10000)

import matplotlib.pyplot as plt
tnrfont = {'fontname':'Times New Roman'}

fig, ax = plt.subplots(figsize=(12,8))
ax.hist(distribution,bins = 50)
ax.set_title('Histogram of Randomly Generated Values',fontsize=25)
ax.set_xlabel('Value',fontsize = 20)
ax.set_ylabel('Occurences',fontsize = 20)
ax.xaxis.set_tick_params(labelsize=20)
ax.yaxis.set_tick_params(labelsize=20)
plt.show()

When employing heuristic solution procedures, a common task is generating a permutation of a vector that represents a solution. NumPy allows random permutations of an array to be generated very simply using the `permutation` routine. The following code block shows an example.  

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
mysequence = np.arange(11)
print("The original sequence is:", mysequence)
print("A permutation of the original sequence is:", np.random.permutation(mysequence))
print("Another permutation of the original sequence is:", np.random.permutation(mysequence))

print("\n")
mysequence = np.array([1,3,5,7,9,11,13])
print("The original sequence is:", mysequence)
print("A permutation of the original sequence is:", np.random.permutation(mysequence))
print("Another permutation of the original sequence is:", np.random.permutation(mysequence))

NumPy also allows us to determine percentiles of data stored in a NumPy array. The following example shows how to determine several percentile values for a randomly sample Normal distribution.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
mean = 1000
sd = 100
distribution = np.random.normal(mean,sd,10000)

print("The 50th and 75th percentiles of the defined distribution are",np.percentile(distribution,[50,75]),"respectively")

print("The 99th percentiles of the defined distribution is",np.percentile(distribution,99))


# Pandas Basics
<a id="Pandas Basics"> </a>

From https://en.wikipedia.org/wiki/Pandas_(software) (accessed 1/7/2018):

>Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for multidimensional, structured data sets. Major features of the library are:

> - DataFrame object for data manipulation with integrated indexing.
> - Tools for reading and writing data between in-memory data structures and different file formats.
> - Data alignment and integrated handling of missing data.
> - Reshaping and pivoting of data sets.
> - Label-based slicing, fancy indexing, and subsetting of large data sets.
> - Data structure column insertion and deletion.
> - Group by engine allowing split-apply-combine operations on data sets.
> - Data set merging and joining.
> - Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
> - Time series-functionality: Date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.
> The library is highly optimized for performance, with critical code paths written in Cython or C.

[Back to Table of Contents](#Table of Contents)<br>

### Importing Data Using Pandas
<a id="Importing Data Using Pandas"> </a>

The following code block imports the Pandas library, defines a small dataframe, and uses the `head()` function to print the first five rows. Actually, the dataframe is first being defined as a dictionary that is converted to a dataframe by the `pd.dataframe()` function call.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
import pandas as pd

df = pd.DataFrame({'Column 1': [1, 2, 3, 4, 5, 6, 7, 8, 9], 
     'Column 2': [11, 12, 13, 14, 15, 16, 17, 18, 19],
     'Column 3': [21, 22, 23, 24, 25, 26, 27, 28, 29]})

print(df.head())
del(df)

In practice, it is most common to create data frames by reading in data from other file formats or the internet. One of the applications presented later in this notebook illustrates how to read data from a *csv* (comma separated values) file format into a pandas dataframe. The following code block shows how to read a *csv* file format from the internet. Actually, although the file extension indicates that the file is formatted using a *csv* format, semi-colons are actually the delimter that separates data in the file. Thus, we pass the argument `sep=';'` to the `read.table()` function to let Pandas know that we want the data separated by semi-colons.

The dataset was collected by a large Brazillian logistics company over 60 days and was stored on the UC Irvine repository of datasets for machine learning at the time of writing (1/8/2018). To avoid errors arising if the data file is moved, we include the data as `pickle` file in the *data* folder and execute the dataframe import in a `try-except` block. Essentially, Python will attempt to download the data file, and if it exists, the data will be stored as a Pandas dataframe, the data will be used to update the pickle file, and the first five rows of the dataframe will be printed. If the data link is no longer valied, a message will print letting the user know that the data file no longer exists and that the data is being read from the existing pickle file.

The pickle format is Python-specific and is optimized for storing and retrieving Python objects. In general, if you ever need to store data in some intermediate form after performing analysis in Python, pickle files are a good way to store the files since they will include the most up to dat content of the object along with information on the data structure.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
import pandas as pd
try:
    df = pd.read_table('http://archive.ics.uci.edu/ml/machine-learning-databases/00409/'
                            'Daily_Demand_Forecasting_Orders.csv', sep=';')
    df.to_pickle('data/pandas_basics_data.pkl')
    print(df.head())
except:
    print('The data file no longer exists!\nReading from pickle file.')
    df = pd.read_pickle('data/pandas_basics_data.pkl')
    print(df.head())

In addition to the `head()` function, Pandas also has a tail function that allows you to veiw records at the end of a dataframe. The following code block uses the `tail()` function to view the last 10 records of our dataframe.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.tail(10)

### Modifying a Pandas Dataframe
<a id="Modifying a Pandas Dataframe"> </a>

We can rename the columns of a Pandas dataframe by accessing its `columns` attibute as shown below.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.columns = ['Week of Month', 'Day of Week', 'Non-urgent Order', 'Urgent Order',
              'Order Type A','Order Type B','Order Type C', 'Fiscal Sector Orders', 
              'Traffic Controller Orders', 'Banking Orders (1)', 'Banking Orders (2)','Banking Orders (3)', 'Total Orders']
df.head()

We can drop columns from a Pandas dataframe using the `drop` function as shown in the following code block. The `axis=1` argument specifies that we are dropping columns, whereas an `axis=0` argument would be used in cases where we want to drop rows.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.drop(['Fiscal Sector Orders', 'Traffic Controller Orders', 'Banking Orders (1)', 'Banking Orders (2)',\
         'Banking Orders (3)'], axis = 1)

It is very important to note that our previous operation results in no permanent change to the dataframe object because we did not specify an assignment. This can be seen by executing the following code block that prints the head of the df object. In particular, note that all of the columns we specified to be dropped still exist.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.head()

If we wanted the drop column operation to be permanent, we coul have overwritten the existing dataframe using `df = df.drop(['Fiscal Sector Orders', 'Traffic Controller Orders', 'Banking Orders (1)', 'Banking Orders (2)','Banking Orders (3)'])`. 

Another approach is to supply the additional argument `inplace = True`. This latter approach is demonstrated in the next code block.

**Note: Once you run the following cell, the dataframe will no longer contain the dropped columns. Thus, if you rerun any cells that reference the dropped columns you will get an error unless you import the dataframe and rename the columns as is done earlier.**

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.drop(['Fiscal Sector Orders', 'Traffic Controller Orders', 'Banking Orders (1)', 'Banking Orders (2)',\
         'Banking Orders (3)'], axis = 1,inplace=True)

df.head()

### Manipulating Data in a Pandas Dataframe
<a id="Manipulating Data in a Pandas Dataframe"> </a>

This section looks at a limited selection of the methods that are available to slice and wrangle data that is stored as a Pandas dataframe (these are not my terms, but instead are commonly used in the field of data science). Before beginning such efforts, we often must investigate the data to better understand its size and the formats used. The following cell shows how to determine the unique values in a specific dataframe column.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df['Day of Week'].unique()

If we wanted to extract the values in a particular column, we can use the `values` attribute.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df['Day of Week'].values

The `describe()` function provides several summary statistics for each comlumn in a Pandas dataframe.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.describe()

We can select individual columns and rows of a Pandas dataframe in many ways. The following cell shows how to select columns by column name using the `loc` method. In this case, we return all rows for the 'Week of Month', 'Day of Week', and 'Total Orders' columns, and only print the first five rows.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.loc[:,['Week of Month','Day of Week','Total Orders']].head()

The same slicing can be achieved using column index values using the `iloc` method. Recall that in our dataframe with dopped columns, 'Week of Month' is the first column (column 0), 'Day of Week' is the second column (column 1), and 'Total Orders' is the eighth column (column 7).

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.iloc[:,[0,1,7]].head()

The following cell shows how to pull the first three rows of the designated column slice. 

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.iloc[0:3,[0,1,7]]

The following cell shows how to pull select non-consecutive rows of the designated column slice.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.iloc[[0,3,6,8,54],[0,1,7]]

The following code block shows how we can limit the records based on a condition. Specifically, the following code block returns entries in the dataframe where the value in the 'Total Orders' column exceeds 500.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df[df['Total Orders']>500]

The following code block shows how we can use lists to limit records in a Pandas dataframe based limit on whether or not row values are in the list.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df[df['Day of Week'].isin([2,4])]

The following code block shows that we can use multiple criteria to flter a Pandas dataframe column.

**Note: The parentheses around each condition are necessary!!!**

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df[(df['Total Orders']>500) & (df['Day of Week'].isin([2,4]))]

### Aggregating and Summarizing a Pandas Dataframe
<a id="Aggregating and Summarizing a Pandas Dataframe"> </a>

In addition to slicing data that exists in a Pandas dataframe object, we can aggregate the data using the `groupby()` function. The following code block aggregates the data by 'Day of Week' and immediately applies the `sum()` function to the grouped object.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.groupby('Day of Week').sum()

Of interest to Microsoft Excel users, Pandas also include a `pivot_table()` function that mimics an Excel pivot table. Information on the function, accessed using the `help()` function, follows.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
help(pd.pivot_table)

The following code block  creates a pivot table that sums the total orders by 'Day of Week' and 'Week of Month'. Note that the sum function if from NumPy!

In [0]:
import numpy as np
pd.pivot_table(df, values='Total Orders', index=['Week of Month'], columns=['Day of Week'],aggfunc=np.sum)

Note in the previous example that we had missing data when the 'Day of Week' is two and the 'Week of Month' is one. The `pivot_table()` function allows you to define a default fill value for cases where data is missing. The following code block creates the same pivot table, but fills missing values with 0.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
import numpy as np
pd.pivot_table(df, values='Total Orders', index=['Week of Month'], columns=['Day of Week'],aggfunc=np.sum,fill_value=0)

Plotting a Pandas pivot table can be done very simply by appending `.plot()` to the pivotable statement and adding an appropriate plotting axis. The following code block provides an example.

**Note: Pandas plotting functionality utilizes the `matplotlib` library. Although this notebook does not discuss `matplotlib`, it recieves much attention in published books and online communities.**

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12,8))
pd.pivot_table(df, values='Total Orders', index=['Week of Month'],\
               columns=['Day of Week'],aggfunc=np.sum,fill_value=0).plot(ax=ax)
ax.set_title('Plot of Pandas Pivot Table',fontsize=25)
ax.set_xlabel('Week of Month',fontsize = 20)
ax.set_ylabel('Sum of Total Orders',fontsize = 20)
ax.xaxis.set_tick_params(labelsize=20)
ax.yaxis.set_tick_params(labelsize=20)


plt.show()

Although we will not elaborate on these capabilities further, Pandas also allows users to define custom function that can be applied to entire data frames. The following code block show an example where the custom function essentially determine the range of values in each column of a Pandas dataframe.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
df.apply(lambda x: x.max() - x.min())

# Applications
<a id="Applications"> </a>

This section of the notebook uses Python to solve problems in different application areas that fall under the umbrella of operations management.

[Back to Table of Contents](#Table of Contents)<br>

## Time Series Forecasting
<a id="Time Series Forecasting"> </a>

The following section will look at some of the time series forecasting capabilities of Python. The following code block reads the data that we will be using for the forecasting demonstration into a `pandas` dataframe. The data provides quarterly sales data for a French company. Once read, the year and quarter is combined into a single cell and set as the index. Once the index is set, the first 15 columns are printed.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
import pandas as pd
import numpy as np

data = pd.read_csv('data/French_Company.csv')
data['Index'] = data.Year.map(str)+data.Quarter
data['Index'] = pd.to_datetime(data['Index'].str.replace(' ', '')) + pd.offsets.QuarterEnd(0)
data.set_index("Index",inplace=True)
data.head(15)

The following code block extracts the sales column from the `pandas` dataframe and stores it in an object named *ts*.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
ts = data['Sales']
ts.head(10)

The following code block plots the times series data stored in the *ts* object.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
import matplotlib.pyplot as plt
tnrfont = {'fontname':'Times New Roman'}

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(ts)
ax.set_title('Time Series',fontsize=25)
ax.set_xlabel('Time',fontsize = 20)
ax.set_ylabel('Value',fontsize = 20)
ax.xaxis.set_tick_params(labelsize=20)
ax.yaxis.set_tick_params(labelsize=20)
plt.show() 

The following code block plots the times series data stored in the *ts* object along with a moving average forecast. The plot is interactive so that users can see how the forecast changes as the number of periods included in the moving average ($n$) varies.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
from ipywidgets import interact
@interact(n = (2, 18, 1))

def simulate(n = 2):
    moving_avg = ts.rolling(window=n,center=False).mean()
    
    fig, ax = plt.subplots(figsize=(12,8))
    ax.plot(ts)
    ax.plot(moving_avg, color='red')
    ax.set_title('Weighted Moving Average Forecast',fontsize=25)
    ax.set_xlabel('Time',fontsize = 20)
    ax.set_ylabel('Value',fontsize = 20)
    ax.xaxis.set_tick_params(labelsize=20)
    ax.yaxis.set_tick_params(labelsize=20)
    plt.show()   

The following code block plots the times series data stored in the *ts* object along with a exponentially weighted moving average forecast. The plot is interactive so that users can see how the forecast changes as the half-life of the weighting ($n$) varies.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
from ipywidgets import interact
@interact(n = (2, 18, 1))

def simulate(n = 2):
    expwighted_avg = ts.ewm(halflife=n,min_periods=0,adjust=True,ignore_na=False).mean()
    
    fig, ax = plt.subplots(figsize=(12,8))
    ax.plot(ts)
    ax.plot(expwighted_avg, color='red')
    ax.set_title('Exponential Weighted Moving Average Forecast',fontsize=25)
    ax.set_xlabel('Time',fontsize = 20)
    ax.set_ylabel('Value',fontsize = 20)
    ax.xaxis.set_tick_params(labelsize=20)
    ax.yaxis.set_tick_params(labelsize=20)
    plt.show()   

The following code block decomposes the time series into trend, seasonal, and residual components. These components are plotted with the original time series.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(ts,'multiplicative')

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2,figsize=(16,9))
ax1.plot(ts, label='Original')
ax1.legend(loc='best',fontsize='xx-large')
ax1.xaxis.set_tick_params(labelsize=12)
ax1.yaxis.set_tick_params(labelsize=12)

ax2.plot(trend, label='Trend')
ax2.legend(loc='best',fontsize='xx-large')
ax2.xaxis.set_tick_params(labelsize=12)
ax2.yaxis.set_tick_params(labelsize=12)

ax3.plot(seasonal,label='Seasonality')
ax3.legend(loc='best',fontsize='xx-large')
ax3.xaxis.set_tick_params(labelsize=12)
ax3.yaxis.set_tick_params(labelsize=12)

ax4.plot(residual, label='Residuals')
ax4.legend(loc='best',fontsize='xx-large')
ax4.xaxis.set_tick_params(labelsize=12)
ax4.yaxis.set_tick_params(labelsize=12)
plt.show()    

### Triple Exponential Smoothing (Holt-Winters Method)
<a id="Triple Exponential Smoothing"> </a>

In this section, we will apply triple exponential smoothing (Holt-Winters method) to the time series data to generate a forecast. The following code block provides three functions that will be used to generate the forecast. The function were mostly copied from those found at https://grisha.org/blog/2016/02/17/triple-exponential-smoothing-forecasting-part-iii/ on 1/5/2018. 

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
def initial_trend(series, slen):
    sum = 0.0
    for i in range(slen):
        sum += float(series[i+slen] - series[i]) / slen
    return sum / slen

def initial_seasonal_components(series, slen):
    seasonals = {}
    season_averages = []
    n_seasons = int(len(series)/slen)
    # compute season averages
    for j in range(n_seasons):
        season_averages.append(sum(series[slen*j:slen*j+slen])/float(slen))
    # compute initial values
    for i in range(slen):
        sum_of_vals_over_avg = 0.0
        for j in range(n_seasons):
            sum_of_vals_over_avg += series[slen*j+i]-season_averages[j]
        seasonals[i] = sum_of_vals_over_avg/n_seasons
    return seasonals

def triple_exponential_smoothing(series, slen, alpha, beta, gamma, n_preds):
    result = []
    seasonals = initial_seasonal_components(series, slen)
    for i in range(len(series)+n_preds):
        if i == 0: # initial values
            smooth = series[0]
            trend = initial_trend(series, slen)
            result.append(series[0])
            continue
        if i >= len(series): # we are forecasting
            m = i - len(series) + 1
            result.append((smooth + m*trend) + seasonals[i%slen])
        else:
            val = series[i]
            last_smooth, smooth = smooth, alpha*(val-seasonals[i%slen]) + (1-alpha)*(smooth+trend)
            trend = beta * (smooth-last_smooth) + (1-beta)*trend
            seasonals[i%slen] = gamma*(val-smooth) + (1-gamma)*seasonals[i%slen]
            result.append(smooth+trend+seasonals[i%slen])
    return result

The following code block plots the times series data stored in the *ts* object along with the Holt-Winters. The plot is interactive so that users can see how the forecast changes as the smoothing parameters vary.

[Back to Table of Contents](#Table of Contents)<br>

In [0]:
@interact(alphav = (0.00, 1.01, 0.05), betav = (0.00, 1.01, 0.05), gammav = (0.00, 1.01, 0.05))

def myfunc(alphav = 0.00, betav = 0.00, gammav = 0.00):
    myforecast = triple_exponential_smoothing(ts, 4, alphav, betav, gammav, 4)
    fig, ax = plt.subplots(figsize=(12,8))
    ax.plot(ts.values,label='Data')
    ax.plot(myforecast,label='Forecast',color='green')
    ax.set_title('Triple Exponential Smoothing Forecast (Holt-Winters)',fontsize=25)
    ax.set_xlabel('Time',fontsize = 20)
    ax.set_ylabel('Value',fontsize = 20)
    ax.xaxis.set_tick_params(labelsize=20)
    ax.yaxis.set_tick_params(labelsize=20)
    ax.legend(fontsize='large')
    plt.show()