## MIS780 - Advanced Artificial Intelligence for Business

## Week 1 - Part 2: Getting Started with Machine Learning

In this session, you will get familar with essential libraries and procedure for developing machine learning solutions.

Make sure that the following packages are installed in your Anaconda environment: `numpy`, `scipy`, `matplotlib`, `ipython`, `scikit-learn` and `pandas`.


## Table of Content


1. [Essential Libraries](#cell_Essential)
    - [NumPy](#cell_NumPy)
    - [SciPy](#cell_SciPy)
    - [matplotlib](#cell_matplotlib)
    - [pandas](#cell_pandas)
    

2. [An Application: Classifying Iris Species](#cell_Iris)
    - [Iris dataset](#cell_dataset)
    - [Training and Testing Data](#cell_TrainingTesting)
    - [Data Splitting Exercise](#cell_cell_DataSplitingExercise)
    - [Examine the Data](#cell_Examine)
    - [Making Predictions](#cell_Predictions)
    - [Evaluating the Mode](#cell_Evaluating)
    
    
3. [Exercise: k-folds cross-validation](#cell_Exercise1)

<a id = "cell_Essential"></a>
### <font color="blue">1. Essential Libraries

<a id = "cell_NumPy"></a>
### NumPy

`numpy` is one of the fundamental packages for scientific computing in Python. It contains functionality for multidimensional arrays, high-level mathematical func‐tions such as linear algebra operations and random number generators.

https://numpy.org/


In [None]:
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6]])
print("x:\n{}".format(x))

In [None]:
#index the first row of x
print("x[0]: {}".format(x[0]))


In [None]:
# Exercise
# index the second row of x

# index the first column of x

# append a new row

# append a new column


<a id = "cell_SciPy"></a>
### SciPy

`scipy` is a collection of functions for scientific computing in Python. It provides, among other functionality, advanced linear algebra routines, mathematical function optimization, signal processing, special mathematical functions, and statistical distributions.

https://scipy.org/

Usually it is not possible to create dense representations of sparse data (as they would not  fit  into  memory),  so  we  need  to  create  sparse  representations  directly.

In [None]:
# Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else
eye = np.eye(4) # Return a 2-D array with ones on the diagonal and zeros elsewhere, N (=4) here returns the number of rows in the array
print("NumPy array:\n{}".format(eye))

In [None]:
from scipy import sparse

data = np.ones(4)
print("data :\n{}".format(data))
row_indices = np.arange(4)
print("row_indices :\n{}".format(row_indices))
col_indices = np.arange(4)

eye_coo = sparse.coo_matrix((data, (row_indices, col_indices)))   #creates a sparse matrix in the COO (Coordinate) format.
print("\nCOO representation:\n{}".format(eye_coo))

In [None]:
# Exercise
# Generate a 4x4 array with random integers

# make 2 elements zero in each row

# get the non zero indices of the array as row, col
row, col = np.nonzero(data)

#convert the sparse array into COO

# Print the Sparse Array

# Print the COO Matrix


<a id = "cell_matplotlib"></a>
### matplotlib

`matplotlib` is the primary scientific plotting library in Python. It provides functionsfor making publication-quality visualizations such as line charts, histograms, scatterplots, and so on. Visualizing your data and different aspects of your analysis can give you important insights.  

https://matplotlib.org/

In [None]:
import matplotlib.pyplot as plt #imports pyplot module from matplotlib

# Generate a sequence of 100 numbers from -10 to 10 (inclusive)
x = np.linspace(-10, 10, 100)

# Create a second array using sine function for each element in the (x) array
y = np.sin(x)

# The plot function makes a line chart of one array against another
plt.plot(x, y, marker="x")

<a id = "cell_pandas"></a>
### pandas

`pandas` is a Python library for data wrangling and analysis. It is built around a data structure called the **DataFrame**. A pandas **DataFrame** is a table, similar to an **Excel** spreadsheet. pandas provides a great range of methods to modify and operate on this table; in particular, it allows **SQL**-like queries and joins of tables.

In [None]:
import pandas as pd

# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
        'Location' : ["New York", "Paris", "Berlin", "London"],
        'Age' : [24, 13, 53, 33]}

data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes
# in the Jupyter notebook
display(data_pandas)

There are several possible ways to query this table. For example:

In [None]:
# Select all rows that have an age column greater than 30
display(data_pandas[data_pandas.Age > 30])

In [None]:
# Exercise - display names starting with P



In [None]:
#Exercise - Select all records with a Location paris


In [None]:
# Exercise -adding another data row to data_pandas

# Creating a new row as a DataFrame named new_row
#new_row =

# add the new row ro the data_pandat
data_pandas = pd.concat([data_pandas, new_row], ignore_index=True)


<a id = "cell_Iris"></a>
### <font color="blue">2. An Application: Classifying Iris Species

In this section, we  will go through a simple machine learning application  and create our first model.  Our goal is to build a machine learning model that can learn from the measurements of  the iris flowers whose species is known, so that we can predict the species for a new iris.

<a id = "cell_dataset"></a>
### Iris dataset
The data we will use for this example is the Iris dataset, a classical dataset in machine learning and statistics. We can load it by calling the `load_iris` function.

The iris object that is returned by load_iris is a Bunch object, which is very similar to a dictionary. It contains keys and values:

In [None]:
from sklearn.datasets import load_iris

iris_dataset = load_iris()
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

The  value  of  the  key  `DESCR`  is  a  short  description  of  the  dataset.

In [None]:
print(iris_dataset['DESCR'][:193] + "\n...")

The  value  of  the  key  `target_names`  is  an  array  of  strings,  containing  the  species  offlower that we want to predict.

In [None]:
print("Target names: {}".format(iris_dataset['target_names']))

The value of `feature_names` is a list of strings, giving the description of each feature

In [None]:
print("Feature names: \n{}".format(iris_dataset['feature_names']))

The data itself is contained in the target and data fields. data contains the numeric measurements of sepal length, sepal width, petal length, and petal width in a`numpy` array:

In [None]:
print("Type of data: {}".format(type(iris_dataset['data'])))

The rows in the data array correspond to flowers, while the columns represent  the four measurements that were taken for each flower.

We  see  that  the  array  contains  measurements  for  150  different  flowers. Individual items are called **samples** in machine learning, and their propertiesare called **features**. The shape of the data array is the number of samples multiplied by the  number  of  features.  

In [None]:
print("Shape of data: {}".format(iris_dataset['data'].shape))

The  feature  values  for  the  first  five samples is shown below.

From this data, we can see that all of the first five flowers have a petal width of 0.2 cm and that the first flower has the longest sepal, at 5.1 cm.

In [None]:
print("First five rows of data:\n{}".format(iris_dataset['data'][:5]))

# Exercises
# Select first two columns of first five rows

# Select every other row

#Select 1st and 3rd column of first 5 rows


The target array contains the species of each of the flowers that were measured

In [None]:
print("Type of target: {}".format(type(iris_dataset['target'])))

print("Shape of target: {}".format(iris_dataset['target'].shape))

The species are encoded as integers from 0 to 2. The   meanings   of   the   numbers   are   given   by   the   `iris['target_names']`   array: **0** means **setosa**, **1** means **versicolor**, and **2** means **virginica**.

In [None]:
print("Target:\n{}".format(iris_dataset['target']))

<a id = "cell_TrainingTesting"></a>
### Training and Testing Data

We  want  to  build  a  machine  learning  model  from  this  data  that  can  predict  the  species of iris for a new set of measurements. But before we can apply our model to new measurements, we  need  to  know its performance.

This is usually done by splitting the labeled data (here,  our  150  flower  measurements) into  two  parts.  One  part  of  the data  is  used  to  build  our  machine  learning  model,  and  is  called  the  **training  data**. The rest of the data will be used to assess how well the model works; this is called the **test data** or **hold-out set**.

We use the `train_test_split` function  that  shuffles and splits the  dataset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)

print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

print("\nTraining labels distribution with stratify:\n", np.bincount(y_train))
print("\nTest labels distribution with stratify:\n", np.bincount(y_test))

<a id = "cell_DataSplitingExercise"></a>

#### <font color="blue">Data Splitting Exercise

Work in pairs and discuss answers to the following questions

a. When you do not explicitly specify **`test_size`** in **`train_test_split()`**, Python uses the default value of `0.25`, which may work in many cases. However, what is the disadvantage of this approach?

b. Read the documentation of the `train_test_split` function and examine the **`stratify`** parameter. What is its purpose? Based on your understanding of the `Iris` dataset, should we use `stratify` when splitting the data? Justify and implement your answer.

In [None]:
#Place your solution here

<a id = "cell_Examine"></a>
### Examine the Data

Before building a machine learning model it is often a good idea to inspect the data,to  see  if  the  task  is  easily  solvable  without  machine learning,  or  if  the  desired  infor‐mation might not be contained in the data.

One of the best ways to inspect data is to visualize it using scatter plot. Unfortunately,  computer screens have only two dimensions, which allows us to plot only two (or maybe three) features at a time. One way around this problem is to do a pair plot, which looks at all possible pairs of features.

In [None]:
from pandas.plotting import scatter_matrix

# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(
    X_train, columns=iris_dataset.feature_names)
print(iris_dataframe.head())
#
# create a scatter matrix from the dataframe, color by y_train
grr = scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
    hist_kwds={'bins': 20}, s=60, alpha=.8)

From  the  plots,  we  can  see  that  the  three  classes  seem  to  be  relatively  well  separatedusing  the  sepal  and  petal  measurements.  This  means  that  a  machine  learning  modelwill likely be able to learn to separate them.

<a id = "cell_k-Nearest"></a>
### Building k-Nearest Neighbors Model

Now we can start building the actual machine learning model. There are many classification algorithms in `scikit-learn` that we could use. Here we will use a k-nearest neighbors classifier, which is easy to understand.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)

To build the model on the training set, we call the `fit` method of the `knn` object. The fit method returns the knn object itself (and modifies it in place), so we get a string representation of our classifier.

In [None]:
knn.fit(X_train, y_train)

<a id = "cell_Predictions"></a>
### Making Predictions

We can now make predictions using this model on new data for which we might not know the correct labels.  Imagine we found an iris in the wild with a sepal length of **5 cm**, a sepal width of **2.9 cm**, a petal length of **1 cm**, and a petal width of **0.2 cm**. What species of iris would this be? We can put this data into a NumPy array.

In [None]:
X_new = np.array([[5, 2.9, 1, 0.2]])
#X_new = np.array([[5, 2.9, 1, 0.2],[5.1, 2.8, 1.1, 0.3]])

print("X_new.shape: {}".format(X_new.shape))

To make a prediction, we call the predict method of the `knn` object:

In [None]:
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
    iris_dataset['target_names'][prediction]))

Our model predicts that this new iris belongs to the class 0, meaning its species is **setosa**.

<a id = "cell_Evaluating"></a>
### Evaluating the Mode

We can make a prediction for each iris in the test data and compare it against its label (the known species). We can measure how well the model works by
computing the **accuracy**.

In [None]:
y_pred = knn.predict(X_test)

print("Test set predictions:\n {}".format(y_pred))

In [None]:
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))

We can also use the `score` method of the `knn` object, which will compute the test set **accuracy** for us:

In [None]:
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

<a id = "cell_Exercise1"></a>
### <font color="blue">3. Exercise: k-folds cross-validation</font>

Evaluate the performance of Support Vector Machine classifier on the iris data set using 10-folds cross-validation.

<details><summary><font color="blue"><b>Click here for solution:</b></font></summary>
import pandas
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.svm import SVC
import numpy as np

dataset = load_iris()
X = dataset['data']
y = dataset['target']
cv = KFold(n_splits=10, random_state=1, shuffle=False)

scores = []
for train_index, test_index in cv.split(X):
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    svclassifier = SVC(kernel='linear')
    svclassifier.fit(X_train, y_train)
    scores.append(svclassifier.score(X_test, y_test))
    
print('Cross validation accuracy: \n', scores)     
print('Overall accuracy: ', np.mean(scores))

In [None]:
#Place your solution here

### References:

- Muller, A. C., & Guido, S. (2017). Introduction to machine learning with python: A guide for data scientists. O'Reilly Media, Sebastopol, CA 95472.  https://www.oreilly.com/library/view/introduction-to-machine/9781449369880/