## CIS 9
## Introduction to Machine Learning



Reading
<br>The Data Science Handbook, Chapter 5:
- What is machine learning
- Introducing Scikit-learn

Machine learning (ML) is a field that works with algorithms that can "learn," or improve themselves, due to their experience with input data and not due to a programmer writing a better algorithm. As the algorithm proceses more and more input data, it modifies itself to be more correct when encountering new data.

Machine learning can be divided into 3 main categories: supervised learning, unsupervised learning, and reinforcement learning. Even though there is a wide spectrum of machine learning algorithms, in general most algorithms can be classified in one of these three categories. The following discusses each type of learning, in order from simple to more complex.

In __supervised learning__ there are: 
- a known set of input called _features_. The standard variable name for the features is X.
- a known set of output called _labels_. The standard variable name for the labels is y.

The goal of the algorithm is to learn the mapping function that maps the input to the output. So that when given new samples of X features, the machine can correctly predict the corresponding y labels.

Examples of common supervised learning applications:
- Given features of a car, a truck, a bus, the algorithm can determine that a new vehicle is a truck. This is known as a _classification_ problem.
- Given features of popular and unpopular movies, the algorithm can predict that a new movie is likely to be popular. This is known as a _regression_ problem.

In __unsupervised learning__ there are:
- a set of features X 
- no corresponding labels y

The goal of the algorithm is to find previously unknown patterns in X. These patterns often are meaningful clusters of similar samples of X, which can show the categories or attributes intrinsic to the data. So that when given new samples of X, the machine can correctly identify the data.

An example of a common unsupervised learning application:
<br>Given features (such as buying and browsing habits) of customers, the algorithm can group customers with similar tendencies together for marketing purpose. This is known as a _clustering_ problem.

In __reinforcement learning__ there are:
- an initial state as input
- criticism or reward

The algorithm is given an initial state but not given any training data or features X. The algorithm iteratively explores the solution space, and when it reaches a conclusion it receives either criticism or reward (a weighted score). Based on this score, the algorithm continues to improve, and the best solution is the one with the most reward.

Examples of common reinforcement learning applications:
- An algorithm playing a complicated game such as chess. The initial state is the starting state of the game and a set of rules. The algorithm explores the solution space, which is dependent on the other player's moves, and use the learned cricism or reward to try to win the game.
- A robot that navigates terrains to perform a task. The initial state comes from the environment around the robot, which is changeable. The algorithm uses its learned cricism or reward to respond to the changing environment and perform the task

0. Given the ice cream sales and temperature problem that was discussed in module 2 (matplotlib) exercise, in which the algorithm can advise the ice cream truck owner how much ice cream to store in the truck for selling, based on the week's temperature. What type of learning algorithm would it be?

In [None]:
# This is supervised learning 

The main steps of machine learning are:
- Gather and prepare the training data
- Choose an algorithm
- Train the algorithm
- Test the algorithm with new data

To illustrate these steps, we start with a simple supervised learning example. The algorithm will learn some basic differences between an apple and an orange, and then identify if a new fruit is an apple or an orange.

1a. Gather data
<br>Write code to read the fruits.xlsx file into a DataFrame, then print the DataFrame to view the data.

In [2]:
import pandas as pd
fruitsDF = pd.read_excel("fruits.xlsx")
print(fruitsDF)

   Weight  Texture   Fruit
0     143    0.870  Orange
1      90    0.200   Apple
2      82    0.125   Apple
3      93    0.120   Apple
4      87    0.140   Apple
5      90    0.880  Orange
6     123    0.890  Orange
7     116    0.900  Orange


1b. Prepare data
<br>The fruit _features_ are the weight (in grams) and the skin texture (a smooth texture is closer to 0.0, the rough texture is closer to 1.0). 
<br>The fruit _labels_ are the fruit names: apple or orange
<br>Write code to separate the DataFrame into:
- a DataFrame named _features_ that has the weight and texture
- a Series named _labels_ that has the fruit names

In [25]:
# Method 1
features = fruitsDF[["Weight", "Texture"]].copy()
labels = pd.Series(fruitsDF.Fruit)
print(features, "\n")
print(type(features), "\n")
print(labels, "\n")
print(type(labels), "\n")


# Second Method
features = fruitsDF.loc[:, ["Weight","Texture"]]
print(features)
print()
print(type(features))
print()
labels = pd.Series(fruitsDF["Fruit"])
print(labels)
print()
print(type(labels))

   Weight  Texture
0     143    0.870
1      90    0.200
2      82    0.125
3      93    0.120
4      87    0.140
5      90    0.880
6     123    0.890
7     116    0.900 

<class 'pandas.core.frame.DataFrame'> 

0    Orange
1     Apple
2     Apple
3     Apple
4     Apple
5    Orange
6    Orange
7    Orange
Name: Fruit, dtype: object 

<class 'pandas.core.series.Series'> 

   Weight  Texture
0     143    0.870
1      90    0.200
2      82    0.125
3      93    0.120
4      87    0.140
5      90    0.880
6     123    0.890
7     116    0.900

<class 'pandas.core.frame.DataFrame'>

0    Orange
1     Apple
2     Apple
3     Apple
4     Apple
5    Orange
6    Orange
7    Orange
Name: Fruit, dtype: object

<class 'pandas.core.series.Series'>


2. Choose an algorithm
<br>This is often the most difficult step: finding the correct algorithm to do the job.
Scikit-learn has a diagram for some of the common algorithms: https://scikit-learn.org/stable/tutorial/machine_learning_map/ and the steps to determine which algorithm or _estimator_ to try.

For this example, since the algorithm needs to decide or classify if a fruit is an apple or an orage, the algorithm is called a _classifier_. 
<br>The classifier we'll use is a decision tree. A decision tree is like the Scikit-learn diagram above. In a decision tree the path is continuously split according to each feature. The algorithm makes a decision when it reaches the end of a path.

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()  

3. Train the classifier
<br>In Scikit-learn, the classifier is trained with the _fit_ method. The input to the fit method, the features and labels, are called the _training data_.

In [None]:
classifier = classifier.fit(features, labels)  

4. Test the classifier with new data
<br>After training, the classifier is given test data with the _predict_ method, and we can observe the output of the algorithm

In [None]:
print(classifier.predict([[85, 0.29]]))

The second example is a more substantial example of supervised learning, it introduces a few more classifiers and show some common steps that are used in ML. 

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
#from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier

5a. Gather data

In [None]:
# Scikit-learn has many sample datasets available. We will use the wine dataset is an example.
wine = load_wine()

# Let's take a look at the general format of the sample datasets.
print(type(wine))
print("keys:\n", wine.keys())  # components of dataset
print("data:\n", wine.data, type(wine.data), wine.data.shape)  # features
print("target:\n", wine.target, type(wine.target), wine.target.shape)  # labels
print("frame:\n", wine.frame, type(wine.frame))
print("feature_name:\n", wine.feature_names, type(wine.feature_names))  # feature description
print("target_name:\n", wine.target_names, type(wine.target_names), wine.target_names.shape) # label description
#print("description:\n", wine.DESCR)

5b. Prepare data

In [None]:
# find X
X = pd.DataFrame(wine.data, columns=wine.feature_names)
X.head(4)

In [None]:
# find y
y = pd.Series(wine.target)
print(y)

In [None]:
# divide the X and y data into 2 parts: training data and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

6. Choose a classifier

In [None]:
# We try a few common classifiers to see which one works best
classifiers = [
    KNeighborsClassifier(3),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GaussianNB()
    ]

7. Train and then test the classifier 

In [None]:
# go through each classifier
for classifier in classifiers:
    # train the classifier
    classifier.fit(X_train, y_train)  
    # test the trained classifier
    y_output = classifier.predict(X_test)
    # compare the predicted output with the actual output
    print(classifier)
    print(f"score: {f1_score(y_test, y_output, average='weighted'):.3f}")
    
# the f1 score shows how close the classifier output is to the actual label. 
# A score of 1.0 is a perfect match, a score of 0.0 is no match

Which classifier performs best for the wine dataset?

In [None]:
#

In [3]:
import numpy as np
A = np.arange(1,16)
print(A.max(), A.min(), np.percentile(A,25)) 

15 1 4.5
