# Data Programming in Python | BAIS:6040
# Advanced Data Analytics: Machine Learning with Scikit-Learn

Instructor: Jeff Hendricks 

Topics to be covered:
- Supervised learning - classification and regression (+ exercises)
- Unsupervised learning - clustering (+ exercises)

References: 
- Documentation scikit-learn (http://scikit-learn.org/stable/documentation.html)
- Introduction to Machine Learning with Python (http://shop.oreilly.com/product/0636920030515.do)
- Python Data Science Handbook by Jake VanderPlas (http://shop.oreilly.com/product/0636920034919.do)
- Python for Data Analysis by Wes McKinney (https://www.oreilly.com/library/view/python-for-data/9781491957653/)
- Confusion Matrix by Geeks for Geeks (https://www.geeksforgeeks.org/confusion-matrix-machine-learning/)

## Importing Modules

In [None]:
import pandas as pd                                       # dataframes
from seaborn import load_dataset                          # Titanic dataset
from sklearn.cluster import KMeans                        # k-means clustering 
from sklearn.model_selection import train_test_split      # train/test data
from sklearn.neighbors import KNeighborsClassifier        # k-NN classification 
from sklearn.linear_model import LogisticRegression       # logistic regression 

## Loading the Dataset into a Pandas Dataframe

In [None]:
df = load_dataset("titanic")
df.head()

In [None]:
df.shape

In [None]:
df.info()

## Filtering Out Unnecessary Data

In [None]:
df = df[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]]

In [None]:
df.info()

## Converting Categorical Columns into Numeric Columns

As most machine learning libraries will only accept numbers as input, every categorical column in a dataset must be replaced with a numerical column. 

In [None]:
df.sex.head()

In [None]:
df.sex = pd.Categorical(df.sex)   # Step 1: declare the column is categorical 

pandas.Categorical: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html

In [None]:
df.sex = df.sex.cat.codes         # Step 2: convert each category to its corresponding code

pandas.Series.cat.codes: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.codes.html

In [None]:
df.sex.head()

In [None]:
df.info()

#### Non-binary Codes - What's the issue?

Category Codes imply an ordering and the learning algorithm might overfit or imply a spurious relationship.

In [None]:
df.embarked = pd.Categorical(df.embarked)

In [None]:
df.embarked = df.embarked.cat.codes

In [None]:
df.embarked.head(10)

#### Let's try a different approach

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

In [None]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]]

In [None]:
df2=pd.get_dummies(df.embarked, prefix_sep = "::", drop_first = True)

In [None]:
df2.head()

In [None]:
df = pd.concat([df.drop('embarked',axis=1), pd.get_dummies(df.embarked, prefix_sep = "::", drop_first = False)], axis = 1)

In [None]:
df.head()

In [None]:
def createCategoricalDummies(df, categoricalList):
    return pd.get_dummies(df[categoricalList], prefix_sep = "::", drop_first = True)

In [None]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]]

categoricalList = ['embarked','sex']

In [None]:
df = pd.concat([df.drop(categoricalList,axis=1), createCategoricalDummies(df,categoricalList)], axis = 1)
df.head()

## Handling Missing Data

As with categorical variables, most machine learning libraries will not accept null values as input. Every null value in a dataset must be removed or replaced with a numerical value. 

In [None]:
df.info()

In [None]:
df[df.isnull().any(axis=1)]

In [None]:
df = df.dropna()        # Drop all rows with any missing values

In [None]:
df.info()

# Supervised Learning - Classification

## Set the Goal

Let's aim to build a classification model using the Titanic dataset that is able to predict whether an imaginery passenger who has a certain class, sex, age, company, fare, and embark location would have survived the accident or not. This is a binary classification problem. 

For example, suppose there was a man of age 25 who purchased a third class ticket at £7 and was on board by himself, would he probably have died or survived?

## Preparing Data for Modeling

In [None]:
df.columns

In [None]:
features = list(df.columns)
features.remove('survived')
features

In [None]:
target = "survived"

According to the goal description above, we predict <i>survived</i> using <i>pclass</i>, <i>sex</i>, <i>age</i>, <i>sibsp</i>, <i>parch</i>, and <i>fare</i>. 

In [None]:
X = df[features]
y = df[target]

For supervised learning tasks, you need a feature dataset <i>X</i> and a target dataset <i>y</i>.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

sklearn.model_selection.train_test_split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

You need to randomly split the feature and target datasets <i>X</i> and <i>y</i> into two training datasets <i>X_train</i> and <i>y_train</i> and two test datasets <i>X_text</i> and <i>y_test</i>. The parameter `test_size` set to 0.25 means splitting the data into 25% of test data and 75% of training data. 

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
X_test.head()

In [None]:
y_test.head()

## Modeling with k-Nearest Neighbors (k-NN)

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)     # Build a new k-NN classification model with k set to 3

class sklearn.neighbors.KNeighborsClassifier(`n_neighbors`=5, `weights`=’uniform’, `algorithm`=’auto’, `leaf_size`=30, `p`=2, `metric`=’minkowski’, `metric_params`=None, `n_jobs`=None, **kwargs)

sklearn.neighbors.KNeighborsClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [None]:
knn.fit(X_train, y_train)                     # Fit the model using the two training datasets 

In [None]:
knn.score(X_train, y_train)                   # Get the training score of the model 

In [None]:
knn.score(X_test, y_test)                     # Get the test score of the model 

### Confusion Matrix Explained

- True Positive (TP) : Observation is positive, and is predicted to be positive.
- False Negative (FN) : Observation is positive, but is predicted negative.
- True Negative (TN) : Observation is negative, and is predicted to be negative.
- False Positive (FP) : Observation is negative, but is predicted positive.

#### Classification Rate or Accuracy is given by the relation:
- (TP + TN) / (TP + TN + FN + FP) 

#### Recall
- Recall can be defined as the ratio of the total number of correctly classified positive examples divided by the total number of positive examples. 
- High Recall indicates the class is correctly recognized (small number of FN).
- Recall is given by the relation: TP / (TP + FN)

#### Precision
- For precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. 
- High Precision indicates an example labeled as positive is indeed positive (small number of FP).
- Precision is given by the relation: TP / (TP + FP)

High recall, low precision: Most of the positive examples are correctly recognized (low FN) but there are a lot of false positives.

Low recall, high precision: Miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP)

#### F-measure
- F-measure which uses Harmonic Mean in place of Arithmetic Mean as it punishes the extreme values more.
- The F-Measure will always be nearer to the smaller value of Precision or Recall.
- F-Measure : (2 * Recall * Precision) / (Recall + Precision)

In [None]:
from IPython.display import Image
Image(url="https://media.geeksforgeeks.org/wp-content/uploads/Confusion_Matrix1_1.png")

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report

# Make predictions against the test set
pred = knn.predict(X_test)

# Show the confusion matrix
print("confusion matrix:")
print(confusion_matrix(y_test, pred))

# Find the accuracy scores of the predictions against the true classes
print("accuracy: %0.3f" % accuracy_score(y_test, pred))
print("recall: %0.3f" % recall_score(y_test, pred))
print("precision: %0.3f" % precision_score(y_test, pred))
print("f-measure: %0.3f" % fbeta_score(y_test, pred, beta=1))
print(classification_report(y_test,pred))


In [None]:
person1 = {"pclass": 3, 
           "age": 25,
           "sibsp": 0,
           "parch": 0,
           "fare": 7,
           "embarked::Q":0,
           "embarked::S":0,
           "sex::male":1}

person2 = {"pclass": 1,
           "age": 8,
           "sibsp": 1,
           "parch": 2,
           "fare": 40,
           "embarked::Q":1,
           "embarked::S":0,
           "sex::male":0}

person3 = {"pclass": 2,
           "age": 20,
           "sibsp": 0,
           "parch": 0,
           "fare": 15,
           "embarked::Q":0,
           "embarked::S":1,
           "sex::male":0}

Suppose there were three imaginary passengers. 

In [None]:
X_new = []                                    # X_new contains new data items 
for person in [person1, person2, person3]:
    new_person = [person["pclass"], person["age"], person["sibsp"], person["parch"]
                  ,person["fare"], person["embarked::Q"], person["embarked::S"], person["sex::male"]]
    X_new.append(new_person)

In [None]:
knn.predict(X_new)

#### The columns of the dataframe sent to predict() have to be in the same order as X_train

- Notice the different prediction

In [None]:
X_train.columns

In [None]:
# create a new person as a dataframe
person1a = {"pclass": 3, 
           "sibsp": 0,
           "parch": 0,
           "fare": 7,
           "embarked::Q":0,
           "embarked::S":0,
           "sex::male":1,
           "age": 25}

X_new2 = pd.DataFrame(person1a,index=[0])

In [None]:
knn.predict(X_new2)

The k-NN model predicts that the persons 1 and 3 would have died, whereas person 2 would have survived.

## Modeling with Logistic Regression

In [None]:
lr = LogisticRegression(solver="liblinear")   # Build a new logistic regression model 

class sklearn.linear_model.LogisticRegression(`penalty`=’l2’, `dual`=False, `tol`=0.0001, `C`=1.0, `fit_intercept`=True, `intercept_scaling`=1, `class_weight`=None, `random_state`=None, `solver`=’warn’, `max_iter`=100, `multi_class`=’warn’, `verbose`=0, `warm_start`=False, `n_jobs`=None, `l1_ratio`=None)

sklearn.linear_model.LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
lr.fit(X_train, y_train)

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

In [None]:
# Make predictions against the test set
pred = lr.predict(X_test)

# Show the confusion matrix
print("confusion matrix:")
print(confusion_matrix(y_test, pred))

# Find the accuracy scores of the predictions against the true classes
print("accuracy: %0.3f" % accuracy_score(y_test, pred))
print("recall: %0.3f" % recall_score(y_test, pred))
print("precision: %0.3f" % precision_score(y_test, pred))
print("f-measure: %0.3f" % fbeta_score(y_test, pred, beta=1))

In [None]:
lr.predict(X_new)

The logistic regression model predicts that the person 3 would have survived, unlike the prediction of the above k-NN model. 

Note that different models could make different predictions. 

# Supervised Learning - Regression

In [None]:
weatherDf = pd.read_csv('data/weather.csv', index_col=0).dropna()        # Drop all rows with any missing values
weatherDf.head()

In [None]:
features = ['MinTemp','MaxTemp','Sunshine','Humidity3pm']
target = 'Rainfall'

## set the independent and dependent variables
X=weatherDf[features]
y=weatherDf[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

## Modeling with Linear Regression

sklearn.linear_model.LinearRegssion: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [None]:
from sklearn.linear_model import LinearRegression #linear regression

lr=LinearRegression()

In [None]:
lr.fit(X_train, y_train)

In [None]:
## score for linear regression is the R2
lr.score(X_train, y_train)

### Other Accuracy Measures

In [None]:
import math
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error

print(lr.score(X_test, y_test))

preds = lr.predict(X_test)

score = explained_variance_score(y_test, preds)
mae = mean_absolute_error(y_test, preds)
rmse = math.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)
    
print("score = {:.5f} | MAE = {:.3f} | RMSE = {:.3f} | R2 = {:.5f}"
          .format(score, mae, rmse, r2))

In [None]:
print(lr.intercept_)
print(lr.coef_)

In [None]:
obs1 = {   "MinTemp": 6, 
           "MaxTemp": 32,
           "Sunshine": 5,
           "Humidity3pm": 30}

obs2 = {   "MinTemp": 16, 
           "MaxTemp": 42,
           "Sunshine": 10,
           "Humidity3pm": 35}

obs3 = {   "MinTemp": 10, 
           "MaxTemp": 25,
           "Sunshine": 7,
           "Humidity3pm": 60}

In [None]:
X_new = []                                    # X_new contains new data items 
for obs in [obs1, obs2, obs3]:
    new_obs = [obs["MinTemp"], obs["MaxTemp"], obs["Sunshine"], obs["Humidity3pm"]]
    X_new.append(new_obs)

In [None]:
lr.predict(X_new)

## Regression Modeling with Ridge

Least Squares with l2 Regularization

sklearn.linear_model.Ridge: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [None]:
from sklearn.linear_model import Ridge

rr=Ridge(solver='svd')

In [None]:
rr.fit(X_train, y_train)

In [None]:
## score for ridge regression is the R2
rr.score(X_train, y_train)

In [None]:
## Other accuracy measures
print(rr.score(X_test, y_test))

preds = rr.predict(X_test)

score = explained_variance_score(y_test, preds)
mae = mean_absolute_error(y_test, preds)
rmse = math.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)
    
print("score = {:.5f} | MAE = {:.3f} | RMSE = {:.3f} | R2 = {:.5f}"
          .format(score, mae, rmse, r2))

In [None]:
rr.predict(X_new)

# Exercises for Supervised Learning (8 questions)

Let's build another classification model for titanic survivors. This time, build a logistic regression model using pclass, age, and fare as the features.

In [None]:
df = load_dataset("titanic")

1\. You need two variables: X as a feature dataset and y as a target dataset. Select the appropriate eatures in <i>df</i> and assign them to a variable called <i>X</i>. Likewise, select the target in <i>df</i> and assign it to a variable called <i>y</i>.

In [None]:
# Your answer here


2\. Split <i>X</i> and <i>y</i> into two training datasets <i>X_train</i> and <i>y_train</i> and two test datasets <i>X_text</i> and <i>y_test</i>. Set the `test_size` to 0.25 and `random_state` to 0.

In [None]:
# Your answer here


You can build a new logistic regression model <i>lgr</i> as follows. The solver is set to <i>liblinear</i>. 

In [None]:
lgr = LogisticRegression(solver="liblinear")   # Build a new logistic regression model 

3\. Fit the logistic regression model <i>lgr</i> using the two training datasets <i>X_train</i> and <i>y_train</i>.

In [None]:
# Your answer here


4\. Get the training score and test score, confusion matrix, and classification report. 

In [None]:
# Your answer here


Let's aim to build a <b>regression</b> model using the Major League Baseball dataset that is able to predict the number of homeruns (HRs) a batter would hit in a single season based on some statistics such as number of games (G), number of at bats (AB), runs scored (R), num of hits (H), number of doubles (2B), number of triples (3B), number of stolens bases (SB), and number of base on balls (BB). 

In [None]:
dfb = pd.read_csv("MLB_Batting.csv")
dfb18 = dfb[(dfb.yearID == 2018) & ((dfb.lgID == "NL") | (dfb.lgID == "AL"))]
dfb18.info()

According to the goal description above, the features to be used include G, AB, R, H, 2B, 3B, SB, and BB, while the target is HR. 

In [None]:
features = ["G", "AB", "R", "H", "2B", "3B", "SB", "BB"]
target = "HR"

5\. You need two variables: X as a feature dataset and y as a target dataset. Select the features in <i>dfb18</i> and assign it to a variable called <i>X</i>. Likewise, select the target in <i>dfb18</i> and assign it to a variable called <i>y</i>.

Split <i>X</i> and <i>y</i> into two training datasets <i>X_train</i> and <i>y_train</i> and two test datasets <i>X_text</i> and <i>y_test</i>. Set the `test_size` to 0.25 and `random_state` to 0.

In [None]:
# Your answer here


You can build a new least squares linear regression model <i>lr</i> as follows.

In [None]:
from sklearn.linear_model import LinearRegression     # linear regression

lr = LinearRegression()

6\. Fit the linear regression model <i>lr</i> using the training datasets.

In [None]:
# Your answer here


7\. Get the training score and test score, MAE, and RMSE, respectively. 

In [None]:
# Your answer here


8\. Suppose there is a new batter who has the following record. How many home runs would the batter hit using your model?

In [None]:
batter = {"G": 130,
          "AB": 450,
          "R": 100,
          "H": 170,
          "2B": 60,
          "3B": 10,
          "SB": 5,
          "BB": 80}

new_batter = [batter["G"], batter["AB"], batter["R"], batter["H"], batter["2B"], batter["3B"], batter["SB"], batter["BB"]]
X_new = [new_batter]

In [None]:
# Your answer here


# Unsupervised Learning - Clustering

## Set the Goal

Let's aim to build a clustering model that is able to group, or cluster, all passengers on board of the Titanic into several groups, or clusters, of similar ones. 

## Prepare Data for Modeling

In [None]:
df = load_dataset("titanic")

df = df[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]]

categoricalList = ['embarked','sex']

df = pd.concat([df.drop(categoricalList,axis=1), createCategoricalDummies(df,categoricalList)], axis = 1).dropna()

In [None]:
X = df

Note that there is no <i>y </i> in unsupervised learning. All you need is just an input dataset <i>X</i>. Also, you do not have to split the data into training and test sets. 

## Modeling with k-Means Clustering

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=0)     # Create a new k-means clustering model with k set to 5

class sklearn.cluster.KMeans(`n_clusters`=8, `init`=’k-means++’, `n_init`=10, m`ax_iter`=300, `tol`=0.0001, `precompute_distances`=’auto’, `verbose`=0, `random_state`=None, `copy_x`=True, `n_jobs`=None, `algorithm`=’auto’)

sklearn.cluster.KMeans: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
kmeans.fit(X)

In [None]:
kmeans.cluster_centers_                          # Store the values of centroids 

In [None]:
kmeans.labels_                                   # Store the cluster labels of data items 

Each data item in <i>X</i> is assigned a cluster label, which is a number between 0 and k-1. 

In [None]:
df["label"] = kmeans.labels_                    # Add a new column lable with the clustering labels 

In [None]:
df.head(10)

In [None]:
df.label.value_counts()                          # Count the number of values for each label 

In [None]:
df[df.label == 2].sample(n=10, replace=False, random_state=0)  # Select a random sample with 10 rows that have the label 2

In [None]:
df[df.label == 0].sample(n=10, replace=False, random_state=0)  # Select a random sample with 10 rows that have the label 0

In [None]:
df.groupby("label").mean()

# Exercises for Clustering (6 questions)

Using the same baseball data, let's aim to build a clustering model that is able to group all batters into 5 clusters of similar ones by looking at the same __8 features__ used in the above regression __exercises__ along with the __target__. 

We need a copy of <i>dfb18</i> for clustering. Use <i>dfb18c</i> for your clustering.

In [None]:
dfb18c = dfb18.copy()
dfb18c.head()

1\. For clustering, all you need is just an input dataset <i>X</i>. Select the 9 features in <i>dfb18c</i> and assign it to <i>X</i>.

In [None]:
# Your answer here


2\. Build a new k-means clustering model <i>kmeans</i>. Set `n_clusters` to 5 and `random_state` to 0.

In [None]:
# Your answer here


3\. Fit the clustering model <i>kmeans</i> using the input dataset <i>X</i>.

In [None]:
# Your answer here


4\. Assign the resulting labels of <i>kmeans</i> to the new column of <i>dfb18c</i> called <i>label</i>.

In [None]:
# Your answer here


5\. Check the number of values for each label. 

In [None]:
# Your answer here


6\. Select a random sample of <i>dfb18c</i> with 10 rows that have the lable 2. For random sampling, set `replace` to False and `random_state` to 0.

In [None]:
# Your answer here
