# Data Programming in Python | BAIS:6040
# Advanced Data Analytics: Machine Learning with Scikit-Learn

Instructor: Jeff Hendricks 

Topics to be covered:
- Supervised learning - classification and regression (+ exercises)
- Unsupervised learning - clustering (+ exercises)

References: 
- Documentation scikit-learn (http://scikit-learn.org/stable/documentation.html)
- Introduction to Machine Learning with Python (http://shop.oreilly.com/product/0636920030515.do)
- Python Data Science Handbook by Jake VanderPlas (http://shop.oreilly.com/product/0636920034919.do)
- Python for Data Analysis by Wes McKinney (https://www.oreilly.com/library/view/python-for-data/9781491957653/)
- Confusion Matrix by Geeks for Geeks (https://www.geeksforgeeks.org/confusion-matrix-machine-learning/)

## Prerequisites

In [1]:
# ! pip install --upgrade numpy pandas sklearn

## Importing Modules

In [2]:
import pandas as pd                                       # dataframes
from seaborn import load_dataset                          # Titanic dataset
from sklearn.cluster import KMeans                        # k-means clustering 
from sklearn.model_selection import train_test_split      # train/test data
from sklearn.neighbors import KNeighborsClassifier        # k-NN classification 
from sklearn.linear_model import LogisticRegression       # logistic regression 

## Loading the Dataset into a Pandas Dataframe

In [3]:
df = load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
df.shape

(891, 15)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


## Filtering Out Unnecessary Data

In [6]:
df = df[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]]

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  891 non-null    int64  
 1   pclass    891 non-null    int64  
 2   sex       891 non-null    object 
 3   age       714 non-null    float64
 4   sibsp     891 non-null    int64  
 5   parch     891 non-null    int64  
 6   fare      891 non-null    float64
 7   embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


## Converting Categorical Columns into Numeric Columns

As most machine learning libraries will only accept numbers as input, every categorical column in a dataset must be replaced with a numerical column. 

In [8]:
df.sex.head()

0      male
1    female
2    female
3    female
4      male
Name: sex, dtype: object

In [9]:
df.sex = pd.Categorical(df.sex)   # Step 1: declare the column is categorical 

pandas.Categorical: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html

In [10]:
df.sex = df.sex.cat.codes         # Step 2: convert each category to its corresponding code

pandas.Series.cat.codes: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.codes.html

In [11]:
df.sex.head()

0    1
1    0
2    0
3    0
4    1
Name: sex, dtype: int8

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  891 non-null    int64  
 1   pclass    891 non-null    int64  
 2   sex       891 non-null    int8   
 3   age       714 non-null    float64
 4   sibsp     891 non-null    int64  
 5   parch     891 non-null    int64  
 6   fare      891 non-null    float64
 7   embarked  889 non-null    object 
dtypes: float64(2), int64(4), int8(1), object(1)
memory usage: 49.7+ KB


#### Non-binary Codes - What's the issue?

Category Codes imply an ordering and the learning algorithm might overfit or imply a spurious relationship.

In [13]:
df.embarked = pd.Categorical(df.embarked)

In [14]:
df.embarked = df.embarked.cat.codes

In [15]:
df.embarked.head(10)

0    2
1    0
2    2
3    2
4    2
5    1
6    2
7    2
8    2
9    0
Name: embarked, dtype: int8

#### Let's try a different approach

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

In [16]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]]

In [17]:
df2=pd.get_dummies(df.embarked, prefix_sep = "::", drop_first = True)

In [18]:
df2.head()

Unnamed: 0,Q,S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


In [19]:
df = pd.concat([df.drop('embarked',axis=1), pd.get_dummies(df.embarked, prefix_sep = "::", drop_first = False)], axis = 1)

In [20]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,C,Q,S
0,0,3,male,22.0,1,0,7.25,0,0,1
1,1,1,female,38.0,1,0,71.2833,1,0,0
2,1,3,female,26.0,0,0,7.925,0,0,1
3,1,1,female,35.0,1,0,53.1,0,0,1
4,0,3,male,35.0,0,0,8.05,0,0,1


In [21]:
def createCategoricalDummies(df, categoricalList):
    return pd.get_dummies(df[categoricalList], prefix_sep = "::", drop_first = True)

In [22]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]]

categoricalList = ['embarked','sex']

In [23]:
df = pd.concat([df.drop(categoricalList,axis=1), createCategoricalDummies(df,categoricalList)], axis = 1)
df.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male
0,0,3,22.0,1,0,7.25,0,1,1
1,1,1,38.0,1,0,71.2833,0,0,0
2,1,3,26.0,0,0,7.925,0,1,0
3,1,1,35.0,1,0,53.1,0,1,0
4,0,3,35.0,0,0,8.05,0,1,1


## Handling Missing Data

As with categorical variables, most machine learning libraries will not accept null values as input. Every null value in a dataset must be removed or replaced with a numerical value. 

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   age          714 non-null    float64
 3   sibsp        891 non-null    int64  
 4   parch        891 non-null    int64  
 5   fare         891 non-null    float64
 6   embarked::Q  891 non-null    uint8  
 7   embarked::S  891 non-null    uint8  
 8   sex::male    891 non-null    uint8  
dtypes: float64(2), int64(4), uint8(3)
memory usage: 44.5 KB


In [25]:
df[df.isnull().any(axis=1)]

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male
5,0,3,,0,0,8.4583,1,0,1
17,1,2,,0,0,13.0000,0,1,1
19,1,3,,0,0,7.2250,0,0,0
26,0,3,,0,0,7.2250,0,0,1
28,1,3,,0,0,7.8792,1,0,0
...,...,...,...,...,...,...,...,...,...
859,0,3,,0,0,7.2292,0,0,1
863,0,3,,8,2,69.5500,0,1,0
868,0,3,,0,0,9.5000,0,1,1
878,0,3,,0,0,7.8958,0,1,1


In [26]:
df = df.dropna()        # Drop all rows with any missing values

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     714 non-null    int64  
 1   pclass       714 non-null    int64  
 2   age          714 non-null    float64
 3   sibsp        714 non-null    int64  
 4   parch        714 non-null    int64  
 5   fare         714 non-null    float64
 6   embarked::Q  714 non-null    uint8  
 7   embarked::S  714 non-null    uint8  
 8   sex::male    714 non-null    uint8  
dtypes: float64(2), int64(4), uint8(3)
memory usage: 41.1 KB


# Supervised Learning - Classification

## Set the Goal

Let's aim to build a classification model using the Titanic dataset that is able to predict whether an imaginery passenger who has a certain class, sex, age, company, fare, and embark location would have survived the accident or not. This is a binary classification problem. 

For example, suppose there was a man of age 25 who purchased a third class ticket at £7 and was on board by himself, would he probably have died or survived?

## Preparing Data for Modeling

In [28]:
df.columns

Index(['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'embarked::Q',
       'embarked::S', 'sex::male'],
      dtype='object')

In [29]:
features = list(df.columns)
features.remove('survived')
features

['pclass',
 'age',
 'sibsp',
 'parch',
 'fare',
 'embarked::Q',
 'embarked::S',
 'sex::male']

In [30]:
target = "survived"

According to the goal description above, we predict <i>survived</i> using <i>pclass</i>, <i>sex</i>, <i>age</i>, <i>sibsp</i>, <i>parch</i>, and <i>fare</i>. 

In [31]:
X = df[features]
y = df[target]

For supervised learning tasks, you need a feature dataset <i>X</i> and a target dataset <i>y</i>.

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

sklearn.model_selection.train_test_split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

You need to randomly split the feature and target datasets <i>X</i> and <i>y</i> into two training datasets <i>X_train</i> and <i>y_train</i> and two test datasets <i>X_text</i> and <i>y_test</i>. The parameter `test_size` set to 0.25 means splitting the data into 25% of test data and 75% of training data. 

In [33]:
X_train.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male
535,2,7.0,0,2,26.25,0,1,0
129,3,45.0,0,0,6.975,0,1,1
491,3,21.0,0,0,7.25,0,1,1
703,3,25.0,0,0,7.7417,1,0,1
313,3,28.0,0,0,7.8958,0,1,1


In [34]:
y_train.head()

535    1
129    0
491    0
703    0
313    0
Name: survived, dtype: int64

In [35]:
X_test.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male
423,3,28.0,1,1,14.4,0,1,0
177,1,50.0,0,0,28.7125,0,0,0
305,1,0.92,1,2,151.55,0,1,1
292,2,36.0,0,0,12.875,0,0,1
889,1,26.0,0,0,30.0,0,0,1


In [36]:
y_test.head()

423    0
177    0
305    1
292    0
889    1
Name: survived, dtype: int64

## Modeling with k-Nearest Neighbors (k-NN)

In [37]:
knn = KNeighborsClassifier(n_neighbors=3)     # Build a new k-NN classification model with k set to 3

class sklearn.neighbors.KNeighborsClassifier(`n_neighbors`=5, `weights`=’uniform’, `algorithm`=’auto’, `leaf_size`=30, `p`=2, `metric`=’minkowski’, `metric_params`=None, `n_jobs`=None, **kwargs)

sklearn.neighbors.KNeighborsClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [38]:
knn.fit(X_train, y_train)                     # Fit the model using the two training datasets 

KNeighborsClassifier(n_neighbors=3)

In [39]:
knn.score(X_train, y_train)                   # Get the training score of the model 

0.8355140186915888

In [40]:
knn.score(X_test, y_test)                     # Get the test score of the model 

0.6480446927374302

### Confusion Matrix Explained

- True Positive (TP) : Observation is positive, and is predicted to be positive.
- False Negative (FN) : Observation is positive, but is predicted negative.
- True Negative (TN) : Observation is negative, and is predicted to be negative.
- False Positive (FP) : Observation is negative, but is predicted positive.

#### Classification Rate or Accuracy is given by the relation:
- (TP + TN) / (TP + TN + FN + FP) 

#### Recall
- Recall can be defined as the ratio of the total number of correctly classified positive examples divided by the total number of positive examples. 
- High Recall indicates the class is correctly recognized (small number of FN).
- Recall is given by the relation: TP / (TP + FN)

#### Precision
- For precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. 
- High Precision indicates an example labeled as positive is indeed positive (small number of FP).
- Precision is given by the relation: TP / (TP + FP)

High recall, low precision: Most of the positive examples are correctly recognized (low FN) but there are a lot of false positives.

Low recall, high precision: Miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP)

#### F-measure
- F-measure which uses Harmonic Mean in place of Arithmetic Mean as it punishes the extreme values more.
- The F-Measure will always be nearer to the smaller value of Precision or Recall.
- F-Measure : (2 * Recall * Precision) / (Recall + Precision)

In [41]:
from IPython.display import Image
Image(url="https://media.geeksforgeeks.org/wp-content/uploads/Confusion_Matrix1_1.png")

In [42]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report

# Make predictions against the test set
pred = knn.predict(X_test)

# Show the confusion matrix
print("confusion matrix:")
print(confusion_matrix(y_test, pred))

# Find the accuracy scores of the predictions against the true classes
print("accuracy: %0.3f" % accuracy_score(y_test, pred))
print("recall: %0.3f" % recall_score(y_test, pred))
print("precision: %0.3f" % precision_score(y_test, pred))
print("f-measure: %0.3f" % fbeta_score(y_test, pred, beta=1))
print(classification_report(y_test,pred))


confusion matrix:
[[79 24]
 [39 37]]
accuracy: 0.648
recall: 0.487
precision: 0.607
f-measure: 0.540
              precision    recall  f1-score   support

           0       0.67      0.77      0.71       103
           1       0.61      0.49      0.54        76

    accuracy                           0.65       179
   macro avg       0.64      0.63      0.63       179
weighted avg       0.64      0.65      0.64       179



In [43]:
person1 = {"pclass": 3, 
           "age": 25,
           "sibsp": 0,
           "parch": 0,
           "fare": 7,
           "embarked::Q":0,
           "embarked::S":0,
           "sex::male":1}

person2 = {"pclass": 1,
           "age": 8,
           "sibsp": 1,
           "parch": 2,
           "fare": 40,
           "embarked::Q":1,
           "embarked::S":0,
           "sex::male":0}

person3 = {"pclass": 2,
           "age": 20,
           "sibsp": 0,
           "parch": 0,
           "fare": 15,
           "embarked::Q":0,
           "embarked::S":1,
           "sex::male":0}

Suppose there were three imaginary passengers. 

In [44]:
X_new = []                                    # X_new contains new data items 
for person in [person1, person2, person3]:
    new_person = [person["pclass"], person["age"], person["sibsp"], person["parch"]
                  ,person["fare"], person["embarked::Q"], person["embarked::S"], person["sex::male"]]
    X_new.append(new_person)

In [45]:
knn.predict(X_new)

array([0, 1, 0])

#### The columns of the dataframe sent to predict() have to be in the same order as X_train

- Notice the different prediction

In [46]:
X_train.columns

Index(['pclass', 'age', 'sibsp', 'parch', 'fare', 'embarked::Q', 'embarked::S',
       'sex::male'],
      dtype='object')

In [47]:
# create a new person as a dataframe
person1a = {"pclass": 3, 
           "sibsp": 0,
           "parch": 0,
           "fare": 7,
           "embarked::Q":0,
           "embarked::S":0,
           "sex::male":1,
           "age": 25}

X_new2 = pd.DataFrame(person1a,index=[0])

In [48]:
knn.predict(X_new2)

array([1])

The k-NN model predicts that the persons 1 and 3 would have died, whereas person 2 would have survived.

## Modeling with Logistic Regression

In [49]:
lr = LogisticRegression(solver="liblinear")   # Build a new logistic regression model 

class sklearn.linear_model.LogisticRegression(`penalty`=’l2’, `dual`=False, `tol`=0.0001, `C`=1.0, `fit_intercept`=True, `intercept_scaling`=1, `class_weight`=None, `random_state`=None, `solver`=’warn’, `max_iter`=100, `multi_class`=’warn’, `verbose`=0, `warm_start`=False, `n_jobs`=None, `l1_ratio`=None)

sklearn.linear_model.LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [50]:
lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [51]:
lr.score(X_train, y_train)

0.7925233644859813

In [52]:
lr.score(X_test, y_test)

0.7932960893854749

In [53]:
# Make predictions against the test set
pred = lr.predict(X_test)

# Show the confusion matrix
print("confusion matrix:")
print(confusion_matrix(y_test, pred))

# Find the accuracy scores of the predictions against the true classes
print("accuracy: %0.3f" % accuracy_score(y_test, pred))
print("recall: %0.3f" % recall_score(y_test, pred))
print("precision: %0.3f" % precision_score(y_test, pred))
print("f-measure: %0.3f" % fbeta_score(y_test, pred, beta=1))

confusion matrix:
[[86 17]
 [20 56]]
accuracy: 0.793
recall: 0.737
precision: 0.767
f-measure: 0.752


In [54]:
lr.predict(X_new)

array([0, 1, 1])

The logistic regression model predicts that the person 3 would have survived, unlike the prediction of the above k-NN model. 

Note that different models could make different predictions. 

# Supervised Learning - Regression

In [55]:
weatherDf = pd.read_csv('../../Data/weather.csv', index_col=0).dropna()        # Drop all rows with any missing values
weatherDf.head()

Unnamed: 0_level_0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2007-11-01,Canberra,8.0,24.3,0.0,3.4,6.3,NW,30.0,SW,NW,...,29,1019.7,1015.0,7,7,14.4,23.6,No,3.6,Yes
2007-11-02,Canberra,14.0,26.9,3.6,4.4,9.7,ENE,39.0,E,W,...,36,1012.4,1008.4,5,3,17.5,25.7,Yes,3.6,Yes
2007-11-03,Canberra,13.7,23.4,3.6,5.8,3.3,NW,85.0,N,NNE,...,69,1009.5,1007.2,8,7,15.4,20.2,Yes,39.8,Yes
2007-11-04,Canberra,13.3,15.5,39.8,7.2,9.1,NW,54.0,WNW,W,...,56,1005.5,1007.0,2,7,13.5,14.1,Yes,2.8,Yes
2007-11-05,Canberra,7.6,16.1,2.8,5.6,10.6,SSE,50.0,SSE,ESE,...,49,1018.3,1018.5,7,7,11.1,15.4,Yes,0.0,No


In [56]:
features = ['MinTemp','MaxTemp','Sunshine','Humidity3pm']
target = 'Rainfall'

## set the independent and dependent variables
X=weatherDf[features]
y=weatherDf[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

## Modeling with Linear Regression

sklearn.linear_model.LinearRegssion: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [57]:
from sklearn.linear_model import LinearRegression #linear regression

lr=LinearRegression()

In [58]:
lr.fit(X_train, y_train)

LinearRegression()

In [59]:
## score for linear regression is the R2
lr.score(X_train, y_train)

0.1585020526818326

### Other Accuracy Measures

In [60]:
import math
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error

print(lr.score(X_test, y_test))

preds = lr.predict(X_test)

score = explained_variance_score(y_test, preds)
mae = mean_absolute_error(y_test, preds)
rmse = math.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)
    
print("score = {:.5f} | MAE = {:.3f} | RMSE = {:.3f} | R2 = {:.5f}"
          .format(score, mae, rmse, r2))

0.22861720735022106
score = 0.23192 | MAE = 1.739 | RMSE = 3.034 | R2 = 0.22862


In [61]:
print(lr.intercept_)
print(lr.coef_)

-1.145411145538914
[ 0.36335719 -0.26837655  0.29186769  0.06800184]


In [62]:
obs1 = {   "MinTemp": 6, 
           "MaxTemp": 32,
           "Sunshine": 5,
           "Humidity3pm": 30}

obs2 = {   "MinTemp": 16, 
           "MaxTemp": 42,
           "Sunshine": 10,
           "Humidity3pm": 35}

obs3 = {   "MinTemp": 10, 
           "MaxTemp": 25,
           "Sunshine": 7,
           "Humidity3pm": 60}

In [63]:
X_new = []                                    # X_new contains new data items 
for obs in [obs1, obs2, obs3]:
    new_obs = [obs["MinTemp"], obs["MaxTemp"], obs["Sunshine"], obs["Humidity3pm"]]
    X_new.append(new_obs)

In [64]:
lr.predict(X_new)

array([-4.05392388, -1.30476981,  1.90193144])

## Regression Modeling with Ridge

Least Squares with l2 Regularization

sklearn.linear_model.Ridge: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [65]:
from sklearn.linear_model import Ridge

rr=Ridge(solver='svd')

In [66]:
rr.fit(X_train, y_train)

Ridge(solver='svd')

In [67]:
## score for ridge regression is the R2
rr.score(X_train, y_train)

0.15850199803975518

In [68]:
## Other accuracy measures
print(rr.score(X_test, y_test))

preds = rr.predict(X_test)

score = explained_variance_score(y_test, preds)
mae = mean_absolute_error(y_test, preds)
rmse = math.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)
    
print("score = {:.5f} | MAE = {:.3f} | RMSE = {:.3f} | R2 = {:.5f}"
          .format(score, mae, rmse, r2))

0.2286000266839302
score = 0.23190 | MAE = 1.739 | RMSE = 3.034 | R2 = 0.22860


In [69]:
rr.predict(X_new)

array([-4.04929829, -1.30156542,  1.90298592])

# Exercises for Supervised Learning (8 questions)

Let's build another classification model for titanic survivors. This time, build a logistic regression model using pclass, age, and fare as the features.

In [70]:
df = load_dataset("titanic")

1\. You need two variables: X as a feature dataset and y as a target dataset. Select the appropriate eatures in <i>df</i> and assign them to a variable called <i>X</i>. Likewise, select the target in <i>df</i> and assign it to a variable called <i>y</i>.

In [71]:
# Your answer here


2\. Split <i>X</i> and <i>y</i> into two training datasets <i>X_train</i> and <i>y_train</i> and two test datasets <i>X_text</i> and <i>y_test</i>. Set the `test_size` to 0.25 and `random_state` to 0.

In [72]:
# Your answer here


You can build a new logistic regression model <i>lgr</i> as follows. The solver is set to <i>liblinear</i>. 

In [73]:
lgr = LogisticRegression(solver="liblinear")   # Build a new logistic regression model 

3\. Fit the logistic regression model <i>lgr</i> using the two training datasets <i>X_train</i> and <i>y_train</i>.

In [74]:
# Your answer here


4\. Get the training score and test score, confusion matrix, and classification report. 

In [75]:
# Your answer here


Let's aim to build a <b>regression</b> model using the Major League Baseball dataset that is able to predict the number of homeruns (HRs) a batter would hit in a single season based on some statistics such as number of games (G), number of at bats (AB), runs scored (R), num of hits (H), number of doubles (2B), number of triples (3B), number of stolens bases (SB), and number of base on balls (BB). 

In [76]:
dfb = pd.read_csv("../../Data/MLB_Batting.csv")
dfb18 = dfb[(dfb.yearID == 2018) & ((dfb.lgID == "NL") | (dfb.lgID == "AL"))]
dfb18.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1535 entries, 104326 to 105860
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   playerID  1535 non-null   object 
 1   yearID    1535 non-null   int64  
 2   stint     1535 non-null   int64  
 3   teamID    1535 non-null   object 
 4   lgID      1535 non-null   object 
 5   G         1535 non-null   int64  
 6   AB        1535 non-null   int64  
 7   R         1535 non-null   int64  
 8   H         1535 non-null   int64  
 9   2B        1535 non-null   int64  
 10  3B        1535 non-null   int64  
 11  HR        1535 non-null   int64  
 12  RBI       1535 non-null   float64
 13  SB        1535 non-null   float64
 14  CS        1535 non-null   float64
 15  BB        1535 non-null   int64  
 16  SO        1535 non-null   float64
 17  IBB       1535 non-null   float64
 18  HBP       1535 non-null   float64
 19  SH        1535 non-null   float64
 20  SF        1535 non-null

According to the goal description above, the features to be used include G, AB, R, H, 2B, 3B, SB, and BB, while the target is HR. 

In [77]:
features = ["G", "AB", "R", "H", "2B", "3B", "SB", "BB"]
target = "HR"

5\. You need two variables: X as a feature dataset and y as a target dataset. Select the features in <i>dfb18</i> and assign it to a variable called <i>X</i>. Likewise, select the target in <i>dfb18</i> and assign it to a variable called <i>y</i>.

Split <i>X</i> and <i>y</i> into two training datasets <i>X_train</i> and <i>y_train</i> and two test datasets <i>X_text</i> and <i>y_test</i>. Set the `test_size` to 0.25 and `random_state` to 0.

In [78]:
# Your answer here


You can build a new least squares linear regression model <i>lr</i> as follows.

In [79]:
from sklearn.linear_model import LinearRegression     # linear regression

lr = LinearRegression()

6\. Fit the linear regression model <i>lr</i> using the training datasets.

In [80]:
# Your answer here


7\. Get the training score and test score, MAE, and RMSE, respectively. 

In [81]:
# Your answer here


8\. Suppose there is a new batter who has the following record. How many home runs would the batter hit using your model?

In [82]:
batter = {"G": 130,
          "AB": 450,
          "R": 100,
          "H": 170,
          "2B": 60,
          "3B": 10,
          "SB": 5,
          "BB": 80}

new_batter = [batter["G"], batter["AB"], batter["R"], batter["H"], batter["2B"], batter["3B"], batter["SB"], batter["BB"]]
X_new = [new_batter]

In [83]:
# Your answer here


# Unsupervised Learning - Clustering

## Set the Goal

Let's aim to build a clustering model that is able to group, or cluster, all passengers on board of the Titanic into several groups, or clusters, of similar ones. 

## Prepare Data for Modeling

In [84]:
df = load_dataset("titanic")

df = df[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]]

categoricalList = ['embarked','sex']

df = pd.concat([df.drop(categoricalList,axis=1), createCategoricalDummies(df,categoricalList)], axis = 1).dropna()

In [85]:
X = df

Note that there is no <i>y </i> in unsupervised learning. All you need is just an input dataset <i>X</i>. Also, you do not have to split the data into training and test sets. 

## Modeling with k-Means Clustering

In [86]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=0)     # Create a new k-means clustering model with k set to 5

class sklearn.cluster.KMeans(`n_clusters`=8, `init`=’k-means++’, `n_init`=10, m`ax_iter`=300, `tol`=0.0001, `precompute_distances`=’auto’, `verbose`=0, `random_state`=None, `copy_x`=True, `n_jobs`=None, `algorithm`=’auto’)

sklearn.cluster.KMeans: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [87]:
kmeans.fit(X)

KMeans(n_clusters=5, random_state=0)

In [88]:
kmeans.cluster_centers_                          # Store the values of centroids 

array([[3.35106383e-01, 2.51595745e+00, 2.84171099e+01, 4.37943262e-01,
        3.81205674e-01, 1.56840344e+01, 4.60992908e-02, 8.28014184e-01,
        6.79078014e-01],
       [1.00000000e+00, 1.00000000e+00, 3.53333333e+01, 0.00000000e+00,
        3.33333333e-01, 5.12329200e+02, 0.00000000e+00, 0.00000000e+00,
        6.66666667e-01],
       [7.33333333e-01, 1.00000000e+00, 3.03333333e+01, 1.00000000e+00,
        1.33333333e+00, 2.39991940e+02, 6.93889390e-18, 4.66666667e-01,
        2.66666667e-01],
       [7.33333333e-01, 1.00000000e+00, 3.24306667e+01, 6.00000000e-01,
        8.66666667e-01, 1.31183883e+02, 6.93889390e-18, 5.00000000e-01,
        3.66666667e-01],
       [6.37254902e-01, 1.27450980e+00, 3.57254902e+01, 8.43137255e-01,
        4.50980392e-01, 6.71931804e+01, 1.96078431e-02, 6.37254902e-01,
        5.19607843e-01]])

In [89]:
kmeans.labels_                                   # Store the cluster labels of data items 

array([0, 4, 0, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 2, 0, 0, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 4, 0, 0, 0,
       4, 0, 4, 4, 0, 0, 0, 0, 0, 0, 4, 4, 0, 4, 0, 0, 0, 0, 0, 4, 0, 0,
       0, 2, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 4, 0, 0,
       0, 0, 0, 0, 0, 2, 0, 4, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4,
       0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4,
       0, 0, 0, 0, 4, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4,
       0, 0, 4, 0, 0, 0, 0, 0, 0, 4, 1, 0, 0, 4, 0, 0, 0, 0, 3, 3, 0, 0,
       0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0, 3, 2,
       0, 3, 3, 0, 4, 4, 2, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 3, 0, 0,
       0, 4, 0, 3, 0, 4, 3, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       4, 0, 0, 0, 0, 0, 0, 4, 4, 4, 0, 0, 3, 0, 0,

Each data item in <i>X</i> is assigned a cluster label, which is a number between 0 and k-1. 

In [90]:
df["label"] = kmeans.labels_                    # Add a new column lable with the clustering labels 

In [91]:
df.head(10)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male,label
0,0,3,22.0,1,0,7.25,0,1,1,0
1,1,1,38.0,1,0,71.2833,0,0,0,4
2,1,3,26.0,0,0,7.925,0,1,0,0
3,1,1,35.0,1,0,53.1,0,1,0,4
4,0,3,35.0,0,0,8.05,0,1,1,0
6,0,1,54.0,0,0,51.8625,0,1,1,4
7,0,3,2.0,3,1,21.075,0,1,1,0
8,1,3,27.0,0,2,11.1333,0,1,0,0
9,1,2,14.0,1,0,30.0708,0,0,0,0
10,1,3,4.0,1,1,16.7,0,1,0,0


In [92]:
df.label.value_counts()                          # Count the number of values for each label 

0    564
4    102
3     30
2     15
1      3
Name: label, dtype: int64

In [93]:
df[df.label == 2].sample(n=10, replace=False, random_state=0)  # Select a random sample with 10 rows that have the label 2

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male,label
88,1,1,23.0,3,2,263.0,0,1,0,2
377,0,1,27.0,0,2,211.5,0,0,1,2
438,0,1,64.0,1,4,263.0,0,1,1,2
689,1,1,15.0,0,1,211.3375,0,1,0,2
779,1,1,43.0,0,1,211.3375,0,1,0,2
311,1,1,18.0,2,2,262.375,0,0,0,2
118,0,1,24.0,0,1,247.5208,0,0,1,2
742,1,1,21.0,2,2,262.375,0,0,0,2
700,1,1,18.0,1,0,227.525,0,0,0,2
380,1,1,42.0,0,0,227.525,0,0,0,2


In [94]:
df[df.label == 0].sample(n=10, replace=False, random_state=0)  # Select a random sample with 10 rows that have the label 0

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male,label
379,0,3,19.0,0,0,7.775,0,1,1,0
500,0,3,17.0,0,0,8.6625,0,1,1,0
285,0,3,33.0,0,0,8.6625,0,0,1,0
503,0,3,37.0,0,0,9.5875,0,1,0,0
662,0,1,47.0,0,0,25.5875,0,1,1,0
442,0,3,25.0,1,0,7.775,0,1,1,0
670,1,2,40.0,1,1,39.0,0,1,0,0
619,0,2,26.0,0,0,10.5,0,1,1,0
823,1,3,27.0,0,1,12.475,0,1,0,0
387,1,2,36.0,0,0,13.0,0,1,0,0


In [95]:
df.groupby("label").mean()

Unnamed: 0_level_0,survived,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0.335106,2.515957,28.41711,0.437943,0.381206,15.684034,0.046099,0.828014,0.679078
1,1.0,1.0,35.333333,0.0,0.333333,512.3292,0.0,0.0,0.666667
2,0.733333,1.0,30.333333,1.0,1.333333,239.99194,0.0,0.466667,0.266667
3,0.733333,1.0,32.430667,0.6,0.866667,131.183883,0.0,0.5,0.366667
4,0.637255,1.27451,35.72549,0.843137,0.45098,67.19318,0.019608,0.637255,0.519608


# Exercises for Clustering (6 questions)

Using the same baseball data, let's aim to build a clustering model that is able to group all batters into 5 clusters of similar ones by looking at the same __8 features__ used in the above regression __exercises__ along with the __target__. 

We need a copy of <i>dfb18</i> for clustering. Use <i>dfb18c</i> for your clustering.

In [96]:
dfb18c = dfb18.copy()
dfb18c.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
104326,abreujo02,2018,1,CHA,AL,128,499,68,132,36,...,78.0,2.0,0.0,37,109.0,7.0,11.0,0.0,6.0,14.0
104327,acunaro01,2018,1,ATL,NL,111,433,78,127,26,...,64.0,16.0,5.0,45,123.0,2.0,6.0,0.0,3.0,4.0
104328,adamewi01,2018,1,TBA,AL,85,288,43,80,7,...,34.0,6.0,5.0,31,95.0,3.0,1.0,1.0,2.0,6.0
104329,adamja01,2018,1,KCA,AL,31,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
104330,adamsau02,2018,1,WAS,NL,2,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0


1\. For clustering, all you need is just an input dataset <i>X</i>. Select the 9 features in <i>dfb18c</i> and assign it to <i>X</i>.

In [97]:
# Your answer here


2\. Build a new k-means clustering model <i>kmeans</i>. Set `n_clusters` to 5 and `random_state` to 0.

In [98]:
# Your answer here


3\. Fit the clustering model <i>kmeans</i> using the input dataset <i>X</i>.

In [99]:
# Your answer here


4\. Assign the resulting labels of <i>kmeans</i> to the new column of <i>dfb18c</i> called <i>label</i>.

In [100]:
# Your answer here


5\. Check the number of values for each label. 

In [101]:
# Your answer here


6\. Select a random sample of <i>dfb18c</i> with 10 rows that have the lable 2. For random sampling, set `replace` to False and `random_state` to 0.

In [102]:
# Your answer here
