# Introduction to Machine Learning
Grant Glass

TAP Institute


<br>Day 01:<br>Key Concepts and Terms




## Preface

Welcome to Machine Learning as a part of the 2021 TAP Institute's summer courses. In this first day notebook, we will go over the core concepts of machine learning and start to get our feet wet with. I will not be providing you a complete overview, but rather a quick way to get a genereal understanding about what machine learning is and how it works. In days two and three, we will be exploring different machine learning techniques more in depth. 


## Covered in this Notebook

1) What is Machine Learning?<br>
2) What is a Statistical Model?<br>
3) A framework for understanding ML<br>
4) Simple Example of ML

## Before we Begin!

Head to the Google Teachable Machine Website: https://teachablemachine.withgoogle.com

The Teachable Machine website provides an easy to use interface for training image, sound, and pose classification models. No login is required to get started. Training data files can be loaded directly from your computer or from your computer‚Äôs webcam or microphone. Models can be exported to use in other projects, and the FAQ (https://cloud.google.com/inclusive-ml/) includes links to read more about fairness and inclusion in ML.

Take a look at this project involving training a model to detect how ripe a piece of fruit is: https://medium.com/@warronbebster/teachable-machine-tutorial-bananameter-4bfffa765866


How do you think the computer figures out ripeness?

What exactly are we teaching the machine?

What other humanistic data could we use for this type of machine learning?




## Part One - What is MACHINE LEARNING?

The field itself: ML is a field of study which harnesses principles of computer science and statistics to create statistical models. These models are generally used to do two things:
Prediction: make predictions about the future based on data about the past
Inference: discover patterns in data

Difference between ML and AI: There is no universally agreed upon distinction between ML and artificial intelligence (AI). AI usually concentrates on programming computers to make decisions (based on ML models and sets of logical rules), whereas ML focuses more on making predictions about the future.
They are highly interconnected fields, and, for most non-technical purposes, they are the same.

## Part Two - What is A STATISTICAL MODEL?

**Models:** Teaching a computer to make predictions involves feeding data into machine learning models, which are representations of how the world supposedly works. If I tell a statistical model that the world works a certain way (say, for example, that two story homes are more expensive than one story homes), then this model can then tell me what be more expensive, a ranch style home or a cape cod. 

What does a model actually look like? Surely the concept of a model makes sense in the abstract, but knowing this is just half the battle. You should also know how it‚Äôs represented inside of a computer, or what it would look like if you wrote it down on paper.

A model is just a mathematical function, which is merely a relationship between a set of inputs and a set of outputs. Here‚Äôs an example:

f(x) = x¬≤

This is a function that takes as input a number and returns that number squared. So, f(1) = 1, f(2) = 4, f(3) = 9.

Let‚Äôs briefly return to the example of the model that predicts home price from home stories. I may believe, based on what I‚Äôve seen in the world, that given a home's price is, on average, equal to the house's stories times 100,000. 

This model can be represented mathematically as follows:

Price = Stories √ó $100,000

In other words, income is a function of stories.

**Here‚Äôs the main point:** Machine learning refers to a set of techniques for estimating functions (like the one involving income) based on datasets (pairs of heights and their associated incomes). These functions, which are called models, can then be used for predictions of future data.

**Algorithms:** These functions are estimated using algorithms. In this context, an algorithm is a predefined set of steps that takes as input a bunch of data and then transforms it through mathematical operations. You can think of an algorithm like a recipe ‚Äî first do this, then do that, then do this. Done.
Machine learning of all types uses models and algorithms as its building blocks to make predictions and inferences about the world.
Now I‚Äôll show you how models actually work by breaking them apart, component by component. This next part is important.

## Part Three - A Framework for understanding ML

**Inputs:** Statistical models learn from the past, formatted as structured tables of data (called **training data**). These datasets ‚Äî such as those you might find in Excel sheets ‚Äî tend to be formatted in a very structured, easy-to-understand way: each row in the dataset represents an individual **observation,** also called a datum or measurement, and each column represents a different **feature**, also called a predictor, of an observation.

For example, you might imagine a dataset about people, in which each row represents a different person, and each column represents a different feature about that person: profession, age, income, etc.

Most traditional models accept data formatted in the way I‚Äôve just described. We call this structured data.

Because one common goal of ML is to make predictions, training data also includes a column containing the data you want to predict. This feature is called the response variable (or output variable, or dependent variable) and looks just like any other feature in the table.

Most common statistical models are constructed using a technique called supervised learning, which uses data that includes a response variable to make predictions or do inference. There is also a branch of ML called unsupervised learning, which doesn‚Äôt require a response variable and which is generally used just to find interesting patterns between variables (this pattern-finding process is known as inference). It is just as important as supervised learning, but it is usually much harder to understand and also less common in practice. This document won‚Äôt talk much about the latter subfield. The takeaway from this paragraph is simply that there are two ‚Äútypes‚Äù of learning, and that supervised learning is more common.


Model selection: We have our data, and we‚Äôve decided that there‚Äôs probably a relationship between our predictors and our response. We‚Äôre ready to make predictions.

As an aside, we don‚Äôt actually need to know if there‚Äôs a relationship between these variables. We could, in fact, just throw all of our data into an algorithm and see if the resulting model is able to make valid predictions.
Now we need to pick which model to use. Naturally, there are many different types of models which explain how the data actually works, and we‚Äôd like to choose the one that most accurately describes the relationship between the predictors and the response variable.

Models generally fall into one of two categories:
**Regression models**, which are used when the response variable (i.e. the variable that you‚Äôre predicting) is continuous. For example, height, age, and income are all continuous. That is, they can be placed and ordered on a number line.
**Classification models**, which are used for categorical data ‚Äî that is, data that doesn‚Äôt have a numerical ordering. For example, you may want to predict, based on an image of a flower, the species of that flower. Or you may want to predict whether a student is a psychology major or a math major.
The first step in picking a model is deciding whether or not your response variable is quantitative or categorical.


Why is model selection an important concept for non-technical people? 

Well, if a model is chosen poorly, then its predictions will be inaccurate!


Below, I‚Äôll walk you through an example of a popular, powerful, simple mode that can be used for prediction.

## Part Four - Let's Look at an example!

In [None]:
# Import Libraries

In [36]:
import pandas as pd
import numpy as np
import scipy
import sklearn
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


# Load data, split into training and validation sets

In [3]:
filepath = 'data/train.csv'
dataframe = pd.read_csv(filepath)
print(len(dataframe))
# print(dataframe)

7239


In [4]:
dataframe.head()

Unnamed: 0,id,original,edit,grades,meanGrade
0,10070,Lawmaker Who Assaulted Reporter Fights Court-O...,Shaving,22110,1.2
1,1062,Trump rolls back Obama 's rule requiring emplo...,pets,33100,1.4
2,12796,' Who the hell is <Dana Rohrabacher/> ? ' Seth...,batman,11110,0.8
3,1745,House Republicans just voted to gut the indepe...,laundry,33200,1.6
4,13366,The Coca-Cola invasion is causing Mexico ‚Äôs sl...,mail,21000,0.6


This is a text-based regression task. Every training document (a line in the data file) contains the following columns (csv format):

**id:** document identifier;
original: original news headline, in which one word is tagged between < and />;

**edit:** the new word to replace the tagged word in the original headline;

**grades:** a list of funniness grades (0="Not Funny", 1="Slightly Funny", 2="Moderately Funny", 3="Funny") concatenated into a single string. For instance, '1101' means four human judges looked at the edited headline and submitted funniness grades {1, 1, 0, 1};

**meanGrade:** the average funniness value. In the previous example, meanGrade = (1+1+0+1)/4 = 0.75 .

Your goal is to predict the average funniness value of an edited headline. More concretely, to predict the meanGrade (a real value) given an original headline, a tagged word, and an edit word.


In [6]:
train_ratio = 0.7 # 70% for training, 30% for validation
random_seed = None # a fixed random seed allows fixed random runs (for controlled debugging). set to None to be random.

train_dataframe = dataframe.sample(frac= train_ratio, random_state=100) 
valid_dataframe = dataframe.drop(train_dataframe.index)
print('training set size:', len(train_dataframe))
print('validation set size:', len(valid_dataframe))
# print(train_dataframe)

training set size: 5067
validation set size: 2172


## Also load test data (no splitting needed here)

In [7]:
test_filepath = 'data/test.csv'
test_dataframe = pd.read_csv(test_filepath)
print('test set size:', len(test_dataframe))
# print(test_dataframe)

test set size: 2413


In [31]:
test_dataframe.head()

Unnamed: 0,id,original,edit
0,13109,Hillary Clinton warns LGBT progress may not be...,bridge
1,3435,Gaza violence : Israel defends actions as 55 <...,newts
2,3794,Germany ‚Äôs SPD Is Open to <Talks/> on New Merk...,fights
3,8136,"North Korea Signals Olympics Truce , Seeks <Ta...",rave
4,11655,Trump On North Korea : ‚Äò We Have No Road Left ...,dinner


# Try the trivial baseline: always predicting the average meanGrade (of training data)

In [24]:
# take out prediction targets: mean grades 
train_Y = train_dataframe['meanGrade']
valid_Y = valid_dataframe['meanGrade']

The Root Mean Squared (RMSE) is our evaluation metric and is calculated as

ùëÖùëÄùëÜùê∏=‚àö‚àëùëõùëñ=1(ùë¶ùëñ‚àíùë¶ÃÇùëñ)2/n

where ùë¶ùëñ is the actual funniness value of the document, and ùë¶ÃÇùëñ is the predicted value of the document, so (ùë¶ùëñ‚àíùë¶ÃÇùëñ)2 is the squared error of prediction. The lower RMSE, the more accurate prediction.

In [25]:
# compute average of a list of numbers: np.mean
train_Y_avg = np.mean(train_dataframe['meanGrade'])
print('average meanGrade on training set:', train_Y_avg)

# make a list filled with train_Y_avg, essentially predicting the same number for all lines in validation set
avg_pred_valid = [train_Y_avg for i in range(len(valid_dataframe))]
# print (avg_pred_valid)

# compute root mean squared error (RMSE) of this prediction on validation set
rmse = np.sqrt(mean_squared_error(valid_Y, avg_pred_valid))
print('RMSE on validation set:', rmse)

#taking the mean as the error

average meanGrade on training set: 0.9418327741595908
RMSE on validation set: 0.5862528391481608


In [26]:
# helper function: write out prediction values into a csv format file
# params:
#     df: dataframe, where each row is a test example, with column 'id' as data id
#     pred: a list or 1-d array of prediction values
#     filepath: the output file path
# return:
#     None

def write_test_prediction(df, pred, filepath):
    with open(filepath, 'w') as outfile:
        outfile.write('{},{}\n'.format('id', 'pred'))
        for index, row in df.iterrows():
            outfile.write('{},{}\n'.format(row['id'], pred[index]))

In [27]:
# make a list filled with train_Y_avg, essentially predicting the same number for all lines in test set
avg_pred_test = [train_Y_avg for i in range(len(test_dataframe))]
write_test_prediction(test_dataframe, avg_pred_test, './average_constant_baseline_new-tf.csv')

# Build feature extractor from training data (here we use a CountVectorizer or TfidfVectorizer )

In [28]:
# get entire raw text in training corpus, including title and edit words (for learning vocabulary and IDF)
# params:
#     df: dataframe, with 'original' and 'edit' columns
# return:
#     corpus: a list of text strings, each is a concatenation of original text and edit word on each line

def get_raw_text(df):
    corpus = []
    for index, row in df.iterrows():
        title = row['original'].replace('<', '').replace('/>', '')
        edit = row['edit']
        corpus.append( title + ' ' + edit )
    return corpus

TF-IDF ( Term Frequency(TF) ‚Äî Inverse Dense Frequency(IDF) ) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

It has many uses, most importantly in automated text analysis, and is very useful for scoring words in machine learning algorithms for Natural Language Processing (NLP).

TF-IDF (term frequency-inverse document frequency) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don‚Äôt mean much to that document in particular.

However, if the word Bug appears many times in a document, while not appearing many times in others, it probably means that it‚Äôs very relevant. For example, if what we‚Äôre doing is trying to find out which topics some NPS responses belong to, the word Bug would probably end up being tied to the topic Reliability, since most responses containing that word would be about that topic.

In [30]:
train_corpus = get_raw_text(train_dataframe)
print(train_corpus)

vectorizer = TfidfVectorizer(stop_words = None).fit(train_corpus)
print(vectorizer.vocabulary_)
#vectorizer = CountVectorizer(stop_words = None).fit(train_corpus)
#print(vectorizer.vocabulary_)



TF-IDF for a word in a document is calculated by multiplying two different metrics:

The **term frequency** of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.

The **inverse document** frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.
So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.
Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.

# Extract features of both training and validation data

In [32]:
# helper function: separate each title into (original_word, context), where context = title text without original word 
# params:
#     df: dataframe, with 'original' and 'edit' columns
# return:
#     original_words: a list of original word strings before editing
#     contexts:       a list of context strings 

def separate_original_word_from_title(df):
    original_words = []
    contexts = []
    for index, row in df.iterrows():
        title = row['original']
        start_position = title.find('<')
        end_position = title.find('/>')
        original_words.append(title[start_position+1 : end_position])
        contexts.append(title[:start_position] + title[end_position+2 :])
    return original_words, contexts

Here we construct a Sparse Feature Matrix. This is to make this task more computationally easy.  More information can be found here: 

https://machinelearningmastery.com/sparse-matrices-for-machine-learning/


In [33]:
# construct sparse feature matrix
# params:
#     df: dataframe, with 'original' and 'edit' columns
#     vectorizer: sklearn text vectorizer, either TfidfVectorizer or Countvectorizer 
# return:
#     M: a sparse feature matrix that represents df's textual information (used by a predictive model)

def construct_feature_matrix(df, vectorizer):
    edit_words = df['edit'].tolist()
    original_words, contexts = separate_original_word_from_title(df)
    
    
    # here the dimensionality of X is len(df) x |V|
    X_edit = vectorizer.transform(edit_words)

    
    return X_edit

In [56]:
# Construct feature matrices for training and validation data
train_X = construct_feature_matrix(train_dataframe, vectorizer)
valid_X = construct_feature_matrix(valid_dataframe, vectorizer)
test_X = construct_feature_matrix(test_dataframe, vectorizer)

# Train model on training set, evaluate model on validation set

You could use a number of different models here. Look at this list and see what potentially othe models you would want to use:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model


In [37]:
# train a linear regression model
lm = LinearRegression()
model = lm.fit(train_X, train_Y)
print (model.intercept_)
print (model.coef_.shape)

0.8836506400461821
(8900,)


In [38]:
# Evaluate model on validation set
valid_Y_hat = model.predict(valid_X)
rmse = np.sqrt(sklearn.metrics.mean_squared_error(valid_Y, valid_Y_hat))
print('RMSE on validation set:', rmse)

RMSE on validation set: 0.6115503493627421


In [39]:
# Evaluate model on training set: 
# expect to see unrealistically good performance! (for RMSE: lower is better)
# unrealistic because YOUR MODEL IS TRAINED ON EXACTLY THESE DATA!
# It gives the best validation/test performance you could hope to achieve using this model.

train_Y_hat = model.predict(train_X)
rmse = np.sqrt(sklearn.metrics.mean_squared_error(train_Y, train_Y_hat))
print('RMSE on training set:', rmse)

RMSE on training set: 0.3349698511504976


In [40]:
# apply the model on test data, write out prediction results to a csv file
test_Y_hat = model.predict(test_X)
write_test_prediction(test_dataframe, test_Y_hat, './ridge-regression_alpha=1_baseline_new-tf.csv')

# Investigate what the model has learned and where it failed (A.K.A. error analysis)

## Look at learned parameters (for linear model: weight of each dimension)

In [54]:
# construct a mapping: word -> learned weight of this word
feature_weight = {}
for word, idx in vectorizer.vocabulary_.items():
    feature_weight[word] = model.coef_[idx]

In [55]:
# words positively correlate with funniness (top ones)
for k, v in sorted(feature_weight.items(), key = lambda x: x[1], reverse = True)[:10]:
     print (k, v)

bathe 1.9163493569274035
buttock 1.9163493569274035
mistresses 1.9163493569274035
biceps 1.716349356927438
midlife 1.716349356927438
sexy 1.716349356927438
dealer 1.716349356927438
spanks 1.716349356927438
santa 1.5163493569274713
assassinations 1.5163493569274713


In [47]:
# words negatively correlate with funniness (top ones)
for k, v in sorted(feature_weight.items(), key = lambda x: x[1], reverse = False)[:10]:
     print (k, v)

opposition -0.8836506444659917
years -0.8836506444659917
border -0.8836506444659917
sale -0.8836506444659917
trump -0.8836506430721189
check -0.8836506430721189
blames -0.8836506430721189
hike -0.8836506430721189
radio -0.8836506430721189
attacks -0.8836506430721189


## Look at how the model makes predictions on individual examples

In [48]:
# We pick a set of examples from the validation set (we predicted scores for those).
# We usually we don't pick from training data (since the good performance may be unrealistic).
# We cannot do error analysis on test data Ôºàbecause no true target value is providedÔºâ.

In [49]:
def explain_linear_prediction(df, model, idx2feature, X, Y, Y_hat, idx_list):
    print('indices:', idx_list)
    for idx in idx_list:
        print ('==============', idx, '================')
        print ('original:', df.iloc[idx]['original'])
        print ('edit:', df.iloc[idx]['edit'])
        print ('grades:', df.iloc[idx]['grades'])
        print ('TRUE score:', df.iloc[idx]['meanGrade'])
        print ('PRED score:', Y_hat[idx])
        
        print ('\nPRED breakdown:')
        print ('\tINTERCEPT', model.intercept_)
        if X[idx, :].nnz == 0:
            print ('\tFEATURE', '[EMPTY]')
        else:
            for entry in X[idx, :]: # looping over a row in sparse matrix 
                feature_value = entry.data[0]
                feature_dim = entry.indices[0]
                print ('\tFEATURE', idx2feature[feature_dim], ':', 'f_value', feature_value, '*', 'f_weight', model.coef_[feature_dim], '=', feature_value*model.coef_[feature_dim])
        

In [50]:
# construct a dictionary mapping: feature index -> word
idx2feature = dict([(v,k) for k,v in vectorizer.vocabulary_.items()])

errors = (valid_Y - valid_Y_hat)**2
# sort errors from low to high
sorted_errors = sorted(enumerate(errors.iloc[:].tolist()), key = lambda x: x[1], reverse = False)
# print(sorted_errors)

### prediction on random examples

In [51]:
# pick a random set of examples from validation set:
K = 5
random_indices = np.random.randint(0, valid_X.shape[0], K)
explain_linear_prediction(valid_dataframe, model, idx2feature, valid_X, valid_Y, valid_Y_hat, random_indices)

indices: [1088 1232 1065 1669  505]
original: Starbucks <encourages/> bipartisan coffee-drinking
edit: forces
grades: 32110
TRUE score: 1.4
PRED score: 0.8836506400461821

PRED breakdown:
	INTERCEPT 0.8836506400461821
	FEATURE forces : f_value 1.0 * f_weight 0.0 = 0.0
original: A detailed <analysis/> of the Trump-Palin-Nugent-Kid Rock photo
edit: shocker
grades: 11000
TRUE score: 0.4
PRED score: 0.8836506400461821

PRED breakdown:
	INTERCEPT 0.8836506400461821
	FEATURE [EMPTY]
original: Major Referendum Today in Turkey , Decision on Whether or Not To Expand Turkish President Erdogan 's <Power/> and Role
edit: kitchen
grades: 32111
TRUE score: 1.6
PRED score: 1.1500000089824969

PRED breakdown:
	INTERCEPT 0.8836506400461821
	FEATURE kitchen : f_value 1.0 * f_weight 0.26634936893631467 = 0.26634936893631467
original: Trump border wall : Texans receiving letters about their <land/> 
edit: barbecue
grades: 32222
TRUE score: 2.2
PRED score: 1.0666666540652607

PRED breakdown:
	INTERCEPT 0.8

### examples with closest prediction

In [52]:
K = 5
# look at data with lowest prediction error
low_error_indices  = [i for i, v in sorted_errors[:K]]
explain_linear_prediction(valid_dataframe, model, idx2feature, valid_X, valid_Y, valid_Y_hat, low_error_indices)

indices: [1428, 84, 110, 619, 955]
original: Susan Sarandon : ‚Äò I Do n‚Äôt Think Trump ‚Äôs Gon na Make It Through His Whole <Term/> ‚Äô
edit: sandwich
grades: 32110
TRUE score: 1.4
PRED score: 1.4000000024059092

PRED breakdown:
	INTERCEPT 0.8836506400461821
	FEATURE sandwich : f_value 1.0 * f_weight 0.5163493623597272 = 0.5163493623597272
original: FBI Director asks Justice Department to publicly <denounce/> Trump 's assertion of Obama wiretapping
edit: approve
grades: 0
TRUE score: 0.0
PRED score: -3.025936834433196e-09

PRED breakdown:
	INTERCEPT 0.8836506400461821
	FEATURE approve : f_value 1.0 * f_weight -0.8836506430721189 = -0.8836506430721189
original: Trump walks back bizarre <comments/> on funding black colleges ‚Äî but this administration ‚Äôs racism is no mistake
edit: rant
grades: 0
TRUE score: 0.0
PRED score: -3.025936834433196e-09

PRED breakdown:
	INTERCEPT 0.8836506400461821
	FEATURE rant : f_value 1.0 * f_weight -0.8836506430721189 = -0.8836506430721189
original: A

### examples with worst predictions

In [53]:
K = 5
# look at data with highest prediction error
high_error_indices = [i for i, v in sorted_errors[-K:]]
explain_linear_prediction(valid_dataframe, model, idx2feature, valid_X, valid_Y, valid_Y_hat, high_error_indices)

indices: [978, 120, 2146, 310, 774]
original: Spicer defends Trump : Issues are ' evolving towards the president 's <position/> '
edit: mouth
grades: 10000
TRUE score: 0.2
PRED score: 2.199999996973688

PRED breakdown:
	INTERCEPT 0.8836506400461821
	FEATURE mouth : f_value 1.0 * f_weight 1.316349356927506 = 1.316349356927506
original: Donald Trump Endorses Keeping Senate in <Session/> Seven Days a Week to Get Nominees Approved
edit: Jail
grades: 32222
TRUE score: 2.2
PRED score: 0.19999999697402915

PRED breakdown:
	INTERCEPT 0.8836506400461821
	FEATURE jail : f_value 1.0 * f_weight -0.683650643072153 = -0.683650643072153
original: Trump to North Korean leader Kim : My ‚Äò Nuclear <Button/> ‚Äô is ‚Äò much bigger &amp; more powerful ‚Äô
edit: Belly
grades: 33222
TRUE score: 2.4
PRED score: 0.3999999969739948

PRED breakdown:
	INTERCEPT 0.8836506400461821
	FEATURE belly : f_value 1.0 * f_weight -0.4836506430721873 = -0.4836506430721873
original: Charlotte Pence : I Bought The Gay <Bunny

## Conclusion

In this notebook, we have learned about what machine learning is and generally how it works. Also, we began to train or first model. Please email me at grantg@live.unc.edu before our next class with a dataset you would like to work with, if you don't have one, I will provide you with a dataset. 