# CORE LEARNING ALGORITHMS

## 1- Linear Regression:

The next example is a model predicting the survival rate of titanic passengers using certain features, via linear regression.

In [1]:
!pip install -q sklearn

### Working with pandas' dataframes


**Some notes**:
- ".csv": stands for comma separated values
- We will choose "survived" column to be our label (value we are trying to predict)

In [27]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib

import tensorflow.compat.v2.feature_column as fc
import tensorflow as tf

dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv') # training data
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv') # testing data


y_train = dftrain.pop('survived')
y_eval = dfeval.pop('survived')

clear_output()

### Some functions dealing with dataframes:

**pd.read_csv('csv_file_name'):** returns a pandas dataframe (sth like a table)

**df.head():** returns first five rows from the dataframe

**df.pop('column_name'):** returns and deletes column w/ name "column_name"

**df.loc\[i\]:** returns row with index i

**df.describe():** returns some basic info abt dataframe

**df.shape():** returns table shape

**df.feature.hist(bins=i):** displays histogram of column "feature" w/ i as the number of bins.

*Note*: display some charts abt your data so you can get some sort of intuition as to what the label may point to.



### Important Notes:

- Separate data sets into training set and testing set.

- **Categorical Data:** Data that is made up of categories (i.e Male, Female, Other). Dealing w/ it is through **one hot encoding**.

#### One hot encoding:
Every category is a feature and we determine which category by using 1 for the correct category for the entry and 0's otherwise.




In [28]:
CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',
                       'embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']

feature_columns = []

for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = dftrain[feature_name].unique()  # gets a list of all unique values from given feature column
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

#print(feature_columns)

### Training process:

- Data is streamed into batches, we don't add entire dataset to model at once
- These batches can be fed to the model multiple times according to the number of epochs

**Epochs:** one stream of the entire dataset, # of epochs is # of times the model will see the entire dataset.

**Overfitting:** memorizing data points in training set only, not able to classify new elements.

**Underfitting:** opposite of overfitting, not being able to predict.


A tensorflow model requires the data we pass to it to be in the form of ```tf.data.Dataset``` object. So, we create an *input function* to convert a pandas dataframe to a tf Dataset.


In [29]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():  # inner function, this will be returned
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))  # create tf.data.Dataset object with data and its label
    if shuffle:
      ds = ds.shuffle(1000)  # randomize order of data
    ds = ds.batch(batch_size).repeat(num_epochs)  # split dataset into batches of 32 and repeat process for number of epochs
    return ds  # return a batch of the dataset
  return input_function  # return a function object for use

train_input_fn = make_input_fn(dftrain, y_train)  # here we will call the input_function that was returned to us to get a dataset object we can feed to the model
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)


Above is a standard input function copied from tensorflow value.

*Notes*: in evaluation, num_epochs=1 and shuffle=False



### Creating, Training, Testing and Predicting

**Creating:** to create a model we need to create an *estimator*, there is one for every core learning algorithm.

Since we're using linear regression: we will use an estimator of type LinearClassifier that takes the feature_columns created before.

**Training & testing:** training & testing the model is done by passing the input functions created.

**Predicting:** predicting is entering an input_function and getting the probability of label 0 or 1.

In [30]:
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)

clear_output()
# We create a linear estimtor by passing the feature columns we created earlier

In [31]:
linear_est.train(train_input_fn)  # train
result = linear_est.evaluate(eval_input_fn)  # get model metrics/stats by testing on tetsing data

clear_output()  # clears console output
print(result['accuracy'])  # the result variable is simply a dict of stats about our model

0.7537879


In [48]:
result_dicts = list(linear_est.predict(eval_input_fn))
probs = pd.Series([result['probabilities'][1] for result in result_dicts])

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\NOURHE~1\AppData\Local\Temp\tmpt0wzfmbg\model.ckpt-200
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


## Molakhas - Linear Regression: 

- Read csv files and put them into pandas dataframes
- Do one-hot encoding on categorical data: outputting *feature_columns*
- Make *input function* transforming dataframe to a tf Dataset 
- Create LinearClassifier estimator using feature_columns dataframe as input
- Train estimator: ```est.train(train_input_fn)```
- Evaluate estimator (test it) & return result: ```est.evaluate(eval_input_fn)```
- result\['accuracy'\]: shows accuracy of model
- To predict: ```est.predict(input_fn)```, returns result, we can transform result to list
- If we print one item of the list result: we can see percentage of survival (probability of label)
- ```print(result[0]['probabilities'][1])``` <-- Probability that passenger survives.



**Notes**: changing # of epochs changes accuracy.


## 2- Classification

Classification is used to seperate data points into classes of different labels. The next example is classifying flowers into one of three types.


In [34]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import pandas as pd

CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']


Previously, we just set constants for flower features and flower species (labels).

Dataset used is "iris" dataset. Here, we are importing csv files through Keras and storing them into a pandas dataframe.

In [38]:
train_path = tf.keras.utils.get_file(
    "iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_path = tf.keras.utils.get_file(
    "iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")

train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0)
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)


In [40]:
train.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,6.4,2.8,5.6,2.2,2
1,5.0,2.3,3.3,1.0,1
2,4.9,2.5,4.5,1.7,2
3,4.9,3.1,1.5,0.1,0
4,5.7,3.8,1.7,0.3,0


In [42]:
train.shape

(120, 5)

Here, the label is the "species" column. Thus, we have 120 data entry.

Then, we should pop the species column and use it as a label.

Next, after getting our data & labels, we make an input function (slightly different than before). This input function doesn't have any epochs. But it does the same thing as before: convert a pandas dataframe into a tf.data.Dataset.

In [43]:
train_y = train.pop('Species')
test_y = test.pop('Species')
train.head() # the species column is now gone

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
0,6.4,2.8,5.6,2.2
1,5.0,2.3,3.3,1.0
2,4.9,2.5,4.5,1.7
3,4.9,3.1,1.5,0.1
4,5.7,3.8,1.7,0.3


In [50]:
def input_fn(features, labels, training=True, batch_size=256):
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # Shuffle and repeat if you are in training mode
    # Another way of saying: shuffle=True
    if training:
        dataset = dataset.shuffle(1000).repeat()
    
    return dataset.batch(batch_size)

In [51]:
my_feature_columns = []
for key in train.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key=key))
print(my_feature_columns)

[NumericColumn(key='SepalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='SepalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='PetalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='PetalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]


We don't need to do the vocabulary thing because everything is one hot encoded (no categorical data, all numeric).

**train.keys():** is the same as feature_name, each key is a feature column.

### Estimator

For classification tasks there are variety of different estimators/models that we can pick from:

- DNNClassifier (Deep Neural Network) - *better*
- LinearClassifier

Why is DNNClassifier better? Because we may not be able to find a linear correspondence in our data. 

In [55]:
# Build a DNN with 2 hidden layers with 30 and 10 hidden nodes each (arbitrary #).
classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[30, 10],
    n_classes=3)

clear_output()

### DNN Classifier Use:

DNN Classifier takes 3 inputs: feature_columns, a list of hidden units and number of classes to classify data into.

**Hidden units list:** number of elements is number of hidden layers and each element's value is the number of nodes in that layer.

### Training

- To avoid creating an inner function (like in linear regression) we use lambda:

```variable = lambda *arguments* : *expression*```

- The steps operator is how many times the classifier to run for. Similar to epochs, but instead of running through the data set a # of times, classifier stops after looking at steps number of "things".


In [57]:
x = lambda a: a+10
x(5)

15

In [62]:
# We include a lambda to avoid creating an inner function previously
classifier.train(
    input_fn=lambda: input_fn(train, train_y, training=True),
    steps=5000)

clear_output()

### Evaluating model (Testing)

In [64]:
eval_result = classifier.evaluate(
    input_fn=lambda: input_fn(test, test_y, training=False))

clear_output()
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))


Test set accuracy: 0.967



### Predicting

**class_ids:** class_id is label index of found (classified) entry in array of constants

**probabilities**: returns 3 probabilities for each class.

**pred_dict\['probabilities'\]\['class_ids'\]:** prints out class name with highest probability of the 3.

In [71]:
def input_fn(features, batch_size=256):
    # Convert the inputs to a Dataset without labels.
    return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)

features = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
predict = {}

print("Please type numeric values as prompted.")
for feature in features:
  valid = True
  while valid: 
    val = input(feature + ": ")
    if not val.isdigit(): valid = False

  predict[feature] = [float(val)]

predictions = classifier.predict(input_fn=lambda: input_fn(predict))
for pred_dict in predictions:
    class_id = pred_dict['class_ids'][0]
    probability = pred_dict['probabilities'][class_id]

    print('Prediction is "{}" ({:.1f}%)'.format(
        SPECIES[class_id], 100 * probability))

Please type numeric values as prompted.
SepalLength: 2.3
SepalWidth: 2.3
PetalLength: 2.3
PetalWidth: 2.3
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\NOURHE~1\AppData\Local\Temp\tmpcabkgcd_\model.ckpt-25000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Prediction is "Virginica" (90.7%)


## Molakhas Classification:

- Load data into pandas dataframe
- Create feature_columns for numeric keys
- Specify input function: steps, shuffle, ...etc.
- Create DNNClassifier estimator w/ inputs: feature_columns and input fn
- classifier.train
- classifier.evaluate
- If you want to predict unknown values: classifier.predict **but** w/ different input function w/ no labels

## 3- Clustering: Unsupervised Learning

No labels, no output, just input and find n clusters.

**Centroid**: where different clusters currently exist


### K-Means Algorithm:

1. Choose K points to place K centroids.
2. Each point is assigned to the closest centroid by distance.
3. Move centroids to center of mass of points that are assigned to it.
4. Repeat assignment step.
5. Repeat last two steps until none of the points are moving anymore (until centroids are directly in the middle of clusters of data).

To predict:
1. Plot that data point
2. Find distance to all clusters.
3. Output is label of closest cluster.

Note: You need to define variable K (number of clusters).


## 4- Hidden Markov Models: Unsupervised Learning

We deal w/ probability distributions. The example used is weather forecasting given the probabilities of certain events occuring.

We have a bunch of states (i.e hot day, cold day).

In each state, we have an observation. 