<a href="https://colab.research.google.com/github/lailaashraff/BasicMachineLearning/blob/main/BasicML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CORE LEARNING ALGORITHMS

## 1- Linear Regression:

The next example is a model predicting the survival rate of titanic passengers using certain features, via linear regression.

In [1]:
!pip install -q sklearn

### Working with pandas' dataframes


**Some notes**:
- ".csv": stands for comma separated values
- We will choose "survived" column to be our label (value we are trying to predict)

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib

import tensorflow.compat.v2.feature_column as fc
import tensorflow as tf

dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv') # training data
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv') # testing data


y_train = dftrain.pop('survived')
y_eval = dfeval.pop('survived')

clear_output()

### Some functions dealing with dataframes:

**pd.read_csv('csv_file_name'):** returns a pandas dataframe (sth like a table)

**df.head():** returns first five rows from the dataframe

**df.pop('column_name'):** returns and deletes column w/ name "column_name"

**df.loc\[i\]:** returns row with index i

**df.describe():** returns some basic info abt dataframe

**df.shape():** returns table shape

**df.feature.hist(bins=i):** displays histogram of column "feature" w/ i as the number of bins.

*Note*: display some charts abt your data so you can get some sort of intuition as to what the label may point to.



### Important Notes:

- Separate data sets into training set and testing set.

- **Categorical Data:** Data that is made up of categories (i.e Male, Female, Other). Dealing w/ it is through **one hot encoding**.

#### One hot encoding:
Every category is a feature and we determine which category by using 1 for the correct category for the entry and 0's otherwise.




In [3]:
CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',
                       'embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']

feature_columns = []

for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = dftrain[feature_name].unique()  # gets a list of all unique values from given feature column
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

print(feature_columns)

[VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='n_siblings_spouses', vocabulary_list=(1, 0, 3, 4, 2, 5, 8), dtype=tf.int64, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='parch', vocabulary_list=(0, 1, 2, 5, 3, 4), dtype=tf.int64, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='class', vocabulary_list=('Third', 'First', 'Second'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='deck', vocabulary_list=('unknown', 'C', 'G', 'A', 'B', 'D', 'F', 'E'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Southampton', 'Cherbourg', 'Queenstown', 'unknown'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='alone', vocabulary_list=('n', 'y'), dtype=tf.string, def

### Training process:

- Data is streamed into batches, we don't add entire dataset to model at once
- These batches can be fed to the model multiple times according to the number of epochs

**Epochs:** one stream of the entire dataset, # of epochs is # of times the model will see the entire dataset.

**Overfitting:** memorizing data points in training set only, not able to classify new elements.

**Underfitting:** opposite of overfitting, not being able to predict.


A tensorflow model requires the data we pass to it to be in the form of ```tf.data.Dataset``` object. So, we create an *input function* to convert a pandas dataframe to a tf Dataset.


In [4]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():  # inner function, this will be returned
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))  # create tf.data.Dataset object with data and its label
    if shuffle:
      ds = ds.shuffle(1000)  # randomize order of data
    ds = ds.batch(batch_size).repeat(num_epochs)  # split dataset into batches of 32 and repeat process for number of epochs
    return ds  # return a batch of the dataset
  return input_function  # return a function object for use

train_input_fn = make_input_fn(dftrain, y_train)  # here we will call the input_function that was returned to us to get a dataset object we can feed to the model
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)


Above is a standard input function copied from tensorflow value.

*Notes*: in evaluation, num_epochs=1 and shuffle=False



### Creating, Training, Testing and Predicting

**Creating:** to create a model we need to create an *estimator*, there is one for every core learning algorithm.

Since we're using linear regression: we will use an estimator of type LinearClassifier that takes the feature_columns created before.

**Training & testing:** training & testing the model is done by passing the input functions created.

**Predicting:** predicting is entering an input_function and getting the probability of label 0 or 1.

In [5]:
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)

clear_output()
# We create a linear estimtor by passing the feature columns we created earlier

In [6]:
linear_est.train(train_input_fn)  # train
result = linear_est.evaluate(eval_input_fn)  # get model metrics/stats by testing on tetsing data

clear_output()  # clears console output
print(result['accuracy'])  # the result variable is simply a dict of stats about our model

0.77272725


In [7]:
result_dicts = list(linear_est.predict(eval_input_fn))
probs = pd.Series([result['probabilities'][1] for result in result_dicts])

INFO:tensorflow:Calling model_fn.


  getter=tf.compat.v1.get_variable)


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpj3hvkeog/model.ckpt-200
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


## Molakhas - Linear Regression: 

- Read csv files and put them into pandas dataframes
- Do one-hot encoding on categorical data: outputting *feature_columns*
- Make *input function* transforming dataframe to a tf Dataset 
- Create LinearClassifier estimator using feature_columns dataframe as input
- Train estimator: ```est.train(train_input_fn)```
- Evaluate estimator (test it) & return result: ```est.evaluate(eval_input_fn)```
- result\['accuracy'\]: shows accuracy of model
- To predict: ```est.predict(input_fn)```, returns result, we can transform result to list
- If we print one item of the list result: we can see percentage of survival (probability of label)
- ```print(result[0]['probabilities'][1])``` <-- Probability that passenger survives.



**Notes**: changing # of epochs changes accuracy.


## 2- Classification

Classification is used to seperate data points into classes of different labels. The next example is classifying flowers into one of three types.


In [8]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import pandas as pd

CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']


Previously, we just set constants for flower features and flower species (labels).

Dataset used is "iris" dataset. Here, we are importing csv files through Keras and storing them into a pandas dataframe.

In [9]:
train_path = tf.keras.utils.get_file(
    "iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_path = tf.keras.utils.get_file(
    "iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")

train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0)
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)


Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv


In [10]:
train.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,6.4,2.8,5.6,2.2,2
1,5.0,2.3,3.3,1.0,1
2,4.9,2.5,4.5,1.7,2
3,4.9,3.1,1.5,0.1,0
4,5.7,3.8,1.7,0.3,0


In [11]:
train.shape

(120, 5)

Here, the label is the "species" column. Thus, we have 120 data entry.

Then, we should pop the species column and use it as a label.

Next, after getting our data & labels, we make an input function (slightly different than before). This input function doesn't have any epochs. But it does the same thing as before: convert a pandas dataframe into a tf.data.Dataset.

In [12]:
train_y = train.pop('Species')
test_y = test.pop('Species')
train.head() # the species column is now gone

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
0,6.4,2.8,5.6,2.2
1,5.0,2.3,3.3,1.0
2,4.9,2.5,4.5,1.7
3,4.9,3.1,1.5,0.1
4,5.7,3.8,1.7,0.3


In [13]:
def input_fn(features, labels, training=True, batch_size=256):
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # Shuffle and repeat if you are in training mode
    # Another way of saying: shuffle=True
    if training:
        dataset = dataset.shuffle(1000).repeat()
    
    return dataset.batch(batch_size)

In [14]:
my_feature_columns = []
for key in train.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key=key))
print(my_feature_columns)

[NumericColumn(key='SepalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='SepalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='PetalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='PetalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]


We don't need to do the vocabulary thing because everything is one hot encoded (no categorical data, all numeric).

**train.keys():** is the same as feature_name, each key is a feature column.

### Estimator

For classification tasks there are variety of different estimators/models that we can pick from:

- DNNClassifier (Deep Neural Network) - *better*
- LinearClassifier

Why is DNNClassifier better? Because we may not be able to find a linear correspondence in our data. 

In [15]:
# Build a DNN with 2 hidden layers with 30 and 10 hidden nodes each (arbitrary #).
classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[30, 10],
    n_classes=3)

clear_output()

### DNN Classifier Use:

DNN Classifier takes 3 inputs: feature_columns, a list of hidden units and number of classes to classify data into.

**Hidden units list:** number of elements is number of hidden layers and each element's value is the number of nodes in that layer.

### Training

- To avoid creating an inner function (like in linear regression) we use lambda:

```variable = lambda *arguments* : *expression*```

- The steps operator is how many times the classifier to run for. Similar to epochs, but instead of running through the data set a # of times, classifier stops after looking at steps number of "things".


In [16]:
x = lambda a: a+10
x(5)

15

In [17]:
# We include a lambda to avoid creating an inner function previously
classifier.train(
    input_fn=lambda: input_fn(train, train_y, training=True),
    steps=5000)

clear_output()

### Evaluating model (Testing)

In [18]:
eval_result = classifier.evaluate(
    input_fn=lambda: input_fn(test, test_y, training=False))

clear_output()
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))


Test set accuracy: 0.533



### Predicting

**class_ids:** class_id is label index of found (classified) entry in array of constants

**probabilities**: returns 3 probabilities for each class.

**pred_dict\['probabilities'\]\['class_ids'\]:** prints out class name with highest probability of the 3.

In [19]:
def input_fn(features, batch_size=256):
    # Convert the inputs to a Dataset without labels.
    return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)

features = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
predict = {}

print("Please type numeric values as prompted.")
for feature in features:
  valid = True
  while valid: 
    val = input(feature + ": ")
    if not val.isdigit(): valid = False

  predict[feature] = [float(val)]

predictions = classifier.predict(input_fn=lambda: input_fn(predict))
for pred_dict in predictions:
    class_id = pred_dict['class_ids'][0]
    probability = pred_dict['probabilities'][class_id]

    print('Prediction is "{}" ({:.1f}%)'.format(
        SPECIES[class_id], 100 * probability))

Please type numeric values as prompted.
SepalLength: 4.8
SepalWidth: 1.7
PetalLength: 5.6
PetalWidth: 7
PetalWidth: 5.4
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpf46i0eq5/model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Prediction is "Virginica" (87.4%)


## Molakhas Classification:

- Load data into pandas dataframe
- Create feature_columns for numeric keys
- Specify input function: steps, shuffle, ...etc.
- Create DNNClassifier estimator w/ inputs: feature_columns and input fn
- classifier.train
- classifier.evaluate
- If you want to predict unknown values: classifier.predict **but** w/ different input function w/ no labels

## 3- Clustering: Unsupervised Learning

No labels, no output, just input and find n clusters.

**Centroid**: where different clusters currently exist


### K-Means Algorithm:

1. Choose K points to place K centroids.
2. Each point is assigned to the closest centroid by distance.
3. Move centroids to center of mass of points that are assigned to it.
4. Repeat assignment step.
5. Repeat last two steps until none of the points are moving anymore (until centroids are directly in the middle of clusters of data).

To predict:
1. Plot that data point
2. Find distance to all clusters.
3. Output is label of closest cluster.

Note: You need to define variable K (number of clusters).


## 4- Hidden Markov Models: Unsupervised Learning

We deal w/ probability distributions. The example used is weather forecasting given the probabilities of certain events occuring.

We have a bunch of states (i.e hot day, cold day).

In each state, we have an observation. 

# Data
Let's start by discussing the type of data we use when we work with a hidden markov model. 

In the previous sections we worked with large datasets of 100's of different entries. For a markov model we are only interested in probability distributions that have to do with states. 

We can find these probabilities from large datasets or may already have these values. We'll run through an example in a second that should clear some things up, but let's discuss the components of a markov model.

**States:** In each markov model we have a finite set of states. These states could be something like "warm" and "cold" or "high" and "low" or even "red", "green" and "blue". These states are "hidden" within the model, which means we do not direcly observe them.

**Observations:** Each state has a particular outcome or observation associated with it based on a probability distribution. An example of this is the following: *On a hot day Tim has a 80% chance of being happy and a 20% chance of being sad.*

**Transitions:** Each state will have a probability defining the likelyhood of transitioning to a different state. An example is the following: *a cold day has a 30% chance of being followed by a hot day and a 70% chance of being follwed by another cold day.*

To create a hidden markov model we need.
- States
- Observation Distribution
- Transition Distribution

For our purpose we will assume we already have this information available as we attempt to predict the weather on a given day.

# Imports And Setups

In [8]:
%tensorflow_version 2.x  # this line is not required unless you are in a notebook

`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `2.x  # this line is not required unless you are in a notebook`. This will be interpreted as: `2.x`.


TensorFlow is already loaded. Please restart the runtime to change versions.


Due to a version mismatch with tensorflow v2 and tensorflow_probability we need to install the most recent version of tensorflow_probability (see below).

In [3]:
!pip install tensorflow_probability==0.13.rc0 --user --upgrade


Collecting tensorflow_probability==0.13.rc0
  Downloading tensorflow_probability-0.13.0rc0-py2.py3-none-any.whl (5.4 MB)
[K     |████████████████████████████████| 5.4 MB 3.9 MB/s 
Collecting cloudpickle>=1.3
  Downloading cloudpickle-2.0.0-py3-none-any.whl (25 kB)
Installing collected packages: cloudpickle, tensorflow-probability
  Attempting uninstall: cloudpickle
    Found existing installation: cloudpickle 1.1.1
    Uninstalling cloudpickle-1.1.1:
      Successfully uninstalled cloudpickle-1.1.1
  Attempting uninstall: tensorflow-probability
    Found existing installation: tensorflow-probability 0.8.0rc0
    Uninstalling tensorflow-probability-0.8.0rc0:
      Successfully uninstalled tensorflow-probability-0.8.0rc0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gym 0.17.3 requires cloudpickle<1.7.0,>=1.2.0, but you have cloudpickle 2.0.0 which is inc

In [1]:
import tensorflow_probability as tfp  # We are using a different module from tensorflow this time which deals with probabilities
import tensorflow as tf

# Weather Model


Taken direclty from the TensorFlow documentation (https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/HiddenMarkovModel). 

We will model a simple weather system and try to predict the temperature on each day given the following information.
1. Cold days are encoded by a 0 and hot days are encoded by a 1.
2. The first day in our sequence has an 80% chance of being cold.
3. A cold day has a 30% chance of being followed by a hot day.
4. A hot day has a 20% chance of being followed by a cold day.
5. On each day the temperature is
 normally distributed with mean and standard deviation 0 and 5 on
 a cold day and mean and standard deviation 15 and 10 on a hot day.

If you're unfamiliar with **standard deviation** it can be put simply as the range of expected values. 

In this example, on a hot day the average temperature is 15 and ranges from 5 to 25. 
Note: i get the range by (subtracting 10 , Adding 10)

To model this in TensorFlow we will do the following.


In [2]:
tfd = tfp.distributions  # making a shortcut for later on
initial_distribution = tfd.Categorical(probs=[0.2, 0.8])  # Refer to point 2 above 
#it refers to 1st day [hot %,cold%]
transition_distribution = tfd.Categorical(probs=[[0.7, 0.3],
                                                 [0.2, 0.8]])  # refer to points 3 and 4 above
#1st list refers to day after cold[cold%,hot%] , 2nd list to after hot[cold%,hot%]
observation_distribution = tfd.Normal(loc=[0., 15.], scale=[5., 10.])  # refer to point 5 above

# the loc argument represents the mean and the scale is the standard devitation

We've now created distribution variables to model our system and it's time to create the hidden markov model.

In [3]:
model = tfd.HiddenMarkovModel(
    initial_distribution=initial_distribution,
    transition_distribution=transition_distribution,
    observation_distribution=observation_distribution,
    num_steps=7)

The **number of steps** represents the number of days that we would like to predict information for. In this case we've chosen 7, an entire week.

To get the expected temperatures on each day we can do the following.

In [4]:
mean = model.mean()

# due to the way TensorFlow works on a lower level we need to evaluate part of the graph
# from within a session to see the value of this tensor

# in the new version of tensorflow we need to use tf.compat.v1.Session() rather than just tf.Session()
with tf.compat.v1.Session() as sess:  
  print(mean.numpy())

[11.999999 10.500001  9.75      9.375     9.1875    9.09375   9.046875]
