# Working with Datasets and CSV files

### The linear regression with ```dataset```
This is a reminder of the contents of the coursera Introduction to tensorflow, week two laboratory. It is important to remind yourself about the versatility of the data API. Begin with training the usage of the Dataset type. At the beggining you started using low level TF to obtain the weights of a line equation.

$y = w0x + w1$

You are going to repeat the process but with a small difference, you are going to incorporate the ```dataset``` type to the training loop. Here are the instructions:

1. Create some X, Y vectors with data
2. Create some **Datasets** from these tensors with a batch size of 3, this is a shape (3,)
3. Verify this goes well and print it. It should have the following shape:

```
x: [0. 1. 2.] y: [10. 12. 14.]
x: [3. 4. 5.] y: [16. 18. 20.]
x: [6. 7. 8.] y: [22. 24. 26.]
```

In [8]:
#begin importing tensorflow
import tensorflow as tf
import numpy as np
# Create some tensors of shape (N_DATAPOINTS)
N_DATAPOINTS = 10 
# Define constants of the line m and b (w0, w1)
m = 4
b = 2

# These are tensor types
X = tf.constant(range(N_DATAPOINTS), dtype = tf.float32)
Y = m*X + b 
print(X, Y)


tf.Tensor([0. 1. 2. 3. 4. 5. 6. 7. 8. 9.], shape=(10,), dtype=float32) tf.Tensor([ 2.  6. 10. 14. 18. 22. 26. 30. 34. 38.], shape=(10,), dtype=float32)


In [21]:
# Now convert these into datasets
xy_ds = tf.data.Dataset.from_tensor_slices((X,Y))
# See what shape this dataset have
print(f'xy_ds is a type TensorSliceDataset:{type(xy_ds)}')

# Each item of this is a tuple
print(f'Each element of the dataset is a tuple (X, Y)')
for item in xy_ds.take(1):
    print(type(item))

# Lets visualize it
for (feature, label) in xy_ds:
    print(f'feature: {feature} label: {label}')

xy_ds is a type TensorSliceDataset:<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
Each element of the dataset is a tuple (X, Y)
<class 'tuple'>
feature: 0.0 label: 2.0
feature: 1.0 label: 6.0
feature: 2.0 label: 10.0
feature: 3.0 label: 14.0
feature: 4.0 label: 18.0
feature: 5.0 label: 22.0
feature: 6.0 label: 26.0
feature: 7.0 label: 30.0
feature: 8.0 label: 34.0
feature: 9.0 label: 38.0


In [22]:
# Order the dataset with batch size = 3
xy_ds = xy_ds.repeat(2).batch(3)
for (feature, label) in xy_ds:
    print(f'feature: {feature} label: {label}')

feature: [0. 1. 2.] label: [ 2.  6. 10.]
feature: [3. 4. 5.] label: [14. 18. 22.]
feature: [6. 7. 8.] label: [26. 30. 34.]
feature: [9. 0. 1.] label: [38.  2.  6.]
feature: [2. 3. 4.] label: [10. 14. 18.]
feature: [5. 6. 7.] label: [22. 26. 30.]
feature: [8. 9.] label: [34. 38.]


In [27]:
# Lets embed this into a function
def create_dataset(X,Y, epochs, batch_size):
    dataset = tf.data.Dataset.from_tensor_slices((X,Y)).repeat(epochs).batch(batch_size, drop_remainder=True)
    return dataset

EPOCHS = 2
BATCH_SIZE=3
dataset = create_dataset(X,Y, EPOCHS, BATCH_SIZE)
for (x,y) in dataset:
    print(f'x:{x} y:{y}')

x:[0. 1. 2.] y:[ 2.  6. 10.]
x:[3. 4. 5.] y:[14. 18. 22.]
x:[6. 7. 8.] y:[26. 30. 34.]
x:[9. 0. 1.] y:[38.  2.  6.]
x:[2. 3. 4.] y:[10. 14. 18.]
x:[5. 6. 7.] y:[22. 26. 30.]


# Now lets define the models for: Loss  and Gradient Computation
$MSE = \frac{1}{n}\sum_{0}^{n}{(\bar{y}-y})^2$

This is the Mean Square Error, Where $\bar{y}$ is our prediction and $y$ is the real value

In [35]:
# Mean Square error
def mse(X,Y,w0,w1):
    y_hat =  X*w0 + w1 
    loss = (y_hat-Y)**2
    mse = tf.reduce_mean(loss)
    return mse

# Compute Gradient
def compute_gradients(X,Y,w0,w1):    
    with tf.GradientTape() as tape:
        loss = mse(X,Y,w0,w1)
        gradient = tape.gradient(loss, [w0,w1])
        return gradient

# Training Loop

In [36]:
EPOCHS = 250
BATCH_SIZE = 2
LEARNING_RATE = 0.02
MSG = "STEP {step}, loss:{loss}, w0:{w0}, w1:{w1}"
#Now create the dataset 
training_ds  = create_dataset(X,Y, EPOCHS,BATCH_SIZE)
# Initialize weights
w0 = tf.Variable(0, dtype=tf.float32)
w1 = tf.Variable(0, dtype=tf.float32)
# Now iterate through the dataset
for step, (x,y) in enumerate(training_ds):
    dw0, dw1 = compute_gradients(x,y,w0,w1)
    w0.assign_sub(dw0*LEARNING_RATE)
    w1.assign_sub(dw1*LEARNING_RATE)

    if step % 100 == 0:
        loss = mse(x,y, w0, w1)
        print(MSG.format(step=step, loss=loss, w0=w0, w1=w1))


STEP 0, loss:18.051998138427734, w0:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=0.11999999>, w1:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=0.15999998>
STEP 100, loss:0.05661267787218094, w0:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=4.0432763>, w1:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.7414136>
STEP 200, loss:0.008423236198723316, w0:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=4.016694>, w1:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.9002551>
STEP 300, loss:0.001253330148756504, w0:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=4.0064397>, w1:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.9615245>
STEP 400, loss:0.00018649132107384503, w0:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=4.002483>, w1:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.9851589>
STEP 500, loss:2.775211032712832e-05, w0:<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=4.0009575>, w1:<

In [37]:
# Compare
print(f'w0 = {w0} m={m}')
print(f'w1 = {w1} b={b}')

w0 = <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=4.000002> m=4
w1 = <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.999994> b=2


# Working with CSV files and Tensorflow Datasets

In [55]:
import csv
filepath_train = "../../datasets/Taxi/taxi-train.csv"
filepath_valid = "../../datasets/Taxi/taxi-valid.csv"
filepath_test = "../../datasets/Taxi/taxi-test.csv"

def print_first_line_of_csv(filepath):
    try:
        with open(filepath) as file:
            csv_file = csv.reader(file)
            print(next(iter(csv_file)))
    except:
        print("Not able to open csv")

# See how this CSV files do not posses Column titles
print_first_line_of_csv(filepath_train)
print_first_line_of_csv(filepath_valid)
print_first_line_of_csv(filepath_test)

#lets define the Column titles and default values:
# Defining the feature names into a list `CSV_COLUMNS`
CSV_COLUMNS = [
    'fare_amount',
    'pickup_datetime',
    'pickup_longitude',
    'pickup_latitude',
    'dropoff_longitude',
    'dropoff_latitude',
    'passenger_count',
    'key'
]
LABEL_COLUMN = 'fare_amount'
# Defining the default values into a list `DEFAULTS`
DEFAULTS = [[0.0], ['na'], [0.0], [0.0], [0.0], [0.0], [0.0], ['na']]

['11.3', '2011-01-28 20:42:59 UTC', '-73.999022', '40.739146', '-73.990369', '40.717866', '1', '0']
['5.3', '2012-01-03 19:21:35 UTC', '-73.962627', '40.763214', '-73.973485', '40.753353', '1', '0']
['6.0', '2013-03-27 03:35:00 UTC', '-73.977672', '40.784052', '-73.965332', '40.801025', '2', '0']


Now we can proceed to act on this data

In [63]:
def create_dataset_from_csv(pattern):
    return tf.data.experimental.make_csv_dataset(pattern, batch_size=1, column_names=CSV_COLUMNS, column_defaults=DEFAULTS)

taxi_csv_ds = create_dataset_from_csv("../../datasets/Taxi/taxi-train.csv")
print(f'type{type(taxi_csv_ds)}')

print("---------------dataset is made of a OrderedDict")
# print each dataset item
for item in taxi_csv_ds.take(1):
    print(item)

# make it more readable using numpy
print("---------------Using numpy make it more readable")
for data in taxi_csv_ds.take(1):
    dict = { k:v.numpy() for (k,v) in data.items() }
    print(dict)


type<class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
---------------dataset is made of a OrderedDict
OrderedDict([('fare_amount', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([6.9], dtype=float32)>), ('pickup_datetime', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'2011-03-19 03:32:00 UTC'], dtype=object)>), ('pickup_longitude', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-74.00206], dtype=float32)>), ('pickup_latitude', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([40.73046], dtype=float32)>), ('dropoff_longitude', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-73.98335], dtype=float32)>), ('dropoff_latitude', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([40.756454], dtype=float32)>), ('passenger_count', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.], dtype=float32)>), ('key', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'2078'], dtype=object)>)])
---------------Using numpy make it more readable
{'fare_amount': arr