# Week 1 - Data Preparation and AWS Introduction

In this lecture we'll get started with some data preparation exercises and introduce fundamental AWS concepts.

**Table of Contents**

-   [Data Preparation](#data-preparation)
    -   [Pandas DataFrame](#pandas-dataframe)
        -   [Creation](#df-creation)
            -   [pandas.DataFrame Constructor](#pandasdataframe-constructor)
            -   [IO Tools](#df-io-tools)
        -   [Exploration and Transformation](#df-exploration-and-transformation)
    -   [R data.frame](#r-dataframe)
    -   [NumPy ndarray](#numpy-array)
        -   [Creation](#ndarray-creation)
            -   [numpy.array Constructor](#nparray-constructor)
            -   [numpy.genfromtxt](#genfromtxt)
        -   [Exploration and Transformation](#np-exploration-and-transformation)
-   [Amazon Web Services](#amazon-web-services)
    -   [Virtual Private Cloud Components](#virtual-private-cloud)

<a id='data-preparation'></a>
## Data Preparation

Many machine learning tools operate on 2-dimensional, in-memory data structures. As such, there is often a non-trival 
task of transforming data from its source representation to a representation suitable for machine-learning algorithms.

To this end, we are going to cover some of the more popular data structures and transformation techniques. 

<a id='pandas-dataframe'></a>
### Pandas DataFrame

A Pandas DataFrame is a data structure, that can hold heterogeneous types of data. The official 
documentation for Pandas DataFrames can be found 
[here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

We can import the `pandas` package as follows:

In [None]:
import pandas as pd

<a id='df-creation'></a>
#### Creation

There are numerous ways to construct a Pandas DataFrame. We cover a couple below.

<a id='pandasdataframe-constructor'></a>
##### pandas.DataFrame Constructor

In [None]:
d = {'Height': [60, 77, 67], 'Weight': [100, 200, 150], 'Eye Color': ['Brown', 'Green', 'Blue']}
df = pd.DataFrame(data=d)
df

If we want to know the data types associated with each of the columns in our DataFrame columns, then look at the 
`dtypes` attribute.

In [None]:
df.dtypes

Pandas supports NumPy data types. In short NumPy supports `float`, `int`, `bool`, `timedelta64` and `datetime64` types. 
More information on NumPy data types can be found 
[here](https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html). In addition, Pandas has extended the NumPy type 
system to include additional data types. For example, Pandas supports the `CategoricalDtype` and `DatetimeTZDtype` 
types. Some of these extended data types are useful for various data transformation tasks. More information on Pandas 
data types can be found  [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#dtypes).  

The DataFrame displayed above has an index for each row. In this case, the first row has an index of 0, and the second
row has an index of 1, and so on. By default, when a DataFrame is created the indexes range from 0 to n-1, where n is
the number of rows. We can explicitly define the `index` during DataFrame creation as well. For example,

In [None]:
df = pd.DataFrame(data=d, index=['A', 'B', 'C'])
df

We can achieve the same result with 

In [None]:
df = pd.DataFrame(data=[[60, 100, 'Brown'], [77, 200, 'Green'], [67, 150, 'Blue']], index=['A', 'B', 'C'], columns=['Height', 'Weight', 'Eye Color'])
df

Notice that we used the `columns` argument to explicitly define the column names of the DataFrame.

<a id='df-io-tools'></a>
##### IO Tools

We don't have to use the `pandas.DataFrame` constructor in order to create DataFrames. Often, our data set will exist
in some structured form in persistent storage, such as a file system or database.

Pandas has a bunch of IO *connectors*, including 
[CSV, JSON, SQL, and Google Big Query](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

When your data is persisted as a flatfile CSV, you can use the `pandas.read_csv()` function to make a DataFrame. This 
function has over 50 possible parameters. This can be expected. `pandas.read_csv()` has the unenviable task of reading
data that can come in all shapes an sizes. Refer to the 
[documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv) to 
learn about these parameters.

Let's say we have a CSV called `iris.csv` in our working directory, which looks like this (first 10 rows):

```
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"
5.4,3.9,1.7,0.4,"virginica"
4.6,3.4,1.4,0.3,"virginica"
5,3.4,1.5,0.2,"setosa"
4.4,2.9,1.4,0.2,"versicolor"
```

We can read this CSV into a DataFrame object with:

In [None]:
import numpy as np
df = pd.read_csv(filepath_or_buffer="iris.csv", header=0, sep=',', dtype={'Sepal.Length': np.float64, 'Sepal.Width': np.float64, 'Petal.Length': np.float64, 'Petal.Width': np.float64, 'Species': 'category'})
df.head()

In [None]:
df.dtypes

In [None]:
df.shape

We've told `pandas.read_csv()` that there is a header in the first row by setting `header=0`. If there were no header, 
we could set this parameter to `None` and specify the column names ourselves with the `names` parameter. Also, notice 
that we explicity specified the column types with `dtype`. Here's what our DataFrame would look like if we let the
the reader figure this stuff out by itself:

In [None]:
df = pd.read_csv(filepath_or_buffer="iris.csv")
df.head()

In [None]:
df.dtypes

In [None]:
df.shape

Looks like it did a pretty good job parsing the file to our specifications, except it assumed the `Species` column is of dtype `object`, which may not be desirable. Not all data is as clean as this example, so you might need to give the parser a little bit of guidance.

<a id='df-exploration-and-transformation'></a>
#### Exploration and Transformation

Once we've created a DataFrame and before we're ready to create a model, we might need to get more familiar with the 
data, or transform the data into a form suitable for modeling. 

DataFrames possess a number of attributes that can give us a better feel for the data.

The `shape` attribute gives the dimensions of the DataFrame.

In [None]:
df.shape

The iris DataFrame has 150 rows and 5 columns.

The `columns` attribute gives the column labels of the DataFrame.

In [None]:
df.columns

Use the `describe()` method to compute some basic summary statistics on the data.

In [None]:
df.describe()

To view the first/last 5 rows of the data use the `head()`/`tail()` mathods.

In [None]:
df.head()

In [None]:
df.tail()

`matplotlib` is a common package for data visualization. We can import it with:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline 

Let's plot the histogram of the columns.

In [None]:
df.plot.hist(alpha=0.5)

Here are some box plots

In [None]:
df.plot.box()

`seaborne` is another popular plotting and data visualization package. You can find more information about seaborne [here](https://seaborn.pydata.org/index.html).

In [None]:
import seaborn as sns

We can generate a pairs plot from our DataFrame as follows

In [None]:
sns.pairplot(df)

Sometimes it's useful to take a `subset` of a DataFrame. This is often referred to as `slicing`. We can slice a DataFrame row-wise, column-wise, or both.

A DataFrame has a couple of `axes`: a row axis and a column axis. Each axis has an `index` associated with it, which can be used to refer to a particular row or column, for example. Observe the `label` indexes of the Iris DataFrame with: 

In [None]:
df.axes

The first list element is row `label` index associated with the row axis, and the second list element is the column `label` index associated with the column axis. The row `label` index is simply a list of integers from 0 to 149. You'll notice that the column `label` index is a list of names.

In [None]:
df.axes[0].values

In [None]:
df.axes[1].values

`.loc[]` is the slicing operator for `label` indexes. `.iloc[]` is another common slicing operator, but it operates on `position` indexes, which we will introduce soon. 

In general, slice operations take the following form `df.loc[row_indexer,column_indexer]`. If the `column_indexer` is not specified, then it is assumed to be the `null` slice, or `:`, which means "all columns." So, we could ask for the first row with:

In [None]:
df.loc[0]

We can ask for the element in the first row and first column with:

In [None]:
df.loc[0,'Sepal.Length']

You might be tempted to think that you can retrieve the same value with `df.loc[0,0]`, but this will surely return an error. Remember that `.loc[]` operates on `label` indexes, not `position` indexes. Refer back to `df.axes` to understand the acceptable label index values for `.loc[]`. For the Iris DataFrame, the acceptable values are `'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', and 'Species'`. `0` is not an acceptable label index value. 

The `.iloc[]` operator, on the other hand operates on `position` indexes. The position indexes are implicitly defined for DataFrames and they range from `0` to `# of rows - 1` for the row index, and from `0` to `# of columns - 1` for the column index.

Grab the element in the first row and first column with:

In [None]:
df.iloc[0,0]

`df.iloc[0,'Sepal.Length']` will fail because `'Sepal.Length'` is not an acceptable `position` index value.

A common use case of DataFrame slicing is to split a dataset into training and test sets, which consist of observations randomly drawn from the original dataset, without replacement. 

One way to do this would be to decide approximately what percent of the original data the train and test sets will assume. Let's say that we want our training set to have about 75% of the original data and the test set 25%. We can first generate a list of 150 numbers between 0 and 1 (uniformly distributed) with: 

In [None]:
rand_arr = np.random.rand(len(df))
rand_arr

In [None]:
rand_arr.dtype

We know that approximately 75% of the observations will be > 0.25.

In [None]:
rand_arr_bool = rand_arr > 0.25
rand_arr_bool

In [None]:
rand_arr_bool.dtype

In [None]:
sum(rand_arr_bool) / len(rand_arr_bool)

Next, we can slice the original DataFrame based on this boolean NumPy array.

In [None]:
train = df.iloc[rand_arr_bool,:]
train.shape

In [None]:
test = df.loc[~rand_arr_bool,:]
test.shape

The `.loc[]` or `.iloc[]` slicing operators work equally well here. Refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing) to understand all of the allowable inputs to these operators. We used a boolean array in the above sampling example, but other inputs are possible.

Sometimes `normalizing` your dataset may lead to superior models and/or faster training times. One common normalization technique is `standardization`. For each element, you subtract the mean and divide by the standard deviation of the column to which the element belongs.

We can compute the mean of a column with the `mean()` function.

In [None]:
df.iloc[:,0].mean()

`mean()` can also operate on the entire DataFrame.

In [None]:
col_means = df.mean()
col_means

In [None]:
type(col_means)

In [None]:
col_means.shape

Notice that the `Species` column, which is categorical, is left off.

We can compute the standard deviation in the same manner.

In [None]:
col_s = df.std()
col_s

In [None]:
type(df)

In [None]:
type(col_means)

We can subtract the column mean from each element in the Dataframe simply by using the `-` operator. Behind the scenes, Pandas has defined what it means to subtract a `pandas.core.series.Series` object from a `pandas.core.frame.DataFrame` object. Learn more about the semantics of these operations [here](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#binary-operator-functions)

In [None]:
df_mean = df.iloc[:,0:4] - col_means
df_mean.head()

Here, an element-wise subtraction of the column means across for each row was performed. Now, let's perform an element-wise division of the column standard deviations for each row.

In [None]:
df_norm = df_mean / df.std()
df_norm.head()

In [None]:
df_norm.mean()

In [None]:
df.head()

Finally, we need the labels attached to this normalized data. We can either update the original dataset with the new data, or simply add the label column to the new data.

In [None]:
df.iloc[:,0:4] = df_norm
df.head()

In [None]:
df_norm.head()

In [None]:
df_norm = df_norm.join(df.iloc[:,4])
df_norm.head()

Notice that we had to assign the result of `df_norm.join(df.iloc[:,4])` to `df_norm`. The `join()` operation return a copy of the dataset. It doesn't alter-in-place `df_norm`.

Sometimes it's desireable to take the log of each value in a column. There are a [ton](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) of Pandas operators for DataFrames, but we have to rely on the NumPy `log` function for this.

In [None]:
df = pd.read_csv(filepath_or_buffer="iris.csv", header=0, sep=',', dtype={'Sepal.Length': np.float64, 'Sepal.Width': np.float64, 'Petal.Length': np.float64, 'Petal.Width': np.float64, 'Species': 'category'})
df.iloc[:,0] = np.log(df.iloc[:,0])
df.head()

<a id='r-dataframe'></a>
### R data.frame


See `week1_R.ipynb`


### NumPy ndarray

The NumPy `ndarray` is another popular data structure. It is used to hold multidimensional, homogeneous data. `ndarray`s are homogeneous because they can only hold objects of the same [dtype](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html). 

<a id='ndarray-creation'></a>
#### Creation

There are numerous ways to construct a NumPy `ndarray`. We cover a couple below.

<a id='nparray-constructor'></a>
##### numpy.array Constructor

In [None]:
import numpy as np

In [None]:
arr = np.array([[1,2,3],[4,5,6]])
arr

In [None]:
type(arr)

The dimensions of a numpy are referred to as `axes`. The above array has 2 axes, for example. We can find this information by accessing the `ndim` attribute of the array.

In [None]:
arr.ndim

If we want to know the size of each of the axes, or dimensions, then we can access the `shape` attribute.

In [None]:
arr.shape

Here, we see that the first dimension has 2 elements, and the second dimension has 3 elements. We can think of this array as a table with 2 rows and 3 columns.

If we want to know the `dtype` of an array, then we can access that attribute as well.

In [None]:
arr.dtype

<a id='genfromtxt'></a>
##### numpy.genfromtxt

In all likelihood our data will probably reside on disk in CSV format. We can use the `genfromtxt()` function to read this data into a ndarray.

In [None]:
from numpy import genfromtxt
iris_arr = genfromtxt('iris.csv', delimiter=',')
iris_arr[:5]

What happened to the column headers and the `Species` classification?

We left off the `dtype` argument to the genfromtxt function, so it had to make a guess. Specifying the dtype as `object`, gives use the following array.

In [None]:
iris_arr2 = genfromtxt('iris.csv', delimiter=',', dtype='object')
iris_arr2[:5]

We can also convert pandas `DataFrame`s into numpy `ndarray`s.

In [None]:
import pandas as pd
df = pd.read_csv(filepath_or_buffer="iris.csv", header=0, sep=',', dtype={'Sepal.Length': np.float64, 'Sepal.Width': np.float64, 'Petal.Length': np.float64, 'Petal.Width': np.float64, 'Species': 'category'})
iris_arr3 = df.values
iris_arr3[0:5]

<a id='np-exploration-and-transformation'></a>
#### Exploration and Transformation

Let's remove the first row of the original array. We can do this with the `delete()` function.

In [None]:
help(np.delete)

In [None]:
iris_arr = np.delete(iris_arr, (0), axis=0)
iris_arr[:5]

Now, let's delete the last column.

In [None]:
iris_arr = np.delete(iris_arr, (4), axis=1)
iris_arr[:5]

We've already seen how to slice an ndarray along the first (row) dimension with `iris_arr[:5]`. Let's slice along the second (column) dimension now too. Let's extract the first 5 rows of the first 2 columns.

In [None]:
a = iris_arr[:5,:2]
a

Arithmetic operators on arrays apply elementwise. A new array is created and filled with the result.

In [None]:
a**2

In [None]:
2*np.sin(a)

In [None]:
b = iris_arr[5:10,:2]
a+b

There are many more operations that we can perform on NumPy arrays and we'll see a lot of them as we proceed through the course. [Here's](https://docs.scipy.org/doc/numpy/reference/) a link to the main NumPy reference.

<a id='amazon-web-services'></a>
## Amazon Web Services

The number of cloud services and providers seem to be constantly growing. [Microsoft Azure](https://azure.microsoft.com/en-us/), [Google Cloud](https://cloud.google.com/), and [Amazon Web Services](https://aws.amazon.com/) are just a few. Basic services include storage, databases, computing resources, and networking infrastructure, to name a few. However, providers appear to be moving "up the stack," providing domain  specific software services, such as [IDEs](https://aws.amazon.com/cloud9/) and machine learning modeling tools. AWS [Sagemaker](https://aws.amazon.com/sagemaker/), [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning-service/), and Google [AI Products](https://cloud.google.com/products/ai/) are a few cloud-based machine learning platforms.

While these services are not necessarily required to build machine learning pipelines and products, they might provide some advantages over developing custom solutions. In this course, we will cover some basic cloud concepts using AWS, which should help you get more comfortable building and interacting with cloud services.

<a id='virtual-private-cloud'></a>
### Virtual Private Cloud (VPC) Components

A [*VPC*](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) is a logically isolated section of the AWS Cloud where you can launch AWS resources in a network that you define. You have complete control over the VPC, including the selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways.

In [None]:
from IPython.display import Image
Image(filename='default-vpc-diagram.png') 

A VPC has a range of IP addresses associated with it, which you specify upon creation. The IP address range is specified using [CIDR](https://tools.ietf.org/html/rfc4632) block notation. For IPv4 addresses, the *network* portion of a 32-bit CIDR block is specified with the 8-bit value after the `/` symbol. For example, in the CIDR block `21.39.16.0/20` the first 20 bits (`000010101.00100111.0001`) specify the network address, and the last 12 bits are zeros (`0000.00000000`). This leaves 4096 (`2^12`) values for end systems within this VPC.

Often, the VPC IP address space is further divided into *Subnets*. A subnet is simply a sub-range of IP addresses of the parent range. For example, CIDR block `21.39.24.0/21` (`000010101.00100111.00011000.00000000`) is a subnet of `21.39.16.0/20`, with 2048 (`2^11`) values for end systems within this subnet. 

When you create a VPC, you must specify an IPv4 CIDR block for the VPC. Blocks are can have a netmask between `/16` and `/28`. Amazon recommends using CIDR blocks from the private IPv4 address ranges as specified in RFC 1918:
```
10.0.0.0 - 10.255.255.255 (10/8 prefix)
172.16.0.0 - 172.31.255.255 (172.16/12 prefix)
192.168.0.0 - 192.168.255.255 (192.168/16 prefix)
```

A VPC exists within an AWS *Region*, which corresponds to a geographical area. Not all regions support all AWS services! Check this [chart](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/). Regions are further split into [*Availability Zones*](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html). A VPC spans all availability zones within a region. Availability zones are isolated from each other. To enhance reliability, you can place backup services in subnets in different availability zones. A subnet must exist in a single availability zone.

Each subnet has a [*Route Table*](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Route_Tables.html), which specifies the route for traffic leaving the subnet. In the above picture, for example, any traffic with a destination within the VPC (`172.31.0.0/16`) is routed locally, but traffic destined for *any* (`0.0.0.0/0`) other destination is routed to the *Internet Gateway*. The most specific route that matches the traffic is used to determine how to route the traffic. When you create a VPC, it automatically has a main route table.

An [*Internet Gateway*](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Internet_Gateway.html) is a VPC component that allows communication between instances in your VPC and the internet. To enable access to or from the internet for instances in a VPC subnet, you must do the following:

- Attach an internet gateway to your VPC.
- Ensure that your subnet's route table points to the internet gateway.
- Ensure that instances in your subnet have a globally unique IP address (public IPv4 address, Elastic IP address, or IPv6 address).
- Ensure that your network access control and security group rules allow the relevant traffic to flow to and from your instance.

A [*Security Group*](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html) acts as a virtual firewall for your instance to control inbound and outbound traffic. Security groups act at the instance level, not the subnet level. Therefore, each instance in a subnet in your VPC could be assigned to a different set of security groups. If you don't specify a particular group at launch time, the instance is automatically assigned to the default security group for the VPC.

For each security group, you add rules that control the inbound traffic to instances, and a separate set of rules that control the outbound traffic. 

In [None]:
Image(filename='default-security-group.png') 

Here are some more security group example rules.

In [None]:
Image(filename='security-group-examples.png')

In our first lab assignment I'll ask you to create a simple VPC with the above components, along with an EC2 instance, using [CloudFormation](https://aws.amazon.com/cloudformation/).